Today’s Research Highlights
AI-enhanced summaries of the latest research papers from arXiv.
Table of Contents
- cs.CL [Total: 212]
- cs.CV [Total: 244]
- cs.AI [Total: 79]
- cs.SD [Total: 23]
- cs.LG [Total: 296]
- cs.MA [Total: 9]
- cs.MM [Total: 3]
- eess.AS [Total: 21]
- eess.IV [Total: 14]
cs.CL
[1] A Novel Differential Feature Learning for Effective Hallucination Detection and Classification
Wenkai Wang, Vincent Lee, Yizhen Zheng
Main category: cs.CL
TL;DR: The paper proposes a dual-model architecture with Projected Fusion and Differential Feature Learning to detect LLM hallucinations, finding that hallucination signals are highly concentrated in sparse feature subsets, enabling efficient detection with minimal feature usage.
Details
Motivation: Large language model hallucination is a critical challenge where outputs deviate from factual accuracy due to training data biases. While prior work identified layer differences in hallucination signals, precise localization remains unclear, limiting efficient detection method development.Method: Dual-model architecture integrating Projected Fusion block for adaptive inter-layer feature weighting and Differential Feature Learning mechanism that identifies discriminative features by computing differences between parallel encoders learning complementary representations from identical inputs.
Result: Achieved significant accuracy improvements on question answering and dialogue tasks. Analysis revealed a hierarchical ‘funnel pattern’ where shallow layers show high feature diversity while deep layers demonstrate concentrated usage, enabling detection with only 1% of feature dimensions with minimal performance degradation.
Conclusion: Hallucination signals are more concentrated than previously assumed, offering a pathway toward computationally efficient detection systems that could reduce inference costs while maintaining accuracy.
Abstract: Large language model hallucination represents a critical challenge where outputs deviate from factual accuracy due to distributional biases in training data. While recent investigations establish that specific hidden layers exhibit differences between hallucinatory and factual content, the precise localization of hallucination signals within layers remains unclear, limiting the development of efficient detection methods. We propose a dual-model architecture integrating a Projected Fusion (PF) block for adaptive inter-layer feature weighting and a Differential Feature Learning (DFL) mechanism that identifies discriminative features by computing differences between parallel encoders learning complementary representations from identical inputs. Through systematic experiments across HaluEval’s question answering, dialogue, and summarization datasets, we demonstrate that hallucination signals concentrate in highly sparse feature subsets, achieving significant accuracy improvements on question answering and dialogue tasks. Notably, our analysis reveals a hierarchical “funnel pattern” where shallow layers exhibit high feature diversity while deep layers demonstrate concentrated usage, enabling detection performance to be maintained with minimal degradation using only 1% of feature dimensions. These findings suggest that hallucination signals are more concentrated than previously assumed, offering a pathway toward computationally efficient detection systems that could reduce inference costs while maintaining accuracy.
[2] Influence Guided Context Selection for Effective Retrieval-Augmented Generation
Jiale Deng, Yanyan Shen, Ziyuan Pei, Youmin Chen, Linpeng Huang
Main category: cs.CL
TL;DR: The paper introduces Contextual Influence Value (CI value) to improve RAG by quantifying context quality through performance degradation when removing contexts, eliminating hyperparameter tuning by retaining only contexts with positive CI values.
Details
Motivation: Standard RAG suffers from poor-quality retrieved contexts containing irrelevant/noisy information, and existing context selection approaches show limited gains due to failure in holistically utilizing available information (query, context list, and generator).Method: Reconceptualize context quality assessment as inference-time data valuation using CI value metric, develop parameterized surrogate model with hierarchical architecture capturing local query-context relevance and global inter-context interactions, trained through oracle CI value supervision and end-to-end generator feedback.
Result: Extensive experiments across 8 NLP tasks and multiple LLMs demonstrate significant outperformance over state-of-the-art baselines, effectively filtering poor-quality contexts while preserving critical information.
Conclusion: The CI value approach provides a comprehensive solution for context quality assessment in RAG systems, eliminating complex hyperparameter tuning and achieving superior performance through holistic utilization of available information.
Abstract: Retrieval-Augmented Generation (RAG) addresses large language model (LLM) hallucinations by grounding responses in external knowledge, but its effectiveness is compromised by poor-quality retrieved contexts containing irrelevant or noisy information. While existing approaches attempt to improve performance through context selection based on predefined context quality assessment metrics, they show limited gains over standard RAG. We attribute this limitation to their failure in holistically utilizing available information (query, context list, and generator) for comprehensive quality assessment. Inspired by recent advances in data selection, we reconceptualize context quality assessment as an inference-time data valuation problem and introduce the Contextual Influence Value (CI value). This novel metric quantifies context quality by measuring the performance degradation when removing each context from the list, effectively integrating query-aware relevance, list-aware uniqueness, and generator-aware alignment. Moreover, CI value eliminates complex selection hyperparameter tuning by simply retaining contexts with positive CI values. To address practical challenges of label dependency and computational overhead, we develop a parameterized surrogate model for CI value prediction during inference. The model employs a hierarchical architecture that captures both local query-context relevance and global inter-context interactions, trained through oracle CI value supervision and end-to-end generator feedback. Extensive experiments across 8 NLP tasks and multiple LLMs demonstrate that our context selection method significantly outperforms state-of-the-art baselines, effectively filtering poor-quality contexts while preserving critical information. Code is available at https://github.com/SJTU-DMTai/RAG-CSM.
[3] Context Is What You Need: The Maximum Effective Context Window for Real World Limits of LLMs
Norman Paulsen
Main category: cs.CL
TL;DR: The paper reveals that the maximum effective context window (MECW) of LLMs is significantly smaller than advertised maximum context windows (MCW), with models falling short by up to 99% and showing degradation in accuracy even with small context sizes.
Details
Motivation: To test the real-world effectiveness of LLM context windows and address the discrepancy between advertised maximum context window sizes and actual usable context capacity.Method: Defined maximum effective context window concept, formulated testing methods for context window effectiveness across various sizes and problem types, and created standardized comparison methods to identify failure points.
Result: Found significant differences between reported MCW and MECW across multiple models, with most models showing severe accuracy degradation by 1000 tokens and some failing with as few as 100 tokens. MECW varies based on problem type.
Conclusion: Maximum effective context window is drastically different from maximum context window and shifts based on problem type, providing actionable insights for improving model accuracy and reducing hallucinations.
Abstract: Large language model (LLM) providers boast big numbers for maximum context window sizes. To test the real world use of context windows, we 1) define a concept of maximum effective context window, 2) formulate a testing method of a context window’s effectiveness over various sizes and problem types, and 3) create a standardized way to compare model efficacy for increasingly larger context window sizes to find the point of failure. We collected hundreds of thousands of data points across several models and found significant differences between reported Maximum Context Window (MCW) size and Maximum Effective Context Window (MECW) size. Our findings show that the MECW is, not only, drastically different from the MCW but also shifts based on the problem type. A few top of the line models in our test group failed with as little as 100 tokens in context; most had severe degradation in accuracy by 1000 tokens in context. All models fell far short of their Maximum Context Window by as much as 99 percent. Our data reveals the Maximum Effective Context Window shifts based on the type of problem provided, offering clear and actionable insights into how to improve model accuracy and decrease model hallucination rates.
[4] How Large Language Models Need Symbolism
Xiaotie Deng, Hanyu Li
Main category: cs.CL
TL;DR: AI needs human-crafted symbols as a compass to guide LLMs beyond scaling for genuine discovery.
Details
Motivation: Current AI development focuses too much on scaling, but this alone cannot unlock genuine discovery. LLMs have powerful intuition but lack direction.Method: Propose using human-crafted symbols as a compass to guide large language models’ intuition.
Result: Not specified in the abstract - this appears to be a position paper proposing a conceptual framework.
Conclusion: The future of AI requires more than scaling; it needs human-crafted symbols to provide direction and enable genuine discovery by guiding LLMs’ powerful but blind intuition.
Abstract: We argue that AI’s future requires more than scaling. To unlock genuine discovery, large language models need a compass: human-crafted symbols to guide their powerful but blind intuition.
[5] One Model, Many Morals: Uncovering Cross-Linguistic Misalignments in Computational Moral Reasoning
Sualeha Farid, Jayden Lin, Zean Chen, Shivani Kumar, David Jurgens
Main category: cs.CL
TL;DR: LLMs show inconsistent moral judgments across different languages, revealing cultural misalignment and highlighting the need for more culturally-aware AI systems.
Details
Motivation: To investigate how language mediates moral decision-making in LLMs and understand their ability to generalize moral reasoning across diverse linguistic and cultural contexts, given their predominant English-language pretraining.Method: Translated two established moral reasoning benchmarks into five culturally and typologically diverse languages for multilingual zero-shot evaluation, combined with carefully constructed research questions to analyze underlying drivers of moral judgment disparities.
Result: Revealed significant inconsistencies in LLMs’ moral judgments across languages, often reflecting cultural misalignment, and identified various drivers including disagreements and reasoning strategy differences.
Conclusion: Developed a structured typology of moral reasoning errors that emphasizes the need for more culturally-aware AI systems to address cross-lingual moral judgment inconsistencies.
Abstract: Large Language Models (LLMs) are increasingly deployed in multilingual and multicultural environments where moral reasoning is essential for generating ethically appropriate responses. Yet, the dominant pretraining of LLMs on English-language data raises critical concerns about their ability to generalize judgments across diverse linguistic and cultural contexts. In this work, we systematically investigate how language mediates moral decision-making in LLMs. We translate two established moral reasoning benchmarks into five culturally and typologically diverse languages, enabling multilingual zero-shot evaluation. Our analysis reveals significant inconsistencies in LLMs’ moral judgments across languages, often reflecting cultural misalignment. Through a combination of carefully constructed research questions, we uncover the underlying drivers of these disparities, ranging from disagreements to reasoning strategies employed by LLMs. Finally, through a case study, we link the role of pretraining data in shaping an LLM’s moral compass. Through this work, we distill our insights into a structured typology of moral reasoning errors that calls for more culturally-aware AI.
[6] LLM-Based Support for Diabetes Diagnosis: Opportunities, Scenarios, and Challenges with GPT-5
Gaurav Kumar Gupta, Nirajan Acharya, Pranal Pande
Main category: cs.CL
TL;DR: GPT-5 shows strong performance in diabetes diagnosis and management across five clinical scenarios, aligning well with ADA standards and demonstrating potential as a dual-purpose tool for clinicians and patients.
Details
Motivation: Diabetes affects over half a billion people globally with rising prevalence, but early recognition remains challenging due to vague symptoms, borderline lab values, gestational complexity, and long-term monitoring demands. LLMs offer opportunities to enhance decision support with structured, interpretable outputs.Method: Evaluated GPT-5 using a simulation framework with synthetic cases aligned with ADA Standards of Care 2025, inspired by public datasets (NHANES, Pima Indians, EyePACS, MIMIC-IV). Tested five scenarios: symptom recognition, lab interpretation, gestational diabetes screening, remote monitoring, and multimodal complication detection.
Result: GPT-5 showed strong alignment with ADA-defined criteria across all tested scenarios. It successfully classified cases, generated clinical rationales, produced patient explanations, and output structured JSON summaries.
Conclusion: GPT-5 may function as a dual-purpose tool for clinicians and patients in diabetes care, while underscoring the importance of reproducible evaluation frameworks for responsibly assessing LLMs in healthcare.
Abstract: Diabetes mellitus is a major global health challenge, affecting over half a billion adults worldwide with prevalence projected to rise. Although the American Diabetes Association (ADA) provides clear diagnostic thresholds, early recognition remains difficult due to vague symptoms, borderline laboratory values, gestational complexity, and the demands of long-term monitoring. Advances in large language models (LLMs) offer opportunities to enhance decision support through structured, interpretable, and patient-friendly outputs. This study evaluates GPT-5, the latest generative pre-trained transformer, using a simulation framework built entirely on synthetic cases aligned with ADA Standards of Care 2025 and inspired by public datasets including NHANES, Pima Indians, EyePACS, and MIMIC-IV. Five representative scenarios were tested: symptom recognition, laboratory interpretation, gestational diabetes screening, remote monitoring, and multimodal complication detection. For each, GPT-5 classified cases, generated clinical rationales, produced patient explanations, and output structured JSON summaries. Results showed strong alignment with ADA-defined criteria, suggesting GPT-5 may function as a dual-purpose tool for clinicians and patients, while underscoring the importance of reproducible evaluation frameworks for responsibly assessing LLMs in healthcare.
[7] Multi-Objective Reinforcement Learning for Large Language Model Optimization: Visionary Perspective
Lingxiao Kong, Cong Yang, Oya Deniz Beyan, Zeyd Boukhers
Main category: cs.CL
TL;DR: This paper presents a comprehensive analysis of Multi-Objective Reinforcement Learning (MORL) for Large Language Model optimization, proposing a taxonomy, benchmarking framework, and future research directions focusing on meta-policy approaches.
Details
Motivation: The motivation is to address the significant challenges and opportunities in optimizing multiple objectives in Large Language Models (LLMs) using MORL, recognizing the need for efficient and flexible approaches that support personalization and handle LLM complexities.Method: The authors introduce a MORL taxonomy, examine advantages and limitations of various MORL methods for LLM optimization, and propose a MORL benchmarking framework to evaluate different methods’ effects on diverse objective relationships.
Result: The paper identifies the need for improved MORL approaches for LLMs and proposes a vision for future research focusing on meta-policy MORL development with bi-level learning paradigms to enhance efficiency and flexibility.
Conclusion: The conclusion emphasizes the importance of developing meta-policy MORL approaches that can improve LLM performance through bi-level learning, highlighting key research questions and potential solutions for advancing MORL applications in LLM optimization.
Abstract: Multi-Objective Reinforcement Learning (MORL) presents significant challenges and opportunities for optimizing multiple objectives in Large Language Models (LLMs). We introduce a MORL taxonomy and examine the advantages and limitations of various MORL methods when applied to LLM optimization, identifying the need for efficient and flexible approaches that accommodate personalization functionality and inherent complexities in LLMs and RL. We propose a vision for a MORL benchmarking framework that addresses the effects of different methods on diverse objective relationships. As future research directions, we focus on meta-policy MORL development that can improve efficiency and flexibility through its bi-level learning paradigm, highlighting key research questions and potential solutions for improving LLM performance.
[8] Diagnosing the Performance Trade-off in Moral Alignment: A Case Study on Gender Stereotypes
Guangliang Liu, Bocheng Chen, Xitong Zhang, Kristen Marie Johnson
Main category: cs.CL
TL;DR: Current fairness objectives for mitigating gender stereotypes in language models cause excessive overall forgetting that degrades downstream task performance, and standard forgetting mitigation techniques are ineffective.
Details
Motivation: To understand why moral alignment through fairness objectives degrades downstream task performance when mitigating gender stereotypes in pretrained language models.Method: Analyzed the mechanisms of performance trade-off through the lens of forgetting and fairness objectives, examining how selective forgetting of stereotypes affects overall forgetting levels.
Result: Found that downstream performance is driven by overall forgetting level, selective stereotype forgetting increases overall forgetting, and general forgetting mitigation solutions are ineffective.
Conclusion: Current fairness objectives have limitations in achieving performance trade-offs due to their tendency to cause excessive overall forgetting that cannot be effectively mitigated.
Abstract: Moral alignment has emerged as a widely adopted approach for regulating the behavior of pretrained language models (PLMs), typically through fine-tuning or model editing on curated datasets. However, this process often comes at the cost of degraded downstream task performance. Prior studies commonly aim to achieve a performance trade-off by encouraging PLMs to selectively forget stereotypical knowledge through carefully designed fairness objectives, while preserving their helpfulness. In this short paper, we investigate the underlying mechanisms of the performance trade-off in the context of mitigating gender stereotypes, through the lens of forgetting and the fairness objective. Our analysis reveals the limitations of current fairness objective in achieving trade-off by demonstrating that: (1) downstream task performance is primarily driven by the overall forgetting level; (2) selective forgetting of stereotypes tends to increase overall forgetting; and (3) general solutions for mitigating forgetting are ineffective at reducing overall forgetting and fail to improve downstream task performance.
[9] A State-of-the-Art SQL Reasoning Model using RLVR
Alnur Ali, Ashutosh Baheti, Jonathan Chang, Ta-Chung Chi, Brandon Cui, Andrew Drozdov, Jonathan Frankle, Abhay Gupta, Pallavi Koppol, Sean Kulinski, Jonathan Li, Dipendra Misra, Krista Opsahl-Ong, Jose Javier Gonzalez Ortiz, Matei Zaharia, Yue Zhang
Main category: cs.CL
TL;DR: The paper presents a Reinforcement Learning with Verifiable Rewards (RLVR) approach for enterprise AI, achieving state-of-the-art results on the BIRD benchmark for SQL generation from natural language queries using a simple training recipe.
Details
Motivation: To develop custom reasoning models that can incorporate organization-specific knowledge for enterprise problems, particularly focusing on RL with Verifiable Rewards (RLVR) settings where reward functions are verifiable.Method: A simple training recipe involving careful prompt and model selection, warm-up using offline RL approach called TAO, followed by rigorous online RLVR training. No additional training data beyond BIRD training set and no proprietary models used.
Result: Achieved state-of-the-art accuracy on BIRD private test set: 73.56% without self-consistency and 75.68% with self-consistency. Required fewer generations than the second-best approach.
Conclusion: The simplicity of the RLVR framework makes it broadly applicable to enterprise domains like business intelligence, data science, and coding, demonstrating strong potential for custom reasoning models in enterprise settings.
Abstract: Developing custom reasoning models via Reinforcement Learning (RL) that can incorporate organization-specific knowledge has great potential to address problems faced by enterprise customers. In many of these problems, the reward function is verifiable, a setting termed RL with Verifiable Rewards (RLVR). We apply RLVR to a popular data science benchmark called BIRD that measures the ability of an AI agent to convert a natural language query for a database to SQL executions. We apply a simple and general-purpose training recipe involving careful prompt and model selection, a warm-up stage using our offline RL approach called TAO, followed by rigorous online RLVR training. With no additional training data beyond the BIRD training set and no use of proprietary models, our very first submission to the BIRD leaderboard reached state-of-the-art accuracy on the private test set: 73.56% without self-consistency and 75.68% with self-consistency. In the latter case, our model also required fewer generations than the second-best approach. While BIRD is only a proxy task, the simplicity of our framework makes it broadly applicable to enterprise domains such as business intelligence, data science, and coding.
[10] Learning to Reason with Mixture of Tokens
Adit Jain, Brendan Rappazzo
Main category: cs.CL
TL;DR: RLVR methods for LLM reasoning discard distributional token information. MoT-G preserves this by operating in continuous mixture space, achieving 5-35% gains on reasoning tasks with better efficiency.
Details
Motivation: Current RLVR methods sample discrete tokens, losing rich distributional information from model's probability distributions over tokens, unnecessarily constraining reasoning search space.Method: Propose mixture-of-token generation (MoT-G) in RLVR, extending it to operate directly in continuous mixture space for chain-of-thought generation, including training-free methods using weighted sums over token embeddings.
Result: MoT-G methods achieve 5-35% gains on 7 out of 10 reasoning tasks compared to standard decoding, reaching comparable accuracy with half the trajectories, suggesting improved training efficiency.
Conclusion: MoT-G’s benefits stem from maintaining higher hidden-state entropy and promoting exploration in token space, demonstrating the value of preserving distributional information in RLVR.
Abstract: Reinforcement learning with verifiable rewards (RLVR) has become a leading approach for improving large language model (LLM) reasoning capabilities. Most current methods follow variants of Group Relative Policy Optimization, which samples multiple reasoning completions, scores them relative to each other, and adjusts the policy accordingly. However, these approaches invariably sample discrete tokens at each reasoning step, discarding the rich distributional information in the model’s probability distribution over candidate tokens. While preserving and utilizing this distributional information has proven beneficial in non-RL settings, current RLVR methods seem to be unnecessarily constraining the reasoning search space by not using this information. To address this limitation, we investigate mixture-of-token generation (MoT-G) in RLVR. We present a unified framework that generalizes existing MoT-G approaches, including existing training-free methods that construct mixture embeddings as weighted sums over token embeddings, and extend RLVR to operate directly in this continuous mixture space for generating chain-of-thought. Evaluating two MoT-G variants on Reasoning-Gym, a suite of reasoning-intensive language tasks, we find that MoT–G methods achieve substantial improvements (5–35 % gains on 7 out of 10 tasks) compared to standard decoding with the Qwen2.5-1.5B model, while reaching comparable accuracy with half the number of trajectories, suggesting improved training efficiency. Through comprehensive hidden-state and token-level analyses, we provide evidence that MoT–G’s benefits may stem from its ability to maintain higher hidden-state entropy throughout the reasoning process and promote exploration in token space.
[11] Following the TRACE: A Structured Path to Empathetic Response Generation with Multi-Agent Models
Ziqi Liu, Ziyang Zhou, Yilin Li, Haiyang Zhang, Yangbin Chen
Main category: cs.CL
TL;DR: TRACE is a novel framework for empathetic response generation that decomposes the task into analysis and synthesis phases, combining deep understanding with expressive generation to outperform existing methods.
Details
Motivation: Address the trade-off between analytical depth of specialized models and generative fluency of LLMs in empathetic response generation by modeling empathy as a structured cognitive process.Method: Task-decomposed reasoning framework that breaks empathy into a pipeline for analysis (building comprehensive understanding) and synthesis (generation), uniting deep analysis with expressive generation.
Result: Significantly outperforms strong baselines in both automatic and LLM-based evaluations, demonstrating the effectiveness of structured decomposition for empathetic agents.
Conclusion: Structured decomposition is a promising paradigm for creating more capable and interpretable empathetic agents, successfully bridging the gap between analytical depth and generative fluency.
Abstract: Empathetic response generation is a crucial task for creating more human-like and supportive conversational agents. However, existing methods face a core trade-off between the analytical depth of specialized models and the generative fluency of Large Language Models (LLMs). To address this, we propose TRACE, Task-decomposed Reasoning for Affective Communication and Empathy, a novel framework that models empathy as a structured cognitive process by decomposing the task into a pipeline for analysis and synthesis. By building a comprehensive understanding before generation, TRACE unites deep analysis with expressive generation. Experimental results show that our framework significantly outperforms strong baselines in both automatic and LLM-based evaluations, confirming that our structured decomposition is a promising paradigm for creating more capable and interpretable empathetic agents. Our code is available at https://anonymous.4open.science/r/TRACE-18EF/README.md.
[12] Dual-Head Reasoning Distillation: Improving Classifier Accuracy with Train-Time-Only Reasoning
Jillian Xu, Dylan Zhou, Vinay Shukla, Yang Yang, Junrui Ruan, Shuhuai Lin, Wenfei Zou, Yinxiao Liu, Karthik Lakshmanan
Main category: cs.CL
TL;DR: Dual-Head Reasoning Distillation (DHRD) improves classification accuracy like Chain-of-Thought prompting but eliminates the throughput penalty by using a dual-head architecture where only the classification head is active during inference.
Details
Motivation: To resolve the trade-off between improved accuracy from Chain-of-Thought prompting and its significant throughput penalty due to rationale generation.Method: Adds a pooled classification head and a reasoning head supervised by teacher rationales, trained with weighted sum of label cross-entropy and token-level LM loss over input-plus-rationale sequences, then disables reasoning head at test time.
Result: Achieves 0.65-5.47% relative gains over pooled baselines on seven SuperGLUE tasks, with larger gains on entailment/causal tasks, while matching pooled classifier throughput and exceeding CoT decoding by 96-142 times in QPS.
Conclusion: DHRD successfully decouples reasoning benefits from inference costs, enabling accuracy improvements without throughput penalties.
Abstract: Chain-of-Thought (CoT) prompting often improves classification accuracy, but it introduces a significant throughput penalty with rationale generation (Wei et al., 2022; Cheng and Van Durme, 2024). To resolve this trade-off, we introduce Dual-Head Reasoning Distillation (DHRD), a simple training method for decoder-only language models (LMs) that adds (i) a pooled classification head used during training and inference and (ii) a reasoning head supervised by teacher rationales used only in training. We train with a loss function that is a weighted sum of label cross-entropy and token-level LM loss over input-plus-rationale sequences. On seven SuperGLUE tasks, DHRD yields relative gains of 0.65-5.47% over pooled baselines, with notably larger gains on entailment/causal tasks. Since we disable the reasoning head at test time, inference throughput matches pooled classifiers and exceeds CoT decoding on the same backbones by 96-142 times in QPS.
[13] On Code-Induced Reasoning in LLMs
Abdul Waheed, Zhen Wu, Carolyn Rosé, Daphne Ippolito
Main category: cs.CL
TL;DR: LLMs are more vulnerable to structural than semantic code perturbations, with pseudocode/flowcharts being as effective as actual code, and corrupted code with surface regularities remaining competitive.
Details
Motivation: To understand which aspects of code most enhance LLM reasoning capabilities, as it remains unclear despite evidence that code data improves reasoning.Method: Systematic framework using parallel instruction datasets in 10 programming languages with controlled perturbations disrupting structural or semantic properties, finetuning LLMs from 5 families and 8 scales, evaluating on 3,331 experiments across natural language, math, and code tasks.
Result: Structural perturbations hurt performance more than semantic ones, especially on math/code tasks; pseudocode/flowcharts work as well as code; corrupted code with surface regularities remains competitive; different languages favor different tasks (Python for natural language, Java/Rust for math).
Conclusion: Different code properties influence reasoning in distinct ways, providing insights for designing training data to enhance LLM reasoning capabilities.
Abstract: Code data has been shown to enhance the reasoning capabilities of large language models (LLMs), but it remains unclear which aspects of code are most responsible. We investigate this question with a systematic, data-centric framework. We construct parallel instruction datasets in ten programming languages and apply controlled perturbations that selectively disrupt structural or semantic properties of code. We then finetune LLMs from five model families and eight scales on each variant and evaluate their performance on natural language, math, and code tasks. Across 3,331 experiments, our results show that LLMs are more vulnerable to structural perturbations than semantic ones, particularly on math and code tasks. Appropriate abstractions like pseudocode and flowcharts can be as effective as code, while encoding the same information with fewer tokens without adhering to original syntax can often retain or even improve performance. Remarkably, even corrupted code with misleading signals remains competitive when surface-level regularities persist. Finally, syntactic styles also shape task-specific gains with Python favoring natural language reasoning and lower-level languages such as Java and Rust favoring math. Through our systematic framework, we aim to provide insight into how different properties of code influence reasoning and inform the design of training data for enhancing LLM reasoning capabilities.
[14] Agribot: agriculture-specific question answer system
Naman Jain, Pranjali Jain, Pratik Kayal, Jayakrishna Sahit, Soham Pachpande, Jayesh Choudhari
Main category: cs.CL
TL;DR: An agricultural chatbot for Indian farmers using Kisan Call Center data, achieving 86% accuracy with entity extraction and synonym elimination.
Details
Motivation: India's agro-based economy requires proper agricultural information for optimal growth. Farmers need accessible answers to queries about weather, market rates, plant protection, and government schemes.Method: Built a chatbot using sentence embedding model from Kisan Call Center dataset. Incorporated entity extraction and eliminated synonyms to improve accuracy.
Result: Initial accuracy of 56% with sentence embedding model. After improvements, accuracy increased to 86%. System provides 24/7 access through any electronic device.
Conclusion: The chatbot enables easier access to farming information, improves agricultural output, and reduces workload for call center staff by redirecting their efforts to more valuable tasks.
Abstract: India is an agro-based economy and proper information about agricultural practices is the key to optimal agricultural growth and output. In order to answer the queries of the farmer, we have build an agricultural chatbot based on the dataset from Kisan Call Center. This system is robust enough to answer queries related to weather, market rates, plant protection and government schemes. This system is available 24* 7, can be accessed through any electronic device and the information is delivered with the ease of understanding. The system is based on a sentence embedding model which gives an accuracy of 56%. After eliminating synonyms and incorporating entity extraction, the accuracy jumps to 86%. With such a system, farmers can progress towards easier information about farming related practices and hence a better agricultural output. The job of the Call Center workforce would be made easier and the hard work of various such workers can be redirected to a better goal.
[15] Domain-Aware Speaker Diarization On African-Accented English
Chibuzor Okocha, Kelechi Ezema, Christan Grant
Main category: cs.CL
TL;DR: This study examines domain effects in speaker diarization for African-accented English, finding consistent performance penalties for clinical speech across models due to short turns and frequent overlap.
Details
Motivation: To investigate domain effects in speaker diarization specifically for African-accented English, evaluating performance differences between general and clinical dialogues under strict protocols.Method: Evaluated multiple production and open systems using strict DER protocol with overlap scoring. Conducted error analysis and tested lightweight domain adaptation by fine-tuning segmentation module on accent-matched data.
Result: Consistent domain penalty appears for clinical speech across all models, attributed to false alarms and missed detections from short turns and frequent overlap. Domain adaptation reduces error but doesn’t eliminate the performance gap.
Conclusion: Results point to overlap-aware segmentation and balanced clinical resources as practical next steps for improving speaker diarization in clinical domains.
Abstract: This study examines domain effects in speaker diarization for African-accented English. We evaluate multiple production and open systems on general and clinical dialogues under a strict DER protocol that scores overlap. A consistent domain penalty appears for clinical speech and remains significant across models. Error analysis attributes much of this penalty to false alarms and missed detections, aligning with short turns and frequent overlap. We test lightweight domain adaptation by fine-tuning a segmentation module on accent-matched data; it reduces error but does not eliminate the gap. Our contributions include a controlled benchmark across domains, a concise approach to error decomposition and conversation-level profiling, and an adaptation recipe that is easy to reproduce. Results point to overlap-aware segmentation and balanced clinical resources as practical next steps.
[16] Generation-Time vs. Post-hoc Citation: A Holistic Evaluation of LLM Attribution
Yash Saxena, Raviteja Bommireddy, Ankur Padia, Manas Gaur
Main category: cs.CL
TL;DR: The paper compares two citation paradigms for LLMs: Generation-Time Citation (G-Cite) that produces answers and citations together, and Post-hoc Citation (P-Cite) that adds citations after drafting. P-Cite achieves better coverage with competitive correctness, while G-Cite prioritizes precision but sacrifices coverage and speed.
Details
Motivation: Trustworthy LLMs must cite verifiable sources in high-stakes domains like healthcare and law where errors have severe consequences. Practitioners need guidance on whether to generate citations during decoding or add them after drafting.Method: Comprehensive evaluation of both citation paradigms from zero-shot to advanced retrieval-augmented methods across four attribution datasets, analyzing trade-offs between coverage and citation correctness.
Result: P-Cite methods achieve high coverage with competitive correctness and moderate latency, while G-Cite methods prioritize precision at the cost of coverage and speed. Retrieval is the main driver of attribution quality in both paradigms.
Conclusion: Recommend retrieval-centric, P-Cite-first approach for high-stakes applications, reserving G-Cite for precision-critical settings like strict claim verification.
Abstract: Trustworthy Large Language Models (LLMs) must cite human-verifiable sources in high-stakes domains such as healthcare, law, academia, and finance, where even small errors can have severe consequences. Practitioners and researchers face a choice: let models generate citations during decoding, or let models draft answers first and then attach appropriate citations. To clarify this choice, we introduce two paradigms: Generation-Time Citation (G-Cite), which produces the answer and citations in one pass, and Post-hoc Citation (P-Cite), which adds or verifies citations after drafting. We conduct a comprehensive evaluation from zero-shot to advanced retrieval-augmented methods across four popular attribution datasets and provide evidence-based recommendations that weigh trade-offs across use cases. Our results show a consistent trade-off between coverage and citation correctness, with retrieval as the main driver of attribution quality in both paradigms. P-Cite methods achieve high coverage with competitive correctness and moderate latency, whereas G-Cite methods prioritize precision at the cost of coverage and speed. We recommend a retrieval-centric, P-Cite-first approach for high-stakes applications, reserving G-Cite for precision-critical settings such as strict claim verification. Our codes and human evaluation results are available at https://anonymous.4open.science/r/Citation_Paradigms-BBB5/
[17] Thinking with Sound: Audio Chain-of-Thought Enables Multimodal Reasoning in Large Audio-Language Models
Zhen Xiong, Yujun Cai, Zhecheng Li, Junsong Yuan, Yiwei Wang
Main category: cs.CL
TL;DR: Thinking-with-Sound (TwS) framework equips Large Audio-Language Models with Audio Chain-of-Thought reasoning to handle complex acoustic scenarios using acoustic tools like noise suppression and source separation.
Details
Motivation: Current Large Audio-Language Models (LALMs) perform well on basic audio understanding tasks but struggle with challenging audio reasoning in complex acoustic environments, lacking access to essential acoustic tools.Method: TwS combines linguistic reasoning with on-the-fly audio-domain analysis, enabling models to actively think with audio signals through numerical analysis and digital manipulation via multimodal reasoning.
Result: On the MELD-Hard1k benchmark with acoustic perturbations, TwS achieved substantial robustness improvements: small models gained 24.73% absolute accuracy, with improvements scaling up to 36.61% for larger models, while baseline LALMs suffered over 50% performance degradation.
Conclusion: Audio Chain-of-Thought reasoning can significantly enhance audio model robustness without retraining, opening new directions for developing more robust audio understanding systems.
Abstract: Recent Large Audio-Language Models (LALMs) have shown strong performance on various audio understanding tasks such as speech translation and Audio Q&A. However, they exhibit significant limitations on challenging audio reasoning tasks in complex acoustic scenarios. These situations would greatly benefit from the use of acoustic tools like noise suppression, source separation, and precise temporal alignment, but current LALMs lack access to such tools. To address this limitation, we introduce Thinking-with-Sound (TwS), a framework that equips LALMs with Audio CoT by combining linguistic reasoning with on-the-fly audio-domain analysis. Unlike existing approaches that treat audio as static input, TwS enables models to actively think with audio signals, performing numerical analysis and digital manipulation through multimodal reasoning. To evaluate this approach, we construct MELD-Hard1k, a new robustness benchmark created by introducing various acoustic perturbations. Experiments reveal that state-of-the-art LALMs suffer dramatic performance degradation on MELD-Hard1k, with accuracy dropping by more than $50%$ compared to clean audio. TwS achieves substantial improvements in robustness, demonstrating both effectiveness and scalability: small models gain $24.73%$ absolute accuracy, with improvements scaling consistently up to $36.61%$ for larger models. Our findings demonstrate that Audio CoT can significantly enhance robustness without retraining, opening new directions for developing more robust audio understanding systems.
[18] Comparative Personalization for Multi-document Summarization
Haoyuan Li, Snigdha Chaturvedi
Main category: cs.CL
TL;DR: ComPSum is a personalized multi-document summarization framework that identifies fine-grained user preference differences through comparative analysis and uses structured user analysis to guide personalized summary generation.
Details
Motivation: To meet individual user preferences for writing style and content focus in summaries by identifying fine-grained differences between users' preferences through comparative analysis.Method: ComPSum generates structured analysis of users by comparing their preferences with other users’ preferences, then uses this analysis to guide personalized summary generation. Also proposes AuthorMap evaluation framework and constructs PerMSum dataset.
Result: ComPSum outperforms strong baselines on the PerMSum dataset when evaluated using the AuthorMap framework.
Conclusion: Comparative analysis of user preferences enables effective personalization in multi-document summarization, and the proposed ComPSum framework with AuthorMap evaluation provides a robust approach for personalized MDS.
Abstract: Personalized multi-document summarization (MDS) is essential for meeting individual user preferences of writing style and content focus for summaries. In this paper, we propose that for effective personalization, it is important to identify fine-grained differences between users’ preferences by comparing the given user’s preferences with other users’ preferences.Motivated by this, we propose ComPSum, a personalized MDS framework. It first generates a structured analysis of a user by comparing their preferences with other users’ preferences. The generated structured analysis is then used to guide the generation of personalized summaries. To evaluate the performance of ComPSum, we propose AuthorMap, a fine-grained reference-free evaluation framework for personalized MDS. It evaluates the personalization of a system based on the authorship attribution between two personalized summaries generated for different users. For robust evaluation of personalized MDS, we construct PerMSum, a personalized MDS dataset in the review and news domain. We evaluate the performance of ComPSum on PerMSum using AuthorMap, showing that it outperforms strong baselines.
[19] Vision Language Models Cannot Plan, but Can They Formalize?
Muyu He, Yuxi Zheng, Yuchen Liu, Zijian An, Bill Cai, Jiani Huang, Lifeng Zhou, Feng Liu, Ziyang Li, Li Zhang
Main category: cs.CL
TL;DR: VLMs translate multimodal environments into PDDL for formal planning instead of directly generating action sequences, outperforming end-to-end approaches but struggling with vision-based object relation extraction.
Details
Motivation: Current VLMs handle simple multimodal planning but fail at long-horizon tasks requiring complex action sequences. Text-only planning improved by using LLMs as formalizers, but multimodal VLM-as-formalizer research is limited with oversimplified setups.Method: Developed five VLM-as-formalizer pipelines for one-shot, open-vocabulary, multimodal PDDL formalization. Evaluated on existing benchmark and introduced two new benchmarks with authentic, multi-view, low-quality images. Used intermediate representations like captions and scene graphs.
Result: VLM-as-formalizer significantly outperforms end-to-end plan generation. Vision is the main bottleneck - VLMs often miss necessary object relations. Intermediate representations provide partial improvement but inconsistent gains.
Conclusion: VLM-as-formalizer approach is promising for multimodal planning but vision capabilities need improvement. Future research should focus on better multimodal planning formalization methods.
Abstract: The advancement of vision language models (VLMs) has empowered embodied agents to accomplish simple multimodal planning tasks, but not long-horizon ones requiring long sequences of actions. In text-only simulations, long-horizon planning has seen significant improvement brought by repositioning the role of LLMs. Instead of directly generating action sequences, LLMs translate the planning domain and problem into a formal planning language like the Planning Domain Definition Language (PDDL), which can call a formal solver to derive the plan in a verifiable manner. In multimodal environments, research on VLM-as-formalizer remains scarce, usually involving gross simplifications such as predefined object vocabulary or overly similar few-shot examples. In this work, we present a suite of five VLM-as-formalizer pipelines that tackle one-shot, open-vocabulary, and multimodal PDDL formalization. We evaluate those on an existing benchmark while presenting another two that for the first time account for planning with authentic, multi-view, and low-quality images. We conclude that VLM-as-formalizer greatly outperforms end-to-end plan generation. We reveal the bottleneck to be vision rather than language, as VLMs often fail to capture an exhaustive set of necessary object relations. While generating intermediate, textual representations such as captions or scene graphs partially compensate for the performance, their inconsistent gain leaves headroom for future research directions on multimodal planning formalization.
[20] “Be My Cheese?”: Assessing Cultural Nuance in Multilingual LLM Translations
Madison Van Doren, Cory Holland
Main category: cs.CL
TL;DR: This pilot study evaluates multilingual AI models’ ability to translate figurative language like idioms and puns, finding that while grammatically correct, culturally nuanced translations often require human refinement.
Details
Motivation: To address the gap in existing LLM translation research that focuses on grammatical accuracy but overlooks cultural appropriateness and localization quality for real-world applications like marketing.Method: Evaluated 87 LLM-generated translations of e-commerce marketing emails across 24 regional dialects of 20 languages using human reviewers who provided quantitative ratings and qualitative feedback on faithfulness to tone, meaning, and intended audience.
Result: Leading models produce grammatically correct translations but struggle with culturally nuanced language, even in high-resource languages. Figurative expressions and wordplay were frequently mistranslated, requiring substantial human refinement.
Conclusion: Cultural appropriateness is a key determinant of multilingual LLM performance, challenging the assumption that data volume alone predicts translation quality. Current systems have limitations for real-world localization, highlighting the need for expanded research in culturally diverse contexts.
Abstract: This pilot study explores the localisation capabilities of state-of-the-art multilingual AI models when translating figurative language, such as idioms and puns, from English into a diverse range of global languages. It expands on existing LLM translation research and industry benchmarks, which emphasise grammatical accuracy and token-level correctness, by focusing on cultural appropriateness and overall localisation quality - critical factors for real-world applications like marketing and e-commerce. To investigate these challenges, this project evaluated a sample of 87 LLM-generated translations of e-commerce marketing emails across 24 regional dialects of 20 languages. Human reviewers fluent in each target language provided quantitative ratings and qualitative feedback on faithfulness to the original’s tone, meaning, and intended audience. Findings suggest that, while leading models generally produce grammatically correct translations, culturally nuanced language remains a clear area for improvement, often requiring substantial human refinement. Notably, even high-resource global languages, despite topping industry benchmark leaderboards, frequently mistranslated figurative expressions and wordplay. This work challenges the assumption that data volume is the most reliable predictor of machine translation quality and introduces cultural appropriateness as a key determinant of multilingual LLM performance - an area currently underexplored in existing academic and industry benchmarks. As a proof of concept, this pilot highlights limitations of current multilingual AI systems for real-world localisation use cases. Results of this pilot support the opportunity for expanded research at greater scale to deliver generalisable insights and inform deployment of reliable machine translation workflows in culturally diverse contexts.
[21] VoiceAssistant-Eval: Benchmarking AI Assistants across Listening, Speaking, and Viewing
Ke Wang, Houxing Ren, Zimu Lu, Mingjie Zhan, Hongsheng Li
Main category: cs.CL
TL;DR: VoiceAssistant-Eval is a comprehensive benchmark with 10,497 examples across 13 task categories to evaluate AI assistants’ listening, speaking, and viewing capabilities, revealing that proprietary models don’t always outperform open-source ones and that smaller models can rival larger ones.
Details
Motivation: Existing benchmarks are inadequate for evaluating the full range of capabilities of voice-first AI assistants, especially with growing capabilities of large language models and multimodal systems.Method: Created VoiceAssistant-Eval benchmark with 10,497 curated examples spanning 13 task categories covering listening (natural sounds, music, spoken dialogue), speaking (multi-turn dialogue, role-play imitation, various scenarios), and viewing (highly heterogeneous images). Evaluated 21 open-source models and GPT-4o-Audio.
Result: Three key findings: (1) proprietary models don’t universally outperform open-source models; (2) most models excel at speaking but lag in audio understanding; (3) well-designed smaller models can rival much larger ones (Step-Audio-2-mini 7B achieved more than double the listening accuracy of LLaMA-Omni2-32B-Bilingual). Challenges remain in multimodal input and role-play voice imitation.
Conclusion: VoiceAssistant-Eval identifies gaps in current AI assistants and establishes a rigorous framework for evaluating and guiding development of next-generation AI assistants, with significant work needed in robustness and safety alignment.
Abstract: The growing capabilities of large language models and multimodal systems have spurred interest in voice-first AI assistants, yet existing benchmarks are inadequate for evaluating the full range of these systems’ capabilities. We introduce VoiceAssistant-Eval, a comprehensive benchmark designed to assess AI assistants across listening, speaking, and viewing. VoiceAssistant-Eval comprises 10,497 curated examples spanning 13 task categories. These tasks include natural sounds, music, and spoken dialogue for listening; multi-turn dialogue, role-play imitation, and various scenarios for speaking; and highly heterogeneous images for viewing. To demonstrate its utility, we evaluate 21 open-source models and GPT-4o-Audio, measuring the quality of the response content and speech, as well as their consistency. The results reveal three key findings: (1) proprietary models do not universally outperform open-source models; (2) most models excel at speaking tasks but lag in audio understanding; and (3) well-designed smaller models can rival much larger ones. Notably, the mid-sized Step-Audio-2-mini (7B) achieves more than double the listening accuracy of LLaMA-Omni2-32B-Bilingual. However, challenges remain: multimodal (audio plus visual) input and role-play voice imitation tasks are difficult for current models, and significant gaps persist in robustness and safety alignment. VoiceAssistant-Eval identifies these gaps and establishes a rigorous framework for evaluating and guiding the development of next-generation AI assistants. Code and data will be released at https://mathllm.github.io/VoiceAssistantEval/ .
[22] OjaKV: Context-Aware Online Low-Rank KV Cache Compression with Oja’s Rule
Yuxuan Zhu, David H. Yang, Mohammad Mohammadi Amiri, Keerthiram Murugesan, Tejaswini Pedapati, Pin-Yu Chen
Main category: cs.CL
TL;DR: OjaKV is a novel KV-cache compression framework that combines hybrid storage (preserving first/recent tokens in full-rank) with online subspace adaptation using Oja’s algorithm, enabling memory-efficient long-context inference without model fine-tuning.
Details
Motivation: The KV cache for long-context LLMs creates significant memory bottlenecks (e.g., 16GB for 32K tokens), and existing static compression methods perform poorly under data distribution shifts.Method: Hybrid storage policy preserves first and most recent tokens in full-rank while compressing intermediate tokens using low-rank projection with online subspace adaptation via Oja’s algorithm. Includes comprehensive updates during prefilling and lightweight periodic updates during decoding.
Result: Maintains or improves zero-shot accuracy at high compression ratios, with strongest gains on very long-context benchmarks requiring complex reasoning. Fully compatible with modern attention modules like FlashAttention.
Conclusion: OjaKV provides a practical, plug-and-play solution for memory-efficient long-context inference that dynamically tracks context shifts through online adaptation, eliminating the need for model fine-tuning.
Abstract: The expanding long-context capabilities of large language models are constrained by a significant memory bottleneck: the key-value (KV) cache required for autoregressive generation. This bottleneck is substantial; for instance, a Llama-3.1-8B model processing a 32K-token prompt at a batch size of 4 requires approximately 16GB for its KV cache, a size exceeding the model’s weights. While KV-cache compression via low-rank projection is a promising direction, existing methods rely on a static, offline-learned subspace that performs poorly under data distribution shifts. To overcome these limitations, we introduce OjaKV, a novel framework that integrates a strategic hybrid storage policy with online subspace adaptation. First, OjaKV recognizes that not all tokens are equally important for compression; it preserves the crucial first and most recent tokens in full-rank, maintaining high-fidelity anchors for attention. Second, for the vast majority of intermediate tokens, it applies low-rank compression by incrementally adapting the projection basis using Oja’s algorithm for online principal component analysis. This adaptation involves a comprehensive update during prompt prefilling and lightweight periodic updates during decoding, ensuring the subspace remains aligned with the evolving context. Crucially, our framework is fully compatible with modern attention modules like FlashAttention. Experiments demonstrate that OjaKV maintains or even improves zero-shot accuracy at high compression ratios. In particular, OjaKV achieves its strongest gains on very long-context benchmarks that require complex reasoning, highlighting the importance of online subspace adaptation in dynamically tracking context shifts. These results establish our hybrid framework as a practical, plug-and-play solution for memory-efficient long-context inference without requiring model fine-tuning.
[23] Towards Transparent AI: A Survey on Explainable Language Models
Avash Palikhe, Zichong Wang, Zhipeng Yin, Rui Guo, Qiang Duan, Jie Yang, Wenbin Zhang
Main category: cs.CL
TL;DR: This survey comprehensively reviews explainable AI (XAI) techniques for language models, organizing them by transformer architectures (encoder-only, decoder-only, encoder-decoder) and evaluating them through plausibility and faithfulness metrics.
Details
Motivation: Language models' black-box nature raises critical interpretability concerns, especially for high-stakes applications. Existing XAI methods face limitations when applied to LMs due to their complex architectures, large training corpora, and broad generalization abilities.Method: The survey organizes XAI techniques according to transformer architectures and analyzes how methods are adapted to each architecture type while assessing their strengths and limitations. Evaluation is done through dual lenses of plausibility and faithfulness.
Result: The paper provides a structured perspective on XAI technique effectiveness for different LM architectures, identifying how methods perform across various transformer model types.
Conclusion: The survey identifies open research challenges and outlines future directions to guide development of robust, transparent, and interpretable XAI methods specifically tailored for language models.
Abstract: Language Models (LMs) have significantly advanced natural language processing and enabled remarkable progress across diverse domains, yet their black-box nature raises critical concerns about the interpretability of their internal mechanisms and decision-making processes. This lack of transparency is particularly problematic for adoption in high-stakes domains, where stakeholders need to understand the rationale behind model outputs to ensure accountability. On the other hand, while explainable artificial intelligence (XAI) methods have been well studied for non-LMs, they face many limitations when applied to LMs due to their complex architectures, considerable training corpora, and broad generalization abilities. Although various surveys have examined XAI in the context of LMs, they often fail to capture the distinct challenges arising from the architectural diversity and evolving capabilities of these models. To bridge this gap, this survey presents a comprehensive review of XAI techniques with a particular emphasis on LMs, organizing them according to their underlying transformer architectures: encoder-only, decoder-only, and encoder-decoder, and analyzing how methods are adapted to each while assessing their respective strengths and limitations. Furthermore, we evaluate these techniques through the dual lenses of plausibility and faithfulness, offering a structured perspective on their effectiveness. Finally, we identify open research challenges and outline promising future directions, aiming to guide ongoing efforts toward the development of robust, transparent, and interpretable XAI methods for LMs.
[24] ReviewScore: Misinformed Peer Review Detection with Large Language Models
Hyun Ryu, Doohyuk Jang, Hyemin S. Lee, Joonhyun Jeong, Gyeongman Kim, Donghyeon Cho, Gyouk Chu, Minyeong Hwang, Hyeongwon Jang, Changhun Kim, Haechan Kim, Jina Kim, Joowon Kim, Yoonjeon Kim, Kwanhyung Lee, Chanjae Park, Heecheol Yun, Gregor Betz, Eunho Yang
Main category: cs.CL
TL;DR: The paper proposes ReviewScore to detect misinformed peer reviews in AI conferences by identifying incorrect premises in weaknesses and already-answered questions, and shows LLMs can moderately automate this evaluation.
Details
Motivation: Peer review quality is degrading in AI conferences due to exploding submissions, creating need for reliable detection of low-quality reviews.Method: Define misinformed review points as weaknesses with incorrect premises or questions already answered by paper. Build human-annotated dataset and test LLMs on premise-level factuality evaluation.
Result: 15.2% of weaknesses and 26.4% of questions are misinformed. LLMs show moderate agreement with humans on ReviewScore evaluation, with premise-level factuality showing higher agreement than weakness-level.
Conclusion: Automated ReviewScore evaluation using LLMs is feasible, with premise-level analysis being more reliable than weakness-level evaluation.
Abstract: Peer review serves as a backbone of academic research, but in most AI conferences, the review quality is degrading as the number of submissions explodes. To reliably detect low-quality reviews, we define misinformed review points as either “weaknesses” in a review that contain incorrect premises, or “questions” in a review that can be already answered by the paper. We verify that 15.2% of weaknesses and 26.4% of questions are misinformed and introduce ReviewScore indicating if a review point is misinformed. To evaluate the factuality of each premise of weaknesses, we propose an automated engine that reconstructs every explicit and implicit premise from a weakness. We build a human expert-annotated ReviewScore dataset to check the ability of LLMs to automate ReviewScore evaluation. Then, we measure human-model agreements on ReviewScore using eight current state-of-the-art LLMs and verify moderate agreements. We also prove that evaluating premise-level factuality shows significantly higher agreements than evaluating weakness-level factuality. A thorough disagreement analysis further supports a potential of fully automated ReviewScore evaluation.
[25] HiddenBench: Assessing Collective Reasoning in Multi-Agent LLMs via Hidden Profile Tasks
Yuxuan Li, Aoi Naito, Hirokazu Shirado
Main category: cs.CL
TL;DR: HiddenBench is the first benchmark for evaluating collective reasoning in multi-agent LLMs, based on the Hidden Profile paradigm from social psychology, revealing persistent limitations in distributed knowledge integration.
Details
Motivation: Multi-agent LLM systems promise enhanced problem-solving but may replicate human collective reasoning failures, yet lack theory-grounded benchmarks for systematic evaluation.Method: Built on Hidden Profile paradigm with 65 tasks from custom designs, human studies, and automatic generation; evaluated 15 LLMs across four model families using asymmetric information integration tasks.
Result: GPT-4.1 groups fail to integrate distributed knowledge, showing human-like collective reasoning failures; some models (Gemini-2.5-Flash/Pro) perform better, but scale and reasoning are not reliable indicators of collective reasoning strength.
Conclusion: HiddenBench provides the first reproducible benchmark for collective reasoning in multi-agent LLMs, offering diagnostic insights and foundation for future artificial collective intelligence research.
Abstract: Multi-agent systems built on large language models (LLMs) promise enhanced problem-solving through distributed information integration, but may also replicate collective reasoning failures observed in human groups. Yet the absence of a theory-grounded benchmark makes it difficult to systematically evaluate and improve such reasoning. We introduce HiddenBench, the first benchmark for evaluating collective reasoning in multi-agent LLMs. It builds on the Hidden Profile paradigm from social psychology, where individuals each hold asymmetric pieces of information and must communicate to reach the correct decision. To ground the benchmark, we formalize the paradigm with custom tasks and show that GPT-4.1 groups fail to integrate distributed knowledge, exhibiting human-like collective reasoning failures that persist even with varied prompting strategies. We then construct the full benchmark, spanning 65 tasks drawn from custom designs, prior human studies, and automatic generation. Evaluating 15 LLMs across four model families, HiddenBench exposes persistent limitations while also providing comparative insights: some models (e.g., Gemini-2.5-Flash/Pro) achieve higher performance, yet scale and reasoning are not reliable indicators of stronger collective reasoning. Our work delivers the first reproducible benchmark for collective reasoning in multi-agent LLMs, offering diagnostic insight and a foundation for future research on artificial collective intelligence.
[26] GRAB: A Risk Taxonomy–Grounded Benchmark for Unsupervised Topic Discovery in Financial Disclosures
Ying Li, Tiejun Ma
Main category: cs.CL
TL;DR: GRAB is a finance-specific benchmark for evaluating unsupervised topic models on 10-K risk disclosures, featuring 1.61M sentences with automated span-grounded labels using FinBERT, YAKE, and taxonomy matching.
Details
Motivation: No public benchmark exists to evaluate unsupervised topic models for risk categorization in 10-K risk disclosures, which is important for oversight and investment.Method: Created GRAB benchmark with 1.61M sentences from 8,247 filings using automated labeling combining FinBERT token attention, YAKE keyphrase signals, and taxonomy-aware collocation matching anchored in a risk taxonomy of 193 terms mapped to 21 fine-grained types.
Result: Developed a unified evaluation framework with fixed dataset splits and robust metrics including Accuracy, Macro-F1, Topic BERTScore, and Effective Number of Topics.
Conclusion: GRAB enables reproducible, standardized comparison across various topic models on financial disclosures, providing dataset, labels, and code for the research community.
Abstract: Risk categorization in 10-K risk disclosures matters for oversight and investment, yet no public benchmark evaluates unsupervised topic models for this task. We present GRAB, a finance-specific benchmark with 1.61M sentences from 8,247 filings and span-grounded sentence labels produced without manual annotation by combining FinBERT token attention, YAKE keyphrase signals, and taxonomy-aware collocation matching. Labels are anchored in a risk taxonomy mapping 193 terms to 21 fine-grained types nested under five macro classes; the 21 types guide weak supervision, while evaluation is reported at the macro level. GRAB unifies evaluation with fixed dataset splits and robust metrics–Accuracy, Macro-F1, Topic BERTScore, and the entropy-based Effective Number of Topics. The dataset, labels, and code enable reproducible, standardized comparison across classical, embedding-based, neural, and hybrid topic models on financial disclosures.
[27] Think-on-Graph 3.0: Efficient and Adaptive LLM Reasoning on Heterogeneous Graphs via Multi-Agent Dual-Evolving Context Retrieval
Xiaojun Wu, Cehao Yang, Xueyuan Lin, Chengjin Xu, Xuhui Jiang, Yuanliang Sun, Hui Xiong, Jia Li, Jian Guo
Main category: cs.CL
TL;DR: ToG-3 introduces a multi-agent framework with dual-evolution mechanism for dynamic graph construction in Graph-based RAG, overcoming limitations of static graph indexes and enabling precise reasoning with lightweight LLMs.
Details
Motivation: Existing Graph-based RAG methods face fundamental trade-offs: manually constructed knowledge graphs are expensive to scale, while automatically extracted graphs are limited by LLM extractor performance, especially with smaller models. Static graph construction without query adaptation is a critical limitation.Method: Proposes Think-on-Graph 3.0 (ToG-3) with Multi-Agent Context Evolution and Retrieval (MACER) mechanism. Features dynamic construction of Chunk-Triplets-Community heterogeneous graph index with dual-evolution of Evolving Query and Evolving Sub-Graph. Uses multi-agent system (Constructor, Retriever, Reflector, Responser) for iterative evidence retrieval, answer generation, and graph refinement.
Result: Extensive experiments show ToG-3 outperforms baselines on both deep and broad reasoning benchmarks. Ablation studies confirm the efficacy of MACER framework components.
Conclusion: ToG-3 successfully addresses limitations of static graph construction in Graph-based RAG by enabling adaptive, targeted graph index building during reasoning, allowing deep precise reasoning even with lightweight LLMs.
Abstract: Retrieval-Augmented Generation (RAG) and Graph-based RAG has become the important paradigm for enhancing Large Language Models (LLMs) with external knowledge. However, existing approaches face a fundamental trade-off. While graph-based methods are inherently dependent on high-quality graph structures, they face significant practical constraints: manually constructed knowledge graphs are prohibitively expensive to scale, while automatically extracted graphs from corpora are limited by the performance of the underlying LLM extractors, especially when using smaller, local-deployed models. This paper presents Think-on-Graph 3.0 (ToG-3), a novel framework that introduces Multi-Agent Context Evolution and Retrieval (MACER) mechanism to overcome these limitations. Our core innovation is the dynamic construction and refinement of a Chunk-Triplets-Community heterogeneous graph index, which pioneeringly incorporates a dual-evolution mechanism of Evolving Query and Evolving Sub-Graph for precise evidence retrieval. This approach addresses a critical limitation of prior Graph-based RAG methods, which typically construct a static graph index in a single pass without adapting to the actual query. A multi-agent system, comprising Constructor, Retriever, Reflector, and Responser agents, collaboratively engages in an iterative process of evidence retrieval, answer generation, sufficiency reflection, and, crucially, evolving query and subgraph. This dual-evolving multi-agent system allows ToG-3 to adaptively build a targeted graph index during reasoning, mitigating the inherent drawbacks of static, one-time graph construction and enabling deep, precise reasoning even with lightweight LLMs. Extensive experiments demonstrate that ToG-3 outperforms compared baselines on both deep and broad reasoning benchmarks, and ablation studies confirm the efficacy of the components of MACER framework.
[28] ProPerSim: Developing Proactive and Personalized AI Assistants through User-Assistant Simulation
Jiho Kim, Junseong Choi, Woosog Chay, Daeun Kyung, Yeonsu Kwon, Yohan Jo, Edward Choi
Main category: cs.CL
TL;DR: ProPerSim: A framework for developing proactive, personalized AI assistants that learn user preferences through feedback in home scenarios.
Details
Motivation: As LLMs become more integrated into daily life, there's growing demand for AI assistants that are both proactive and personalized, but current approaches haven't sufficiently combined these two aspects.Method: ProPerSim simulation framework where a user agent with rich persona interacts with the assistant and provides preference ratings. ProPerAssistant uses retrieval-augmented learning and continuously adapts through user feedback.
Result: Experiments across 32 diverse personas show ProPerAssistant successfully adapts its strategy and steadily improves user satisfaction over time.
Conclusion: The framework demonstrates the promise of combining proactivity and personalization in AI assistants, enabling them to make timely, personalized recommendations that align with user preferences.
Abstract: As large language models (LLMs) become increasingly integrated into daily life, there is growing demand for AI assistants that are not only reactive but also proactive and personalized. While recent advances have pushed forward proactivity and personalization individually, their combination remains underexplored. To bridge this gap, we introduce ProPerSim, a new task and simulation framework for developing assistants capable of making timely, personalized recommendations in realistic home scenarios. In our simulation environment, a user agent with a rich persona interacts with the assistant, providing ratings on how well each suggestion aligns with its preferences and context. The assistant’s goal is to use these ratings to learn and adapt to achieve higher scores over time. Built on ProPerSim, we propose ProPerAssistant, a retrieval-augmented, preference-aligned assistant that continually learns and adapts through user feedback. Experiments across 32 diverse personas show that ProPerAssistant adapts its strategy and steadily improves user satisfaction, highlighting the promise of uniting proactivity and personalization.
[29] Towards Inclusive ASR: Investigating Voice Conversion for Dysarthric Speech Recognition in Low-Resource Languages
Chin-Jou Li, Eunjung Yeo, Kwanghee Choi, Paula Andrea Pérez-Toro, Masao Someki, Rohan Kumar Das, Zhengjun Yue, Juan Rafael Orozco-Arroyave, Elmar Nöth, David R. Mortensen
Main category: cs.CL
TL;DR: Fine-tuning a voice conversion model on English dysarthric speech to generate non-English dysarthric-like speech for improving multilingual ASR performance on dysarthric speech.
Details
Motivation: Address data scarcity in non-English dysarthric speech recognition by leveraging English dysarthric data through voice conversion.Method: Fine-tune voice conversion model on English dysarthric speech (UASpeech) to encode speaker characteristics and prosodic distortions, then apply it to convert healthy non-English speech (FLEURS) into dysarthric-like speech. Use generated data to fine-tune multilingual ASR model (MMS).
Result: VC with both speaker and prosody conversion significantly outperforms off-the-shelf MMS and conventional augmentation techniques (speed/tempo perturbation) on PC-GITA (Spanish), EasyCall (Italian), and SSNCE (Tamil) datasets. Generated speech simulates dysarthric characteristics confirmed by objective and subjective analyses.
Conclusion: Voice conversion approach effectively generates non-English dysarthric-like speech, enabling improved multilingual ASR performance for dysarthric speech recognition.
Abstract: Automatic speech recognition (ASR) for dysarthric speech remains challenging due to data scarcity, particularly in non-English languages. To address this, we fine-tune a voice conversion model on English dysarthric speech (UASpeech) to encode both speaker characteristics and prosodic distortions, then apply it to convert healthy non-English speech (FLEURS) into non-English dysarthric-like speech. The generated data is then used to fine-tune a multilingual ASR model, Massively Multilingual Speech (MMS), for improved dysarthric speech recognition. Evaluation on PC-GITA (Spanish), EasyCall (Italian), and SSNCE (Tamil) demonstrates that VC with both speaker and prosody conversion significantly outperforms the off-the-shelf MMS performance and conventional augmentation techniques such as speed and tempo perturbation. Objective and subjective analyses of the generated data further confirm that the generated speech simulates dysarthric characteristics.
[30] How Accurate Are LLMs at Multi-Question Answering on Conversational Transcripts?
Xiliang Zhu, Shi Zong, David Rossouw
Main category: cs.CL
TL;DR: Fine-tuned public LLMs with up to 8B parameters can outperform GPT-4o in accuracy for multi-question answering over long contexts, offering cost-effective deployment alternatives.
Details
Motivation: Deploying LLMs for QA over lengthy contexts faces high computational costs and latency challenges, especially when answering multiple questions from the same context in industrial settings.Method: Conducted extensive experiments benchmarking both proprietary and public LLMs on answering multiple questions based on the same conversational context, including fine-tuning public models.
Result: Strong proprietary LLMs like GPT-4o achieve best overall performance, but fine-tuned public LLMs with up to 8B parameters can surpass GPT-4o in accuracy.
Conclusion: Fine-tuned public LLMs demonstrate potential for transparent and cost-effective deployment in real-world applications, providing viable alternatives to proprietary models.
Abstract: Deploying Large Language Models (LLMs) for question answering (QA) over lengthy contexts is a significant challenge. In industrial settings, this process is often hindered by high computational costs and latency, especially when multiple questions must be answered based on the same context. In this work, we explore the capabilities of LLMs to answer multiple questions based on the same conversational context. We conduct extensive experiments and benchmark a range of both proprietary and public models on this challenging task. Our findings highlight that while strong proprietary LLMs like GPT-4o achieve the best overall performance, fine-tuned public LLMs with up to 8 billion parameters can surpass GPT-4o in accuracy, which demonstrates their potential for transparent and cost-effective deployment in real-world applications.
[31] Self-Speculative Biased Decoding for Faster Live Translation
Linxiao Zeng, Haoyun Deng, Kangyuan Shu, Shizhen Wang
Main category: cs.CL
TL;DR: Self-Speculative Biased Decoding enables efficient streaming translation by using recent outputs as drafts and biasing verification towards them, achieving 1.7x speedup and 80% flicker reduction without quality loss.
Details
Motivation: LLMs struggle with streaming applications where output must update continuously as input grows, while maintaining low computational cost for latency requirements.Method: Uses most recent output as draft for growing input context, biases verification towards draft tokens for higher acceptance rate, and continues from divergence point after verification.
Result: Achieves up to 1.7x speedup compared to conventional auto-regressive re-translation without quality compromise, and reduces flickering by 80% using display-only mask-k technique.
Conclusion: Provides model-agnostic, plug-and-play solution for streaming applications that eliminates draft computations and significantly improves performance.
Abstract: Large Language Models (LLMs) have recently demonstrated impressive capabilities in various text generation tasks. However, it remains challenging to use them off-the-shelf in streaming applications (such as live translation), where the output must continually update as the input context expands, while still maintaining a reasonable computational cost to meet the latency requirement. In this work, we reexamine the re-translation approach to simultaneous translation and propose Self-Speculative Biased Decoding, a novel inference paradigm designed to avoid repeatedly generating output from scratch for a consistently growing input stream. We propose using the most recent output as a draft for the current growing input context. During the verification stage, the output will be biased towards the draft token for a higher draft acceptance rate. This strategy not only minimizes flickering that might distract users but also leads to higher speedups. Conventional decoding may take charge from the point of divergence after draft verification and continue until the end condition is met. Unlike existing speculative decoding strategies, our approach eliminates the need for draft computations, making it a model-agnostic and plug-and-play solution for accelerating latency-sensitive streaming applications. Experimental results on simultaneous text-to-text re-translation demonstrate that our approach achieves up to 1.7x speedup compared to conventional auto-regressive re-translation without compromising quality. Additionally, it significantly reduces flickering by 80% by incorporating the display-only mask-k technique.
[32] SynerGen: Contextualized Generative Recommender for Unified Search and Recommendation
Vianne R. Gao, Chen Xue, Marc Versage, Xie Zhou, Zhongruo Wang, Chao Li, Yeon Seonwoo, Nan Chen, Zhen Ge, Gourab Kundu, Weiqi Zhang, Tian Wang, Qingjun Cui, Trishul Chilimbi
Main category: cs.CL
TL;DR: SynerGen is a unified generative recommender model that bridges personalized search and recommendation using a single decoder-only Transformer backbone, achieving superior performance on both retrieval and ranking tasks.
Details
Motivation: Current retrieve-then-rank pipelines suffer from mis-calibration and engineering overhead due to architectural splits and differing optimization objectives. Existing generative models typically address either personalized search or query-free recommendation, with performance trade-offs when unifying both.Method: Uses a decoder-only Transformer trained on behavioral sequences with joint optimization: InfoNCE for retrieval and hybrid pointwise-pairwise loss for ranking. Introduces novel time-aware rotary positional embedding to incorporate time information into attention mechanism.
Result: Achieves significant improvements on widely adopted recommendation and search benchmarks compared to strong generative recommender and joint search and recommendation baselines.
Conclusion: Demonstrates the viability of a single generative foundation model for industrial-scale unified information access, allowing semantic signals from search to improve recommendation and vice versa.
Abstract: The dominant retrieve-then-rank pipeline in large-scale recommender systems suffers from mis-calibration and engineering overhead due to its architectural split and differing optimization objectives. While recent generative sequence models have shown promise in unifying retrieval and ranking by auto-regressively generating ranked items, existing solutions typically address either personalized search or query-free recommendation, often exhibiting performance trade-offs when attempting to unify both. We introduce \textit{SynerGen}, a novel generative recommender model that bridges this critical gap by providing a single generative backbone for both personalized search and recommendation, while simultaneously excelling at retrieval and ranking tasks. Trained on behavioral sequences, our decoder-only Transformer leverages joint optimization with InfoNCE for retrieval and a hybrid pointwise-pairwise loss for ranking, allowing semantic signals from search to improve recommendation and vice versa. We also propose a novel time-aware rotary positional embedding to effectively incorporate time information into the attention mechanism. \textit{SynerGen} achieves significant improvements on widely adopted recommendation and search benchmarks compared to strong generative recommender and joint search and recommendation baselines. This work demonstrates the viability of a single generative foundation model for industrial-scale unified information access.
[33] Navigating the Impact of Structured Output Format on Large Language Models through the Compass of Causal Inference
Han Yuan, Yue Zhao, Li Zhang, Wuqiong Luo, Zheng Ma
Main category: cs.CL
TL;DR: Causal analysis reveals structured output has minimal causal impact on LLM generation quality, contradicting prior conflicting findings.
Details
Motivation: To resolve conflicting prior findings about structured output's effects on LLM generation quality using rigorous causal inference methods.Method: Used causal inference with five potential causal structures across eight reasoning tasks to analyze structured output’s impact on GPT-4o.
Result: Found no causal impact in 43/48 scenarios; only 5 showed effects, with 3 involving multifaceted causal structures influenced by instructions.
Conclusion: Structured output generally has minimal causal impact on LLM generation quality, with effects mainly emerging through complex instruction interactions.
Abstract: Structured output from large language models (LLMs) has enhanced efficiency in processing generated information and is increasingly adopted in industrial applications. Prior studies have investigated the impact of structured output on LLMs’ generation quality, often presenting one-way findings. Some suggest that structured format enhances completeness and factual accuracy, while others argue that it restricts the reasoning capacity of LLMs and leads to reductions in standard evaluation metrics. Potential limitations of these assessments include restricted testing scenarios, weakly controlled comparative settings, and reliance on coarse metrics. In this work, we present a refined analysis using causal inference. Based on one assumed and two guaranteed constraints, we derive five potential causal structures characterizing the influence of structured output on LLMs’ generation: (1) collider without m-bias, (2) collider with m-bias, (3) single cause from instruction, (4) single cause from output format, and (5) independence. Across seven public and one developed reasoning tasks, we find that coarse metrics report positive, negative, or neutral effects of structured output on GPT-4o’s generation. However, causal inference reveals no causal impact in 43 out of 48 scenarios. In the remaining 5, 3 involve multifaceted causal structures influenced by concrete instructions.
[34] Evaluating and Improving Cultural Awareness of Reward Models for LLM Alignment
Hongbin Zhang, Kehai Chen, Xuefeng Bai, Yang Xiang, Min Zhang
Main category: cs.CL
TL;DR: CARB is a benchmark for evaluating cultural awareness in reward models, revealing deficiencies in current models and proposing Think-as-Locals with RLVR to improve cultural understanding.
Details
Motivation: Existing reward model evaluations lack culturally relevant datasets, making it difficult to assess cultural awareness needed for global alignment of LLMs.Method: Proposed CARB benchmark covering 10 cultures across 4 domains, and Think-as-Locals approach using reinforcement learning from verifiable rewards to elicit culturally grounded reasoning.
Result: Evaluation shows current RMs have deficiencies in cultural awareness modeling and rely on surface-level features rather than authentic cultural understanding. Think-as-Locals effectively mitigates spurious correlations.
Conclusion: CARB enables better evaluation of cultural awareness in RMs, and Think-as-Locals with RLVR advances culture-aware reward modeling by reducing reliance on superficial features.
Abstract: Reward models (RMs) are crucial for aligning large language models (LLMs) with diverse cultures. Consequently, evaluating their cultural awareness is essential for further advancing global alignment of LLMs. However, existing RM evaluations fall short in assessing cultural awareness due to the scarcity of culturally relevant evaluation datasets. To fill this gap, we propose Cultural Awareness Reward modeling Benchmark (CARB), covering 10 distinct cultures across 4 cultural domains. Our extensive evaluation of state-of-the-art RMs reveals their deficiencies in modeling cultural awareness and demonstrates a positive correlation between performance on CARB and downstream multilingual cultural alignment tasks. Further analysis identifies the spurious correlations within culture-aware reward modeling, wherein RM’s scoring relies predominantly on surface-level features rather than authentic cultural nuance understanding. To address these, we propose Think-as-Locals to elicit deeper culturally grounded reasoning from generative RMs via reinforcement learning from verifiable rewards (RLVR) and employ well-designed rewards to ensure accurate preference judgments and high-quality structured evaluation criteria generation. Experimental results validate its efficacy in mitigating spurious features interference and advancing culture-aware reward modeling.
[35] Redefining Machine Simultaneous Interpretation: From Incremental Translation to Human-Like Strategies
Qianen Zhang, Satoshi Nakamura
Main category: cs.CL
TL;DR: This paper extends Simultaneous Machine Translation (SiMT) with four adaptive actions (SENTENCE_CUT, DROP, PARTIAL_SUMMARIZATION, PRONOMINALIZATION) in a decoder-only LLM framework, achieving better translation quality and lower latency than traditional approaches.
Details
Motivation: Traditional encoder-decoder SiMT policies with only READ/WRITE actions cannot fully address the real-time constraints and quality requirements of simultaneous translation.Method: Implemented four adaptive actions in a decoder-only LLM framework with action-aware prompting for training, and developed a latency-aware TTS pipeline for realistic timing evaluation.
Result: Experiments on ACL60/60 benchmarks show consistent improvements in semantic metrics (COMET-KIWI) and lower delay (Average Lagging) compared to reference translations and salami-based baselines, with DROP and SENTENCE_CUT combination yielding the best balance.
Conclusion: Enriching the action space of LLM-based SiMT provides a promising direction for bridging the gap between human and machine interpretation.
Abstract: Simultaneous Machine Translation (SiMT) requires high-quality translations under strict real-time constraints, which traditional encoder-decoder policies with only READ/WRITE actions cannot fully address. We extend the action space of SiMT with four adaptive actions: SENTENCE_CUT, DROP, PARTIAL_SUMMARIZATION and PRONOMINALIZATION, which enable real-time restructuring, omission, and simplification while preserving semantic fidelity. We implement these actions in a decoder-only large language model (LLM) framework and construct training references through action-aware prompting. To evaluate both quality and latency, we further develop a latency-aware TTS pipeline that maps textual outputs to speech with realistic timing. Experiments on the ACL60/60 English-Chinese and English-German benchmarks show that our framework consistently improves semantic metrics (e.g., COMET-KIWI) and achieves lower delay (measured by Average Lagging) compared to reference translations and salami-based baselines. Notably, combining DROP and SENTENCE_CUT yields the best overall balance between fluency and latency. These results demonstrate that enriching the action space of LLM-based SiMT provides a promising direction for bridging the gap between human and machine interpretation.
[36] Towards Minimal Causal Representations for Human Multimodal Language Understanding
Menghua Jiang, Yuncheng Jiang, Haifeng Hu, Sijie Mai
Main category: cs.CL
TL;DR: CaMIB is a causal multimodal information bottleneck model that improves out-of-distribution generalization by disentangling causal features from shortcut features using information bottleneck and causal principles.
Details
Motivation: Existing multimodal learning methods that maximize mutual information between data and labels are vulnerable to dataset biases, causing models to conflate statistical shortcuts with genuine causal features and resulting in poor out-of-distribution generalization.Method: Applies information bottleneck to filter unimodal inputs, uses parameterized mask generator to disentangle multimodal representations into causal and shortcut subrepresentations, incorporates instrumental variable constraint for global consistency, and adopts backdoor adjustment by randomly recombining features.
Result: Extensive experiments on multimodal sentiment analysis, humor detection, and sarcasm detection with OOD test sets demonstrate the effectiveness of CaMIB, showing improved generalization performance.
Conclusion: CaMIB provides an effective causal approach for multimodal language understanding that enhances out-of-distribution generalization while maintaining interpretability and theoretical soundness.
Abstract: Human Multimodal Language Understanding (MLU) aims to infer human intentions by integrating related cues from heterogeneous modalities. Existing works predominantly follow a ``learning to attend" paradigm, which maximizes mutual information between data and labels to enhance predictive performance. However, such methods are vulnerable to unintended dataset biases, causing models to conflate statistical shortcuts with genuine causal features and resulting in degraded out-of-distribution (OOD) generalization. To alleviate this issue, we introduce a Causal Multimodal Information Bottleneck (CaMIB) model that leverages causal principles rather than traditional likelihood. Concretely, we first applies the information bottleneck to filter unimodal inputs, removing task-irrelevant noise. A parameterized mask generator then disentangles the fused multimodal representation into causal and shortcut subrepresentations. To ensure global consistency of causal features, we incorporate an instrumental variable constraint, and further adopt backdoor adjustment by randomly recombining causal and shortcut features to stabilize causal estimation. Extensive experiments on multimodal sentiment analysis, humor detection, and sarcasm detection, along with OOD test sets, demonstrate the effectiveness of CaMIB. Theoretical and empirical analyses further highlight its interpretability and soundness.
[37] Can LLMs Solve and Generate Linguistic Olympiad Puzzles?
Neh Majmudar, Elena Filatova
Main category: cs.CL
TL;DR: LLMs outperform humans on most linguistic puzzles except writing systems and understudied languages, enabling automated puzzle generation to promote linguistics education.
Details
Motivation: To explore LLM capabilities in solving linguistic puzzles from Olympiads and use insights to automate puzzle generation for expanding interest in linguistics.Method: Extend existing benchmark for linguistic puzzles, test LLMs including OpenAI’s o1 across various linguistic topics, analyze performance gaps, and apply findings to puzzle generation.
Result: LLMs outperform humans on most puzzle types except writing systems and understudied languages, demonstrating potential for automated puzzle generation.
Conclusion: Automated linguistic puzzle generation can promote linguistics education and disseminate knowledge about rare languages, making it an important research task.
Abstract: In this paper, we introduce a combination of novel and exciting tasks: the solution and generation of linguistic puzzles. We focus on puzzles used in Linguistic Olympiads for high school students. We first extend the existing benchmark for the task of solving linguistic puzzles. We explore the use of Large Language Models (LLMs), including recent state-of-the-art models such as OpenAI’s o1, for solving linguistic puzzles, analyzing their performance across various linguistic topics. We demonstrate that LLMs outperform humans on most puzzles types, except for those centered on writing systems, and for the understudied languages. We use the insights from puzzle-solving experiments to direct the novel task of puzzle generation. We believe that automating puzzle generation, even for relatively simple puzzles, holds promise for expanding interest in linguistics and introducing the field to a broader audience. This finding highlights the importance of linguistic puzzle generation as a research task: such puzzles can not only promote linguistics but also support the dissemination of knowledge about rare and understudied languages.
[38] ResT: Reshaping Token-Level Policy Gradients for Tool-Use Large Language Models
Zihan Lin, Xiaohan Wang, Jie Cao, Jiajun Chai, Guojun Yin, Wei Lin, Ran He
Main category: cs.CL
TL;DR: ResT is a new method that reshapes policy gradients for LLM tool-use tasks through entropy-informed token reweighting, achieving state-of-the-art performance by stabilizing training and improving efficiency.
Details
Motivation: Current RL approaches for LLM tool-use rely on sparse outcome rewards and ignore task particularities, leading to high policy-gradient variance and inefficient training. The paper establishes a theoretical link between policy entropy and training stability.Method: ResT reshapes policy gradients through entropy-informed token reweighting, progressively upweighting reasoning tokens during training. This enables a smooth transition from structural correctness to semantic reasoning.
Result: ResT achieves state-of-the-art results on BFCL and API-Bank benchmarks, outperforming prior methods by up to 8.76%. When fine-tuned on a 4B LLM, it surpasses GPT-4o by 4.11% on single-turn tasks and 1.50% on multi-turn tasks.
Conclusion: The entropy-aware token reweighting scheme in ResT effectively stabilizes convergence in multi-turn tool-use tasks and demonstrates superior performance compared to existing methods and even large commercial models like GPT-4o.
Abstract: Large language models (LLMs) transcend passive generation and act as goal-directed agents by invoking external tools. Reinforcement learning (RL) offers a principled framework for optimizing these emergent tool-use policies, yet the prevailing paradigm relies exclusively on sparse outcome rewards and lacks consideration of the particularity of tool-use tasks, inflating policy-gradient variance and resulting in inefficient training. To better understand and address these challenges, we first establish a theoretical link between policy entropy and training stability of tool-use tasks, which reveals that structured, low-entropy tokens are primary determinants of rewards. Motivated by this insight, we propose \textbf{Res}haped \textbf{T}oken-level policy gradients (\textbf{ResT}) for tool-use tasks. ResT reshapes the policy gradient through entropy-informed token reweighting, progressively upweighting reasoning tokens as training proceeds. This entropy-aware scheme enables a smooth shift from structural correctness to semantic reasoning and stabilizes convergence in multi-turn tool-use tasks. Evaluation on BFCL and API-Bank shows that ResT achieves state-of-the-art results, outperforming prior methods by up to $8.76%$. When fine-tuned on a 4B base LLM, ResT further surpasses GPT-4o by $4.11%$ on single-turn tasks and $1.50%$ on multi-turn base tasks.
[39] Semantic Agreement Enables Efficient Open-Ended LLM Cascades
Duncan Soiffer, Steven Kolawole, Virginia Smith
Main category: cs.CL
TL;DR: Semantic agreement between ensemble outputs serves as a training-free signal for reliable deferral in LLM cascade systems, enabling cost reduction of 40% and latency reduction of up to 60% while maintaining quality.
Details
Motivation: Cascade systems face challenges in open-ended text generation where determining output reliability is difficult due to the continuous spectrum of generation quality and multiple valid responses.Method: Propose semantic agreement (meaning-level consensus between ensemble outputs) as a training-free signal for reliable deferral, which works without model internals and across black-box APIs.
Result: Semantic cascades match or surpass target-model quality at 40% of the cost, reduce latency by up to 60%, and remain robust to model updates.
Conclusion: Semantic agreement provides a practical baseline for real-world LLM deployment, offering stronger reliability signals than token-level confidence while being training-free and model-agnostic.
Abstract: Cascade systems route computational requests to smaller models when possible and defer to larger models only when necessary, offering a promising approach to balance cost and quality in LLM deployment. However, they face a fundamental challenge in open-ended text generation: determining output reliability when generation quality lies on a continuous spectrum, often with multiple valid responses. To address this, we propose semantic agreement – meaning-level consensus between ensemble outputs – as a training-free signal for reliable deferral. We show that when diverse model outputs agree semantically, their consensus is a stronger reliability signal than token-level confidence. Evaluated from 500M to 70B-parameter models, we find that semantic cascades match or surpass target-model quality at 40% of the cost and reduce latency by up to 60%. Our method requires no model internals, works across black-box APIs, and remains robust to model updates, making it a practical baseline for real-world LLM deployment.
[40] KnowMT-Bench: Benchmarking Knowledge-Intensive Long-Form Question Answering in Multi-Turn Dialogues
Junhao Chen, Yu Huang, Siyuan Li, Rui Yao, Hanqian Li, Hanyu Zhang, Jungang Li, Jian Chen, Bowen Wang, Xuming Hu
Main category: cs.CL
TL;DR: KnowMT-Bench is the first benchmark for evaluating multi-turn long-form question answering in knowledge-intensive domains, revealing that multi-turn contexts degrade factual accuracy and information efficiency, but RAG can mitigate this degradation.
Details
Motivation: Existing benchmarks are limited to single-turn dialogue or assess orthogonal capabilities rather than knowledge-intensive factuality in multi-turn settings, creating a critical gap in evaluating LLMs for real-world knowledge applications.Method: Created KnowMT-Bench with dynamic evaluation where models generate their own multi-turn dialogue histories from progressive question sequences, then evaluated final-turn answers using human-validated automated pipeline across medicine, finance, and law domains.
Result: Multi-turn contexts degrade performance: factual capability declines due to contextual noise from self-generated histories, and information efficiency drops as models become more verbose with longer dialogues.
Conclusion: RAG can effectively alleviate factual degradation, highlighting the importance of KnowMT-Bench for evaluating and enhancing conversational factual capabilities of LLMs in real-world knowledge-intensive applications.
Abstract: Multi-Turn Long-Form Question Answering (MT-LFQA) is a key application paradigm of Large Language Models (LLMs) in knowledge-intensive domains. However, existing benchmarks are limited to single-turn dialogue, while multi-turn dialogue benchmarks typically assess other orthogonal capabilities rather than knowledge-intensive factuality. To bridge this critical gap, we introduce \textbf{KnowMT-Bench}, the \textit{first-ever} benchmark designed to systematically evaluate MT-LFQA for LLMs across knowledge-intensive fields, including medicine, finance, and law. To faithfully assess the model’s real-world performance, KnowMT-Bench employs a dynamic evaluation setting where models generate their own multi-turn dialogue histories given logically progressive question sequences. The factual capability and information delivery efficiency of the \textit{final-turn} answer are then evaluated using a human-validated automated pipeline. Our experiments reveal that multi-turn contexts degrade performance: factual capability declines due to the contextual noise from self-generated histories, while information efficiency drops as models become more verbose with increasing dialogue length. We then investigate mitigation strategies, demonstrating that retrieval-augmented generation (RAG) can effectively alleviate and even reverse this factual degradation. These findings underscore the importance of our benchmark in evaluating and enhancing the conversational factual capabilities of LLMs in real-world knowledge-intensive applications. Code is available at \href{https://github.com/hardenyu21/KnowMT-Bench}{\textcolor{cyan}{\texttt{KnowMT-Bench}}}.
[41] Enhancing Low-Rank Adaptation with Structured Nonlinear Transformations
Guanzhi Deng, Mingyang Liu, Dapeng Wu, Yinqiao Li, Linqi Song
Main category: cs.CL
TL;DR: LoRAN extends LoRA with non-linear transformations and introduces Sinter, a sine-based activation, improving performance over QLoRA in summarization and classification tasks.
Details
Motivation: The linear nature of LoRA limits its expressiveness, prompting the development of a non-linear extension to enhance fine-tuning capabilities for large language models.Method: LoRAN applies lightweight non-linear transformations to low-rank updates and uses Sinter, a sine-based activation that adds structured perturbations without increasing parameters.
Result: Experiments show LoRAN consistently outperforms QLoRA across summarization and classification tasks, with Sinter surpassing standard activations like Sigmoid, ReLU, and Tanh.
Conclusion: The study highlights the importance of activation design in low-rank tuning, demonstrating that non-linear extensions and specialized activations can significantly improve parameter-efficient fine-tuning.
Abstract: Low-Rank Adaptation (LoRA) is a widely adopted parameter-efficient fine-tuning method for large language models. However, its linear nature limits expressiveness. We propose LoRAN, a non-linear extension of LoRA that applies lightweight transformations to the low-rank updates. We further introduce Sinter, a sine-based activation that adds structured perturbations without increasing parameter count. Experiments across summarization and classification tasks show that LoRAN consistently improves over QLoRA. Ablation studies reveal that Sinter outperforms standard activations such as Sigmoid, ReLU, and Tanh, highlighting the importance of activation design in lowrank tuning.
[42] LUMINA: Detecting Hallucinations in RAG System with Context-Knowledge Signals
Min-Hsuan Yeh, Yixuan Li, Tanwi Mallick
Main category: cs.CL
TL;DR: LUMINA is a novel framework that detects hallucinations in RAG systems by quantifying context-knowledge signals through distributional distance for external context and token evolution tracking across transformer layers for internal knowledge.
Details
Motivation: RAG-based LLMs still hallucinate even with correct context due to imbalance between external context and internal knowledge usage, and existing detection methods require extensive hyperparameter tuning, limiting generalizability.Method: Quantifies external context utilization via distributional distance, measures internal knowledge utilization by tracking predicted token evolution across transformer layers, and introduces statistical validation framework.
Result: Achieves consistently high AUROC and AUPRC scores, outperforming prior methods by up to +13% AUROC on HalluRAG benchmark, and remains robust under relaxed assumptions about retrieval quality and model matching.
Conclusion: LUMINA offers an effective and practical solution for hallucination detection in RAG systems, combining statistical validation with robust performance across different scenarios.
Abstract: Retrieval-Augmented Generation (RAG) aims to mitigate hallucinations in large language models (LLMs) by grounding responses in retrieved documents. Yet, RAG-based LLMs still hallucinate even when provided with correct and sufficient context. A growing line of work suggests that this stems from an imbalance between how models use external context and their internal knowledge, and several approaches have attempted to quantify these signals for hallucination detection. However, existing methods require extensive hyperparameter tuning, limiting their generalizability. We propose LUMINA, a novel framework that detects hallucinations in RAG systems through context-knowledge signals: external context utilization is quantified via distributional distance, while internal knowledge utilization is measured by tracking how predicted tokens evolve across transformer layers. We further introduce a framework for statistically validating these measurements. Experiments on common RAG hallucination benchmarks and four open-source LLMs show that LUMINA achieves consistently high AUROC and AUPRC scores, outperforming prior utilization-based methods by up to +13% AUROC on HalluRAG. Moreover, LUMINA remains robust under relaxed assumptions about retrieval quality and model matching, offering both effectiveness and practicality.
[43] No Prompt Left Behind: Exploiting Zero-Variance Prompts in LLM Reinforcement Learning via Entropy-Guided Advantage Shaping
Thanh-Long V. Le, Myeongho Jeon, Kim Vu, Viet Lai, Eunho Yang
Main category: cs.CL
TL;DR: RL-ZVP is a new reinforcement learning algorithm that extracts learning signals from zero-variance prompts (where all responses get the same reward), improving LLM reasoning by rewarding correctness and penalizing errors without requiring contrasting responses.
Details
Motivation: Current RLVR methods like GRPO ignore zero-variance prompts where all responses receive the same reward, treating them as useless. The authors argue these prompts can provide meaningful feedback for policy optimization.Method: RL-ZVP extracts learning signals from zero-variance prompts by directly rewarding correctness and penalizing errors even without contrasting responses, using token-level characteristics to modulate feedback and preserve nuanced signals.
Result: Across six math reasoning benchmarks, RL-ZVP achieved significant improvements of up to 8.61 points in accuracy and 7.77 points in pass rate over GRPO, consistently outperforming baselines that filter out zero-variance prompts.
Conclusion: Zero-variance prompts have untapped potential for learning in RLVR, and RL-ZVP successfully leverages them to improve LLM reasoning abilities.
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) is a powerful framework for improving the reasoning abilities of Large Language Models (LLMs). However, current methods such as GRPO rely only on problems where the model responses to the same input differ in correctness, while ignoring those where all responses receive the same reward - so-called zero-variance prompts. In this work, we argue that such prompts are not useless but can, in fact, provide meaningful feedback for policy optimization. To this end, we introduce RL with Zero-Variance Prompts (RL-ZVP), a novel algorithm that extract learning signals from zero-variance prompts. RL-ZVP directly rewards correctness and penalizes errors even without contrasting responses, modulating feedback with token-level characteristics to preserve informative, nuanced signals. Across six math reasoning benchmarks, RL-ZVP achieves significant improvements of up to 8.61 points in accuracy and 7.77 points in pass rate over GRPO, while consistently outperforming other baselines that filter out zero-variance prompts. These results highlight the untapped potential of learning from zero-variance prompts in RLVR.
[44] QoNext: Towards Next-generation QoE for Foundation Models
Yijin Guo, Ye Shen, Farong Wen, Junying Wang, Zicheng Zhang, Qi Jia, Guangtao Zhai
Main category: cs.CL
TL;DR: QoNext is a framework that adapts Quality of Experience (QoE) principles to evaluate foundation models, focusing on user interaction experience rather than just output correctness.
Details
Motivation: Current evaluation methods for foundation models focus only on output correctness and fail to capture user experience during interaction, which is crucial for user satisfaction.Method: QoNext identifies experiential factors, conducts controlled experiments with human ratings under varied configurations, builds a QoE-oriented database, and trains predictive models to estimate user experience from system parameters.
Result: QoNext enables proactive and fine-grained evaluation of foundation models and provides actionable guidance for optimizing productized services.
Conclusion: The QoNext framework successfully bridges the gap in foundation model evaluation by focusing on user experience through QoE principles, offering practical optimization guidance.
Abstract: Existing evaluations of foundation models, including recent human-centric approaches, fail to capture what truly matters: user’s experience during interaction. Current methods treat evaluation as a matter of output correctness alone, overlooking that user satisfaction emerges from the interplay between response quality and interaction, which limits their ability to account for the mechanisms underlying user experience. To address this gap, we introduce QoNext, the first framework that adapts Quality of Experience (QoE) principles from networking and multimedia to the assessment of foundation models. QoNext identifies experiential factors that shape user experience and incorporates them into controlled experiments, where human ratings are collected under varied configurations. From these studies we construct a QoE-oriented database and train predictive models that estimate perceived user experience from measurable system parameters. Our results demonstrate that QoNext not only enables proactive and fine-grained evaluation but also provides actionable guidance for productized services of optimizing foundation models in practice.
[45] Elastic MoE: Unlocking the Inference-Time Scalability of Mixture-of-Experts
Naibin Gu, Zhenyu Zhang, Yuchen Feng, Yilong Chen, Peng Fu, Zheng Lin, Shuohuan Wang, Yu Sun, Hua Wu, Weiping Wang, Haifeng Wang
Main category: cs.CL
TL;DR: EMoE enables Mixture-of-Experts models to scale activated experts at inference without extra training, expanding performance range 2-3x while improving peak performance.
Details
Motivation: Standard MoE models degrade when activating more experts at inference due to lack of learned collaboration among experts, despite expected performance improvements.Method: Elastic Mixture-of-Experts (EMoE) trains experts to collaborate in diverse combinations and improves router selection quality without additional training overhead.
Result: EMoE significantly expands performance-scaling range to 2-3x training-time k and pushes peak performance to higher levels across various MoE settings.
Conclusion: EMoE effectively addresses expert collaboration issues in MoE models, enabling flexible inference-time scaling while maintaining robust performance across computational budgets.
Abstract: Mixture-of-Experts (MoE) models typically fix the number of activated experts $k$ at both training and inference. Intuitively, activating more experts at inference $k’$ (where $k’> k$) means engaging a larger set of model parameters for the computation and thus is expected to improve performance. However, contrary to this intuition, we find the scaling range to be so narrow that performance begins to degrade rapidly after only a slight increase in the number of experts. Further investigation reveals that this degradation stems from a lack of learned collaboration among experts. To address this, we introduce Elastic Mixture-of-Experts (EMoE), a novel training framework that enables MoE models to scale the number of activated experts at inference without incurring additional training overhead. By simultaneously training experts to collaborate in diverse combinations and encouraging the router for high-quality selections, EMoE ensures robust performance across computational budgets at inference. We conduct extensive experiments on various MoE settings. Our results show that EMoE significantly expands the effective performance-scaling range, extending it to as much as 2-3$\times$ the training-time $k$, while also pushing the model’s peak performance to a higher level.
[46] A Large-Scale Dataset and Citation Intent Classification in Turkish with LLMs
Kemal Sami Karaca, Bahaeddin Eravcı
Main category: cs.CL
TL;DR: This paper introduces a systematic methodology and dataset for Turkish citation intent classification, using DSPy framework for automated prompt optimization and stacked ensemble to achieve 91.3% accuracy.
Details
Motivation: Understanding citation intent is crucial for academic research assessment, but poses unique challenges for agglutinative languages like Turkish where existing methods are limited.Method: Created a new Turkish citation intent dataset, evaluated standard ICL with LLMs, introduced DSPy framework for automated prompt optimization, and used stacked ensemble with XGBoost meta-model for classification.
Result: Achieved state-of-the-art accuracy of 91.3% using the optimized ensemble approach, significantly outperforming standard in-context learning methods.
Conclusion: Provides Turkish NLP community with foundational dataset and robust classification framework for qualitative citation studies, paving way for future research in this domain.
Abstract: Understanding the qualitative intent of citations is essential for a comprehensive assessment of academic research, a task that poses unique challenges for agglutinative languages like Turkish. This paper introduces a systematic methodology and a foundational dataset to address this problem. We first present a new, publicly available dataset of Turkish citation intents, created with a purpose-built annotation tool. We then evaluate the performance of standard In-Context Learning (ICL) with Large Language Models (LLMs), demonstrating that its effectiveness is limited by inconsistent results caused by manually designed prompts. To address this core limitation, we introduce a programmable classification pipeline built on the DSPy framework, which automates prompt optimization systematically. For final classification, we employ a stacked generalization ensemble to aggregate outputs from multiple optimized models, ensuring stable and reliable predictions. This ensemble, with an XGBoost meta-model, achieves a state-of-the-art accuracy of 91.3%. Ultimately, this study provides the Turkish NLP community and the broader academic circles with a foundational dataset and a robust classification framework paving the way for future qualitative citation studies.
[47] AutoSCORE: Enhancing Automated Scoring with Multi-Agent Large Language Models via Structured Component Recognition
Yun Wang, Zhaojun Ding, Xuansheng Wu, Siyue Sun, Ninghao Liu, Xiaoming Zhai
Main category: cs.CL
TL;DR: AutoSCORE is a multi-agent LLM framework that improves automated scoring by extracting rubric-aligned components from student responses before assigning final scores, enhancing accuracy and interpretability.
Details
Motivation: Current LLM-based automated scoring faces challenges like low accuracy, prompt sensitivity, limited interpretability, and rubric misalignment, hindering practical implementation in educational assessment.Method: A two-agent framework: first agent extracts rubric-relevant components and encodes them into structured representation, second agent uses this to assign final scores, mimicking human grading process.
Result: AutoSCORE consistently improved scoring accuracy, human-machine agreement (QWK, correlations), and error metrics (MAE, RMSE) across diverse tasks and rubrics, with strong benefits on complex rubrics and smaller LLMs.
Conclusion: Structured component recognition combined with multi-agent design provides a scalable, reliable, and interpretable solution for automated scoring in education.
Abstract: Automated scoring plays a crucial role in education by reducing the reliance on human raters, offering scalable and immediate evaluation of student work. While large language models (LLMs) have shown strong potential in this task, their use as end-to-end raters faces challenges such as low accuracy, prompt sensitivity, limited interpretability, and rubric misalignment. These issues hinder the implementation of LLM-based automated scoring in assessment practice. To address the limitations, we propose AutoSCORE, a multi-agent LLM framework enhancing automated scoring via rubric-aligned Structured COmponent REcognition. With two agents, AutoSCORE first extracts rubric-relevant components from student responses and encodes them into a structured representation (i.e., Scoring Rubric Component Extraction Agent), which is then used to assign final scores (i.e., Scoring Agent). This design ensures that model reasoning follows a human-like grading process, enhancing interpretability and robustness. We evaluate AutoSCORE on four benchmark datasets from the ASAP benchmark, using both proprietary and open-source LLMs (GPT-4o, LLaMA-3.1-8B, and LLaMA-3.1-70B). Across diverse tasks and rubrics, AutoSCORE consistently improves scoring accuracy, human-machine agreement (QWK, correlations), and error metrics (MAE, RMSE) compared to single-agent baselines, with particularly strong benefits on complex, multi-dimensional rubrics, and especially large relative gains on smaller LLMs. These results demonstrate that structured component recognition combined with multi-agent design offers a scalable, reliable, and interpretable solution for automated scoring.
[48] SimulSense: Sense-Driven Interpreting for Efficient Simultaneous Speech Translation
Haotian Tan, Hiroki Ouchi, Sakriani Sakti
Main category: cs.CL
TL;DR: SimulSense is a novel framework for simultaneous speech translation that mimics human interpreters by continuously reading input speech and triggering write decisions when new sense units are perceived, achieving superior quality-latency tradeoff and 9.6x faster decision-making than state-of-the-art baselines.
Details
Motivation: Current SimulST systems require specialized interleaved training data and rely on computationally expensive LLM inference for decision-making, which limits their real-time efficiency and practical deployment.Method: Proposes SimulSense framework that continuously reads input speech and triggers write decisions when new sense units are perceived, mimicking human interpreter behavior without requiring specialized training data or expensive LLM inference.
Result: Achieves superior quality-latency tradeoff compared to two state-of-the-art baseline systems, with decision-making up to 9.6x faster than the baselines.
Conclusion: SimulSense provides a more efficient and practical approach to simultaneous speech translation by mimicking human interpreter decision-making processes, substantially improving real-time efficiency while maintaining translation quality.
Abstract: How to make human-interpreter-like read/write decisions for simultaneous speech translation (SimulST) systems? Current state-of-the-art systems formulate SimulST as a multi-turn dialogue task, requiring specialized interleaved training data and relying on computationally expensive large language model (LLM) inference for decision-making. In this paper, we propose SimulSense, a novel framework for SimulST that mimics human interpreters by continuously reading input speech and triggering write decisions to produce translation when a new sense unit is perceived. Experiments against two state-of-the-art baseline systems demonstrate that our proposed method achieves a superior quality-latency tradeoff and substantially improved real-time efficiency, where its decision-making is up to 9.6x faster than the baselines.
[49] Why Chain of Thought Fails in Clinical Text Understanding
Jiageng Wu, Kevin Xie, Bowen Gu, Nils Krüger, Kueiyu Joshua Lin, Jie Yang
Main category: cs.CL
TL;DR: Chain-of-thought prompting degrades performance for most LLMs in clinical text tasks, creating a paradox where interpretability improves but reliability decreases.
Details
Motivation: To systematically evaluate CoT prompting effectiveness in clinical contexts, particularly for EHRs which are lengthy, fragmented, and noisy, given the critical need for accuracy and transparent reasoning in clinical care.Method: Large-scale study assessing 95 advanced LLMs on 87 real-world clinical text tasks across 9 languages and 8 task types, with fine-grained analyses of reasoning length, medical concept alignment, and error profiles using both LLM-as-a-judge and clinical expert evaluation.
Result: 86.3% of models suffered consistent performance degradation with CoT; more capable models remained relatively robust while weaker ones declined substantially.
Conclusion: CoT enhances interpretability but may undermine reliability in clinical text tasks, highlighting a critical paradox and the need for transparent and trustworthy approaches in clinical reasoning.
Abstract: Large language models (LLMs) are increasingly being applied to clinical care, a domain where both accuracy and transparent reasoning are critical for safe and trustworthy deployment. Chain-of-thought (CoT) prompting, which elicits step-by-step reasoning, has demonstrated improvements in performance and interpretability across a wide range of tasks. However, its effectiveness in clinical contexts remains largely unexplored, particularly in the context of electronic health records (EHRs), the primary source of clinical documentation, which are often lengthy, fragmented, and noisy. In this work, we present the first large-scale systematic study of CoT for clinical text understanding. We assess 95 advanced LLMs on 87 real-world clinical text tasks, covering 9 languages and 8 task types. Contrary to prior findings in other domains, we observe that 86.3% of models suffer consistent performance degradation in the CoT setting. More capable models remain relatively robust, while weaker ones suffer substantial declines. To better characterize these effects, we perform fine-grained analyses of reasoning length, medical concept alignment, and error profiles, leveraging both LLM-as-a-judge evaluation and clinical expert evaluation. Our results uncover systematic patterns in when and why CoT fails in clinical contexts, which highlight a critical paradox: CoT enhances interpretability but may undermine reliability in clinical text tasks. This work provides an empirical basis for clinical reasoning strategies of LLMs, highlighting the need for transparent and trustworthy approaches.
[50] Debiasing Large Language Models in Thai Political Stance Detection via Counterfactual Calibration
Kasidit Sermsri, Teerapong Panboonyuen
Main category: cs.CL
TL;DR: ThaiFACTUAL is a lightweight calibration framework that reduces political bias in LLMs for Thai political stance detection without fine-tuning, using counterfactual data augmentation and rationale-based supervision.
Details
Motivation: LLMs show systematic biases in Thai political stance detection due to indirect language, polarized figures, and entangled sentiment-stance relationships, undermining fairness and reliability.Method: Uses counterfactual data augmentation and rationale-based supervision to disentangle sentiment from stance, with a model-agnostic framework that doesn’t require fine-tuning.
Result: Significantly reduces spurious correlations, enhances zero-shot generalization, and improves fairness across multiple LLMs. Also releases the first high-quality Thai political stance dataset.
Conclusion: Highlights the importance of culturally grounded debiasing techniques for underrepresented languages like Thai.
Abstract: Political stance detection in low-resource and culturally complex settings poses a critical challenge for large language models (LLMs). In the Thai political landscape - marked by indirect language, polarized figures, and entangled sentiment and stance - LLMs often display systematic biases such as sentiment leakage and favoritism toward entities. These biases undermine fairness and reliability. We present ThaiFACTUAL, a lightweight, model-agnostic calibration framework that mitigates political bias without requiring fine-tuning. ThaiFACTUAL uses counterfactual data augmentation and rationale-based supervision to disentangle sentiment from stance and reduce bias. We also release the first high-quality Thai political stance dataset, annotated with stance, sentiment, rationales, and bias markers across diverse entities and events. Experimental results show that ThaiFACTUAL significantly reduces spurious correlations, enhances zero-shot generalization, and improves fairness across multiple LLMs. This work highlights the importance of culturally grounded debiasing techniques for underrepresented languages.
[51] MotivGraph-SoIQ: Integrating Motivational Knowledge Graphs and Socratic Dialogue for Enhanced LLM Ideation
Xinping Lei, Tong Zhou, Yubo Chen, Kang Liu, Jun Zhao
Main category: cs.CL
TL;DR: MotivGraph-SoIQ integrates motivational knowledge graphs and Socratic dialogue to enhance LLM ideation by providing grounding and mitigating confirmation bias.
Details
Motivation: Large Language Models have potential for academic ideation but face challenges in grounding ideas and mitigating confirmation bias for refinement.Method: Combines Motivational Knowledge Graph (storing problem, challenge, solution nodes) with dual-agent Socratic Ideator using questioning to refine ideas.
Result: Outperforms state-of-the-art approaches on ICLR25 paper topics dataset across LLM-based scoring, ELO ranking, and human evaluation metrics.
Conclusion: The framework successfully addresses LLM ideation limitations by providing essential grounding and practical improvement steps through motivational graphs and Socratic questioning.
Abstract: Large Language Models (LLMs) hold substantial potential for accelerating academic ideation but face critical challenges in grounding ideas and mitigating confirmation bias for further refinement. We propose integrating motivational knowledge graphs and socratic dialogue to address these limitations in enhanced LLM ideation (MotivGraph-SoIQ). This novel framework provides essential grounding and practical idea improvement steps for LLM ideation by integrating a Motivational Knowledge Graph (MotivGraph) with a Q-Driven Socratic Ideator. The MotivGraph structurally stores three key node types(problem, challenge and solution) to offer motivation grounding for the LLM ideation process. The Ideator is a dual-agent system utilizing Socratic questioning, which facilitates a rigorous refinement process that mitigates confirmation bias and improves idea quality across novelty, experimental rigor, and motivational rationality dimensions. On the ICLR25 paper topics dataset, MotivGraph-SoIQ exhibits clear advantages over existing state-of-the-art approaches across LLM-based scoring, ELO ranking, and human evaluation metrics.
[52] Black-Box Hallucination Detection via Consistency Under the Uncertain Expression
Seongho Joo, Kyungmin Min, Jahyun Koo, Kyomin Jung
Main category: cs.CL
TL;DR: A black-box hallucination detection method that uses uncertainty expression to identify when LLMs generate non-factual responses, without requiring internal model access or external resources.
Details
Motivation: LLMs like GPT3 often generate non-factual responses (hallucinations), but existing detection methods need external resources or internal model states. Black-box approaches are urgently needed due to restricted API access and limited external resources.Method: Propose a simple black-box metric based on analyzing LLM behavior under uncertainty expression. Found that LLMs generate consistent responses for factual content but inconsistent responses for non-factual content.
Result: The proposed metric is more predictive of factuality in model responses than baselines that use internal knowledge of LLMs.
Conclusion: Expression of uncertainty can effectively detect hallucinations in black-box settings, providing a practical solution without requiring internal model access.
Abstract: Despite the great advancement of Language modeling in recent days, Large Language Models (LLMs) such as GPT3 are notorious for generating non-factual responses, so-called “hallucination” problems. Existing methods for detecting and alleviating this hallucination problem require external resources or the internal state of LLMs, such as the output probability of each token. Given the LLM’s restricted external API availability and the limited scope of external resources, there is an urgent demand to establish the Black-Box approach as the cornerstone for effective hallucination detection. In this work, we propose a simple black-box hallucination detection metric after the investigation of the behavior of LLMs under expression of uncertainty. Our comprehensive analysis reveals that LLMs generate consistent responses when they present factual responses while non-consistent responses vice versa. Based on the analysis, we propose an efficient black-box hallucination detection metric with the expression of uncertainty. The experiment demonstrates that our metric is more predictive of the factuality in model responses than baselines that use internal knowledge of LLMs.
[53] GraphSearch: An Agentic Deep Searching Workflow for Graph Retrieval-Augmented Generation
Cehao Yang, Xiaojun Wu, Xueyuan Lin, Chengjin Xu, Xuhui Jiang, Yuanliang Sun, Jia Li, Hui Xiong, Jian Guo
Main category: cs.CL
TL;DR: GraphSearch is a novel agentic deep searching workflow with dual-channel retrieval that addresses limitations in existing GraphRAG approaches by enabling comprehensive evidence retrieval and efficient utilization of structural graph data.
Details
Motivation: Existing GraphRAG approaches face limitations with shallow retrieval that fails to surface all critical evidence and inefficient utilization of pre-constructed structural graph data, hindering effective reasoning from complex queries.Method: GraphSearch organizes retrieval into a modular framework with six modules for multi-turn interactions and iterative reasoning, plus a dual-channel retrieval strategy that issues semantic queries over chunk-based text data and relational queries over structural graph data.
Result: Experimental results across six multi-hop RAG benchmarks demonstrate that GraphSearch consistently improves answer accuracy and generation quality over traditional strategies.
Conclusion: GraphSearch represents a promising direction for advancing graph retrieval-augmented generation by enabling comprehensive utilization of both text and graph modalities through its dual-channel retrieval approach.
Abstract: Graph Retrieval-Augmented Generation (GraphRAG) enhances factual reasoning in LLMs by structurally modeling knowledge through graph-based representations. However, existing GraphRAG approaches face two core limitations: shallow retrieval that fails to surface all critical evidence, and inefficient utilization of pre-constructed structural graph data, which hinders effective reasoning from complex queries. To address these challenges, we propose \textsc{GraphSearch}, a novel agentic deep searching workflow with dual-channel retrieval for GraphRAG. \textsc{GraphSearch} organizes the retrieval process into a modular framework comprising six modules, enabling multi-turn interactions and iterative reasoning. Furthermore, \textsc{GraphSearch} adopts a dual-channel retrieval strategy that issues semantic queries over chunk-based text data and relational queries over structural graph data, enabling comprehensive utilization of both modalities and their complementary strengths. Experimental results across six multi-hop RAG benchmarks demonstrate that \textsc{GraphSearch} consistently improves answer accuracy and generation quality over the traditional strategy, confirming \textsc{GraphSearch} as a promising direction for advancing graph retrieval-augmented generation.
[54] From Outliers to Topics in Language Models: Anticipating Trends in News Corpora
Evangelia Zve, Benjamin Icard, Alice Breton, Lila Sainero, Gauvain Bourgne, Jean-Gabriel Ganascia
Main category: cs.CL
TL;DR: Outliers in topic modeling can serve as early indicators of emerging topics, evolving into coherent topics over time across different models and languages.
Details
Motivation: To challenge the common practice of dismissing outliers as noise in topic modeling and explore their potential as weak signals for emerging topics in dynamic news corpora.Method: Using vector embeddings from state-of-the-art language models and a cumulative clustering approach to track outlier evolution over time in French and English news datasets on corporate social responsibility and climate change.
Result: Outliers consistently evolve into coherent topics over time across both models and languages, revealing a consistent pattern of topic emergence.
Conclusion: Outliers should not be dismissed as noise but rather recognized as valuable indicators of emerging topics that can provide early insights into topic evolution in dynamic text corpora.
Abstract: This paper examines how outliers, often dismissed as noise in topic modeling, can act as weak signals of emerging topics in dynamic news corpora. Using vector embeddings from state-of-the-art language models and a cumulative clustering approach, we track their evolution over time in French and English news datasets focused on corporate social responsibility and climate change. The results reveal a consistent pattern: outliers tend to evolve into coherent topics over time across both models and languages.
[55] Taxonomy of Comprehensive Safety for Clinical Agents
Jean Seo, Hyunkyung Lee, Gibaeg Kim, Wooseok Han, Jaehyo Yoo, Seungseop Lim, Kihun Shin, Eunho Yang
Main category: cs.CL
TL;DR: TACOS is a comprehensive safety taxonomy for clinical chatbots that integrates safety filtering and tool selection into a single user intent classification step, addressing nuanced clinical domain demands.
Details
Motivation: Existing safety methods like guardrails and tool calling are insufficient for clinical chatbots where inaccurate responses can have serious consequences, requiring more nuanced safety approaches.Method: Developed TACOS, a fine-grained 21-class taxonomy that explicitly models varying safety thresholds and external tool dependencies, integrating safety filtering and tool selection into unified intent classification.
Result: Created a TACOS-annotated dataset and experiments showed the value of specialized taxonomy for clinical agents, revealing insights about training data distribution and base model pretrained knowledge.
Conclusion: TACOS provides an effective framework for comprehensive safety in clinical chatbots, demonstrating the importance of domain-specific safety taxonomies that integrate multiple safety considerations.
Abstract: Safety is a paramount concern in clinical chatbot applications, where inaccurate or harmful responses can lead to serious consequences. Existing methods–such as guardrails and tool calling–often fall short in addressing the nuanced demands of the clinical domain. In this paper, we introduce TACOS (TAxonomy of COmprehensive Safety for Clinical Agents), a fine-grained, 21-class taxonomy that integrates safety filtering and tool selection into a single user intent classification step. TACOS is a taxonomy that can cover a wide spectrum of clinical and non-clinical queries, explicitly modeling varying safety thresholds and external tool dependencies. To validate our framework, we curate a TACOS-annotated dataset and perform extensive experiments. Our results demonstrate the value of a new taxonomy specialized for clinical agent settings, and reveal useful insights about train data distribution and pretrained knowledge of base models.
[56] Fuzzy Reasoning Chain (FRC): An Innovative Reasoning Framework from Fuzziness to Clarity
Ping Chen, Xiang Liu, Zhaoxiang Liu, Zezhou Chen, Xingpeng Zhang, Huan Hu, Zipeng Wang, Kai Wang, Shuming Shi, Shiguo Lian
Main category: cs.CL
TL;DR: The paper introduces Fuzzy Reasoning Chain (FRC), a framework that combines LLM semantic priors with fuzzy membership degrees to handle ambiguity and uncertainty in text processing, improving interpretability and robustness.
Details
Motivation: Despite progress in NLP with LLMs, challenges remain in handling ambiguous, polysemous, or uncertain texts that traditional probability-based methods struggle with.Method: FRC integrates LLM semantic priors with continuous fuzzy membership degrees, creating explicit interaction between probability-based reasoning and fuzzy membership reasoning to gradually transform ambiguous inputs into clear decisions.
Result: Validated on sentiment analysis tasks, FRC ensures stable reasoning and facilitates knowledge transfer across different model scales, capturing conflicting or uncertain signals that traditional methods miss.
Conclusion: FRC provides a general mechanism for managing subtle and ambiguous expressions with improved interpretability and robustness in NLP tasks.
Abstract: With the rapid advancement of large language models (LLMs), natural language processing (NLP) has achieved remarkable progress. Nonetheless, significant challenges remain in handling texts with ambiguity, polysemy, or uncertainty. We introduce the Fuzzy Reasoning Chain (FRC) framework, which integrates LLM semantic priors with continuous fuzzy membership degrees, creating an explicit interaction between probability-based reasoning and fuzzy membership reasoning. This transition allows ambiguous inputs to be gradually transformed into clear and interpretable decisions while capturing conflicting or uncertain signals that traditional probability-based methods cannot. We validate FRC on sentiment analysis tasks, where both theoretical analysis and empirical results show that it ensures stable reasoning and facilitates knowledge transfer across different model scales. These findings indicate that FRC provides a general mechanism for managing subtle and ambiguous expressions with improved interpretability and robustness.
[57] RedNote-Vibe: A Dataset for Capturing Temporal Dynamics of AI-Generated Text in Social Media
Yudong Li, Yufei Sun, Yuhan Yao, Peiru Yang, Wanyue Li, Jiajun Zou, Yongfeng Huang, Linlin Shen
Main category: cs.CL
TL;DR: RedNote-Vibe is the first longitudinal dataset for social media AIGT analysis spanning 5 years, and PLAD is a psycholinguistic framework that achieves superior AIGT detection while revealing relationships between linguistic features and user engagement.
Details
Motivation: Existing AIGT datasets are static and don't capture the temporal dynamics and user engagement patterns of AI-generated content on social media platforms.Method: Created RedNote-Vibe dataset from Xiaohongshu platform with 5 years of data including user engagement metrics, and proposed PLAD framework using psycholinguistic features for interpretable AIGT detection.
Result: PLAD achieves superior detection performance and provides insights into signatures distinguishing human vs AI content, revealing complex relationships between linguistic features and social media engagement.
Conclusion: The RedNote-Vibe dataset enables temporal AIGT analysis, and PLAD offers an effective interpretable approach for social media AIGT detection with insights into linguistic-engagement relationships.
Abstract: The proliferation of Large Language Models (LLMs) has led to widespread AI-Generated Text (AIGT) on social media platforms, creating unique challenges where content dynamics are driven by user engagement and evolve over time. However, existing datasets mainly depict static AIGT detection. In this work, we introduce RedNote-Vibe, the first longitudinal (5-years) dataset for social media AIGT analysis. This dataset is sourced from Xiaohongshu platform, containing user engagement metrics (e.g., likes, comments) and timestamps spanning from the pre-LLM period to July 2025, which enables research into the temporal dynamics and user interaction patterns of AIGT. Furthermore, to detect AIGT in the context of social media, we propose PsychoLinguistic AIGT Detection Framework (PLAD), an interpretable approach that leverages psycholinguistic features. Our experiments show that PLAD achieves superior detection performance and provides insights into the signatures distinguishing human and AI-generated content. More importantly, it reveals the complex relationship between these linguistic features and social media engagement. The dataset is available at https://github.com/testuser03158/RedNote-Vibe.
[58] The QCET Taxonomy of Standard Quality Criterion Names and Definitions for the Evaluation of NLP Systems
Anya Belz, Simon Mille, Craig Thomson
Main category: cs.CL
TL;DR: QCET creates a standardized taxonomy for NLP evaluation quality criteria to address the problem of misleading comparability between evaluations that use the same criterion names but measure different aspects.
Details
Motivation: NLP evaluations often use the same quality criterion names (e.g., Fluency) but measure different aspects, making comparisons unreliable and hindering scientific progress in the field.Method: Derived a standard set of quality criterion names and definitions from three surveys of NLP evaluations, structured into a hierarchy where parent nodes capture common aspects of child nodes.
Result: Developed QCET (Quality Criteria for Evaluation Taxonomy) - a standardized taxonomy that provides clear definitions and hierarchical structure for NLP quality criteria.
Conclusion: QCET enables establishing comparability of existing evaluations, guiding new evaluation designs, and assessing regulatory compliance, addressing a long-standing issue in NLP evaluation methodology.
Abstract: Prior work has shown that two NLP evaluation experiments that report results for the same quality criterion name (e.g. Fluency) do not necessarily evaluate the same aspect of quality, and the comparability implied by the name can be misleading. Not knowing when two evaluations are comparable in this sense means we currently lack the ability to draw reliable conclusions about system quality on the basis of multiple, independently conducted evaluations. This in turn hampers the ability of the field to progress scientifically as a whole, a pervasive issue in NLP since its beginning (Sparck Jones, 1981). It is hard to see how the issue of unclear comparability can be fully addressed other than by the creation of a standard set of quality criterion names and definitions that the several hundred quality criterion names actually in use in the field can be mapped to, and grounded in. Taking a strictly descriptive approach, the QCET Quality Criteria for Evaluation Taxonomy derives a standard set of quality criterion names and definitions from three surveys of evaluations reported in NLP, and structures them into a hierarchy where each parent node captures common aspects of its child nodes. We present QCET and the resources it consists of, and discuss its three main uses in (i) establishing comparability of existing evaluations, (ii) guiding the design of new evaluations, and (iii) assessing regulatory compliance.
[59] Fine-tuning Done Right in Model Editing
Wanli Yang, Fei Sun, Rui Tang, Hongyu Zang, Du Su, Qi Cao, Jingang Wang, Huawei Shen, Xueqi Cheng
Main category: cs.CL
TL;DR: Fine-tuning is actually effective for model editing when using breadth-first (epoch-based) pipeline with mini-batch optimization instead of the traditional depth-first approach, and when combined with optimal tuning locations.
Details
Motivation: To challenge the long-standing belief that fine-tuning is ineffective for model editing, arguing that previous failures were due to suboptimal pipeline design rather than inherent limitations of fine-tuning.Method: Proposed LocFT-BF method that restores fine-tuning to standard breadth-first pipeline with mini-batch optimization and systematically analyzes tuning locations for localized editing.
Result: Outperforms state-of-the-art methods by large margins, sustains 100K edits and 72B-parameter models (10x beyond prior practice) without sacrificing general capabilities.
Conclusion: Fine-tuning can be advanced from an underestimated baseline to a leading method for model editing by using proper pipeline design and localized tuning strategy.
Abstract: Fine-tuning, a foundational method for adapting large language models, has long been considered ineffective for model editing. Here, we challenge this belief, arguing that the reported failure arises not from the inherent limitation of fine-tuning itself, but from adapting it to the sequential nature of the editing task, a single-pass depth-first pipeline that optimizes each sample to convergence before moving on. While intuitive, this depth-first pipeline coupled with sample-wise updating over-optimizes each edit and induces interference across edits. Our controlled experiments reveal that simply restoring fine-tuning to the standard breadth-first (i.e., epoch-based) pipeline with mini-batch optimization substantially improves its effectiveness for model editing. Moreover, fine-tuning in editing also suffers from suboptimal tuning parameter locations inherited from prior methods. Through systematic analysis of tuning locations, we derive LocFT-BF, a simple and effective localized editing method built on the restored fine-tuning framework. Extensive experiments across diverse LLMs and datasets demonstrate that LocFT-BF outperforms state-of-the-art methods by large margins. Notably, to our knowledge, it is the first to sustain 100K edits and 72B-parameter models,10 x beyond prior practice, without sacrificing general capabilities. By clarifying a long-standing misconception and introducing a principled localized tuning strategy, we advance fine-tuning from an underestimated baseline to a leading method for model editing, establishing a solid foundation for future research.
[60] COSPADI: Compressing LLMs via Calibration-Guided Sparse Dictionary Learning
Dmitriy Shopkhoev, Denis Makhov, Magauiya Zhussip, Ammar Ali, Stamatios Lefkimmiatis
Main category: cs.CL
TL;DR: CoSpaDi is a training-free LLM compression method that uses structured sparse dictionary learning instead of low-rank approximation, achieving better accuracy preservation through union-of-subspaces representation and data-aware optimization.
Details
Motivation: Low-rank weight approximation for LLM compression imposes rigid structural constraints that lead to noticeable accuracy drops, necessitating a more flexible approach.Method: Replaces low-rank decomposition with structured sparse factorization using a dense dictionary and column-sparse coefficient matrix, optimized with calibration data to minimize functional reconstruction error.
Result: Consistently superior to state-of-the-art low-rank methods in accuracy and perplexity across Llama and Qwen models at 20-50% compression ratios.
Conclusion: Structured sparse dictionary learning is a powerful alternative to conventional low-rank approaches for efficient LLM deployment, offering better expressiveness and model fidelity.
Abstract: Post-training compression of large language models (LLMs) largely relies on low-rank weight approximation, which represents each column of a weight matrix in a shared low-dimensional subspace. While this is a computationally efficient strategy, the imposed structural constraint is rigid and can lead to a noticeable model accuracy drop. In this work, we propose CoSpaDi (Compression via Sparse Dictionary Learning), a novel training-free compression framework that replaces low-rank decomposition with a more flexible structured sparse factorization in which each weight matrix is represented with a dense dictionary and a column-sparse coefficient matrix. This formulation enables a union-of-subspaces representation: different columns of the original weight matrix are approximated in distinct subspaces spanned by adaptively selected dictionary atoms, offering greater expressiveness than a single invariant basis. Crucially, CoSpaDi leverages a small calibration dataset to optimize the factorization such that the output activations of compressed projection layers closely match those of the original ones, thereby minimizing functional reconstruction error rather than mere weight approximation. This data-aware strategy preserves better model fidelity without any fine-tuning under reasonable compression ratios. Moreover, the resulting structured sparsity allows efficient sparse-dense matrix multiplication and is compatible with post-training quantization for further memory and latency gains. We evaluate CoSpaDi across multiple Llama and Qwen models under per-layer and per-group settings at 20-50% compression ratios, demonstrating consistent superiority over state-of-the-art data-aware low-rank methods both in accuracy and perplexity. Our results establish structured sparse dictionary learning as a powerful alternative to conventional low-rank approaches for efficient LLM deployment.
[61] Multilingual Dialogue Generation and Localization with Dialogue Act Scripting
Justin Vasselli, Eunike Andriani Kardinata, Yusuke Sakai, Taro Watanabe
Main category: cs.CL
TL;DR: DAS is a framework for generating multilingual dialogues from abstract intent representations, avoiding translation artifacts and improving cultural appropriateness.
Details
Motivation: Non-English dialogue datasets are scarce, and using translated dialogues introduces artifacts that reduce naturalness and cultural appropriateness.Method: Proposes Dialogue Act Script (DAS), a structured framework for encoding, localizing, and generating multilingual dialogues from abstract intent representations instead of direct translation.
Result: Human evaluations across Italian, German, and Chinese show DAS-generated dialogues outperform both machine and human translators on cultural relevance, coherence, and situational appropriateness.
Conclusion: DAS enables generation of culturally appropriate multilingual dialogues by using structured dialogue act representations, mitigating translationese and improving fluency.
Abstract: Non-English dialogue datasets are scarce, and models are often trained or evaluated on translations of English-language dialogues, an approach which can introduce artifacts that reduce their naturalness and cultural appropriateness. This work proposes Dialogue Act Script (DAS), a structured framework for encoding, localizing, and generating multilingual dialogues from abstract intent representations. Rather than translating dialogue utterances directly, DAS enables the generation of new dialogues in the target language that are culturally and contextually appropriate. By using structured dialogue act representations, DAS supports flexible localization across languages, mitigating translationese and enabling more fluent, naturalistic conversations. Human evaluations across Italian, German, and Chinese show that DAS-generated dialogues consistently outperform those produced by both machine and human translators on measures of cultural relevance, coherence, and situational appropriateness.
[62] S2J: Bridging the Gap Between Solving and Judging Ability in Generative Reward Models
Shaoning Sun, Jiachen Yu, Zongqi Wang, Xuewei Yang, Tianle Gu, Yujiu Yang
Main category: cs.CL
TL;DR: The paper identifies a ‘solve-to-judge gap’ where generative reward models (GRMs) can solve problems but fail to make correct judgments on them, and proposes a Solve-to-Judge (S2J) approach that simultaneously leverages solving and judging capabilities to bridge this gap.
Details
Motivation: To address the significant solve-to-judge gap observed in generative reward models, where models struggle to make correct judgments on queries they can actually solve (14%-37% failure rate), despite the general understanding that stronger problem-solving capabilities should lead to better judgment abilities.Method: Proposes Solve-to-Judge (S2J) approach that simultaneously leverages both solving and judging capabilities on a single GRM’s output for supervision, explicitly linking problem-solving and evaluation abilities during model optimization to narrow the solve-to-judge gap.
Result: S2J effectively reduces the solve-to-judge gap by 16.2% and enhances judgment performance by 5.8%, achieving state-of-the-art performance among GRMs built on the same base model while using significantly smaller training datasets, accomplished through self-evolution without external model distillation.
Conclusion: The Solve-to-Judge approach successfully bridges the solve-to-judge gap in generative reward models by explicitly connecting problem-solving and judgment capabilities during optimization, leading to improved performance with less data and no reliance on external models.
Abstract: With the rapid development of large language models (LLMs), generative reward models (GRMs) have been widely adopted for reward modeling and evaluation. Previous studies have primarily focused on training specialized GRMs by optimizing them on preference datasets with the judgment correctness as supervision. While it’s widely accepted that GRMs with stronger problem-solving capabilities typically exhibit superior judgment abilities, we first identify a significant solve-to-judge gap when examining individual queries. Specifically, the solve-to-judge gap refers to the phenomenon where GRMs struggle to make correct judgments on some queries (14%-37%), despite being fully capable of solving them. In this paper, we propose the Solve-to-Judge (S2J) approach to address this problem. Specifically, S2J simultaneously leverages both the solving and judging capabilities on a single GRM’s output for supervision, explicitly linking the GRM’s problem-solving and evaluation abilities during model optimization, thereby narrowing the gap. Our comprehensive experiments demonstrate that S2J effectively reduces the solve-to-judge gap by 16.2%, thereby enhancing the model’s judgment performance by 5.8%. Notably, S2J achieves state-of-the-art (SOTA) performance among GRMs built on the same base model while utilizing a significantly smaller training dataset. Moreover, S2J accomplishes this through self-evolution without relying on more powerful external models for distillation.
[63] Think Right, Not More: Test-Time Scaling for Numerical Claim Verification
Primakov Chungkham, V Venktesh, Vinay Setty, Avishek Anand
Main category: cs.CL
TL;DR: This paper introduces VERIFIERFC, a system that uses test-time scaling with multiple reasoning paths and a verifier model to improve fact-checking of complex numerical claims, achieving 18.8% performance improvement over single-shot methods.
Details
Motivation: Current LLMs struggle with fact-checking real-world numerical claims due to difficulties with compositional and numerical reasoning, understanding numerical nuances, and reasoning drift issues where models misinterpret information.Method: The approach uses test-time compute scaling to elicit multiple reasoning paths from LLMs, trains a verifier model (VERIFIERFC) to select the best reasoning path, and introduces an adaptive mechanism for selective TTS based on claim complexity.
Result: The method achieves 1.8x higher efficiency than standard TTS and delivers 18.8% performance improvement over single-shot claim verification methods, while effectively mitigating reasoning drift issues.
Conclusion: Test-time compute scaling with verifier models and adaptive mechanisms significantly improves fact-checking of complex numerical claims by addressing reasoning drift and enhancing computational efficiency.
Abstract: Fact-checking real-world claims, particularly numerical claims, is inherently complex that require multistep reasoning and numerical reasoning for verifying diverse aspects of the claim. Although large language models (LLMs) including reasoning models have made tremendous advances, they still fall short on fact-checking real-world claims that require a combination of compositional and numerical reasoning. They are unable to understand nuance of numerical aspects, and are also susceptible to the reasoning drift issue, where the model is unable to contextualize diverse information resulting in misinterpretation and backtracking of reasoning process. In this work, we systematically explore scaling test-time compute (TTS) for LLMs on the task of fact-checking complex numerical claims, which entails eliciting multiple reasoning paths from an LLM. We train a verifier model (VERIFIERFC) to navigate this space of possible reasoning paths and select one that could lead to the correct verdict. We observe that TTS helps mitigate the reasoning drift issue, leading to significant performance gains for fact-checking numerical claims. To improve compute efficiency in TTS, we introduce an adaptive mechanism that performs TTS selectively based on the perceived complexity of the claim. This approach achieves 1.8x higher efficiency than standard TTS, while delivering a notable 18.8% performance improvement over single-shot claim verification methods. Our code and data can be found at https://github.com/VenkteshV/VerifierFC
[64] Universal Legal Article Prediction via Tight Collaboration between Supervised Classification Model and LLM
Xiao Chi, Wenlin Zhong, Yiquan Wu, Wei Wang, Kun Kuang, Fei Wu, Minghui Xiong
Main category: cs.CL
TL;DR: Uni-LAP is a universal framework for legal article prediction that combines supervised classification models and large language models through tight collaboration to overcome limitations of existing methods.
Details
Motivation: Existing methods struggle with legal article prediction - supervised models can't capture complex fact patterns well, while LLMs perform poorly due to the abstract nature of legal articles. Most approaches are also jurisdiction-specific and lack broader applicability.Method: Uni-LAP integrates SCMs and LLMs: SCM uses novel Top-K loss function to generate candidate articles, while LLM employs syllogism-inspired reasoning to refine final predictions.
Result: Empirical evaluation on multiple jurisdiction datasets shows Uni-LAP consistently outperforms existing baselines, demonstrating effectiveness and generalizability.
Conclusion: The proposed Uni-LAP framework successfully addresses limitations of existing methods by combining strengths of SCMs and LLMs, showing improved performance and broader applicability across different legal systems.
Abstract: Legal Article Prediction (LAP) is a critical task in legal text classification, leveraging natural language processing (NLP) techniques to automatically predict relevant legal articles based on the fact descriptions of cases. As a foundational step in legal decision-making, LAP plays a pivotal role in determining subsequent judgments, such as charges and penalties. Despite its importance, existing methods face significant challenges in addressing the complexities of LAP. Supervised classification models (SCMs), such as CNN and BERT, struggle to fully capture intricate fact patterns due to their inherent limitations. Conversely, large language models (LLMs), while excelling in generative tasks, perform suboptimally in predictive scenarios due to the abstract and ID-based nature of legal articles. Furthermore, the diversity of legal systems across jurisdictions exacerbates the issue, as most approaches are tailored to specific countries and lack broader applicability. To address these limitations, we propose Uni-LAP, a universal framework for legal article prediction that integrates the strengths of SCMs and LLMs through tight collaboration. Specifically, in Uni-LAP, the SCM is enhanced with a novel Top-K loss function to generate accurate candidate articles, while the LLM employs syllogism-inspired reasoning to refine the final predictions. We evaluated Uni-LAP on datasets from multiple jurisdictions, and empirical results demonstrate that our approach consistently outperforms existing baselines, showcasing its effectiveness and generalizability.
[65] Multilingual Vision-Language Models, A Survey
Andrei-Alexandru Manea, Jindřich Libovický
Main category: cs.CL
TL;DR: Survey of 31 multilingual vision-language models and 21 benchmarks, revealing tension between language neutrality (consistent representations) and cultural awareness (context adaptation), with current methods favoring neutrality and evaluation gaps.
Details
Motivation: To examine multilingual vision-language models that process text and images across languages and understand the trade-offs between language neutrality and cultural awareness in these systems.Method: Comprehensive review of 31 models and 21 benchmarks, analyzing encoder-only and generative architectures, training methods (contrastive learning), and evaluation approaches (translation-based vs culturally grounded).
Result: Found current training methods favor language neutrality over cultural awareness, with two-thirds of benchmarks using translation-based approaches. Identified discrepancies in cross-lingual capabilities and gaps between training objectives and evaluation goals.
Conclusion: There is a fundamental tension between language neutrality and cultural awareness in multilingual vision-language models, with current approaches prioritizing neutrality through contrastive learning, while cultural awareness remains underdeveloped and evaluation methods need better alignment with real-world cultural contexts.
Abstract: This survey examines multilingual vision-language models that process text and images across languages. We review 31 models and 21 benchmarks, spanning encoder-only and generative architectures, and identify a key tension between language neutrality (consistent cross-lingual representations) and cultural awareness (adaptation to cultural contexts). Current training methods favor neutrality through contrastive learning, while cultural awareness depends on diverse data. Two-thirds of evaluation benchmarks use translation-based approaches prioritizing semantic consistency, though recent work incorporates culturally grounded content. We find discrepancies in cross-lingual capabilities and gaps between training objectives and evaluation goals.
[66] FoodSEM: Large Language Model Specialized in Food Named-Entity Linking
Ana Gjorgjevikj, Matej Martinc, Gjorgjina Cenikj, Sašo Džeroski, Barbara Koroušić Seljak, Tome Eftimov
Main category: cs.CL
TL;DR: FoodSEM is a fine-tuned LLM for food named-entity linking that achieves state-of-the-art performance (up to 98% F1) by linking food entities to ontologies like FoodOn, SNOMED-CT, and Hansard taxonomy.
Details
Motivation: Food NEL cannot be accurately solved by general-purpose LLMs or existing domain-specific models, creating a need for specialized food entity linking capabilities.Method: Uses instruction-response fine-tuning on food-annotated corpora to link food entities to multiple ontologies, comparing against zero-shot, one-shot, and few-shot LLM baselines.
Result: Achieves state-of-the-art performance with F1 scores reaching 98% on some ontologies and datasets, significantly outperforming non-fine-tuned versions.
Conclusion: Provides three main contributions: food-annotated corpora in IR format, a robust food semantic understanding model, and a strong baseline for future food NEL benchmarking.
Abstract: This paper introduces FoodSEM, a state-of-the-art fine-tuned open-source large language model (LLM) for named-entity linking (NEL) to food-related ontologies. To the best of our knowledge, food NEL is a task that cannot be accurately solved by state-of-the-art general-purpose (large) language models or custom domain-specific models/systems. Through an instruction-response (IR) scenario, FoodSEM links food-related entities mentioned in a text to several ontologies, including FoodOn, SNOMED-CT, and the Hansard taxonomy. The FoodSEM model achieves state-of-the-art performance compared to related models/systems, with F1 scores even reaching 98% on some ontologies and datasets. The presented comparative analyses against zero-shot, one-shot, and few-shot LLM prompting baselines further highlight FoodSEM’s superior performance over its non-fine-tuned version. By making FoodSEM and its related resources publicly available, the main contributions of this article include (1) publishing a food-annotated corpora into an IR format suitable for LLM fine-tuning/evaluation, (2) publishing a robust model to advance the semantic understanding of text in the food domain, and (3) providing a strong baseline on food NEL for future benchmarking.
[67] R-Capsule: Compressing High-Level Plans for Efficient Large Language Model Reasoning
Hongyu Shan, Mingyang Song, Chang Dai, Di Liang, Han Chen
Main category: cs.CL
TL;DR: R-Capsule is a hybrid reasoning framework that compresses high-level plans into learned latent tokens (capsules) while keeping execution steps lightweight, balancing efficiency, accuracy, and interpretability.
Details
Motivation: CoT prompting increases latency and memory usage due to verbosity and may propagate early errors across long chains. There's a need to combine efficiency of latent reasoning with transparency of explicit CoT.Method: Uses Information Bottleneck principle to compress high-level plans into minimal yet sufficient latent tokens (Reasoning Capsules). Employs dual objective: primary task loss for accuracy and auxiliary plan-reconstruction loss to ground latent space and improve interpretability.
Result: Reduces visible token footprint of reasoning while maintaining or improving accuracy on complex benchmarks. Achieves better balance between efficiency, accuracy, and interpretability compared to standard CoT.
Conclusion: R-Capsule framework successfully combines efficiency of latent reasoning with transparency of explicit CoT, offering a practical solution to CoT’s verbosity and error propagation issues while maintaining interpretability.
Abstract: Chain-of-Thought (CoT) prompting helps Large Language Models (LLMs) tackle complex reasoning by eliciting explicit step-by-step rationales. However, CoT’s verbosity increases latency and memory usage and may propagate early errors across long chains. We propose the Reasoning Capsule (R-Capsule), a framework that aims to combine the efficiency of latent reasoning with the transparency of explicit CoT. The core idea is to compress the high-level plan into a small set of learned latent tokens (a Reasoning Capsule) while keeping execution steps lightweight or explicit. This hybrid approach is inspired by the Information Bottleneck (IB) principle, where we encourage the capsule to be approximately minimal yet sufficient for the task. Minimality is encouraged via a low-capacity bottleneck, which helps improve efficiency. Sufficiency is encouraged via a dual objective: a primary task loss for answer accuracy and an auxiliary plan-reconstruction loss that encourages the capsule to faithfully represent the original textual plan. The reconstruction objective helps ground the latent space, thereby improving interpretability and reducing the use of uninformative shortcuts. Our framework strikes a balance between efficiency, accuracy, and interpretability, thereby reducing the visible token footprint of reasoning while maintaining or improving accuracy on complex benchmarks. Our codes are available at: https://anonymous.4open.science/r/Reasoning-Capsule-7BE0
[68] Bridging Draft Policy Misalignment: Group Tree Optimization for Speculative Decoding
Shijing Hu, Jingyang Li, Zhihui Lu, Pan Zhou
Main category: cs.CL
TL;DR: GTO aligns draft model training with speculative decoding’s tree policy using Draft Tree Reward and group-based optimization, increasing acceptance length by 7.4% and speedup by 7.7% over prior methods.
Details
Motivation: Existing training objectives optimize only a single greedy draft path, while decoding uses a tree policy that verifies multiple branches, creating draft policy misalignment that limits achievable speedups.Method: Group Tree Optimization (GTO) with two components: (1) Draft Tree Reward - sampling-free objective measuring expected acceptance length; (2) Group-based Draft Policy Training - stable optimization contrasting trees from current and reference models using debiased advantages and PPO-style updates.
Result: Across dialogue (MT-Bench), code (HumanEval), and math (GSM8K) tasks with multiple LLMs, GTO increases acceptance length by 7.4% and yields additional 7.7% speedup over prior state-of-the-art EAGLE-3.
Conclusion: By bridging draft policy misalignment, GTO offers a practical, general solution for efficient LLM inference through better alignment between training objectives and decoding-time tree policies.
Abstract: Speculative decoding accelerates large language model (LLM) inference by letting a lightweight draft model propose multiple tokens that the target model verifies in parallel. Yet existing training objectives optimize only a single greedy draft path, while decoding follows a tree policy that re-ranks and verifies multiple branches. This draft policy misalignment limits achievable speedups. We introduce Group Tree Optimization (GTO), which aligns training with the decoding-time tree policy through two components: (i) Draft Tree Reward, a sampling-free objective equal to the expected acceptance length of the draft tree under the target model, directly measuring decoding performance; (ii) Group-based Draft Policy Training, a stable optimization scheme that contrasts trees from the current and a frozen reference draft model, forming debiased group-standardized advantages and applying a PPO-style surrogate along the longest accepted sequence for robust updates. We further prove that increasing our Draft Tree Reward provably improves acceptance length and speedup. Across dialogue (MT-Bench), code (HumanEval), and math (GSM8K), and multiple LLMs (e.g., LLaMA-3.1-8B, LLaMA-3.3-70B, Vicuna-1.3-13B, DeepSeek-R1-Distill-LLaMA-8B), GTO increases acceptance length by 7.4% and yields an additional 7.7% speedup over prior state-of-the-art EAGLE-3. By bridging draft policy misalignment, GTO offers a practical, general solution for efficient LLM inference.
[69] NFDI4DS Shared Tasks for Scholarly Document Processing
Raia Abu Ahmad, Rana Abdulla, Tilahun Abedissa Taffa, Soeren Auer, Hamed Babaei Giglou, Ekaterina Borisova, Zongxiong Chen, Stefan Dietze, Jennifer DSouza, Mayra Elwes, Genet-Asefa Gesese, Shufan Jiang, Ekaterina Kutafina, Philipp Mayr, Georg Rehm, Sameer Sadruddin, Sonja Schimmler, Daniel Schneider, Kanishka Silva, Sharmila Upadhyaya, Ricardo Usbeck
Main category: cs.CL
TL;DR: Overview of 12 shared tasks developed under NFDI4DS consortium for advancing FAIR and reproducible research in scholarly document processing.
Details
Motivation: To promote findable, accessible, interoperable, and reusable (FAIR) research practices through community-based standardized evaluation.Method: Development and hosting of twelve shared tasks covering diverse challenges in scholarly document processing, integrated into the consortium’s research data infrastructure.
Result: Fostered methodological innovations and contributed open-access datasets, models, and tools for the broader research community.
Conclusion: Shared tasks are powerful tools for advancing research through community-based standardized evaluation and promoting FAIR research practices.
Abstract: Shared tasks are powerful tools for advancing research through community-based standardised evaluation. As such, they play a key role in promoting findable, accessible, interoperable, and reusable (FAIR), as well as transparent and reproducible research practices. This paper presents an updated overview of twelve shared tasks developed and hosted under the German National Research Data Infrastructure for Data Science and Artificial Intelligence (NFDI4DS) consortium, covering a diverse set of challenges in scholarly document processing. Hosted at leading venues, the tasks foster methodological innovations and contribute open-access datasets, models, and tools for the broader research community, which are integrated into the consortium’s research data infrastructure.
[70] From Long to Lean: Performance-aware and Adaptive Chain-of-Thought Compression via Multi-round Refinement
Jianzhi Yan, Le Liu, Youcheng Pan, Shiwei Chen, Zike Yuan, Yang Xiang, Buzhou Tang
Main category: cs.CL
TL;DR: MACC is a framework that progressively compresses Chain-of-Thought reasoning via multiround refinement, leveraging token elasticity to reduce latency while improving accuracy.
Details
Motivation: Chain-of-Thought reasoning improves complex task performance but introduces significant inference latency due to verbosity, creating a need for efficient compression methods.Method: Multiround Adaptive Chain-of-Thought Compression (MACC) uses token elasticity phenomenon and multiround refinement to progressively compress CoTs, adaptively determining optimal compression depth for each input.
Result: Achieves 5.6% average accuracy improvement over state-of-the-art baselines, reduces CoT length by 47 tokens on average, significantly lowers latency, and enables reliable performance prediction using interpretable features.
Conclusion: CoT compression is both effective and predictable, enabling efficient model selection and forecasting without repeated fine-tuning across different models.
Abstract: Chain-of-Thought (CoT) reasoning improves performance on complex tasks but introduces significant inference latency due to verbosity. We propose Multiround Adaptive Chain-of-Thought Compression (MACC), a framework that leverages the token elasticity phenomenon–where overly small token budgets can paradoxically increase output length–to progressively compress CoTs via multiround refinement. This adaptive strategy allows MACC to determine the optimal compression depth for each input. Our method achieves an average accuracy improvement of 5.6 percent over state-of-the-art baselines, while also reducing CoT length by an average of 47 tokens and significantly lowering latency. Furthermore, we show that test-time performance–accuracy and token length–can be reliably predicted using interpretable features like perplexity and compression rate on the training set. Evaluated across different models, our method enables efficient model selection and forecasting without repeated fine-tuning, demonstrating that CoT compression is both effective and predictable. Our code will be released in https://github.com/Leon221220/MACC.
[71] Mixture of Detectors: A Compact View of Machine-Generated Text Detection
Sai Teja Lekkala, Yadagiri Annepaka, Arun Kumar Challa, Samatha Reddy Machireddy, Partha Pakray, Chukhu Chunka
Main category: cs.CL
TL;DR: This paper investigates machine-generated text detection across multiple scenarios including document classification, generator attribution, sentence segmentation, and adversarial attacks, introducing the BMAS English dataset.
Details
Motivation: To address critical questions about the authenticity of human work and preservation of creativity in the face of increasingly creative LLMs, and to improve machine-generated text detection capabilities.Method: Introduces BMAS English dataset for binary classification (human vs machine text), multiclass classification (identifying specific generators), sentence-level segmentation (detecting human-AI collaborative text boundaries), and adversarial attack scenarios.
Result: A comprehensive dataset and framework for machine-generated text detection that addresses multiple detection scenarios including classification, attribution, segmentation, and adversarial settings.
Conclusion: This work aims to address previous limitations in Machine-Generated Text Detection (MGTD) in a more meaningful way by providing a multi-faceted approach to detecting and analyzing machine-generated content.
Abstract: Large Language Models (LLMs) are gearing up to surpass human creativity. The veracity of the statement needs careful consideration. In recent developments, critical questions arise regarding the authenticity of human work and the preservation of their creativity and innovative abilities. This paper investigates such issues. This paper addresses machine-generated text detection across several scenarios, including document-level binary and multiclass classification or generator attribution, sentence-level segmentation to differentiate between human-AI collaborative text, and adversarial attacks aimed at reducing the detectability of machine-generated text. We introduce a new work called BMAS English: an English language dataset for binary classification of human and machine text, for multiclass classification, which not only identifies machine-generated text but can also try to determine its generator, and Adversarial attack addressing where it is a common act for the mitigation of detection, and Sentence-level segmentation, for predicting the boundaries between human and machine-generated text. We believe that this paper will address previous work in Machine-Generated Text Detection (MGTD) in a more meaningful way.
[72] Context Parametrization with Compositional Adapters
Josip Jukić, Martin Tutek, Jan Šnajder
Main category: cs.CL
TL;DR: CompAs is a meta-learning framework that translates context into compositional adapter parameters, enabling algebraic merging of multiple information chunks without reprocessing long prompts.
Details
Motivation: Address limitations of in-context learning (inefficient with many demonstrations) and supervised fine-tuning (training overhead, loss of flexibility) by generating adapters from context.Method: Meta-learning framework that translates context into adapter parameters with compositional structure, enabling algebraic merging of adapters from different inputs.
Result: Outperforms ICL and prior generator-based methods on multiple-choice and extractive QA tasks, especially when scaling to more inputs. Provides lower inference cost, robustness to long-context instability, and reversible encoding.
Conclusion: Composable adapter generation is a practical and efficient alternative for scaling LLM deployment, offering benefits in cost, robustness, and handling context window limitations.
Abstract: Large language models (LLMs) often seamlessly adapt to new tasks through in-context learning (ICL) or supervised fine-tuning (SFT). However, both of these approaches face key limitations: ICL is inefficient when handling many demonstrations, and SFT incurs training overhead while sacrificing flexibility. Mapping instructions or demonstrations from context directly into adapter parameters offers an appealing alternative. While prior work explored generating adapters based on a single input context, it has overlooked the need to integrate multiple chunks of information. To address this gap, we introduce CompAs, a meta-learning framework that translates context into adapter parameters with a compositional structure. Adapters generated this way can be merged algebraically, enabling instructions, demonstrations, or retrieved passages to be seamlessly combined without reprocessing long prompts. Critically, this approach yields three benefits: lower inference cost, robustness to long-context instability, and establishes a principled solution when input exceeds the model’s context window. Furthermore, CompAs encodes information into adapter parameters in a reversible manner, enabling recovery of input context through a decoder, facilitating safety and security. Empirical results on diverse multiple-choice and extractive question answering tasks show that CompAs outperforms ICL and prior generator-based methods, especially when scaling to more inputs. Our work establishes composable adapter generation as a practical and efficient alternative for scaling LLM deployment.
[73] When Does Reasoning Matter? A Controlled Study of Reasoning’s Contribution to Model Performance
Nicolas Boizard, Hippolyte Gisserot-Boukhlef, Kevin El-Haddad, Céline Hudelot, Pierre Colombo
Main category: cs.CL
TL;DR: Reasoning models consistently outperform or match larger instruction fine-tuned models, becoming increasingly valuable at larger scales for reasoning-intensive and open-ended tasks.
Details
Motivation: To understand when reasoning becomes effective in LLMs and compare its performance and costs against instruction fine-tuning across different model sizes and tasks.Method: Used synthetic data distillation framework to conduct large-scale supervised study comparing Instruction Fine-Tuning (IFT) and reasoning models of varying sizes on math-centric and general-purpose tasks in multiple-choice and open-ended formats.
Result: Reasoning consistently improves model performance, often matching or surpassing significantly larger IFT systems. Reasoning models become increasingly valuable as model size scales, overcoming IFT performance limits on reasoning-intensive and open-ended tasks.
Conclusion: While IFT remains Pareto-optimal in training and inference costs, reasoning capabilities provide significant performance benefits, especially at larger model scales and for reasoning-intensive tasks.
Abstract: Large Language Models (LLMs) with reasoning capabilities have achieved state-of-the-art performance on a wide range of tasks. Despite its empirical success, the tasks and model scales at which reasoning becomes effective, as well as its training and inference costs, remain underexplored. In this work, we rely on a synthetic data distillation framework to conduct a large-scale supervised study. We compare Instruction Fine-Tuning (IFT) and reasoning models of varying sizes, on a wide range of math-centric and general-purpose tasks, evaluating both multiple-choice and open-ended formats. Our analysis reveals that reasoning consistently improves model performance, often matching or surpassing significantly larger IFT systems. Notably, while IFT remains Pareto-optimal in training and inference costs, reasoning models become increasingly valuable as model size scales, overcoming IFT performance limits on reasoning-intensive and open-ended tasks.
[74] The Outputs of Large Language Models are Meaningless
Anandi Hattiangadi, Anders J. Schoubye
Main category: cs.CL
TL;DR: LLM outputs are meaningless because they lack the necessary intentions for literal meaning, though they can still appear meaningful and enable knowledge acquisition.
Details
Motivation: To demonstrate that large language models cannot produce meaningful outputs due to their inability to form the required intentions for literal meaning.Method: Argument based on two premises: (1) specific intentions are needed for literal meaning, and (2) LLMs lack these intentions. Defense against semantic externalist and internalist counterarguments.
Result: The paper concludes that LLM outputs are indeed meaningless in the literal sense, despite their apparent meaningfulness and utility.
Conclusion: While LLM outputs lack literal meaning due to absence of proper intentions, they can still appear meaningful and serve as tools for acquiring true beliefs and knowledge.
Abstract: In this paper, we offer a simple argument for the conclusion that the outputs of large language models (LLMs) are meaningless. Our argument is based on two key premises: (a) that certain kinds of intentions are needed in order for LLMs’ outputs to have literal meanings, and (b) that LLMs cannot plausibly have the right kinds of intentions. We defend this argument from various types of responses, for example, the semantic externalist argument that deference can be assumed to take the place of intentions and the semantic internalist argument that meanings can be defined purely in terms of intrinsic relations between concepts, such as conceptual roles. We conclude the paper by discussing why, even if our argument is sound, the outputs of LLMs nevertheless seem meaningful and can be used to acquire true beliefs and even knowledge.
[75] Question-Driven Analysis and Synthesis: Building Interpretable Thematic Trees with LLMs for Text Clustering and Controllable Generation
Tiago Fernandes Tavares
Main category: cs.CL
TL;DR: RTP is a novel framework that uses LLMs to build interpretable binary tree taxonomies through natural language questions, outperforming traditional keyword-based topic models like BERTopic in interpretability and downstream task utility.
Details
Motivation: Traditional topic models produce hard-to-interpret keyword clusters that require manual effort and lack semantic coherence, creating an interpretability gap in unsupervised text analysis.Method: Recursive Thematic Partitioning (RTP) leverages LLMs to interactively build a binary tree where each node is a natural language question that semantically partitions the data, creating interpretable taxonomies.
Result: RTP’s question-driven hierarchy is more interpretable than BERTopic’s keyword-based topics and serves as powerful features in downstream classification tasks, especially when themes correlate with task labels.
Conclusion: RTP shifts text analysis from statistical pattern discovery to knowledge-driven thematic analysis and enables structured, controllable prompts for generative models, transforming analysis into synthesis tools.
Abstract: Unsupervised analysis of text corpora is challenging, especially in data-scarce domains where traditional topic models struggle. While these models offer a solution, they typically describe clusters with lists of keywords that require significant manual effort to interpret and often lack semantic coherence. To address this critical interpretability gap, we introduce Recursive Thematic Partitioning (RTP), a novel framework that leverages Large Language Models (LLMs) to interactively build a binary tree. Each node in the tree is a natural language question that semantically partitions the data, resulting in a fully interpretable taxonomy where the logic of each cluster is explicit. Our experiments demonstrate that RTP’s question-driven hierarchy is more interpretable than the keyword-based topics from a strong baseline like BERTopic. Furthermore, we establish the quantitative utility of these clusters by showing they serve as powerful features in downstream classification tasks, particularly when the data’s underlying themes correlate with the task labels. RTP introduces a new paradigm for data exploration, shifting the focus from statistical pattern discovery to knowledge-driven thematic analysis. Furthermore, we demonstrate that the thematic paths from the RTP tree can serve as structured, controllable prompts for generative models. This transforms our analytical framework into a powerful tool for synthesis, enabling the consistent imitation of specific characteristics discovered in the source corpus.
[76] StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs
Yuhan Song, Linhao Zhang, Chuhan Wu, Aiwei Liu, Wei Jia, Houfeng Wang, Xiao Zhou
Main category: cs.CL
TL;DR: StableToken is a robust speech tokenizer that uses multi-branch architecture and bit-wise voting to maintain stable token sequences under acoustic perturbations, improving downstream SpeechLLM performance.
Details
Motivation: Existing semantic speech tokenizers are fragile to meaning-irrelevant acoustic perturbations, causing drastic token sequence changes even at high SNRs where speech remains intelligible, which increases learning burden for downstream LLMs.Method: Proposes StableToken with multi-branch architecture that processes audio in parallel and merges representations through bit-wise voting mechanism to form stable token sequences.
Result: Sets new SOTA in token stability, drastically reducing Unit Edit Distance under diverse noise conditions, and significantly improves robustness of SpeechLLMs on various tasks.
Conclusion: StableToken’s consensus-driven mechanism effectively addresses tokenizer instability, providing foundational stability that directly benefits downstream speech language models.
Abstract: Prevalent semantic speech tokenizers, designed to capture linguistic content, are surprisingly fragile. We find they are not robust to meaning-irrelevant acoustic perturbations; even at high Signal-to-Noise Ratios (SNRs) where speech is perfectly intelligible, their output token sequences can change drastically, increasing the learning burden for downstream LLMs. This instability stems from two flaws: a brittle single-path quantization architecture and a distant training signal indifferent to intermediate token stability. To address this, we introduce StableToken, a tokenizer that achieves stability through a consensus-driven mechanism. Its multi-branch architecture processes audio in parallel, and these representations are merged via a powerful bit-wise voting mechanism to form a single, stable token sequence. StableToken sets a new state-of-the-art in token stability, drastically reducing Unit Edit Distance (UED) under diverse noise conditions. This foundational stability translates directly to downstream benefits, significantly improving the robustness of SpeechLLMs on a variety of tasks.
[77] Thinking in Many Modes: How Composite Reasoning Elevates Large Language Model Performance with Limited Data
Zishan Ahmad, Saisubramaniam Gopalakrishnan
Main category: cs.CL
TL;DR: Composite Reasoning (CR) enables LLMs to dynamically combine multiple reasoning styles (deductive, inductive, abductive, causal) for better problem-solving, outperforming Chain-of-Thought and DeepSeek-R1 on scientific and medical QA benchmarks.
Details
Motivation: Current LLMs rely on singular reasoning paradigms, limiting their performance on complex problems requiring diverse cognitive strategies.Method: Introduces Composite Reasoning (CR) approach that allows LLMs to dynamically explore and combine multiple reasoning styles adaptively based on domain requirements.
Result: Outperforms existing baselines (CoT, DeepSeek-R1) on scientific and medical QA benchmarks, with superior sample efficiency and token usage. Adaptively prioritizes different reasoning styles for different domains.
Conclusion: Cultivating internal reasoning style diversity enables LLMs to develop more robust, adaptive, and efficient problem-solving abilities.
Abstract: Large Language Models (LLMs), despite their remarkable capabilities, rely on singular, pre-dominant reasoning paradigms, hindering their performance on intricate problems that demand diverse cognitive strategies. To address this, we introduce Composite Reasoning (CR), a novel reasoning approach empowering LLMs to dynamically explore and combine multiple reasoning styles like deductive, inductive, and abductive for more nuanced problem-solving. Evaluated on scientific and medical question-answering benchmarks, our approach outperforms existing baselines like Chain-of-Thought (CoT) and also surpasses the accuracy of DeepSeek-R1 style reasoning (SR) capabilities, while demonstrating superior sample efficiency and adequate token usage. Notably, CR adaptively emphasizes domain-appropriate reasoning styles. It prioritizes abductive and deductive reasoning for medical question answering, but shifts to causal, deductive, and inductive methods for scientific reasoning. Our findings highlight that by cultivating internal reasoning style diversity, LLMs acquire more robust, adaptive, and efficient problem-solving abilities.
[78] In Their Own Words: Reasoning Traces Tailored for Small Models Make Them Better Reasoners
Jaehoon Kim, Kwangwook Seo, Dongha Lee
Main category: cs.CL
TL;DR: Reverse Speculative Decoding (RSD) enables effective reasoning transfer from large to small language models by filtering out low-probability tokens that exceed the student model’s representation capacity, achieving 4.9% improvement instead of 20.5% degradation.
Details
Motivation: Standard supervised fine-tuning fails to transfer reasoning capabilities from larger to smaller models due to distributional misalignment, where teacher-generated reasoning traces contain tokens that are low probability under the student's distribution.Method: Proposed Reverse Speculative Decoding (RSD) where the teacher model proposes candidate tokens but the student model determines acceptance based on its own probability distributions, filtering out low probability tokens to create student-friendly reasoning traces.
Result: Direct distillation degraded performance by 20.5%, while RSD-generated reasoning traces achieved 4.9% improvement across major reasoning benchmarks. RSD traces are model-specific and must be tailored for each student architecture.
Conclusion: Low probability tokens constitute the critical bottleneck in reasoning ability transfer, and distributional alignment through methods like RSD must be specifically tailored for each student model’s unique internal representation.
Abstract: Transferring reasoning capabilities from larger language models to smaller ones through supervised fine-tuning often fails counterintuitively, with performance degrading despite access to high-quality teacher demonstrations. We identify that this failure stems from distributional misalignment: reasoning traces from larger models contain tokens that are low probability under the student’s distribution, exceeding the internal representation capacity of smaller architectures and creating learning barriers rather than helpful guidance. We propose Reverse Speculative Decoding (RSD), a mechanism for generating student-friendly reasoning traces in which the teacher model proposes candidate tokens but the student model determines acceptance based on its own probability distributions, filtering low probability tokens. When applied to Qwen3-0.6B, direct distillation of s1K-1.1 reasoning trace data degrades average performance across major reasoning benchmarks by 20.5%, while the same model trained on RSD-generated reasoning traces achieves meaningful improvements of 4.9%. Our analysis reveals that low probability tokens constitute the critical bottleneck in reasoning ability transfer. However, cross-model experiments demonstrate that RSD traces are model-specific rather than universally applicable, indicating that distributional alignment must be tailored for each student architecture’s unique internal representation.
[79] FeatBench: Evaluating Coding Agents on Feature Implementation for Vibe Coding
Haorui Chen, Chengze Li, Jia Li
Main category: cs.CL
TL;DR: FeatBench is a new benchmark for evaluating “vibe coding” capabilities of LLMs, focusing on feature implementation through pure natural language prompts with comprehensive testing and diverse domains.
Details
Motivation: Existing benchmarks are misaligned with the "vibe coding" paradigm as they require code-level specifications or focus narrowly on issue-solving, neglecting feature implementation scenarios.Method: FeatBench uses pure natural language prompts without code hints, employs a multi-level filtering pipeline for quality, includes F2P and P2P tests for verification, and covers diverse application domains.
Result: Evaluation shows feature implementation in vibe coding is challenging, with the highest success rate only 29.94%. Analysis reveals “aggressive implementation” strategy that causes failures but leads to superior software design.
Conclusion: FeatBench addresses the gap in evaluating vibe coding capabilities and reveals the significant challenge of feature implementation in this paradigm, with the benchmark and tools released for community research.
Abstract: The rapid advancement of Large Language Models (LLMs) has given rise to a novel software development paradigm known as “vibe coding,” where users interact with coding agents through high-level natural language. However, existing evaluation benchmarks for code generation inadequately assess an agent’s vibe coding capabilities. Existing benchmarks are misaligned, as they either require code-level specifications or focus narrowly on issue-solving, neglecting the critical scenario of feature implementation within the vibe coding paradiam. To address this gap, we propose FeatBench, a novel benchmark for vibe coding that focuses on feature implementation. Our benchmark is distinguished by several key features: 1. Pure Natural Language Prompts. Task inputs consist solely of abstract natural language descriptions, devoid of any code or structural hints. 2. A Rigorous & Evolving Data Collection Process. FeatBench is built on a multi-level filtering pipeline to ensure quality and a fully automated pipeline to evolve the benchmark, mitigating data contamination. 3. Comprehensive Test Cases. Each task includes Fail-to-Pass (F2P) and Pass-to-Pass (P2P) tests to verify correctness and prevent regressions. 4. Diverse Application Domains. The benchmark includes repositories from diverse domains to ensure it reflects real-world scenarios. We evaluate two state-of-the-art agent frameworks with four leading LLMs on FeatBench. Our evaluation reveals that feature implementation within the vibe coding paradigm is a significant challenge, with the highest success rate of only 29.94%. Our analysis also reveals a tendency for “aggressive implementation,” a strategy that paradoxically leads to both critical failures and superior software design. We release FeatBench, our automated collection pipeline, and all experimental results to facilitate further community research.
[80] FLEXI: Benchmarking Full-duplex Human-LLM Speech Interaction
Yuan Ge, Saihan Chen, Jingqi Xiao, Xiaoqian Liu, Tong Xiao, Yan Xiang, Zhengtao Yu, Jingbo Zhu
Main category: cs.CL
TL;DR: FLEXI is the first benchmark for full-duplex LLM-human spoken interaction that evaluates model interruption in emergency scenarios, revealing performance gaps between open source and commercial models.
Details
Motivation: Full-duplex speech-to-speech LLMs are foundational for natural human-computer interaction, but benchmarking and modeling these systems remains challenging.Method: Introduced FLEXI benchmark with six diverse human-LLM interaction scenarios to systematically evaluate latency, quality, and conversational effectiveness, focusing on model interruption in emergencies.
Result: Revealed significant gaps between open source and commercial models in emergency awareness, turn terminating, and interaction latency.
Conclusion: Next token-pair prediction offers a promising path toward achieving truly seamless and human-like full-duplex interaction.
Abstract: Full-Duplex Speech-to-Speech Large Language Models (LLMs) are foundational to natural human-computer interaction, enabling real-time spoken dialogue systems. However, benchmarking and modeling these models remains a fundamental challenge. We introduce FLEXI, the first benchmark for full-duplex LLM-human spoken interaction that explicitly incorporates model interruption in emergency scenarios. FLEXI systematically evaluates the latency, quality, and conversational effectiveness of real-time dialogue through six diverse human-LLM interaction scenarios, revealing significant gaps between open source and commercial models in emergency awareness, turn terminating, and interaction latency. Finally, we suggest that next token-pair prediction offers a promising path toward achieving truly seamless and human-like full-duplex interaction.
[81] Safety Compliance: Rethinking LLM Safety Reasoning through the Lens of Compliance
Wenbin Hu, Huihao Jing, Haochen Shi, Haoran Li, Yangqiu Song
Main category: cs.CL
TL;DR: This paper proposes a legal compliance approach to LLM safety, using EU AI Act and GDPR as standards. They create a safety compliance benchmark and develop Compliance Reasoner using Group Policy Optimization, achieving significant performance improvements.
Details
Motivation: Existing LLM safety methods lack systematic protection and rely on ad-hoc taxonomy, failing to ensure safety for complex LLM behaviors. The authors aim to address this by approaching safety from legal compliance perspectives.Method: 1) Develop a new benchmark for safety compliance by generating realistic LLM safety scenarios seeded with legal statutes; 2) Align Qwen3-8B using Group Policy Optimization (GRPO) to construct Compliance Reasoner, which aligns LLMs with legal standards.
Result: Compliance Reasoner achieves superior performance on the new benchmark, with average improvements of +10.45% for EU AI Act and +11.85% for GDPR compliance.
Conclusion: The legal compliance approach provides a rigorous, systematic framework for LLM safety, effectively bridging the gap between LLM safety and legal standards through the proposed Compliance Reasoner.
Abstract: The proliferation of Large Language Models (LLMs) has demonstrated remarkable capabilities, elevating the critical importance of LLM safety. However, existing safety methods rely on ad-hoc taxonomy and lack a rigorous, systematic protection, failing to ensure safety for the nuanced and complex behaviors of modern LLM systems. To address this problem, we solve LLM safety from legal compliance perspectives, named safety compliance. In this work, we posit relevant established legal frameworks as safety standards for defining and measuring safety compliance, including the EU AI Act and GDPR, which serve as core legal frameworks for AI safety and data security in Europe. To bridge the gap between LLM safety and legal compliance, we first develop a new benchmark for safety compliance by generating realistic LLM safety scenarios seeded with legal statutes. Subsequently, we align Qwen3-8B using Group Policy Optimization (GRPO) to construct a safety reasoner, Compliance Reasoner, which effectively aligns LLMs with legal standards to mitigate safety risks. Our comprehensive experiments demonstrate that the Compliance Reasoner achieves superior performance on the new benchmark, with average improvements of +10.45% for the EU AI Act and +11.85% for GDPR.
[82] Beyond Textual Context: Structural Graph Encoding with Adaptive Space Alignment to alleviate the hallucination of LLMs
Yifang Zhang, Pengfei Duan, Yiwen Yang, Shengwu Xiong
Main category: cs.CL
TL;DR: SSKG-LLM is a novel architecture that integrates both structural and semantic information from Knowledge Graphs into LLMs to address hallucination issues, using specialized modules for retrieval, encoding, and adaptation.
Details
Motivation: Current LLMs treat KGs as plain text, extracting only semantic information and missing crucial structural aspects, while also facing embedding space gaps between KG encoders and LLMs that hinder effective knowledge integration.Method: Proposes SSKG-LLM with three modules: Knowledge Graph Retrieval (KGR) and Knowledge Graph Encoding (KGE) to preserve semantics while utilizing structure, and Knowledge Graph Adaptation (KGA) to enable LLMs to understand KG embeddings.
Result: Extensive experiments show that incorporating structural information from KGs enhances the factual reasoning abilities of LLMs, though specific quantitative results are not provided in the abstract.
Conclusion: SSKG-LLM successfully bridges the gap between KG structural information and LLM reasoning processes, providing an effective solution to reduce hallucination in LLMs through better KG integration.
Abstract: Currently, the main approach for Large Language Models (LLMs) to tackle the hallucination issue is incorporating Knowledge Graphs(KGs).However, LLMs typically treat KGs as plain text, extracting only semantic information and limiting their use of the crucial structural aspects of KGs. Another challenge is the gap between the embedding spaces of KGs encoders and LLMs text embeddings, which hinders the effective integration of structured knowledge. To overcome these obstacles, we put forward the SSKG-LLM, an innovative model architecture that is designed to efficiently integrate both the Structural and Semantic information of KGs into the reasoning processes of LLMs. SSKG-LLM incorporates the Knowledge Graph Retrieval (KGR) module and the Knowledge Graph Encoding (KGE) module to preserve semantics while utilizing structure. Then, the Knowledge Graph Adaptation (KGA) module is incorporated to enable LLMs to understand KGs embeddings. We conduct extensive experiments and provide a detailed analysis to explore how incorporating the structural information of KGs can enhance the factual reasoning abilities of LLMs. Our code are available at https://github.com/yfangZhang/SSKG-LLM.
[83] Bridging Fairness and Explainability: Can Input-Based Explanations Promote Fairness in Hate Speech Detection?
Yifan Wang, Mayank Jobanputra, Ji-Ung Lee, Soyoung Oh, Isabel Valera, Vera Demberg
Main category: cs.CL
TL;DR: This paper systematically studies the relationship between explainability and fairness in hate speech detection, finding that input-based explanations can detect biased predictions and help reduce bias during training, but are unreliable for selecting fair models.
Details
Motivation: NLP models often replicate social bias from training data, and their black-box nature makes it difficult to recognize biased predictions and mitigate them effectively. While some studies suggest explanations can help detect bias, others question their reliability.Method: Conducted the first systematic study of explainability-fairness relationship in hate speech detection using both encoder- and decoder-only models. Examined three dimensions: identifying biased predictions, selecting fair models, and mitigating bias during training.
Result: Input-based explanations can effectively detect biased predictions and serve as useful supervision for reducing bias during training, but they are unreliable for selecting fair models among candidates.
Conclusion: While explanations show promise for bias detection and mitigation in hate speech detection, they should not be solely relied upon for model selection due to their unreliability in this context.
Abstract: Natural language processing (NLP) models often replicate or amplify social bias from training data, raising concerns about fairness. At the same time, their black-box nature makes it difficult for users to recognize biased predictions and for developers to effectively mitigate them. While some studies suggest that input-based explanations can help detect and mitigate bias, others question their reliability in ensuring fairness. Existing research on explainability in fair NLP has been predominantly qualitative, with limited large-scale quantitative analysis. In this work, we conduct the first systematic study of the relationship between explainability and fairness in hate speech detection, focusing on both encoder- and decoder-only models. We examine three key dimensions: (1) identifying biased predictions, (2) selecting fair models, and (3) mitigating bias during model training. Our findings show that input-based explanations can effectively detect biased predictions and serve as useful supervision for reducing bias during training, but they are unreliable for selecting fair models among candidates.
[84] Advancing Natural Language Formalization to First Order Logic with Fine-tuned LLMs
Felix Vossel, Till Mossakowski, Björn Gehrke
Main category: cs.CL
TL;DR: Fine-tuned Flan-T5-XXL achieves 70% accuracy in translating natural language to first-order logic, outperforming GPT-4o and symbolic systems, with predicate availability being crucial for performance.
Details
Motivation: Automating natural language to first-order logic translation is crucial for knowledge representation and formal methods but remains challenging, requiring systematic evaluation of LLM approaches.Method: Systematic evaluation of fine-tuned LLMs using MALLS and Willow datasets, comparing encoder-decoder vs decoder-only architectures, with techniques like vocabulary extension, predicate conditioning, and multilingual training.
Result: Fine-tuned Flan-T5-XXL achieves 70% accuracy with predicate lists, outperforming GPT-4o and DeepSeek-R1-0528, and shows 15-20% performance boost from predicate availability. Models generalize to unseen logical arguments without specific training.
Conclusion: Structural logic translation is robust, but predicate extraction remains the main bottleneck. T5 models surpass larger decoder-only LLMs, and predicate availability significantly boosts performance.
Abstract: Automating the translation of natural language to first-order logic (FOL) is crucial for knowledge representation and formal methods, yet remains challenging. We present a systematic evaluation of fine-tuned LLMs for this task, comparing architectures (encoder-decoder vs. decoder-only) and training strategies. Using the MALLS and Willow datasets, we explore techniques like vocabulary extension, predicate conditioning, and multilingual training, introducing metrics for exact match, logical equivalence, and predicate alignment. Our fine-tuned Flan-T5-XXL achieves 70% accuracy with predicate lists, outperforming GPT-4o and even the DeepSeek-R1-0528 model with CoT reasoning ability as well as symbolic systems like ccg2lambda. Key findings show: (1) predicate availability boosts performance by 15-20%, (2) T5 models surpass larger decoder-only LLMs, and (3) models generalize to unseen logical arguments (FOLIO dataset) without specific training. While structural logic translation proves robust, predicate extraction emerges as the main bottleneck.
[85] Transformers Can Learn Connectivity in Some Graphs but Not Others
Amit Roy, Abulhair Saparov
Main category: cs.CL
TL;DR: Transformers can learn transitive relations (graph connectivity) on grid-like directed graphs but struggle with graphs containing many disconnected components. Model scaling improves generalization for grid graphs, while graph dimensionality strongly predicts learning difficulty.
Details
Motivation: To investigate transformers' capability to learn transitive relations from training data (rather than in-context examples) and understand how scaling affects this ability, particularly for causal inference applications.Method: Generated directed graphs to train transformer models of varying sizes, evaluating their ability to infer transitive relations (connectivity) for different graph sizes and structures, focusing on grid-like graphs vs. graphs with disconnected components.
Result: Transformers successfully learn connectivity on low-dimensional grid graphs where nodes can be embedded in low-dimensional subspaces. Higher-dimensional grid graphs are more challenging. Model scaling improves generalization for grid graphs, but transformers struggle with graphs containing many disconnected components.
Conclusion: Transformers can learn transitive relations effectively for structured grid-like graphs but face limitations with complex graph structures containing many disconnected components, with graph dimensionality being a key predictor of learning difficulty.
Abstract: Reasoning capability is essential to ensure the factual correctness of the responses of transformer-based Large Language Models (LLMs), and robust reasoning about transitive relations is instrumental in many settings, such as causal inference. Hence, it is essential to investigate the capability of transformers in the task of inferring transitive relations (e.g., knowing A causes B and B causes C, then A causes C). The task of inferring transitive relations is equivalent to the task of connectivity in directed graphs (e.g., knowing there is a path from A to B, and there is a path from B to C, then there is a path from A to C). Past research focused on whether transformers can learn to infer transitivity from in-context examples provided in the input prompt. However, transformers’ capability to infer transitive relations from training examples and how scaling affects the ability is unexplored. In this study, we seek to answer this question by generating directed graphs to train transformer models of varying sizes and evaluate their ability to infer transitive relations for various graph sizes. Our findings suggest that transformers are capable of learning connectivity on “grid-like’’ directed graphs where each node can be embedded in a low-dimensional subspace, and connectivity is easily inferable from the embeddings of the nodes. We find that the dimensionality of the underlying grid graph is a strong predictor of transformers’ ability to learn the connectivity task, where higher-dimensional grid graphs pose a greater challenge than low-dimensional grid graphs. In addition, we observe that increasing the model scale leads to increasingly better generalization to infer connectivity over grid graphs. However, if the graph is not a grid graph and contains many disconnected components, transformers struggle to learn the connectivity task, especially when the number of components is large.
[86] The InviTE Corpus: Annotating Invectives in Tudor English Texts for Computational Modeling
Sophie Spliethoff, Sanne Hoeken, Silke Schwandt, Sina Zarrieß, Özge Alaçam
Main category: cs.CL
TL;DR: The paper introduces the InviTE corpus of 2000 Early Modern English sentences with expert annotations for invective language, and compares fine-tuned BERT models with zero-shot LLMs for invective detection.
Details
Motivation: To apply NLP techniques to historical research, specifically studying religious invectives during the Protestant Reformation in Tudor England.Method: Created workflow from raw data to annotation, built InviTE corpus with expert annotations, and compared fine-tuned BERT models with zero-shot prompted LLMs.
Result: Models pre-trained on historical data and fine-tuned for invective detection performed better than zero-shot LLMs.
Conclusion: Fine-tuned historical BERT models are superior to zero-shot LLMs for detecting invective language in historical texts.
Abstract: In this paper, we aim at the application of Natural Language Processing (NLP) techniques to historical research endeavors, particularly addressing the study of religious invectives in the context of the Protestant Reformation in Tudor England. We outline a workflow spanning from raw data, through pre-processing and data selection, to an iterative annotation process. As a result, we introduce the InviTE corpus – a corpus of almost 2000 Early Modern English (EModE) sentences, which are enriched with expert annotations regarding invective language throughout 16th-century England. Subsequently, we assess and compare the performance of fine-tuned BERT-based models and zero-shot prompted instruction-tuned large language models (LLMs), which highlights the superiority of models pre-trained on historical data and fine-tuned to invective detection.
[87] Conversational Implicatures: Modelling Relevance Theory Probabilistically
Christoph Unger, Hendrik Buschmeier
Main category: cs.CL
TL;DR: This paper explores applying Bayesian probability theory to relevance-theoretic pragmatics, focusing on how implicit meaning is communicated through conversational implicatures.
Details
Motivation: Recent advances in Bayesian probability theory and computational tools have enabled a 'probabilistic turn' in pragmatics. While Rational Speech Act theory has successfully modeled Gricean pragmatic phenomena in Bayesian terms, there's a need to extend this approach to relevance-theoretic pragmatics.Method: The study applies Bayesian framework to relevance-theoretic pragmatics by examining paradigmatic pragmatic phenomena, particularly the communication of implicit meaning through conversational implicatures.
Result: The paper demonstrates how Bayesian approaches can be extended beyond Gricean frameworks to model relevance-theoretic accounts of pragmatic communication.
Conclusion: Bayesian probability theory provides a promising computational framework for modeling relevance-theoretic pragmatics, particularly for understanding how implicit meaning is conveyed through conversational implicatures.
Abstract: Recent advances in Bayesian probability theory and its application to cognitive science in combination with the development of a new generation of computational tools and methods for probabilistic computation have led to a ‘probabilistic turn’ in pragmatics and semantics. In particular, the framework of Rational Speech Act theory has been developed to model broadly Gricean accounts of pragmatic phenomena in Bayesian terms, starting with fairly simple reference games and covering ever more complex communicative exchanges such as verbal syllogistic reasoning. This paper explores in which way a similar Bayesian approach might be applied to relevance-theoretic pragmatics (Sperber & Wilson, 1995) by study a paradigmatic pragmatic phenomenon: the communication of implicit meaning by ways of (conversational) implicatures.
[88] CHRONOBERG: Capturing Language Evolution and Temporal Awareness in Foundation Models
Niharika Hegde, Subarnaduti Paul, Lars Joel-Frey, Manuel Brack, Kristian Kersting, Martin Mundt, Patrick Schramowski
Main category: cs.CL
TL;DR: CHRONOBERG is a temporally structured corpus of English books spanning 250 years, designed to study linguistic change and improve LLMs’ ability to handle diachronic language variation.
Details
Motivation: Existing corpora lack long-term temporal structure, limiting LLMs' ability to contextualize semantic and normative evolution of language and capture diachronic variation.Method: Curated books from Project Gutenberg with temporal annotations, used time-sensitive Valence-Arousal-Dominance analysis to quantify lexical semantic change, and constructed historically calibrated affective lexicons.
Result: Modern LLM-based tools struggle to detect discriminatory language and contextualize sentiment across different time periods, and language models trained on CHRONOBERG have difficulty encoding diachronic meaning shifts.
Conclusion: There is a need for temporally aware training and evaluation pipelines, and CHRONOBERG serves as a scalable resource for studying linguistic change and temporal generalization.
Abstract: Large language models (LLMs) excel at operating at scale by leveraging social media and various data crawled from the web. Whereas existing corpora are diverse, their frequent lack of long-term temporal structure may however limit an LLM’s ability to contextualize semantic and normative evolution of language and to capture diachronic variation. To support analysis and training for the latter, we introduce CHRONOBERG, a temporally structured corpus of English book texts spanning 250 years, curated from Project Gutenberg and enriched with a variety of temporal annotations. First, the edited nature of books enables us to quantify lexical semantic change through time-sensitive Valence-Arousal-Dominance (VAD) analysis and to construct historically calibrated affective lexicons to support temporally grounded interpretation. With the lexicons at hand, we demonstrate a need for modern LLM-based tools to better situate their detection of discriminatory language and contextualization of sentiment across various time-periods. In fact, we show how language models trained sequentially on CHRONOBERG struggle to encode diachronic shifts in meaning, emphasizing the need for temporally aware training and evaluation pipelines, and positioning CHRONOBERG as a scalable resource for the study of linguistic change and temporal generalization. Disclaimer: This paper includes language and display of samples that could be offensive to readers. Open Access: Chronoberg is available publicly on HuggingFace at ( https://huggingface.co/datasets/spaul25/Chronoberg). Code is available at (https://github.com/paulsubarna/Chronoberg).
[89] Exploratory Semantic Reliability Analysis of Wind Turbine Maintenance Logs using Large Language Models
Max Malyi, Jonathan Shek, Andre Biscaya
Main category: cs.CL
TL;DR: This paper introduces an LLM-based framework for deep semantic analysis of wind turbine maintenance logs, moving beyond simple classification to perform complex reasoning tasks like failure mode identification and causal inference.
Details
Motivation: Traditional quantitative reliability analysis cannot access the operational intelligence locked in unstructured wind turbine maintenance logs, and existing ML approaches only perform basic classification without deeper reasoning.Method: An exploratory framework using large language models (LLMs) to perform deep semantic analysis through four analytical workflows: failure mode identification, causal chain inference, comparative site analysis, and data quality auditing.
Result: LLMs successfully functioned as “reliability co-pilots,” synthesizing textual information to generate actionable, expert-level hypotheses beyond simple labeling.
Conclusion: The work provides a novel methodology for using LLMs as reasoning tools to unlock insights from unstructured data, offering a new pathway to enhance operational intelligence in the wind energy sector.
Abstract: A wealth of operational intelligence is locked within the unstructured free-text of wind turbine maintenance logs, a resource largely inaccessible to traditional quantitative reliability analysis. While machine learning has been applied to this data, existing approaches typically stop at classification, categorising text into predefined labels. This paper addresses the gap in leveraging modern large language models (LLMs) for more complex reasoning tasks. We introduce an exploratory framework that uses LLMs to move beyond classification and perform deep semantic analysis. We apply this framework to a large industrial dataset to execute four analytical workflows: failure mode identification, causal chain inference, comparative site analysis, and data quality auditing. The results demonstrate that LLMs can function as powerful “reliability co-pilots,” moving beyond labelling to synthesise textual information and generate actionable, expert-level hypotheses. This work contributes a novel and reproducible methodology for using LLMs as a reasoning tool, offering a new pathway to enhance operational intelligence in the wind energy sector by unlocking insights previously obscured in unstructured data.
[90] What Is The Political Content in LLMs’ Pre- and Post-Training Data?
Tanise Ceron, Dmitry Nikolaev, Dominik Stammbach, Debora Nozza
Main category: cs.CL
TL;DR: Analysis of OLMO2’s training data reveals left-leaning political bias predominance, with pre-training corpora containing more politically engaged content than post-training data, and strong correlation between training data stance and model political biases.
Details
Motivation: Large language models generate politically biased text, but how these biases arise from training data remains unclear, particularly the political content analysis of training corpora.Method: Analyzed OLMO2’s pre- and post-training corpora by drawing large random samples, automatically annotating documents for political orientation, and examining source domains and content, then correlating with model stance on policy issues.
Result: Left-leaning documents predominate across datasets, pre-training corpora contain more politically engaged content, left/right documents frame topics through distinct values and legitimacy sources, and training data stance strongly correlates with model political biases.
Conclusion: Political content analysis should be integrated into data curation pipelines and filtering strategies should be thoroughly documented for transparency.
Abstract: Large language models (LLMs) are known to generate politically biased text, yet how such biases arise remains unclear. A crucial step toward answering this question is the analysis of training data, whose political content remains largely underexplored in current LLM research. To address this gap, we present in this paper an analysis of the pre- and post-training corpora of OLMO2, the largest fully open-source model released together with its complete dataset. From these corpora, we draw large random samples, automatically annotate documents for political orientation, and analyze their source domains and content. We then assess how political content in the training data correlates with models’ stance on specific policy issues. Our analysis shows that left-leaning documents predominate across datasets, with pre-training corpora containing significantly more politically engaged content than post-training data. We also find that left- and right-leaning documents frame similar topics through distinct values and sources of legitimacy. Finally, the predominant stance in the training data strongly correlates with models’ political biases when evaluated on policy issues. These findings underscore the need to integrate political content analysis into future data curation pipelines as well as in-depth documentation of filtering strategies for transparency.
[91] Chimera: Diagnosing Shortcut Learning in Visual-Language Understanding
Ziheng Chi, Yifan Hou, Chenxi Pang, Shaobo Cui, Mubashara Akhtar, Mrinmaya Sachan
Main category: cs.CL
TL;DR: Chimera is a comprehensive test suite with 7,500 Wikipedia diagrams to evaluate VLMs’ genuine diagram comprehension, revealing that current models rely heavily on shortcuts rather than true understanding.
Details
Motivation: Current VLMs appear to perform well on diagram benchmarks but may rely on knowledge, reasoning, or modality shortcuts rather than genuine diagram comprehension, creating a need for more robust evaluation.Method: Created Chimera test suite with 7,500 Wikipedia diagrams annotated with semantic triples and multi-level questions assessing entity recognition, relation understanding, knowledge grounding, and visual reasoning. Evaluated 15 VLMs from 7 families for three shortcut types.
Result: VLMs’ strong performance largely stems from shortcut behaviors: visual-memorization shortcuts have slight impact, knowledge-recall shortcuts play moderate role, and Clever-Hans shortcuts contribute significantly.
Conclusion: Current VLMs have critical limitations in genuine diagram comprehension and rely heavily on shortcuts, underscoring the need for more robust evaluation protocols that benchmark true understanding of complex visual inputs.
Abstract: Diagrams convey symbolic information in a visual format rather than a linear stream of words, making them especially challenging for AI models to process. While recent evaluations suggest that vision-language models (VLMs) perform well on diagram-related benchmarks, their reliance on knowledge, reasoning, or modality shortcuts raises concerns about whether they genuinely understand and reason over diagrams. To address this gap, we introduce Chimera, a comprehensive test suite comprising 7,500 high-quality diagrams sourced from Wikipedia; each diagram is annotated with its symbolic content represented by semantic triples along with multi-level questions designed to assess four fundamental aspects of diagram comprehension: entity recognition, relation understanding, knowledge grounding, and visual reasoning. We use Chimera to measure the presence of three types of shortcuts in visual question answering: (1) the visual-memorization shortcut, where VLMs rely on memorized visual patterns; (2) the knowledge-recall shortcut, where models leverage memorized factual knowledge instead of interpreting the diagram; and (3) the Clever-Hans shortcut, where models exploit superficial language patterns or priors without true comprehension. We evaluate 15 open-source VLMs from 7 model families on Chimera and find that their seemingly strong performance largely stems from shortcut behaviors: visual-memorization shortcuts have slight impact, knowledge-recall shortcuts play a moderate role, and Clever-Hans shortcuts contribute significantly. These findings expose critical limitations in current VLMs and underscore the need for more robust evaluation protocols that benchmark genuine comprehension of complex visual inputs (e.g., diagrams) rather than question-answering shortcuts.
[92] Detecting (Un)answerability in Large Language Models with Linear Directions
Maor Juliet Lavi, Tova Milo, Mor Geva
Main category: cs.CL
TL;DR: A method using activation space directions to detect unanswerable questions in extractive QA, outperforming existing approaches and generalizing to other unanswerability types.
Details
Motivation: LLMs often provide confident but hallucinated answers when lacking information, creating a need for reliable unanswerability detection in extractive QA.Method: Identify a direction in model’s activation space that captures unanswerability through activation additions during inference, then use projection onto this direction for classification.
Result: Method effectively detects unanswerable questions, generalizes better across datasets than prompt-based and classifier-based approaches, and extends to other unanswerability types like lack of consensus and subjectivity.
Conclusion: Activation space directions provide a reliable way to detect unanswerability and can control model abstention behavior through causal interventions.
Abstract: Large language models (LLMs) often respond confidently to questions even when they lack the necessary information, leading to hallucinated answers. In this work, we study the problem of (un)answerability detection, focusing on extractive question answering (QA) where the model should determine if a passage contains sufficient information to answer a given question. We propose a simple approach for identifying a direction in the model’s activation space that captures unanswerability and uses it for classification. This direction is selected by applying activation additions during inference and measuring their impact on the model’s abstention behavior. We show that projecting hidden activations onto this direction yields a reliable score for (un)answerability classification. Experiments on two open-weight LLMs and four extractive QA benchmarks show that our method effectively detects unanswerable questions and generalizes better across datasets than existing prompt-based and classifier-based approaches. Moreover, the obtained directions extend beyond extractive QA to unanswerability that stems from factors, such as lack of scientific consensus and subjectivity. Last, causal interventions show that adding or ablating the directions effectively controls the abstention behavior of the model.
[93] Evaluating the Limits of Large Language Models in Multilingual Legal Reasoning
Antreas Ioannou, Andreas Shiamishis, Nora Hollenstein, Nezihe Merve Gürel
Main category: cs.CL
TL;DR: This paper evaluates LLaMA and Gemini LLMs on multilingual legal tasks, finding significant performance gaps in legal reasoning (below 50% accuracy) compared to general tasks (over 70%), with Gemini outperforming LLaMA by 24 percentage points on average.
Details
Motivation: To understand LLM capabilities and limitations in high-stakes legal applications, especially in multilingual, jurisdictionally diverse, and adversarial contexts where performance remains insufficiently explored.Method: Used multilingual legal and non-legal benchmarks with adversarial robustness testing through character/word-level perturbations. Employed LLM-as-a-Judge approach for evaluation and developed an open-source modular pipeline for multilingual benchmarking of legal tasks including classification, summarization, and reasoning.
Result: Legal tasks pose significant challenges with accuracies often below 50% on legal reasoning benchmarks vs over 70% on general tasks. English provides more stable but not always higher accuracy. Gemini outperforms LLaMA by ~24 percentage points. Performance correlates with syntactic similarity to English. Prompt sensitivity and adversarial vulnerability persist across languages.
Conclusion: Despite improvements in newer LLMs, significant challenges remain for reliable deployment in critical, multilingual legal applications, highlighting the need for continued research and development in this domain.
Abstract: In an era dominated by Large Language Models (LLMs), understanding their capabilities and limitations, especially in high-stakes fields like law, is crucial. While LLMs such as Meta’s LLaMA, OpenAI’s ChatGPT, Google’s Gemini, DeepSeek, and other emerging models are increasingly integrated into legal workflows, their performance in multilingual, jurisdictionally diverse, and adversarial contexts remains insufficiently explored. This work evaluates LLaMA and Gemini on multilingual legal and non-legal benchmarks, and assesses their adversarial robustness in legal tasks through character and word-level perturbations. We use an LLM-as-a-Judge approach for human-aligned evaluation. We moreover present an open-source, modular evaluation pipeline designed to support multilingual, task-diverse benchmarking of any combination of LLMs and datasets, with a particular focus on legal tasks, including classification, summarization, open questions, and general reasoning. Our findings confirm that legal tasks pose significant challenges for LLMs with accuracies often below 50% on legal reasoning benchmarks such as LEXam, compared to over 70% on general-purpose tasks like XNLI. In addition, while English generally yields more stable results, it does not always lead to higher accuracy. Prompt sensitivity and adversarial vulnerability is also shown to persist across languages. Finally, a correlation is found between the performance of a language and its syntactic similarity to English. We also observe that LLaMA is weaker than Gemini, with the latter showing an average advantage of about 24 percentage points across the same task. Despite improvements in newer LLMs, challenges remain in deploying them reliably for critical, multilingual legal applications.
[94] NeLLCom-Lex: A Neural-agent Framework to Study the Interplay between Lexical Systems and Language Use
Yuqing Zhang, Ecesu Ürker, Tessa Verhoef, Gemma Boleda, Arianna Bisazza
Main category: cs.CL
TL;DR: NeLLCom-Lex is a neural-agent framework that simulates semantic change by grounding agents in real lexical systems and manipulating their communicative needs, using color naming tasks to study how agents develop human-like lexicons and change behavior.
Details
Motivation: Existing methods for studying lexical semantic change have limitations - observational methods cannot identify causal mechanisms, and experimental methods are difficult to apply to extended diachronic processes.Method: A neural-agent framework that first grounds agents in real lexical systems (e.g., English), then systematically manipulates communicative needs using color naming tasks, with different supervised and reinforcement learning pipelines.
Result: Neural agents trained to ‘speak’ an existing language can reproduce human-like patterns in color naming to a remarkable extent, simulating the evolution of lexical systems within a single generation.
Conclusion: The framework supports further use of NeLLCom-Lex to elucidate the mechanisms of semantic change, showing that agents can develop human-like naming behavior and change lexicons according to communicative needs.
Abstract: Lexical semantic change has primarily been investigated with observational and experimental methods; however, observational methods (corpus analysis, distributional semantic modeling) cannot get at causal mechanisms, and experimental paradigms with humans are hard to apply to semantic change due to the extended diachronic processes involved. This work introduces NeLLCom-Lex, a neural-agent framework designed to simulate semantic change by first grounding agents in a real lexical system (e.g. English) and then systematically manipulating their communicative needs. Using a well-established color naming task, we simulate the evolution of a lexical system within a single generation, and study which factors lead agents to: (i) develop human-like naming behavior and lexicons, and (ii) change their behavior and lexicons according to their communicative needs. Our experiments with different supervised and reinforcement learning pipelines show that neural agents trained to ‘speak’ an existing language can reproduce human-like patterns in color naming to a remarkable extent, supporting the further use of NeLLCom-Lex to elucidate the mechanisms of semantic change.
[95] Exploring Solution Divergence and Its Effect on Large Language Model Problem Solving
Hang Li, Kaiqi Yang, Yucheng Chu, Hui Liu, Jiliang Tang
Main category: cs.CL
TL;DR: Solution divergence in LLMs is positively correlated with problem-solving ability and can be used as a metric to improve training and evaluation.
Details
Motivation: To explore a new perspective on improving LLM performance by examining solution divergence rather than traditional supervised fine-tuning or reinforcement learning approaches.Method: Propose solution divergence as a novel metric that supports both SFT and RL strategies, tested across three representative problem domains.
Result: Using solution divergence consistently improves success rates in problem-solving tasks across various models.
Conclusion: Solution divergence is a simple but effective tool for advancing LLM training and evaluation.
Abstract: Large language models (LLMs) have been widely used for problem-solving tasks. Most recent work improves their performance through supervised fine-tuning (SFT) with labeled data or reinforcement learning (RL) from task feedback. In this paper, we study a new perspective: the divergence in solutions generated by LLMs for a single problem. We show that higher solution divergence is positively related to better problem-solving abilities across various models. Based on this finding, we propose solution divergence as a novel metric that can support both SFT and RL strategies. We test this idea on three representative problem domains and find that using solution divergence consistently improves success rates. These results suggest that solution divergence is a simple but effective tool for advancing LLM training and evaluation.
[96] JGU Mainz’s Submission to the WMT25 Shared Task on LLMs with Limited Resources for Slavic Languages: MT and QA
Hossain Shaikh Saadi, Minh Duc Bui, Mario Sanz-Guerrero, Katharina von der Wense
Main category: cs.CL
TL;DR: The JGU Mainz team submitted models for WMT25 Shared Task on LLMs with Limited Resources for Slavic Languages, achieving performance improvements over baseline in both machine translation and question answering for Ukrainian, Upper Sorbian, and Lower Sorbian.
Details
Motivation: To address the challenge of developing effective language models for Slavic languages with limited resources, specifically focusing on machine translation and question answering tasks.Method: Jointly fine-tuned Qwen2.5-3B-Instruct models using parameter-efficient finetuning for both tasks, integrated additional translation and QA data, used retrieval-augmented generation for Ukrainian QA, and applied ensembling for Upper and Lower Sorbian QA.
Result: The developed models outperformed the baseline on both machine translation and question answering tasks across all three Slavic languages.
Conclusion: The approach demonstrates that parameter-efficient finetuning combined with task-specific enhancements like retrieval-augmented generation and ensembling can effectively improve performance for low-resource Slavic languages in both translation and QA tasks.
Abstract: This paper presents the JGU Mainz submission to the WMT25 Shared Task on LLMs with Limited Resources for Slavic Languages: Machine Translation and Question Answering, focusing on Ukrainian, Upper Sorbian, and Lower Sorbian. For each language, we jointly fine-tune a Qwen2.5-3B-Instruct model for both tasks with parameter-efficient finetuning. Our pipeline integrates additional translation and multiple-choice question answering (QA) data. For Ukrainian QA, we further use retrieval-augmented generation. We also apply ensembling for QA in Upper and Lower Sorbian. Experiments show that our models outperform the baseline on both tasks.
[97] Representing LLMs in Prompt Semantic Task Space
Idan Kashani, Avi Mendelson, Yaniv Nemcovsky
Main category: cs.CL
TL;DR: A training-free method to represent LLMs as linear operators in semantic task space for interpretable model selection and performance prediction.
Details
Motivation: Existing LLM representation methods have limited scalability, require costly retraining, and produce non-interpretable representations, making it challenging to select the best LLM for specific tasks from expanding public repositories.Method: Uses closed-form computation of geometrical properties to represent LLMs as linear operators within prompts’ semantic task space, providing an efficient training-free approach.
Result: Achieves competitive or state-of-the-art results on success prediction and model selection tasks, with notable performance in out-of-sample scenarios.
Conclusion: The proposed method offers highly interpretable LLM representations with exceptional scalability and real-time adaptability to dynamically expanding model repositories.
Abstract: Large language models (LLMs) achieve impressive results over various tasks, and ever-expanding public repositories contain an abundance of pre-trained models. Therefore, identifying the best-performing LLM for a given task is a significant challenge. Previous works have suggested learning LLM representations to address this. However, these approaches present limited scalability and require costly retraining to encompass additional models and datasets. Moreover, the produced representation utilizes distinct spaces that cannot be easily interpreted. This work presents an efficient, training-free approach to representing LLMs as linear operators within the prompts’ semantic task space, thus providing a highly interpretable representation of the models’ application. Our method utilizes closed-form computation of geometrical properties and ensures exceptional scalability and real-time adaptability to dynamically expanding repositories. We demonstrate our approach on success prediction and model selection tasks, achieving competitive or state-of-the-art results with notable performance in out-of-sample scenarios.
[98] We Think, Therefore We Align LLMs to Helpful, Harmless and Honest Before They Go Wrong
Gautam Siddharth Kashyap, Mark Dras, Usman Naseem
Main category: cs.CL
TL;DR: AMBS is a two-stage 1-to-N framework for multi-objective LLM alignment that prevents catastrophic forgetting and inference fragmentation by using shared representations and parallel steering branches.
Details
Motivation: Current steering methods for LLM alignment either cause catastrophic forgetting (1-to-1 approaches) or inference fragmentation across objectives (naive 1-to-N approaches), requiring a unified solution.Method: Two-stage framework: Stage I computes shared post-attention hidden states; Stage II clones these into parallel branches with policy-reference steering for objective-specific control while maintaining consistency.
Result: On DeepSeek-7B, AMBS improves average HHH alignment scores by +32.4%, reduces unsafe outputs by 11.0% vs naive 1-to-N baseline, and remains competitive with SOTA methods.
Conclusion: AMBS provides an effective approach for unified multi-objective LLM alignment that maintains consistency across objectives while improving performance.
Abstract: Alignment of Large Language Models (LLMs) along multiple objectives-helpfulness, harmlessness, and honesty (HHH)-is critical for safe and reliable deployment. Prior work has used steering vector-small control signals injected into hidden states-to guide LLM outputs, typically via one-to-one (1-to-1) Transformer decoders. In this setting, optimizing a single alignment objective can inadvertently overwrite representations learned for other objectives, leading to catastrophic forgetting. More recent approaches extend steering vectors via one-to-many (1-to-N) Transformer decoders. While this alleviates catastrophic forgetting, naive multi-branch designs optimize each objective independently, which can cause inference fragmentation-outputs across HHH objectives may become inconsistent. We propose Adaptive Multi-Branch Steering (AMBS), a two-stage 1-to-N framework for unified and efficient multi-objective alignment. In Stage I, post-attention hidden states of the Transformer layer are computed once to form a shared representation. In Stage II, this representation is cloned into parallel branches and steered via a policy-reference mechanism, enabling objective-specific control while maintaining cross-objective consistency. Empirical evaluations on Alpaca, BeaverTails, and TruthfulQA show that AMBS consistently improves HHH alignment across multiple 7B LLM backbones. For example, on DeepSeek-7B, AMBS improves average alignment scores by +32.4% and reduces unsafe outputs by 11.0% compared to a naive 1-to-N baseline, while remaining competitive with state-of-the-art methods.
[99] InfiR2: A Comprehensive FP8 Training Recipe for Reasoning-Enhanced Language Models
Wenjun Wang, Shuo Cai, Congkai Xie, Mingfa Feng, Yiming Zhang, Zhen Li, Kejing Yang, Ming Li, Jiannong Cao, Yuan Xie, Hongxia Yang
Main category: cs.CL
TL;DR: An end-to-end FP8 training recipe that enables lossless LLM training with substantial efficiency gains (22% faster training, 14% lower memory, 19% higher throughput) compared to BF16 baseline.
Details
Motivation: The immense computational cost of training LLMs is a major barrier to innovation, and while FP8 offers theoretical efficiency gains, there's no comprehensive open-source training recipe available.Method: Fine-grained, hybrid-granularity quantization strategy for FP8 training that seamlessly integrates continual pre-training and supervised fine-tuning while maintaining numerical fidelity.
Result: The FP8 recipe is remarkably stable and essentially lossless, achieving performance on par with BF16 baseline across reasoning benchmarks, with 22% reduction in training time, 14% decrease in peak memory usage, and 19% increase in throughput.
Conclusion: FP8 is established as a practical and robust alternative to BF16 for large-scale model training, with code release to democratize access.
Abstract: The immense computational cost of training Large Language Models (LLMs) presents a major barrier to innovation. While FP8 training offers a promising solution with significant theoretical efficiency gains, its widespread adoption has been hindered by the lack of a comprehensive, open-source training recipe. To bridge this gap, we introduce an end-to-end FP8 training recipe that seamlessly integrates continual pre-training and supervised fine-tuning. Our methodology employs a fine-grained, hybrid-granularity quantization strategy to maintain numerical fidelity while maximizing computational efficiency. Through extensive experiments, including the continue pre-training of models on a 160B-token corpus, we demonstrate that our recipe is not only remarkably stable but also essentially lossless, achieving performance on par with the BF16 baseline across a suite of reasoning benchmarks. Crucially, this is achieved with substantial efficiency improvements, including up to a 22% reduction in training time, a 14% decrease in peak memory usage, and a 19% increase in throughput. Our results establish FP8 as a practical and robust alternative to BF16, and we will release the accompanying code to further democratize large-scale model training.
[100] Think Socially via Cognitive Reasoning
Jinfeng Zhou, Zheyu Chen, Shuai Wang, Quanyu Dai, Zhenhua Dong, Hongning Wang, Minlie Huang
Main category: cs.CL
TL;DR: Introduces Cognitive Reasoning and CogFlow framework to enhance LLMs’ social cognition by modeling human interpretive processes through structured cognitive flows and reinforcement learning.
Details
Motivation: Current LLMs excel at logical reasoning but struggle with social situations that require interpretive analysis of ambiguous cues rather than definitive answers.Method: Proposes CogFlow framework: curates cognitive flow dataset via tree-structured planning, uses supervised fine-tuning for basic capability, then reinforcement learning with multi-objective rewards for self-improvement.
Result: Extensive experiments show CogFlow effectively enhances social cognitive capabilities of LLMs and even humans, leading to more effective social decision-making.
Conclusion: Cognitive Reasoning paradigm and CogFlow framework successfully bridge the gap between logical reasoning and social cognition in AI systems.
Abstract: LLMs trained for logical reasoning excel at step-by-step deduction to reach verifiable answers. However, this paradigm is ill-suited for navigating social situations, which induce an interpretive process of analyzing ambiguous cues that rarely yield a definitive outcome. To bridge this gap, we introduce Cognitive Reasoning, a paradigm modeled on human social cognition. It formulates the interpretive process into a structured cognitive flow of interconnected cognitive units (e.g., observation or attribution), which combine adaptively to enable effective social thinking and responses. We then propose CogFlow, a complete framework that instills this capability in LLMs. CogFlow first curates a dataset of cognitive flows by simulating the associative and progressive nature of human thought via tree-structured planning. After instilling the basic cognitive reasoning capability via supervised fine-tuning, CogFlow adopts reinforcement learning to enable the model to improve itself via trial and error, guided by a multi-objective reward that optimizes both cognitive flow and response quality. Extensive experiments show that CogFlow effectively enhances the social cognitive capabilities of LLMs, and even humans, leading to more effective social decision-making.
[101] Retrieval-Augmented Guardrails for AI-Drafted Patient-Portal Messages: Error Taxonomy Construction and Large-Scale Evaluation
Wenyuan Chen, Fateme Nateghi Haredasht, Kameron C. Black, Francois Grolleau, Emily Alsentzer, Jonathan H. Chen, Stephen P. Ma
Main category: cs.CL
TL;DR: Developed a retrieval-augmented evaluation pipeline (RAEC) using LLMs to assess draft responses to patient messages, with improved error detection through historical message-response pairs.
Details
Motivation: Asynchronous patient-clinician messaging via EHR portals increases clinician workload, and LLM-generated draft responses may contain clinical inaccuracies, omissions, or tone mismatches that require robust evaluation.Method: Created a clinically grounded error ontology (5 domains, 59 codes), developed RAEC pipeline using retrieval of similar historical message-response pairs, and implemented two-stage DSPy prompting for hierarchical error detection.
Result: Retrieval context improved error identification in clinical completeness and workflow appropriateness. Human validation showed superior agreement (50% vs 33%) and performance (F1=0.500 vs 0.256) compared to baseline.
Conclusion: The RAEC pipeline serves as effective AI guardrails for patient messaging, demonstrating that context-enhanced evaluation significantly improves error detection in LLM-generated clinical responses.
Abstract: Asynchronous patient-clinician messaging via EHR portals is a growing source of clinician workload, prompting interest in large language models (LLMs) to assist with draft responses. However, LLM outputs may contain clinical inaccuracies, omissions, or tone mismatches, making robust evaluation essential. Our contributions are threefold: (1) we introduce a clinically grounded error ontology comprising 5 domains and 59 granular error codes, developed through inductive coding and expert adjudication; (2) we develop a retrieval-augmented evaluation pipeline (RAEC) that leverages semantically similar historical message-response pairs to improve judgment quality; and (3) we provide a two-stage prompting architecture using DSPy to enable scalable, interpretable, and hierarchical error detection. Our approach assesses the quality of drafts both in isolation and with reference to similar past message-response pairs retrieved from institutional archives. Using a two-stage DSPy pipeline, we compared baseline and reference-enhanced evaluations on over 1,500 patient messages. Retrieval context improved error identification in domains such as clinical completeness and workflow appropriateness. Human validation on 100 messages demonstrated superior agreement (concordance = 50% vs. 33%) and performance (F1 = 0.500 vs. 0.256) of context-enhanced labels vs. baseline, supporting the use of our RAEC pipeline as AI guardrails for patient messaging.
[102] Fine-Grained Detection of Context-Grounded Hallucinations Using LLMs
Yehonatan Pesiakhovsky, Zorik Gekhman, Yosi Mass, Liat Ein-Dor, Roi Reichart
Main category: cs.CL
TL;DR: This paper studies using LLMs to detect context-grounded hallucinations in model outputs, creates a benchmark for evaluating hallucination localization, and analyzes LLM performance on this challenging task.
Details
Motivation: To provide a more practical alternative to complex evaluation pipelines for detecting hallucinations where model outputs contain unverifiable information not present in source texts.Method: Constructed a benchmark with 1,000+ human-annotated examples, proposed free-form textual descriptions for error representation, and evaluated four large-scale LLMs using optimal prompting strategies.
Result: The benchmark proved challenging - best model achieved only 0.67 F1 score. Key challenges identified: LLMs incorrectly flag missing details as inconsistent and struggle with factually correct outputs that align with parametric knowledge but aren’t verifiable from source.
Conclusion: LLMs show promise for hallucination localization but face significant challenges, particularly with distinguishing between actual inconsistencies and missing details, and handling outputs that align with their parametric knowledge but lack source verification.
Abstract: Context-grounded hallucinations are cases where model outputs contain information not verifiable against the source text. We study the applicability of LLMs for localizing such hallucinations, as a more practical alternative to existing complex evaluation pipelines. In the absence of established benchmarks for meta-evaluation of hallucinations localization, we construct one tailored to LLMs, involving a challenging human annotation of over 1,000 examples. We complement the benchmark with an LLM-based evaluation protocol, verifying its quality in a human evaluation. Since existing representations of hallucinations limit the types of errors that can be expressed, we propose a new representation based on free-form textual descriptions, capturing the full range of possible errors. We conduct a comprehensive study, evaluating four large-scale LLMs, which highlights the benchmark’s difficulty, as the best model achieves an F1 score of only 0.67. Through careful analysis, we offer insights into optimal prompting strategies for the task and identify the main factors that make it challenging for LLMs: (1) a tendency to incorrectly flag missing details as inconsistent, despite being instructed to check only facts in the output; and (2) difficulty with outputs containing factually correct information absent from the source - and thus not verifiable - due to alignment with the model’s parametric knowledge.
[103] ArabJobs: A Multinational Corpus of Arabic Job Ads
Mo El-Haj
Main category: cs.CL
TL;DR: ArabJobs is a public corpus of 8,500+ Arabic job ads from 4 Arab countries, used for analyzing gender representation, occupational structure, dialectal variation, and applications like salary estimation and bias detection.
Details
Motivation: To create a comprehensive dataset capturing linguistic, regional, and socio-economic variation in the Arab labor market for fairness-aware Arabic NLP research.Method: Collected over 8,500 job advertisements from Egypt, Jordan, Saudi Arabia, and UAE, comprising 550,000+ words, and applied analyses on gender representation, occupational structure, and dialectal variation.
Result: The dataset enables analyses of gender bias, profession classification, salary estimation, and job category normalization using large language models, demonstrating utility for labor market research.
Conclusion: ArabJobs provides valuable resources for fairness-aware Arabic NLP and labor market studies, with the dataset publicly available on GitHub for future research.
Abstract: ArabJobs is a publicly available corpus of Arabic job advertisements collected from Egypt, Jordan, Saudi Arabia, and the United Arab Emirates. Comprising over 8,500 postings and more than 550,000 words, the dataset captures linguistic, regional, and socio-economic variation in the Arab labour market. We present analyses of gender representation and occupational structure, and highlight dialectal variation across ads, which offers opportunities for future research. We also demonstrate applications such as salary estimation and job category normalisation using large language models, alongside benchmark tasks for gender bias detection and profession classification. The findings show the utility of ArabJobs for fairness-aware Arabic NLP and labour market research. The dataset is publicly available on GitHub: https://github.com/drelhaj/ArabJobs.
[104] From Formal Language Theory to Statistical Learning: Finite Observability of Subregular Languages
Katsuhiko Hayashi, Hidetaka Kamigaito
Main category: cs.CL
TL;DR: All standard subregular language classes are linearly separable when represented by their deciding predicates, enabling learnability with simple linear models.
Details
Motivation: To establish that subregular language classes provide a rigorous and interpretable foundation for modeling natural language structure.Method: Representing subregular language classes by their deciding predicates and using linear models for learning, with experiments on synthetic data and English morphology.
Result: Perfect separability under noise-free conditions in synthetic experiments, and learned features aligning with well-known linguistic constraints in real-data experiments on English morphology.
Conclusion: The subregular hierarchy provides a rigorous and interpretable foundation for modeling natural language structure, with demonstrated linear separability and learnability.
Abstract: We prove that all standard subregular language classes are linearly separable when represented by their deciding predicates. This establishes finite observability and guarantees learnability with simple linear models. Synthetic experiments confirm perfect separability under noise-free conditions, while real-data experiments on English morphology show that learned features align with well-known linguistic constraints. These results demonstrate that the subregular hierarchy provides a rigorous and interpretable foundation for modeling natural language structure. Our code used in real-data experiments is available at https://github.com/UTokyo-HayashiLab/subregular.
[105] Capturing Opinion Shifts in Deliberative Discourse through Frequency-based Quantum deep learning methods
Rakesh Thakur, Harsh Chaturvedi, Ruqayya Shah, Janvi Chauhan, Ayush Sharma
Main category: cs.CL
TL;DR: Comparative analysis of NLP techniques for modeling deliberation, showing that Frequency-Based Discourse Modulation and Quantum-Deliberation Framework outperform existing models in interpreting deliberative discourse and predicting opinion shifts.
Details
Motivation: To computationally model deliberation by analyzing opinion shifts and predicting outcomes under varying scenarios, leveraging recent NLP advancements to understand how models interpret deliberative discourse.Method: Collected opinions from diverse individuals to create a self-sourced dataset, simulated deliberation using product presentations with striking facts, and compared multiple NLP techniques including Frequency-Based Discourse Modulation and Quantum-Deliberation Framework.
Result: The Frequency-Based Discourse Modulation and Quantum-Deliberation Framework models outperformed existing state-of-the-art models in interpreting deliberative discourse and producing meaningful insights about opinion shifts.
Conclusion: The study demonstrates practical applications in public policy-making, debate evaluation, decision-support frameworks, and large-scale social media opinion mining, highlighting the effectiveness of advanced NLP techniques for deliberation modeling.
Abstract: Deliberation plays a crucial role in shaping outcomes by weighing diverse perspectives before reaching decisions. With recent advancements in Natural Language Processing, it has become possible to computationally model deliberation by analyzing opinion shifts and predicting potential outcomes under varying scenarios. In this study, we present a comparative analysis of multiple NLP techniques to evaluate how effectively models interpret deliberative discourse and produce meaningful insights. Opinions from individuals of varied backgrounds were collected to construct a self-sourced dataset that reflects diverse viewpoints. Deliberation was simulated using product presentations enriched with striking facts, which often prompted measurable shifts in audience opinions. We have given comparative analysis between two models namely Frequency-Based Discourse Modulation and Quantum-Deliberation Framework which outperform the existing state of art models. The findings highlight practical applications in public policy-making, debate evaluation, decision-support frameworks, and large-scale social media opinion mining.
[106] From tests to effect sizes: Quantifying uncertainty and statistical variability in multilingual and multitask NLP evaluation benchmarks
Jonne Sälevä, Duygu Ataman, Constantine Lignos
Main category: cs.CL
TL;DR: Resampling methods for quantifying uncertainty in multilingual/multitask NLP benchmarks, accounting for both model- and data-related variation to avoid underestimating variability.
Details
Motivation: To properly quantify uncertainty and statistical precision of evaluation metrics in multilingual/multitask NLP benchmarks, as experimental variation arises from both model- and data-related sources.Method: Resampling-based methods that account for both model- and data-related sources of variation in performance scores.
Result: Shows that accounting for both sources of variation is necessary to avoid substantially underestimating overall variability; demonstrates utility for computing sampling distributions for leaderboard quantities like averages, pairwise differences, and rankings.
Conclusion: Resampling methods are effective for comprehensive uncertainty quantification in NLP benchmarks, providing better statistical precision estimates for common leaderboard metrics.
Abstract: In this paper, we introduce a set of resampling-based methods for quantifying uncertainty and statistical precision of evaluation metrics in multilingual and/or multitask NLP benchmarks. We show how experimental variation in performance scores arises from both model- and data-related sources, and that accounting for both of them is necessary to avoid substantially underestimating the overall variability over hypothetical replications. Using multilingual question answering, machine translation, and named entity recognition as example tasks, we also demonstrate how resampling methods are useful for computing sampling distributions for various quantities used in leaderboards such as the average/median, pairwise differences between models, and rankings.
[107] StateX: Enhancing RNN Recall via Post-training State Expansion
Xingyu Shen, Yingfa Chen, Zhen Leng Thai, Xu Han, Zhiyuan Liu, Maosong Sun
Main category: cs.CL
TL;DR: StateX is a post-training pipeline that efficiently expands recurrent states in pre-trained RNNs (linear attention and state space models) to improve recall and in-context learning abilities without significantly increasing parameters or training costs.
Details
Motivation: Transformer models are expensive for long contexts, while RNNs with constant per-token complexity struggle with accurate recall from long contexts due to limited state size. Directly training RNNs with larger states is costly.Method: Design post-training architectural modifications for linear attention and state space models to scale up state size with minimal parameter increase, using a training pipeline called StateX.
Result: Experiments on models up to 1.3B parameters show StateX efficiently enhances recall and in-context learning abilities without high post-training costs or compromising other capabilities.
Conclusion: StateX provides an effective solution for improving RNNs’ long-context recall abilities through efficient post-training state expansion, bridging the gap between RNN efficiency and Transformer-like recall performance.
Abstract: While Transformer-based models have demonstrated remarkable language modeling performance, their high complexities result in high costs when processing long contexts. In contrast, recurrent neural networks (RNNs) such as linear attention and state space models have gained popularity due to their constant per-token complexities. However, these recurrent models struggle with tasks that require accurate recall of contextual information from long contexts, because all contextual information is compressed into a constant-size recurrent state. Previous works have shown that recall ability is positively correlated with the recurrent state size, yet directly training RNNs with larger recurrent states results in high training costs. In this paper, we introduce StateX, a training pipeline for efficiently expanding the states of pre-trained RNNs through post-training. For two popular classes of RNNs, linear attention and state space models, we design post-training architectural modifications to scale up the state size with no or negligible increase in model parameters. Experiments on models up to 1.3B parameters demonstrate that StateX efficiently enhances the recall and in-context learning ability of RNNs without incurring high post-training costs or compromising other capabilities.
[108] Variational Reasoning for Language Models
Xiangxin Zhou, Zichen Liu, Haonan Wang, Chao Du, Min Lin, Chongxuan Li, Liang Wang, Tianyu Pang
Main category: cs.CL
TL;DR: A variational reasoning framework that treats thinking traces as latent variables and optimizes them through variational inference, unifying variational methods with RL-style approaches for language model reasoning.
Details
Motivation: To provide a principled probabilistic perspective that unifies variational inference with RL-style methods for improving language model reasoning abilities.Method: Proposes a variational reasoning framework using thinking traces as latent variables, extends ELBO to multi-trace objectives, introduces forward-KL formulation for training stability, and shows connections to rejection sampling and binary-reward RL.
Result: Empirically validated on Qwen 2.5 and Qwen 3 model families across various reasoning tasks, demonstrating stable objectives and improved reasoning capabilities.
Conclusion: The work provides a unified probabilistic framework that connects variational inference with RL methods, revealing previously unnoticed biases and offering stable training objectives for enhancing language model reasoning.
Abstract: We introduce a variational reasoning framework for language models that treats thinking traces as latent variables and optimizes them through variational inference. Starting from the evidence lower bound (ELBO), we extend it to a multi-trace objective for tighter bounds and propose a forward-KL formulation that stabilizes the training of the variational posterior. We further show that rejection sampling finetuning and binary-reward RL, including GRPO, can be interpreted as local forward-KL objectives, where an implicit weighting by model accuracy naturally arises from the derivation and reveals a previously unnoticed bias toward easier questions. We empirically validate our method on the Qwen 2.5 and Qwen 3 model families across a wide range of reasoning tasks. Overall, our work provides a principled probabilistic perspective that unifies variational inference with RL-style methods and yields stable objectives for improving the reasoning ability of language models. Our code is available at https://github.com/sail-sg/variational-reasoning.
[109] Language Models Can Learn from Verbal Feedback Without Scalar Rewards
Renjie Luo, Zichen Liu, Xiangyan Liu, Chao Du, Min Lin, Wenhu Chen, Wei Lu, Tianyu Pang
Main category: cs.CL
TL;DR: FCP treats verbal feedback as a conditioning signal for LLMs instead of compressing it into scalar rewards, enabling more expressive learning from feedback through conditional generation.
Details
Motivation: Current RL methods compress nuanced verbal feedback into scalar rewards, losing richness and causing scale imbalance in LLM training.Method: Feedback-conditional policy (FCP) learns from response-feedback pairs via maximum likelihood training on offline data, plus online bootstrapping where policy generates under positive conditions and refines with fresh feedback.
Result: FCP reframes feedback-driven learning as conditional generation rather than reward optimization, providing a more expressive approach for LLMs to learn directly from verbal feedback.
Conclusion: The proposed method offers a novel paradigm for LLM training that preserves the richness of verbal feedback through conditional generation techniques.
Abstract: LLMs are often trained with RL from human or AI feedback, yet such methods typically compress nuanced feedback into scalar rewards, discarding much of their richness and inducing scale imbalance. We propose treating verbal feedback as a conditioning signal. Inspired by language priors in text-to-image generation, which enable novel outputs from unseen prompts, we introduce the feedback-conditional policy (FCP). FCP learns directly from response-feedback pairs, approximating the feedback-conditional posterior through maximum likelihood training on offline data. We further develop an online bootstrapping stage where the policy generates under positive conditions and receives fresh feedback to refine itself. This reframes feedback-driven learning as conditional generation rather than reward optimization, offering a more expressive way for LLMs to directly learn from verbal feedback. Our code is available at https://github.com/sail-sg/feedback-conditional-policy.
[110] Death of the Novel(ty): Beyond n-Gram Novelty as a Metric for Textual Creativity
Arkadiy Saakyan, Najoung Kim, Smaranda Muresan, Tuhin Chakrabarty
Main category: cs.CL
TL;DR: N-gram novelty alone is insufficient for measuring creativity as it ignores appropriateness. Expert annotations show most high n-gram novelty expressions aren’t creative, and LLMs struggle with pragmaticality at high novelty.
Details
Motivation: To investigate whether n-gram novelty adequately captures creativity's dual nature (novelty + appropriateness) and examine how LLMs perform on creative expression identification.Method: Used 7542 expert writer annotations of novelty, pragmaticality, and sensicality via close reading of human and AI-generated text. Tested zero-shot, few-shot, and finetuned models on creative expression identification.
Result: 91% of top-quartile n-gram novelty expressions weren’t judged creative. Higher n-gram novelty in LLMs correlates with lower pragmaticality. Frontier LLMs perform better than random but struggle with non-pragmatic expressions.
Conclusion: N-gram novelty alone is inadequate for measuring creativity. LLMs need improvement in identifying creative and non-pragmatic expressions, though LLM-as-a-Judge novelty scores show promise for predicting expert preferences.
Abstract: N-gram novelty is widely used to evaluate language models’ ability to generate text outside of their training data. More recently, it has also been adopted as a metric for measuring textual creativity. However, theoretical work on creativity suggests that this approach may be inadequate, as it does not account for creativity’s dual nature: novelty (how original the text is) and appropriateness (how sensical and pragmatic it is). We investigate the relationship between this notion of creativity and n-gram novelty through 7542 expert writer annotations (n=26) of novelty, pragmaticality, and sensicality via close reading of human and AI-generated text. We find that while n-gram novelty is positively associated with expert writer-judged creativity, ~91% of top-quartile expressions by n-gram novelty are not judged as creative, cautioning against relying on n-gram novelty alone. Furthermore, unlike human-written text, higher n-gram novelty in open-source LLMs correlates with lower pragmaticality. In an exploratory study with frontier close-source models, we additionally confirm that they are less likely to produce creative expressions than humans. Using our dataset, we test whether zero-shot, few-shot, and finetuned models are able to identify creative expressions (a positive aspect of writing) and non-pragmatic ones (a negative aspect). Overall, frontier LLMs exhibit performance much higher than random but leave room for improvement, especially struggling to identify non-pragmatic expressions. We further find that LLM-as-a-Judge novelty scores from the best-performing model were predictive of expert writer preferences.
[111] WebGen-Agent: Enhancing Interactive Website Generation with Multi-Level Feedback and Step-Level Reinforcement Learning
Zimu Lu, Houxing Ren, Yunqiao Yang, Ke Wang, Zhuofan Zong, Junting Pan, Mingjie Zhan, Hongsheng Li
Main category: cs.CL
TL;DR: WebGen-Agent is a website-generation agent that uses visual feedback from screenshots and GUI-agent testing to iteratively refine code generation, outperforming previous state-of-the-art systems.
Details
Motivation: Current code agents rely only on simple code execution feedback, which fails to capture the actual quality of generated websites that depend heavily on visual effects and user-interaction feedback.Method: Uses visual language model (VLM) to generate detailed descriptions and suggestions from screenshots and GUI-agent testing, with backtracking and select-best mechanism. Also introduces Step-GRPO training using screenshot and GUI-agent scores as dense process supervision.
Result: On WebGen-Bench dataset, increased Claude-3.5-Sonnet accuracy from 26.4% to 51.9% and appearance score from 3.0 to 3.9. Step-GRPO training increased Qwen2.5-Coder-7B-Instruct accuracy from 38.9% to 45.4% and appearance score from 3.4 to 3.7.
Conclusion: Comprehensive visual feedback and Step-GRPO training significantly improve website-generation performance, demonstrating the importance of visual evaluation for code quality assessment in web development tasks.
Abstract: Agent systems powered by large language models (LLMs) have demonstrated impressive performance on repository-level code-generation tasks. However, for tasks such as website codebase generation, which depend heavily on visual effects and user-interaction feedback, current code agents rely only on simple code execution for feedback and verification. This approach fails to capture the actual quality of the generated code. In this paper, we propose WebGen-Agent, a novel website-generation agent that leverages comprehensive and multi-level visual feedback to iteratively generate and refine the website codebase. Detailed and expressive text descriptions and suggestions regarding the screenshots and GUI-agent testing of the websites are generated by a visual language model (VLM), together with scores that quantify their quality. The screenshot and GUI-agent scores are further integrated with a backtracking and select-best mechanism, enhancing the performance of the agent. Utilizing the accurate visual scores inherent in the WebGen-Agent workflow, we further introduce \textit{Step-GRPO with Screenshot and GUI-agent Feedback} to improve the ability of LLMs to act as the reasoning engine of WebGen-Agent. By using the screenshot and GUI-agent scores at each step as the reward in Step-GRPO, we provide a dense and reliable process supervision signal, which effectively improves the model’s website-generation ability. On the WebGen-Bench dataset, WebGen-Agent increases the accuracy of Claude-3.5-Sonnet from 26.4% to 51.9% and its appearance score from 3.0 to 3.9, outperforming the previous state-of-the-art agent system. Additionally, our Step-GRPO training approach increases the accuracy of Qwen2.5-Coder-7B-Instruct from 38.9% to 45.4% and raises the appearance score from 3.4 to 3.7.
[112] Constituency Parsing using LLMs
Xuefeng Bai, Jialong Wu, Yulong Chen, Zhongqing Wang, Kehai Chen, Min Zhang, Yue Zhang
Main category: cs.CL
TL;DR: LLMs reformat constituency parsing as sequence-to-sequence generation but struggle with valid tree generation. Proposed methods use error learning and multi-agent collaboration to improve parsing accuracy.
Details
Motivation: Constituency parsing remains unsolved in NLP, and LLMs show potential but lack mechanisms to guarantee valid and faithful constituent trees.Method: Reformat constituency parsing as seq2seq generation; evaluate LLMs under zero-shot, few-shot, and fine-tuning paradigms; propose two strategies: learning from errors and multi-agent output refinement.
Result: LLMs achieve acceptable improvements but have substantial limitations; proposed methods effectively reduce invalid/unfaithful trees and enhance parsing performance across learning paradigms.
Conclusion: The proposed error-learning and multi-agent collaboration strategies successfully improve constituency parsing by ensuring more valid and faithful constituent tree generation.
Abstract: Constituency parsing is a fundamental yet unsolved challenge in natural language processing. In this paper, we examine the potential of recent large language models (LLMs) to address this challenge. We reformat constituency parsing as a sequence-to-sequence generation problem and evaluate the performance of a diverse range of LLMs under zero-shot, few-shot, and supervised fine-tuning learning paradigms. We observe that while LLMs achieve acceptable improvements, they still encounter substantial limitations, due to the absence of mechanisms to guarantee the validity and faithfulness of the generated constituent trees. Motivated by this observation, we propose two strategies to guide LLMs to generate more accurate constituent trees by learning from erroneous samples and refining outputs in a multi-agent collaboration way, respectively. The experimental results demonstrate that our methods effectively reduce the occurrence of invalid and unfaithful trees, thereby enhancing overall parsing performance and achieving promising results across different learning paradigms.
[113] TEXT2AFFORD: Probing Object Affordance Prediction abilities of Language Models solely from Text
Sayantan Adak, Daivik Agrawal, Animesh Mukherjee, Somak Aditya
Main category: cs.CL
TL;DR: This paper investigates object affordance knowledge in pre-trained language models and vision-language models, introduces a new dataset called Text2Afford with 15 affordance classes, and finds that current models have limited reasoning abilities for uncommon affordances.
Details
Motivation: To quantify the effect of grounding in pre-trained models by examining their understanding of object affordances, as current models show inconsistent failures and lack of reasoning capabilities.Method: Curated a novel Text2Afford dataset with in-the-wild sentences annotated with objects and affordances, then evaluated PTLMs and VLMs on affordance reasoning tasks and conducted few-shot fine-tuning experiments.
Result: PTLMs show limited reasoning for uncommon object affordances, pre-trained VLMs don’t effectively capture object affordances, but few-shot fine-tuning improves affordance knowledge in both model types.
Conclusion: The research contributes a novel dataset for language grounding tasks and provides insights into LM capabilities regarding object affordances, advancing understanding of this important aspect of grounded reasoning.
Abstract: We investigate the knowledge of object affordances in pre-trained language models (LMs) and pre-trained Vision-Language models (VLMs). A growing body of literature shows that PTLMs fail inconsistently and non-intuitively, demonstrating a lack of reasoning and grounding. To take a first step toward quantifying the effect of grounding (or lack thereof), we curate a novel and comprehensive dataset of object affordances – Text2Afford, characterized by 15 affordance classes. Unlike affordance datasets collected in vision and language domains, we annotate in-the-wild sentences with objects and affordances. Experimental results reveal that PTLMs exhibit limited reasoning abilities when it comes to uncommon object affordances. We also observe that pre-trained VLMs do not necessarily capture object affordances effectively. Through few-shot fine-tuning, we demonstrate improvement in affordance knowledge in PTLMs and VLMs. Our research contributes a novel dataset for language grounding tasks, and presents insights into LM capabilities, advancing the understanding of object affordances. Codes and data are available at https://github.com/sayantan11995/Text2Afford
[114] Sharing Matters: Analysing Neurons Across Languages and Tasks in LLMs
Weixuan Wang, Barry Haddow, Minghao Wu, Wei Peng, Alexandra Birch
Main category: cs.CL
TL;DR: This study examines neuron activation sharing across tasks and languages in multilingual LLMs, classifying neurons into four categories and finding that all-shared neurons are crucial for performance.
Details
Motivation: Most LLM research focuses on monolingual (English) settings, creating a gap in understanding how LLMs work in multilingual contexts. This study aims to explore neuron activation patterns across different languages.Method: Classified neurons into four categories (all-shared, partial-shared, specific, non-activated) based on their responses across languages. Conducted experiments on three tasks across nine languages using several LLMs.
Result: Deactivating all-shared neurons significantly decreases performance; shared neurons play vital role in response generation; neuron activation patterns are highly sensitive and vary across tasks, LLMs, and languages.
Conclusion: The findings provide insights into internal workings of multilingual LLMs and pave way for future research. Code is released to foster further investigation.
Abstract: Large language models (LLMs) have revolutionized the field of natural language processing (NLP), and recent studies have aimed to understand their underlying mechanisms. However, most of this research is conducted within a monolingual setting, primarily focusing on English. Few studies have attempted to explore the internal workings of LLMs in multilingual settings. In this study, we aim to fill this research gap by examining how neuron activation is shared across tasks and languages. We classify neurons into four distinct categories based on their responses to a specific input across different languages: all-shared, partial-shared, specific, and non-activated. Building upon this categorisation, we conduct extensive experiments on three tasks across nine languages using several LLMs and present an in-depth analysis in this work. Our findings reveal that: (i) deactivating the all-shared neurons significantly decreases performance; (ii) the shared neurons play a vital role in generating responses, especially for the all-shared neurons; (iii) neuron activation patterns are highly sensitive and vary across tasks, LLMs, and languages. These findings shed light on the internal workings of multilingual LLMs and pave the way for future research. We release the code to foster research in this area.
[115] LLMAEL: Large Language Models are Good Context Augmenters for Entity Linking
Amy Xin, Yunjia Qi, Zijun Yao, Fangwei Zhu, Kaisheng Zeng, Xu Bin, Lei Hou, Juanzi Li
Main category: cs.CL
TL;DR: LLMAEL enhances specialized entity linking models by using LLMs to generate entity descriptions as additional context, achieving state-of-the-art results across 6 benchmarks with 8.9% accuracy improvement.
Details
Motivation: Specialized EL models struggle with long-tail entities due to limited training data, while LLMs have broader knowledge but fail at accurate entity name generation. LLMs are better at context generation than EL execution.Method: LLMAEL uses off-the-shelf, tuning-free LLMs as context augmenters to generate entity descriptions that serve as additional input for specialized entity linking models.
Result: Sets new state-of-the-art results across 6 EL benchmarks, achieving an absolute 8.9% gain in EL accuracy compared to prior LLM integration methods.
Conclusion: The framework successfully leverages LLMs’ context generation capabilities to enhance specialized EL models, demonstrating significant performance improvements without requiring LLM fine-tuning.
Abstract: Specialized entity linking (EL) models are well-trained at mapping mentions to unique knowledge base (KB) entities according to a given context. However, specialized EL models struggle to disambiguate long-tail entities due to their limited training data. Meanwhile, extensively pre-trained large language models (LLMs) possess broader knowledge of uncommon entities. Yet, with a lack of specialized EL training, LLMs frequently fail to generate accurate KB entity names, limiting their standalone effectiveness in EL. With the observation that LLMs are more adept at context generation instead of EL execution, we introduce LLM-Augmented Entity Linking (LLMAEL), the first framework to enhance specialized EL models with LLM data augmentation. LLMAEL leverages off-the-shelf, tuning-free LLMs as context augmenters, generating entity descriptions to serve as additional input for specialized EL models. Experiments show that LLMAEL sets new state-of-the-art results across 6 widely adopted EL benchmarks: compared to prior methods that integrate tuning-free LLMs into EL, LLMAEL achieves an absolute 8.9% gain in EL accuracy. We release our code and datasets.
[116] Position IDs Matter: An Enhanced Position Layout for Efficient Context Compression in Large Language Models
Runsong Zhao, Xin Liu, Xinyu Liu, Pengcheng Huang, Chunyang Xiao, Tong Xiao, Jingbo Zhu
Main category: cs.CL
TL;DR: EPL is a method that improves context compression in LLMs by adjusting position IDs to minimize distance between context tokens and special tokens while maintaining sequence order.
Details
Motivation: Existing context compression approaches neglect that position encodings induce local biases, causing them to ignore holistic contextual dependencies.Method: Enhanced Position Layout (EPL) adjusts position IDs to minimize distance between context tokens and special tokens while maintaining sequence order between all tokens.
Result: EPL improves ROUGE-1 F1 by 1.9 points on out-of-domain QA datasets and boosts vision compression LLM accuracy by 2.6 points in multimodal scenarios.
Conclusion: Simple adjustments to position IDs can significantly enhance context compression capabilities in LLMs across both text and multimodal domains.
Abstract: Using special tokens (e.g., gist, memory, or compressed tokens) to compress context information is a common practice for large language models (LLMs). However, existing approaches often neglect that position encodings inherently induce local inductive biases in models, causing the compression process to ignore holistic contextual dependencies. We propose \textbf{Enhanced Position Layout (EPL)}, a simple yet effective method that improves the context compression capability of LLMs by only adjusting position IDs, the numerical identifiers that specify token positions. EPL minimizes the distance between context tokens and their corresponding special tokens and at the same time maintains the sequence order in position IDs between context tokens, special tokens, and the subsequent tokens. Integrating EPL into our best performing context compression model results in a 1.9 ROUGE-1 F1 improvement on out-of-domain question answering datasets on average. When extended to multimodal scenarios, EPL leads to an average accuracy gain of 2.6 points for vision compression LLMs.
[117] Stuffed Mamba: Oversized States Lead to the Inability to Forget
Yingfa Chen, Xinrong Zhang, Shengding Hu, Xu Han, Zhiyuan Liu, Maosong Sun
Main category: cs.CL
TL;DR: Mamba-based RNN models struggle to effectively forget earlier tokens despite built-in forgetting mechanisms, due to training on contexts shorter than state size. Minimum training length scales linearly with state size, while maximum context length for accurate retrieval scales exponentially.
Details
Motivation: To address information interference in recurrent architectures like Mamba and RWKV, where fixed-size states cause performance degradation and incoherent outputs beyond certain context lengths due to inability to effectively forget earlier tokens.Method: Analyzed Mamba-based models’ forgetting capabilities, investigated the relationship between training context length and state size, and measured how minimum training length and maximum context length for accurate retrieval scale with state size.
Result: Models fail to learn effective forgetting when trained on contexts shorter than state size. Minimum training length required for learning forgetting scales linearly with state size, while maximum context length for accurate passkey retrieval scales exponentially with state size.
Conclusion: Current RNN architectures have critical limitations in long-context modeling. Future designs must account for interplay between state size, training length, and forgetting mechanisms to achieve robust performance in long-context tasks.
Abstract: Recent advancements in recurrent architectures, such as Mamba and RWKV, have showcased strong language capabilities. Unlike transformer-based models, these architectures encode all contextual information into a fixed-size state, leading to great inference efficiency. However, this approach can cause information interference, where different token data conflicts, resulting in performance degradation and incoherent outputs beyond a certain context length. To prevent this, most RNNs incorporate mechanisms designed to “forget” earlier tokens. In this paper, we reveal that Mamba-based models struggle to effectively forget earlier tokens even with built-in forgetting mechanisms. We demonstrate that this issue stems from training on contexts that are too short for the state size, enabling the model to perform well without needing to learn how to forget. Then, we show that the minimum training length required for the model to learn forgetting scales linearly with the state size, and the maximum context length for accurate retrieval of a 5-digit passkey scales exponentially with the state size, indicating that the model retains some information beyond the point where forgetting begins. These findings highlight a critical limitation in current RNN architectures and provide valuable insights for improving long-context modeling. Our work suggests that future RNN designs must account for the interplay between state size, training length, and forgetting mechanisms to achieve robust performance in long-context tasks.
[118] Improving the Language Understanding Capabilities of Large Language Models Using Reinforcement Learning
Bokai Hu, Sai Ashish Somayajula, Xin Pan, Pengtao Xie
Main category: cs.CL
TL;DR: Using Proximal Policy Optimization (PPO) to fine-tune LLMs for NLU tasks outperforms supervised fine-tuning and prompting methods, achieving significant improvements on benchmarks like GLUE and surpassing GPT-4o on specific tasks.
Details
Motivation: Smaller instruction-fine-tuned LLMs (under 14B parameters) underperform on NLU tasks compared to smaller models like BERT-base, motivating the exploration of reinforcement learning methods to improve their NLU capabilities.Method: Frame NLU as a reinforcement learning environment where token generation is treated as actions, and use Proximal Policy Optimization (PPO) to optimize for reward signals based on alignment with ground-truth labels.
Result: PPO consistently outperforms supervised fine-tuning with 6.3 point average improvement on GLUE, surpasses zero-shot and few-shot prompting by 38.7 and 26.1 points respectively, and beats GPT-4o by over 4% on sentiment and NLI tasks.
Conclusion: Reframing NLU tasks as reinforcement learning problems enables effective adaptation of LLMs using simple end-task rewards rather than extensive data curation, showing promising direction for task adaptation.
Abstract: Instruction-fine-tuned large language models (LLMs) under 14B parameters continue to underperform on natural language understanding (NLU) tasks, often trailing smaller models like BERT-base on benchmarks such as GLUE and SuperGLUE. Motivated by the success of reinforcement learning in reasoning tasks (e.g., DeepSeek), we explore Proximal Policy Optimization (PPO) as a framework to improve the NLU capabilities of LLMs. We frame NLU as a reinforcement learning environment, treating token generation as a sequence of actions and optimizing for reward signals based on alignment with ground-truth labels. PPO consistently outperforms supervised fine-tuning, yielding an average improvement of 6.3 points on GLUE, and surpasses zero-shot and few-shot prompting by 38.7 and 26.1 points, respectively. Notably, PPO-tuned models outperform GPT-4o by over 4% on average across sentiment and natural language inference tasks, including gains of 7.3% on the Mental Health dataset and 10.9% on SIGA-nli. This work highlights a promising direction for adapting LLMs to new tasks by reframing them as reinforcement learning problems, enabling learning through simple end-task rewards rather than extensive data curation.
[119] Vulnerability of LLMs to Vertically Aligned Text Manipulations
Zhecheng Li, Yiwei Wang, Bryan Hooi, Yujun Cai, Zhen Xiong, Nanyun Peng, Kai-wei Chang
Main category: cs.CL
TL;DR: Vertical text input significantly degrades LLM performance in text classification tasks, and CoT reasoning doesn’t help mitigate this vulnerability.
Details
Motivation: To investigate if decoder-based LLMs are vulnerable to vertically formatted text input, similar to encoder-based models, and understand the underlying causes.Method: Analyzed impact of vertical text input on various LLMs across multiple text classification datasets, examined tokenization and attention matrices, and tested mitigation strategies like CoT and few-shot learning.
Result: Vertical text input significantly reduces LLM accuracy in text classification; CoT doesn’t help but few-shot learning with careful analysis does; vulnerability stems from tokenization and attention matrix issues.
Conclusion: LLMs are vulnerable to vertical text formatting, posing risks in real-world applications, and require specific mitigation strategies beyond standard reasoning approaches.
Abstract: Vertical text input is commonly encountered in various real-world applications, such as mathematical computations and word-based Sudoku puzzles. While current large language models (LLMs) have excelled in natural language tasks, they remain vulnerable to variations in text formatting. Recent research demonstrates that modifying input formats, such as vertically aligning words for encoder-based models, can substantially lower accuracy in text classification tasks. While easily understood by humans, these inputs can significantly mislead models, posing a potential risk of bypassing detection in real-world scenarios involving harmful or sensitive information. With the expanding application of LLMs, a crucial question arises: Do decoder-based LLMs exhibit similar vulnerabilities to vertically formatted text input? In this paper, we investigate the impact of vertical text input on the performance of various LLMs across multiple text classification datasets and analyze the underlying causes. Our findings are as follows: (i) Vertical text input significantly degrades the accuracy of LLMs in text classification tasks. (ii) Chain-of-Thought (CoT) reasoning does not help LLMs recognize vertical input or mitigate its vulnerability, but few-shot learning with careful analysis does. (iii) We explore the underlying cause of the vulnerability by analyzing the inherent issues in tokenization and attention matrices.
[120] Semantic Component Analysis: Introducing Multi-Topic Distributions to Clustering-Based Topic Modeling
Florian Eichin, Carolin M. Schuster, Georg Groh, Michael A. Hedderich
Main category: cs.CL
TL;DR: Semantic Component Analysis (SCA) is a new topic modeling technique that efficiently scales to large datasets and discovers multiple topics per document, outperforming existing methods like BERTopic and TopicGPT.
Details
Motivation: Existing topic modeling approaches either fail to scale efficiently to large datasets or are limited by assuming only one topic per document, which restricts their practical application.Method: SCA introduces a decomposition step to the clustering-based topic modeling framework, allowing it to discover multiple topics per sample by breaking down documents into semantic components.
Result: SCA achieves competitive coherence and diversity compared to BERTopic while uncovering at least double the topics and maintaining near-zero noise rate. It also outperforms LLM-based TopicGPT with similar compute budgets.
Conclusion: SCA provides an effective and efficient approach for topic modeling of large datasets, overcoming key limitations of existing methods.
Abstract: Topic modeling is a key method in text analysis, but existing approaches fail to efficiently scale to large datasets or are limited by assuming one topic per document. Overcoming these limitations, we introduce Semantic Component Analysis (SCA), a topic modeling technique that discovers multiple topics per sample by introducing a decomposition step to the clustering-based topic modeling framework. We evaluate SCA on Twitter datasets in English, Hausa and Chinese. There, it achieves competitive coherence and diversity compared to BERTopic, while uncovering at least double the topics and maintaining a noise rate close to zero. We also find that SCA outperforms the LLM-based TopicGPT in scenarios with similar compute budgets. SCA thus provides an effective and efficient approach for topic modeling of large datasets.
[121] $100K or 100 Days: Trade-offs when Pre-Training with Academic Resources
Apoorv Khandelwal, Tian Yun, Nihal V. Nayak, Jack Merullo, Stephen H. Bach, Chen Sun, Ellie Pavlick
Main category: cs.CL
TL;DR: Academic researchers can pre-train models despite limited compute resources by optimizing training configurations and using fewer GPUs over longer periods.
Details
Motivation: To challenge the assumption that academic researchers cannot pre-train models due to compute limitations and provide practical guidance for efficient pre-training on academic resources.Method: Surveyed academic compute resources, created a benchmark to measure pre-training time on various GPUs, identified optimal training settings, and conducted experiments using 2,000 GPU-hours across different models and academic GPUs.
Result: Found that models like Pythia-1B can be replicated with 3x fewer GPU-days (4 GPUs in 18 days vs original 64 GPUs in 3 days) using optimized configurations.
Conclusion: Academic pre-training is feasible with proper optimization, and the benchmark provides guidance for researchers to conduct larger-scale training experiments despite resource constraints.
Abstract: Pre-training is notoriously compute-intensive and academic researchers are notoriously under-resourced. It is, therefore, commonly assumed that academics can’t pre-train models. In this paper, we seek to clarify this assumption. We first survey academic researchers to learn about their available compute and then empirically measure the time to replicate models on such resources. We introduce a benchmark to measure the time to pre-train models on given GPUs and also identify ideal settings for maximizing training speed. We run our benchmark on a range of models and academic GPUs, spending 2,000 GPU-hours on our experiments. Our results reveal a brighter picture for academic pre-training: for example, although Pythia-1B was originally trained on 64 GPUs for 3 days, we find it is also possible to replicate this model (with the same hyper-parameters) in 3x fewer GPU-days: i.e. on 4 GPUs in 18 days. We conclude with a cost-benefit analysis to help clarify the trade-offs between price and pre-training time. We believe our benchmark will help academic researchers conduct experiments that require training larger models on more data. We fully release our codebase at: https://github.com/apoorvkh/academic-pretraining.
[122] AtomR: Atomic Operator-Empowered Large Language Models for Heterogeneous Knowledge Reasoning
Amy Xin, Jinxin Liu, Zijun Yao, Zhicheng Lee, Shulin Cao, Lei Hou, Juanzi Li
Main category: cs.CL
TL;DR: AtomR is a framework that enables LLMs to perform accurate heterogeneous knowledge reasoning using atomic-level operators, outperforming state-of-the-art methods on complex QA tasks.
Details
Motivation: LLMs struggle with compositional reasoning and hallucination in knowledge-intensive tasks. Existing CoT+RAG approaches have inadequate reasoning planning and poor integration of heterogeneous knowledge sources.Method: Proposes three atomic knowledge operators for retrieving and manipulating heterogeneous knowledge. First decomposes questions into fine-grained reasoning trees with atomic operators, then executes each operator to flexibly retrieve and operate atomic-level knowledge from multiple sources.
Result: Significant improvements over state-of-the-art baselines: 9.4% F1 improvement on 2WikiMultihop and 9.5% on BlendQA benchmark. Also introduces BlendQA, a challenging benchmark for heterogeneous knowledge reasoning.
Conclusion: AtomR effectively addresses limitations in compositional reasoning and knowledge integration through atomic-level operators, demonstrating superior performance on complex knowledge reasoning tasks across multiple datasets.
Abstract: Despite the outstanding capabilities of large language models (LLMs), knowledge-intensive reasoning still remains a challenging task due to LLMs’ limitations in compositional reasoning and the hallucination problem. A prevalent solution is to employ chain-of-thought (CoT) with retrieval-augmented generation (RAG), which first formulates a reasoning plan by decomposing complex questions into simpler sub-questions, and then applies iterative RAG at each sub-question. However, prior works exhibit two crucial problems: inadequate reasoning planning and poor incorporation of heterogeneous knowledge. In this paper, we introduce AtomR, a framework for LLMs to conduct accurate heterogeneous knowledge reasoning at the atomic level. Inspired by how knowledge graph query languages model compositional reasoning through combining predefined operations, we propose three atomic knowledge operators, a unified set of operators for LLMs to retrieve and manipulate knowledge from heterogeneous sources. First, in the reasoning planning stage, AtomR decomposes a complex question into a reasoning tree where each leaf node corresponds to an atomic knowledge operator, achieving question decomposition that is highly fine-grained and orthogonal. Subsequently, in the reasoning execution stage, AtomR executes each atomic knowledge operator, which flexibly selects, retrieves, and operates atomic level knowledge from heterogeneous sources. We also introduce BlendQA, a challenging benchmark specially tailored for heterogeneous knowledge reasoning. Experiments on three single-source and two multi-source datasets show that AtomR outperforms state-of-the-art baselines by a large margin, with F1 score improvements of 9.4% on 2WikiMultihop and 9.5% on BlendQA. We release our code and datasets.
[123] Can LLMs be Good Graph Judge for Knowledge Graph Construction?
Haoyu Huang, Chong Chen, Zeang Sheng, Yang Li, Wentao Zhang
Main category: cs.CL
TL;DR: GraphJudge is a framework that addresses noise, domain-specific inaccuracies, and hallucinations in knowledge graph construction from unstructured data by using entity-centric noise elimination and a fine-tuned LLM as a graph judge.
Details
Motivation: Real-world IR systems produce unstructured data with noise, domain-specific inaccuracies, and LLM hallucinations that hinder accurate knowledge graph construction from documents.Method: Proposes GraphJudge framework with entity-centric strategy to eliminate noise and fine-tunes an LLM as a graph judge to enhance KG quality.
Result: Achieves state-of-the-art performance on two general and one domain-specific text-graph pair datasets with strong generalization abilities.
Conclusion: GraphJudge effectively addresses key challenges in KG construction from unstructured data through noise elimination and quality enhancement via a specialized graph judge LLM.
Abstract: In real-world scenarios, most of the data obtained from the information retrieval (IR) system is unstructured. Converting natural language sentences into structured Knowledge Graphs (KGs) remains a critical challenge. We identified three limitations with respect to existing KG construction methods: (1) There could be a large amount of noise in real-world documents, which could result in extracting messy information. (2) Naive LLMs usually extract inaccurate knowledge from some domain-specific documents. (3) Hallucination phenomenon cannot be overlooked when directly using LLMs to construct KGs. In this paper, we propose \textbf{GraphJudge}, a KG construction framework to address the aforementioned challenges. In this framework, we designed an entity-centric strategy to eliminate the noise information in the documents. And we fine-tuned a LLM as a graph judge to finally enhance the quality of generated KGs. Experiments conducted on two general and one domain-specific text-graph pair datasets demonstrate state-of-the-art performance against various baseline methods with strong generalization abilities. Our code is available at \href{https://github.com/hhy-huang/GraphJudge}{https://github.com/hhy-huang/GraphJudge}.
[124] Demystifying Domain-adaptive Post-training for Financial LLMs
Zixuan Ke, Yifei Ming, Xuan-Phi Nguyen, Caiming Xiong, Shafiq Joty
Main category: cs.CL
TL;DR: FINDAP is a systematic framework for domain-adaptive post-training of LLMs in finance, featuring four components: FinCap (capability definition), FinRec (training recipe), FinTrain (datasets), and FinEval (evaluation). The resulting Llama-Fin model achieves SOTA performance.
Details
Motivation: Domain-adaptive post-training shows promise for specialized domains like finance, but challenges remain in identifying optimal adaptation criteria and training strategies across different data and model configurations.Method: Four-component framework: FinCap defines required capabilities, FinRec provides training recipe with continual pre-training and instruction-following plus preference data distillation using generative reward model signals, FinTrain offers curated datasets, and FinEval provides comprehensive evaluation suite.
Result: Llama-Fin achieves state-of-the-art performance across a wide range of financial tasks. Analysis shows how each post-training stage contributes to distinct capabilities and reveals specific challenges and effective solutions.
Conclusion: The FINDAP framework provides valuable insights for domain adaptation of LLMs, demonstrating systematic approaches to address challenges in specialized domain post-training.
Abstract: Domain-adaptive post-training of large language models (LLMs) has emerged as a promising approach for specialized domains such as medicine and finance. However, significant challenges remain in identifying optimal adaptation criteria and training strategies across varying data and model configurations. To address these challenges, we introduce FINDAP, a systematic and fine-grained investigation into domain-adaptive post-training of LLMs for the finance domain. Our approach consists of four key components: FinCap, which defines the core capabilities required for the target domain; FinRec, an effective training recipe that jointly optimizes continual pre-training and instruction-following, along with a novel preference data distillation method leveraging process signals from a generative reward model; FinTrain, a curated set of training datasets supporting FinRec; and FinEval, a comprehensive evaluation suite aligned with FinCap. The resulting model, Llama-Fin, achieves state-of-the-art performance across a wide range of financial tasks. Our analysis also highlights how each post-training stage contributes to distinct capabilities, uncovering specific challenges and effective solutions, providing valuable insights for domain adaptation of LLMs
[125] Labeling Free-text Data using Language Model Ensembles
Jiaxing Qiu, Dongliang Guo, Natalie Papini, Noelle Peace, Hannah F. Fitterman-Harris, Cheri A. Levinson, Tom Hartvigsen, Teague R. Henry
Main category: cs.CL
TL;DR: Proposes an ensemble framework using locally-deployable LLMs for labeling free-text data under privacy constraints, achieving better accuracy than individual models.
Details
Motivation: Free-text data in psychological studies provides rich insights but manual labeling is labor-intensive. Closed-source LLMs cannot be used due to privacy concerns about external data sharing.Method: Assembles diverse open-source LLMs in an ensemble approach that balances agreement and disagreement, using relevancy scoring based on embedding distances between topic descriptions and LLM reasoning.
Result: Ensemble approach achieved highest accuracy and optimal precision-sensitivity trade-off compared to individual LLMs. Relevancy scoring effectively mitigated LLM heterogeneity.
Conclusion: The ensemble framework with locally-deployable LLMs provides an effective solution for privacy-preserving labeling of free-text data in psychological research.
Abstract: Free-text responses are commonly collected in psychological studies, providing rich qualitative insights that quantitative measures may not capture. Labeling curated topics of research interest in free-text data by multiple trained human coders is typically labor-intensive and time-consuming. Though large language models (LLMs) excel in language processing, LLM-assisted labeling techniques relying on closed-source LLMs cannot be directly applied to free-text data, without explicit consent for external use. In this study, we propose a framework of assembling locally-deployable LLMs to enhance the labeling of predetermined topics in free-text data under privacy constraints. Analogous to annotation by multiple human raters, this framework leverages the heterogeneity of diverse open-source LLMs. The ensemble approach seeks a balance between the agreement and disagreement across LLMs, guided by a relevancy scoring methodology that utilizes embedding distances between topic descriptions and LLMs’ reasoning. We evaluated the ensemble approach using both publicly accessible Reddit data from eating disorder related forums, and free-text responses from eating disorder patients, both complemented by human annotations. We found that: (1) there is heterogeneity in the performance of labeling among same-sized LLMs, with some showing low sensitivity but high precision, while others exhibit high sensitivity but low precision. (2) Compared to individual LLMs, the ensemble of LLMs achieved the highest accuracy and optimal precision-sensitivity trade-off in predicting human annotations. (3) The relevancy scores across LLMs showed greater agreement than dichotomous labels, indicating that the relevancy scoring method effectively mitigates the heterogeneity in LLMs’ labeling.
[126] Demystifying Multilingual Chain-of-Thought in Process Reward Modeling
Weixuan Wang, Minghao Wu, Barry Haddow, Alexandra Birch
Main category: cs.CL
TL;DR: This paper extends process reward models (PRMs) to multilingual settings by training on translated datasets across 7 languages, showing improved reasoning accuracy and reduced early-stage errors across 11 languages.
Details
Motivation: Current process reward modeling for complex reasoning tasks primarily focuses on English, creating a gap in multilingual applications. The authors aim to address this limitation by developing multilingual PRMs.Method: Trained multilingual PRMs on a dataset spanning 7 languages (translated from English) and evaluated on two reasoning benchmarks across 11 languages. Analyzed sensitivity to training languages, English data volume, candidate responses, and model parameters.
Result: Multilingual PRMs improved average accuracy and reduced early-stage reasoning errors across multiple languages. Performance was sensitive to training language diversity and English data volume, with benefits from more candidate responses and trainable parameters.
Conclusion: The work successfully extends process reward modeling to multilingual settings, opening avenues for robust multilingual applications in complex reasoning tasks. The authors release code to support further research.
Abstract: Large language models (LLMs) are designed to perform a wide range of tasks. To improve their ability to solve complex problems requiring multi-step reasoning, recent research leverages process reward modeling to provide fine-grained feedback at each step of the reasoning process for reinforcement learning (RL), but it predominantly focuses on English. In this paper, we tackle the critical challenge of extending process reward models (PRMs) to multilingual settings. To achieve this, we train multilingual PRMs on a dataset spanning seven languages, which is translated from English. Through comprehensive evaluations on two widely used reasoning benchmarks across 11 languages, we demonstrate that multilingual PRMs not only improve average accuracy but also reduce early-stage reasoning errors. Furthermore, our results highlight the sensitivity of multilingual PRMs to both the number of training languages and the volume of English data, while also uncovering the benefits arising from more candidate responses and trainable parameters. This work opens promising avenues for robust multilingual applications in complex, multi-step reasoning tasks. In addition, we release the code to foster research along this line.
[127] Elucidating Mechanisms of Demographic Bias in LLMs for Healthcare
Hiba Ahsan, Arnab Sen Sharma, Silvio Amir, David Bau, Byron C. Wallace
Main category: cs.CL
TL;DR: Using mechanistic interpretability to identify and manipulate sociodemographic biases in LLMs for healthcare applications.
Details
Motivation: LLMs encode social biases that manifest in clinical tasks, and there's a need to understand and address these biases in healthcare contexts.Method: Applied mechanistic interpretability tools to identify activations encoding sociodemographic information (gender, race) in LLMs, using MLP layer analysis and inference-time patching interventions.
Result: Gender information is highly localized in MLP layers and can be reliably manipulated via patching, altering clinical vignettes and downstream predictions like depression risk. Race representation is more distributed but also intervenable.
Conclusion: This is the first application of mechanistic interpretability to LLMs in healthcare, demonstrating the ability to identify and surgically alter sociodemographic biases in clinical contexts.
Abstract: We know from prior work that LLMs encode social biases, and that this manifests in clinical tasks. In this work we adopt tools from mechanistic interpretability to unveil sociodemographic representations and biases within LLMs in the context of healthcare. Specifically, we ask: Can we identify activations within LLMs that encode sociodemographic information (e.g., gender, race)? We find that gender information is highly localized in MLP layers and can be reliably manipulated at inference time via patching. Such interventions can surgically alter generated clinical vignettes for specific conditions, and also influence downstream clinical predictions which correlate with gender, e.g., patient risk of depression. We find that representation of patient race is somewhat more distributed, but can also be intervened upon, to a degree. To our knowledge, this is the first application of mechanistic interpretability methods to LLMs for healthcare.
[128] LoRA-MGPO: Mitigating Double Descent in Low-Rank Adaptation via Momentum-Guided Perturbation Optimization
Yupeng Chang, Chenlu Guo, Yi Chang, Yuan Wu
Main category: cs.CL
TL;DR: LoRA-MGPO introduces momentum-guided perturbation optimization to stabilize LoRA fine-tuning by mitigating the double descent phenomenon and avoiding sharp local minima, achieving better performance than standard LoRA.
Details
Motivation: Standard LoRA exhibits unstable double descent phenomenon with increasing rank, causing transient divergence in training loss, delayed convergence, and impaired generalization due to attraction to sharp local minima.Method: LoRA-MGPO incorporates Momentum-Guided Perturbation Optimization (MGPO) that uses momentum vectors from optimizer’s state to guide weight perturbations without dual gradient computations, plus adaptive normalization using EMA of gradient norms to scale perturbation magnitudes.
Result: Experiments on natural language understanding and generation benchmarks show LoRA-MGPO consistently achieves superior performance over LoRA and other PEFT methods, with smoother loss curves, faster convergence, and improved generalization.
Conclusion: LoRA-MGPO effectively stabilizes training dynamics by mitigating double descent phenomenon and avoiding sharp minima, leading to more stable optimization and better fine-tuning performance.
Abstract: Parameter-efficient fine-tuning (PEFT), particularly Low-Rank Adaptation (LoRA), adapts large language models (LLMs) by training only a small fraction of parameters. However, as the rank of the low-rank matrices used for adaptation increases, LoRA often exhibits an unstable “double descent” phenomenon, characterized by transient divergence in the training loss, which delays convergence and impairs generalization by causing instability due to the attraction to sharp local minima. To address this, we introduce LoRA-MGPO, a framework that incorporates Momentum-Guided Perturbation Optimization (MGPO). MGPO stabilizes training dynamics by mitigating the double descent phenomenon and guiding weight perturbations using momentum vectors from the optimizer’s state, thus avoiding dual gradient computations. Additionally, an adaptive normalization scheme scales the magnitude of perturbations based on an exponential moving average (EMA) of gradient norms, further enhancing stability. While EMA controls the magnitude of the perturbations, MGPO guides their direction, ensuring a more stable optimization trajectory. Experiments on a suite of natural language understanding and generation benchmarks show that LoRA-MGPO consistently achieves superior performance over LoRA and other PEFT methods. The analysis indicates that LoRA-MGPO leads to smoother loss curves, faster convergence, and improved generalization by stabilizing the training process and mitigating the attraction to sharp minima.
[129] RuCCoD: Towards Automated ICD Coding in Russian
Aleksandr Nesterov, Andrey Sakhovskiy, Ivan Sviridov, Airat Valiev, Vladimir Makharev, Petr Anokhin, Galina Zubkova, Elena Tutubalina
Main category: cs.CL
TL;DR: This paper presents a new Russian ICD coding dataset and demonstrates that automated clinical coding using state-of-the-art models can outperform manual physician annotations in accuracy.
Details
Motivation: To address the challenge of automating clinical coding in Russian, a language with limited biomedical resources, and improve clinical efficiency and data accuracy.Method: Created a new Russian ICD coding dataset with EHR diagnosis fields, benchmarked BERT, LLaMA with LoRA, and RAG models, conducted transfer learning experiments across domains and terminologies, and applied the best model to label in-house EHR data.
Result: Training with automated predicted codes significantly improved accuracy compared to manually annotated data from physicians on a curated test set.
Conclusion: Automated clinical coding shows strong potential for resource-limited languages like Russian, offering enhanced clinical efficiency and data accuracy.
Abstract: This study investigates the feasibility of automating clinical coding in Russian, a language with limited biomedical resources. We present a new dataset for ICD coding, which includes diagnosis fields from electronic health records (EHRs) annotated with over 10,000 entities and more than 1,500 unique ICD codes. This dataset serves as a benchmark for several state-of-the-art models, including BERT, LLaMA with LoRA, and RAG, with additional experiments examining transfer learning across domains (from PubMed abstracts to medical diagnosis) and terminologies (from UMLS concepts to ICD codes). We then apply the best-performing model to label an in-house EHR dataset containing patient histories from 2017 to 2021. Our experiments, conducted on a carefully curated test set, demonstrate that training with the automated predicted codes leads to a significant improvement in accuracy compared to manually annotated data from physicians. We believe our findings offer valuable insights into the potential for automating clinical coding in resource-limited languages like Russian, which could enhance clinical efficiency and data accuracy in these contexts. Our code and dataset are available at https://github.com/auto-icd-coding/ruccod.
[130] How LLMs Fail to Support Fact-Checking
Adiba Mahbub Proma, Neeley Pate, James Druckman, Gourab Ghoshal, Hangfeng He, Ehsan Hoque
Main category: cs.CL
TL;DR: LLMs struggle to effectively counter political misinformation through prompt-engineering alone, showing biases in source selection and limited response diversity.
Details
Motivation: To empirically study LLMs' capabilities in countering political misinformation, as they can both amplify and potentially tackle misinformation.Method: Two-step chain-of-thought prompting approach where models first identify credible sources for claims, then generate persuasive responses using three LLMs (ChatGPT, Gemini, Claude).
Result: Models struggle to ground responses in real news sources, prefer citing left-leaning sources, and show varying response diversity.
Conclusion: Prompt-engineering alone is insufficient for LLM-based fact-checking; more robust guardrails are needed for both researchers and non-technical users.
Abstract: While Large Language Models (LLMs) can amplify online misinformation, they also show promise in tackling misinformation. In this paper, we empirically study the capabilities of three LLMs – ChatGPT, Gemini, and Claude – in countering political misinformation. We implement a two-step, chain-of-thought prompting approach, where models first identify credible sources for a given claim and then generate persuasive responses. Our findings suggest that models struggle to ground their responses in real news sources, and tend to prefer citing left-leaning sources. We also observe varying degrees of response diversity among models. Our findings highlight concerns about using LLMs for fact-checking through only prompt-engineering, emphasizing the need for more robust guardrails. Our results have implications for both researchers and non-technical users.
[131] Adaptively profiling models with task elicitation
Davis Brown, Prithvi Balehannina, Helen Jin, Shreya Havaldar, Hamed Hassani, Eric Wong
Main category: cs.CL
TL;DR: Task elicitation automatically builds evaluations to profile model behavior, finding hundreds of natural-language tasks where frontier models exhibit systematic failures.
Details
Motivation: Language model evaluations often fail to characterize consequential failure modes, forcing experts to manually inspect outputs and build new benchmarks.Method: Task elicitation - an automated method that builds new evaluations to profile model behavior by finding natural-language tasks where models exhibit systematic failures.
Result: The method found hundreds of natural-language tasks (order of magnitude more than prior work) where frontier models show systematic failures in domains like forecasting and online harassment. Specific examples include Sonnet 3.5 over-associating quantum computing with AGI, and o3-mini hallucinating when fabrications are repeated in-context.
Conclusion: Task elicitation is an effective automated approach for discovering systematic model failures across diverse domains, significantly expanding evaluation coverage beyond manual benchmark creation.
Abstract: Language model evaluations often fail to characterize consequential failure modes, forcing experts to inspect outputs and build new benchmarks. We introduce task elicitation, a method that automatically builds new evaluations to profile model behavior. Task elicitation finds hundreds of natural-language tasks – an order of magnitude more than prior work – where frontier models exhibit systematic failures, in domains ranging from forecasting to online harassment. For example, we find that Sonnet 3.5 over-associates quantum computing and AGI and that o3-mini is prone to hallucination when fabrications are repeated in-context.
[132] Generator-Assistant Stepwise Rollback Framework for Large Language Model Agent
Xingzuo Li, Kehai Chen, Yunfei Long, Xuefeng Bai, Yong Xu, Min Zhang
Main category: cs.CL
TL;DR: GA-Rollback is a novel framework that addresses the one-pass error propagation issue in LLM agents by using a generator-assistant architecture with rollback capability for incorrect actions.
Details
Motivation: Current LLM agents suffer from irreversible error propagation in step-by-step reasoning, where incorrect intermediate thoughts are permanently incorporated into the trajectory.Method: Uses a generator to interact with the environment and an assistant to examine each action, triggering rollback operations when incorrect actions are detected, with additional strategies for rollback scenarios.
Result: Achieves significant improvements over strong baselines on three widely used benchmarks and functions as a robust plug-and-play module.
Conclusion: GA-Rollback effectively mitigates error propagation in LLM agents through its generator-assistant rollback mechanism.
Abstract: Large language model (LLM) agents typically adopt a step-by-step reasoning framework, in which they interleave the processes of thinking and acting to accomplish the given task. However, this paradigm faces a deep-rooted one-pass issue whereby each generated intermediate thought is plugged into the trajectory regardless of its correctness, which can cause irreversible error propagation. To address the issue, this paper proposes a novel framework called Generator-Assistant Stepwise Rollback (GA-Rollback) to induce better decision-making for LLM agents. Particularly, GA-Rollback utilizes a generator to interact with the environment and an assistant to examine each action produced by the generator, where the assistant triggers a rollback operation upon detection of incorrect actions. Moreover, we introduce two additional strategies tailored for the rollback scenario to further improve its effectiveness. Extensive experiments show that GA-Rollback achieves significant improvements over several strong baselines on three widely used benchmarks. Our analysis further reveals that GA-Rollback can function as a robust plug-and-play module, integrating seamlessly with other methods.
[133] Improving LLM-as-a-Judge Inference with the Judgment Distribution
Victor Wang, Michael J. Q. Zhang, Eunsol Choi
Main category: cs.CL
TL;DR: Using distributional outputs from LLM judges (taking the mean) outperforms greedy decoding, and risk-averse methods further improve performance, while chain-of-thought prompting can harm performance by collapsing judgment distributions.
Details
Motivation: Current LLM-as-a-judge approaches typically use greedy decoding from textual outputs, but LLMs naturally provide probability distributions over judgment tokens, suggesting better inference methods could extract more fine-grained preferences.Method: Compare different inference methods for extracting preferences from LLM judgment distributions, including taking the mean vs mode, risk-averse approaches, and analyzing chain-of-thought prompting effects on distribution spread.
Result: Taking the mean of judgment distributions consistently outperforms greedy decoding across all evaluation settings (pointwise, pairwise, listwise). Risk-averse methods often improve performance, while chain-of-thought prompting collapses distribution spread and harms performance.
Conclusion: Leveraging distributional outputs from LLM judges provides better performance than using the text interface alone, with mean-based inference and risk-averse methods being particularly effective.
Abstract: Using language models to scalably approximate human preferences on text quality (LLM-as-a-judge) has become a standard practice applicable to many tasks. A judgment is often extracted from the judge’s textual output alone, typically with greedy decoding. However, LLM judges naturally provide distributions over judgment tokens, inviting a breadth of inference methods for extracting fine-grained preferences. We find that taking the mean of the judgment distribution consistently outperforms taking the mode (i.e. greedy decoding) in all evaluation settings (i.e. pointwise, pairwise, and listwise). We further explore novel methods of deriving preferences from judgment distributions, and find that methods incorporating risk aversion often improve performance. Lastly, we analyze LLM-as-a-judge paired with chain-of-thought (CoT) prompting, showing that CoT can collapse the spread of the judgment distribution, often harming performance. Our findings show that leveraging distributional output improves LLM-as-a-judge, as opposed to using the text interface alone.
[134] InftyThink: Breaking the Length Limits of Long-Context Reasoning in Large Language Models
Yuchen Yan, Yongliang Shen, Yang Liu, Jin Jiang, Mengdi Zhang, Jian Shao, Yueting Zhuang
Main category: cs.CL
TL;DR: InftyThink introduces an iterative reasoning paradigm with intermediate summarization to overcome limitations of long-context reasoning, enabling unbounded reasoning depth with bounded computational costs.
Details
Motivation: Current long-context reasoning faces quadratic computational scaling, context boundary constraints, and performance degradation beyond pre-training windows. Existing compression methods don't solve the fundamental scaling problem.Method: Transform monolithic reasoning into iterative process with short reasoning segments interleaved with concise progress summaries, creating a sawtooth memory pattern. Reconstruct long-context datasets into iterative format.
Result: Reduces computational costs while improving performance - Qwen2.5-Math-7B shows 3-13% improvements across MATH500, AIME24, and GPQA_diamond benchmarks. Created 333K training instances from OpenR1-Math.
Conclusion: Challenges the assumed trade-off between reasoning depth and computational efficiency, providing scalable approach to complex reasoning without architectural modifications.
Abstract: Advanced reasoning in large language models has achieved remarkable performance on challenging tasks, but the prevailing long-context reasoning paradigm faces critical limitations: quadratic computational scaling with sequence length, reasoning constrained by maximum context boundaries, and performance degradation beyond pre-training context windows. Existing approaches primarily compress reasoning chains without addressing the fundamental scaling problem. To overcome these challenges, we introduce InftyThink, a paradigm that transforms monolithic reasoning into an iterative process with intermediate summarization. By interleaving short reasoning segments with concise progress summaries, our approach enables unbounded reasoning depth while maintaining bounded computational costs. This creates a characteristic sawtooth memory pattern that significantly reduces computational complexity compared to traditional approaches. Furthermore, we develop a methodology for reconstructing long-context reasoning datasets into our iterative format, transforming OpenR1-Math into 333K training instances. Experiments across multiple model architectures demonstrate that our approach reduces computational costs while improving performance, with Qwen2.5-Math-7B showing 3-13% improvements across MATH500, AIME24, and GPQA_diamond benchmarks. Our work challenges the assumed trade-off between reasoning depth and computational efficiency, providing a more scalable approach to complex reasoning without architectural modifications.
[135] Cost-Optimal Grouped-Query Attention for Long-Context Modeling
Yingfa Chen, Yutong Wu, Chenyang Song, Zhen Leng Thai, Xingyu Shen, Xu Han, Zhiyuan Liu, Maosong Sun
Main category: cs.CL
TL;DR: Current GQA configurations are suboptimal for long-context scenarios. The paper proposes decoupling head size from hidden size and jointly optimizing model size with GQA configuration, achieving >50% reduction in memory usage and FLOPs without performance loss.
Details
Motivation: Existing GQA configurations overlook how context length affects inference cost, leading to suboptimal performance in long-context scenarios where inference cost grows with context length.Method: 1) Decouple total head size from hidden size for flexible attention FLOPs control; 2) Jointly optimize model size and GQA configuration to better allocate inference resources between attention layers and other components.
Result: Common GQA configurations are highly suboptimal for long contexts. The proposed recipe shows that for long-context scenarios, fewer attention heads with larger model size is optimal, achieving >50% reduction in memory usage and FLOPs compared to Llama-3’s GQA with no performance degradation.
Conclusion: The findings provide valuable insights for designing efficient long-context LLMs, with a recipe for deriving cost-optimal GQA configurations that significantly improve efficiency without compromising model capabilities.
Abstract: Grouped-Query Attention (GQA) is a widely adopted strategy for reducing the computational cost of attention layers in large language models (LLMs). However, current GQA configurations are often suboptimal because they overlook how context length influences inference cost. Since inference cost grows with context length, the most cost-efficient GQA configuration should also vary accordingly. In this work, we analyze the relationship among context length, model size, GQA configuration, and model loss, and introduce two innovations: (1) we decouple the total head size from the hidden size, enabling more flexible control over attention FLOPs; and (2) we jointly optimize the model size and the GQA configuration to arrive at a better allocation of inference resources between attention layers and other components. Our analysis reveals that commonly used GQA configurations are highly suboptimal for long-context scenarios. More importantly, we propose a recipe for deriving cost-optimal GQA configurations. Our results show that for long-context scenarios, one should use fewer attention heads while scaling up model size. Configurations selected by our recipe can reduce both memory usage and FLOPs by more than 50% compared to Llama-3’s GQA, with no degradation in model capabilities. Our findings offer valuable insights for designing efficient long-context LLMs. The code is available at https://www.github.com/THUNLP/cost-optimal-gqa .
[136] Retrieval-Augmented Generation with Hierarchical Knowledge
Haoyu Huang, Yongfeng Huang, Junjie Yang, Zhenyu Pan, Yongqiang Chen, Kaili Ma, Hongzhi Chen, James Cheng
Main category: cs.CL
TL;DR: HiRAG is a new RAG approach that leverages hierarchical knowledge to improve semantic understanding and structure capturing in indexing and retrieval, outperforming state-of-the-art methods.
Details
Motivation: Existing RAG methods don't adequately utilize the naturally inherent hierarchical knowledge in human cognition, limiting RAG system capabilities.Method: HiRAG utilizes hierarchical knowledge to enhance semantic understanding and structure capturing capabilities in the indexing and retrieval processes of RAG systems.
Result: Extensive experiments demonstrate that HiRAG achieves significant performance improvements over state-of-the-art baseline methods.
Conclusion: Hierarchical knowledge utilization in RAG systems leads to enhanced performance in domain-specific tasks.
Abstract: Graph-based Retrieval-Augmented Generation (RAG) methods have significantly enhanced the performance of large language models (LLMs) in domain-specific tasks. However, existing RAG methods do not adequately utilize the naturally inherent hierarchical knowledge in human cognition, which limits the capabilities of RAG systems. In this paper, we introduce a new RAG approach, called HiRAG, which utilizes hierarchical knowledge to enhance the semantic understanding and structure capturing capabilities of RAG systems in the indexing and retrieval processes. Our extensive experiments demonstrate that HiRAG achieves significant performance improvements over the state-of-the-art baseline methods.
[137] Texture or Semantics? Vision-Language Models Get Lost in Font Recognition
Zhecheng Li, Guoxian Song, Yujun Cai, Zhen Xiong, Junsong Yuan, Yiwei Wang
Main category: cs.CL
TL;DR: VLMs have limited font recognition capabilities, struggle with stroop effects, and show minimal improvement from few-shot learning or CoT prompting.
Details
Motivation: To investigate if modern VLMs can effectively recognize fonts in fine-grained tasks, given their multimodal capabilities and potential use in real-world scenarios like design materials.Method: Created Font Recognition Benchmark (FRB) with 15 common fonts in easy (10 sentences) and hard (font names as text, creating stroop effect) versions. Evaluated various VLMs on font recognition tasks.
Result: Current VLMs show poor font recognition performance, are easily affected by stroop effects, and few-shot learning/CoT prompting provide minimal accuracy improvements.
Conclusion: VLMs have inherent limitations in capturing semantic features for font recognition, highlighting their current shortcomings in fine-grained visual tasks.
Abstract: Modern Vision-Language Models (VLMs) exhibit remarkable visual and linguistic capabilities, achieving impressive performance in various tasks such as image recognition and object localization. However, their effectiveness in fine-grained tasks remains an open question. In everyday scenarios, individuals encountering design materials, such as magazines, typography tutorials, research papers, or branding content, may wish to identify aesthetically pleasing fonts used in the text. Given their multimodal capabilities and free accessibility, many VLMs are often considered potential tools for font recognition. This raises a fundamental question: Do VLMs truly possess the capability to recognize fonts? To investigate this, we introduce the Font Recognition Benchmark (FRB), a compact and well-structured dataset comprising 15 commonly used fonts. FRB includes two versions: (i) an easy version, where 10 sentences are rendered in different fonts, and (ii) a hard version, where each text sample consists of the names of the 15 fonts themselves, introducing a stroop effect that challenges model perception. Through extensive evaluation of various VLMs on font recognition tasks, we arrive at the following key findings: (i) Current VLMs exhibit limited font recognition capabilities, with many state-of-the-art models failing to achieve satisfactory performance and being easily affected by the stroop effect introduced by textual information. (ii) Few-shot learning and Chain-of-Thought (CoT) prompting provide minimal benefits in improving font recognition accuracy across different VLMs. (iii) Attention analysis sheds light on the inherent limitations of VLMs in capturing semantic features.
[138] CLASH: Evaluating Language Models on Judging High-Stakes Dilemmas from Multiple Perspectives
Ayoung Lee, Ryan Sungmo Kwon, Peter Railton, Lu Wang
Main category: cs.CL
TL;DR: CLASH dataset enables study of value-based decision-making in high-stakes dilemmas, revealing LLMs’ limitations in handling ambivalent decisions, value shifts, and showing new failure patterns in value reasoning.
Details
Motivation: To address the gap in AI's ability to navigate conflicting values in high-stakes domains, which is challenging even for humans, by creating a specialized dataset for studying value-based decision-making.Method: Created CLASH dataset with 345 high-impact dilemmas and 3,795 individual perspectives of diverse values, then benchmarked 14 non-thinking and thinking models on understanding decision ambivalence, psychological discomfort, and value shifts.
Result: Key findings: (1) Strong models struggle with ambivalent decisions (GPT-5: 24.06%, Claude-4-Sonnet: 51.01% accuracy), (2) LLMs predict discomfort but not value shifts, (3) Cognitive behaviors don’t transfer to value reasoning, (4) Steerability correlates with value preferences, (5) Third-party reasoning increases steerability.
Conclusion: LLMs face significant challenges in high-stakes value reasoning, exhibiting specific failure patterns and limited understanding of value dynamics, highlighting the need for specialized approaches to value-based decision-making in AI.
Abstract: Navigating dilemmas involving conflicting values is challenging even for humans in high-stakes domains, let alone for AI, yet prior work has been limited to everyday scenarios. To close this gap, we introduce CLASH (Character perspective-based LLM Assessments in Situations with High-stakes), a meticulously curated dataset consisting of 345 high-impact dilemmas along with 3,795 individual perspectives of diverse values. CLASH enables the study of critical yet underexplored aspects of value-based decision-making processes, including understanding of decision ambivalence and psychological discomfort as well as capturing the temporal shifts of values in the perspectives of characters. By benchmarking 14 non-thinking and thinking models, we uncover several key findings. (1) Even strong proprietary models, such as GPT-5 and Claude-4-Sonnet, struggle with ambivalent decisions, achieving only 24.06 and 51.01 accuracy. (2) Although LLMs reasonably predict psychological discomfort, they do not adequately comprehend perspectives involving value shifts. (3) Cognitive behaviors that are effective in the math-solving and game strategy domains do not transfer to value reasoning. Instead, new failure patterns emerge, including early commitment and overcommitment. (4) The steerability of LLMs towards a given value is significantly correlated with their value preferences. (5) Finally, LLMs exhibit greater steerability when reasoning from a third-party perspective, although certain values (e.g., safety) benefit uniquely from first-person framing.
[139] SOLAR: Towards Characterizing Subjectivity of Individuals through Modeling Value Conflicts and Trade-offs
Younghun Lee, Dan Goldwasser
Main category: cs.CL
TL;DR: SOLAR framework uses LLMs to infer individual moral judgments from social media by analyzing value conflicts and trade-offs in user-generated texts.
Details
Motivation: To explore whether LLMs can account for individual-level subjectivity in moral judgments, which has not been sufficiently studied despite LLMs' strong performance in subjective decision making.Method: Propose SOLAR framework that observes value conflicts and trade-offs in user-generated texts to better represent subjective ground of individuals on social media.
Result: Empirical results show SOLAR improves overall inference results and performance on controversial situations, and provides explanations about individuals’ value preferences.
Conclusion: SOLAR framework effectively infers individual moral judgments and provides value preference explanations, demonstrating LLMs’ capability to account for individual-level subjectivity.
Abstract: Large Language Models (LLMs) not only have solved complex reasoning problems but also exhibit remarkable performance in tasks that require subjective decision making. Existing studies suggest that LLM generations can be subjectively grounded to some extent, yet exploring whether LLMs can account for individual-level subjectivity has not been sufficiently studied. In this paper, we characterize subjectivity of individuals on social media and infer their moral judgments using LLMs. We propose a framework, SOLAR (Subjective Ground with Value Abstraction), that observes value conflicts and trade-offs in the user-generated texts to better represent subjective ground of individuals. Empirical results show that our framework improves overall inference results as well as performance on controversial situations. Additionally, we qualitatively show that SOLAR provides explanations about individuals’ value preferences, which can further account for their judgments.
[140] MrGuard: A Multilingual Reasoning Guardrail for Universal LLM Safety
Yahan Yang, Soham Dan, Shuo Li, Dan Roth, Insup Lee
Main category: cs.CL
TL;DR: MrGuard is a multilingual guardrail system that detects unsafe content in LLMs across diverse languages using synthetic data generation, supervised fine-tuning, and curriculum-based optimization, achieving over 15% improvement over baselines.
Details
Motivation: LLMs are vulnerable to adversarial attacks like jailbreaking, especially in multilingual settings where safety-aligned data is limited, creating a need for robust multilingual content filtering systems.Method: Three-stage approach: (1) synthetic multilingual data generation with cultural/linguistic nuances, (2) supervised fine-tuning, and (3) curriculum-based Group Relative Policy Optimization (GRPO) framework.
Result: MrGuard consistently outperforms recent baselines by more than 15% across both in-domain and out-of-domain languages, and maintains robustness to multilingual variations like code-switching and low-resource language distractors.
Conclusion: The multilingual guardrail with reasoning capability effectively addresses LLM safety vulnerabilities in diverse linguistic contexts and provides explanations for language-specific risks in content moderation.
Abstract: Large Language Models (LLMs) are susceptible to adversarial attacks such as jailbreaking, which can elicit harmful or unsafe behaviors. This vulnerability is exacerbated in multilingual settings, where multilingual safety-aligned data is often limited. Thus, developing a guardrail capable of detecting and filtering unsafe content across diverse languages is critical for deploying LLMs in real-world applications. In this work, we introduce a multilingual guardrail with reasoning for prompt classification. Our method consists of: (1) synthetic multilingual data generation incorporating culturally and linguistically nuanced variants, (2) supervised fine-tuning, and (3) a curriculum-based Group Relative Policy Optimization (GRPO) framework that further improves performance. Experimental results demonstrate that our multilingual guardrail, MrGuard, consistently outperforms recent baselines across both in-domain and out-of-domain languages by more than 15%. We also evaluate MrGuard’s robustness to multilingual variations, such as code-switching and low-resource language distractors in the prompt, and demonstrate that it preserves safety judgments under these challenging conditions. The multilingual reasoning capability of our guardrail enables it to generate explanations, which are particularly useful for understanding language-specific risks and ambiguities in multilingual content moderation.
[141] Comparing Uncertainty Measurement and Mitigation Methods for Large Language Models: A Systematic Review
Toghrul Abbasli, Kentaroh Toyoda, Yuan Wang, Leon Witt, Muhammad Asif Ali, Yukai Miao, Dan Li, Qingsong Wei
Main category: cs.CL
TL;DR: This paper provides the first systematic survey and benchmark of uncertainty quantification and calibration methods for large language models to address hallucination issues.
Details
Motivation: LLMs suffer from hallucination (outputting incorrect information confidently), but there's no comprehensive analysis of uncertainty quantification methods' effectiveness or benchmark for comparing solutions.Method: Conducted systematic survey of prior UQ and calibration methods for LLMs, created rigorous benchmark, and empirically evaluated six methods using two reliability datasets.
Result: The empirical evaluation justified significant findings from the review, providing insights into the effectiveness of different UQ and calibration approaches.
Conclusion: This is the first dedicated study reviewing calibration methods and relevant metrics for LLMs, with identified future directions and open challenges in the field.
Abstract: Large Language Models (LLMs) have been transformative across many domains. However, hallucination – confidently outputting incorrect information – remains one of the leading challenges for LLMs. This raises the question of how to accurately assess and quantify the uncertainty of LLMs. Extensive literature on traditional models has explored Uncertainty Quantification (UQ) to measure uncertainty and employed calibration techniques to address the misalignment between uncertainty and accuracy. While some of these methods have been adapted for LLMs, the literature lacks an in-depth analysis of their effectiveness and does not offer a comprehensive benchmark to enable insightful comparison among existing solutions. In this work, we fill this gap via a systematic survey of representative prior works on UQ and calibration for LLMs and introduce a rigorous benchmark. Using two widely used reliability datasets, we empirically evaluate six related methods, which justify the significant findings of our review. Finally, we provide outlooks for key future directions and outline open challenges. To the best of our knowledge, this survey is the first dedicated study to review the calibration methods and relevant metrics for LLMs.
[142] LLM-OptiRA: LLM-Driven Optimization of Resource Allocation for Non-Convex Problems in Wireless Communications
Xinyue Peng, Yanming Liu, Yihan Cang, Chaoqun Cao, Ming Chen
Main category: cs.CL
TL;DR: LLM-OptiRA is a framework that uses large language models to automatically detect and transform non-convex components in wireless resource allocation problems into solvable forms, achieving high execution and success rates.
Details
Motivation: Traditional optimization techniques struggle with non-convex resource allocation problems in wireless communication systems, requiring expert knowledge and manual intervention.Method: Leverages LLMs to automatically detect non-convex components and transform them into solvable forms, with integrated error correction and feasibility validation mechanisms.
Result: Achieves 96% execution rate and 80% success rate on GPT-4, significantly outperforming baseline approaches in complex optimization tasks across diverse scenarios.
Conclusion: LLM-OptiRA enables fully automated resolution of non-convex resource allocation problems, reducing reliance on expert knowledge while ensuring robustness through validation mechanisms.
Abstract: Solving non-convex resource allocation problems poses significant challenges in wireless communication systems, often beyond the capability of traditional optimization techniques. To address this issue, we propose LLM-OptiRA, the first framework that leverages large language models (LLMs) to automatically detect and transform non-convex components into solvable forms, enabling fully automated resolution of non-convex resource allocation problems in wireless communication systems. LLM-OptiRA not only simplifies problem-solving by reducing reliance on expert knowledge, but also integrates error correction and feasibility validation mechanisms to ensure robustness. Experimental results show that LLM-OptiRA achieves an execution rate of 96% and a success rate of 80% on GPT-4, significantly outperforming baseline approaches in complex optimization tasks across diverse scenarios.
[143] Follow the Path: Reasoning over Knowledge Graph Paths to Improve LLM Factuality
Mike Zhang, Johannes Bjerva, Russa Biswas
Main category: cs.CL
TL;DR: fs1 improves reasoning factuality by sourcing traces from large reasoning models and grounding them on knowledge graph paths, achieving significant performance gains on complex QA tasks.
Details
Motivation: To improve the factuality of reasoning traces in LLMs by grounding them in factual knowledge graph paths, especially for complex knowledge-intensive tasks.Method: Fine-tune instruction-tuned LLMs on 3.9K factually grounded reasoning traces sourced from large reasoning models and conditioned on KG paths.
Result: fs1-tuned model (32B) outperforms instruction-tuned counterparts by 6-14 absolute points (pass@16), with most improvements on complex questions requiring 3+ hops and numerical answers.
Conclusion: Anchoring reasoning to factual KG paths is critical for transforming LLMs into reliable systems for knowledge-intensive tasks, showing effectiveness beyond STEM domains.
Abstract: We introduce fs1, a simple yet effective method that improves the factuality of reasoning traces by sourcing them from large reasoning models (e.g., DeepSeek-R1) and grounding them by conditioning on knowledge graph (KG) paths. We fine-tune eight instruction-tuned Large Language Models (LLMs) on 3.9K factually grounded reasoning traces and rigorously evaluate them on six complex open-domain question-answering (QA) benchmarks encompassing 23.9K questions. Our results demonstrate that our fs1-tuned model (32B parameters) consistently outperforms instruction-tuned counterparts with parallel sampling by 6-14 absolute points (pass@$16$). Our detailed analysis shows that fs1 considerably improves model performance over more complex questions (requiring 3 or more hops on KG paths) and numerical answer types compared to the baselines. Furthermore, in single-pass inference, we notice that smaller LLMs show the most improvements. While prior works demonstrate the effectiveness of reasoning traces primarily in the STEM domains, our work shows strong evidence that anchoring reasoning to factual KG paths is a critical step in transforming LLMs for reliable knowledge-intensive tasks.
[144] SuperCoder: Assembly Program Superoptimization with Large Language Models
Anjiang Wei, Tarun Suresh, Huanmi Tan, Yinglun Xu, Gagandeep Singh, Ke Wang, Alex Aiken
Main category: cs.CL
TL;DR: LLMs can serve as superoptimizers for assembly programs, achieving 95% correctness and 1.46x speedup over gcc -O3 through reinforcement learning fine-tuning.
Details
Motivation: To investigate whether large language models can outperform industry-standard compilers in optimizing assembly programs, moving beyond traditional compiler heuristics.Method: Fine-tuned Qwen2.5-Coder-7B-Instruct with reinforcement learning using a reward function that combines correctness and performance speedup, evaluated on a large-scale benchmark of 8,072 real-world assembly programs.
Result: SuperCoder achieved 95.0% correctness and 1.46x average speedup over gcc -O3, significantly outperforming Claude-opus-4 baseline (51.5% correctness, 1.43x speedup).
Conclusion: LLMs can effectively serve as superoptimizers for assembly programs, establishing a new foundation for program performance optimization beyond traditional compiler approaches.
Abstract: Superoptimization is the task of transforming a program into a faster one while preserving its input-output behavior. In this work, we investigate whether large language models (LLMs) can serve as superoptimizers, generating assembly programs that outperform code already optimized by industry-standard compilers. We construct the first large-scale benchmark for this problem, consisting of 8,072 real-world assembly programs averaging 130 lines, in contrast to prior datasets restricted to 2-15 straight-line, loop-free programs. We evaluate 23 LLMs on this benchmark and find that the strongest baseline, Claude-opus-4, achieves a 51.5% test-passing rate and a 1.43x average speedup over gcc -O3. To further enhance performance, we fine-tune models with reinforcement learning, optimizing a reward function that integrates correctness and performance speedup. Starting from Qwen2.5-Coder-7B-Instruct (61.4% correctness, 1.10x speedup), the fine-tuned model SuperCoder attains 95.0% correctness and 1.46x average speedup. Our results demonstrate, for the first time, that LLMs can be applied as superoptimizers for assembly programs, establishing a foundation for future research in program performance optimization beyond compiler heuristics.
[145] ZeroTuning: Unlocking the Initial Token’s Power to Enhance Large Language Models Without Training
Feijiang Han, Xiaodong Yu, Jianheng Tang, Delip Rao, Weihua Du, Lyle Ungar
Main category: cs.CL
TL;DR: ZeroTuning is a training-free method that improves LLM performance by applying head-specific attention adjustments to the initial token (e.g.,
Details
Motivation: Existing token-level attention tuning methods depend on auxiliary heuristics to identify important tokens, which can introduce bias and limit applicability when token importance is unclear or when attention maps are inaccessible in optimized kernels.Method: ZeroTuning adds lightweight biases to the initial token’s attention logits, which monotonically controls the entropy of downstream attention distribution. It comes in two variants: supervised mode calibrated on validation examples, and unsupervised mode that directly minimizes model output entropy.
Result: Achieves broad gains across 15 datasets with Llama-3.1-8B: 19.9% improvement on classification, 4.5% on question answering, and 2.1% on dialogue. Works with quantized inference and maintains performance with increasing context lengths.
Conclusion: ZeroTuning provides a simpler, more elegant alternative to previous attention tuning methods that is lightweight, kernel-agnostic, and outperforms more complex approaches while requiring minimal code changes.
Abstract: Token-level attention tuning, a class of training-free methods including
Post-hoc Attention Steering (PASTA) and Attention Calibration (ACT), has
emerged as a promising way to improve frozen LLMs with interpretable
interventions. However, these methods depend on auxiliary heuristics to
identify “important” task-specific tokens, which can introduce bias and limit
applicability when token importance is unclear or when using optimized kernels
where attention maps are inaccessible. We propose a simpler and more elegant
alternative: acting only on the initial token (e.g.,
[146] HBO: Hierarchical Balancing Optimization for Fine-Tuning Large Language Models
Weixuan Wang, Minghao Wu, Barry Haddow, Alexandra Birch
Main category: cs.CL
TL;DR: HBO is a hierarchical balancing optimization method that addresses data imbalance and heterogeneity in LLM fine-tuning through global and local data allocation adjustments.
Details
Motivation: Existing methods only handle data imbalance across datasets globally but ignore local imbalances within individual datasets, limiting their effectiveness in fine-tuning LLMs on diverse data mixtures.Method: HBO uses bilevel optimization with Global Actors that balance data sampling across datasets and Local Actors that optimize data usage within each dataset based on difficulty levels, guided by reward functions from the LLM’s training state.
Result: HBO consistently outperforms existing baselines across three LLM backbones and nine diverse tasks in multilingual and multitask setups, achieving significant accuracy gains.
Conclusion: HBO provides a comprehensive solution to data imbalance and heterogeneity in LLM fine-tuning, enabling more effective training across diverse datasets through its hierarchical balancing approach.
Abstract: Fine-tuning large language models (LLMs) on a mixture of diverse datasets poses challenges due to data imbalance and heterogeneity. Existing methods often address these issues across datasets (globally) but overlook the imbalance and heterogeneity within individual datasets (locally), which limits their effectiveness. We introduce Hierarchical Balancing Optimization (HBO), a novel method that enables LLMs to autonomously adjust data allocation during fine-tuning both across datasets (globally) and within each individual dataset (locally). HBO employs a bilevel optimization strategy with two types of actors: a Global Actor, which balances data sampling across different subsets of the training mixture, and several Local Actors, which optimizes data usage within each subset based on difficulty levels. These actors are guided by reward functions derived from the LLM’s training state, which measure learning progress and relative performance improvement. We evaluate HBO on three LLM backbones across nine diverse tasks in multilingual and multitask setups. Results show that HBO consistently outperforms existing baselines, achieving significant accuracy gains. Our in-depth analysis further demonstrates that both the global actor and local actors of HBO effectively adjust data usage during fine-tuning. HBO provides a comprehensive solution to the challenges of data imbalance and heterogeneity in LLM fine-tuning, enabling more effective training across diverse datasets.
[147] ExpertSteer: Intervening in LLMs through Expert Knowledge
Weixuan Wang, Minghao Wu, Barry Haddow, Alexandra Birch
Main category: cs.CL
TL;DR: ExpertSteer enables cross-model activation steering by using external expert models to generate steering vectors for controlling LLM behavior during inference, outperforming existing methods across diverse tasks.
Details
Motivation: Existing activation steering methods are limited to using steering vectors generated by the model itself, which restricts effectiveness to that specific model and prevents leveraging powerful external expert models for improved guidance.Method: A four-step process: 1) align representation dimensions with auto-encoders, 2) identify intervention layer pairs via mutual information analysis, 3) generate steering vectors from expert model using Recursive Feature Machines, 4) apply vectors during inference without updating parameters.
Result: Significantly outperforms established baselines across 15 benchmarks in four domains using three different LLMs, achieving improved performance at minimal cost.
Conclusion: ExpertSteer successfully enables cross-model knowledge transfer for activation steering, allowing arbitrary expert models to guide target LLMs without parameter updates, demonstrating superior effectiveness over model-specific steering approaches.
Abstract: Large Language Models (LLMs) exhibit remarkable capabilities across various tasks, yet guiding them to follow desired behaviours during inference remains a significant challenge. Activation steering offers a promising method to control the generation process of LLMs by modifying their internal activations. However, existing methods commonly intervene in the model’s behaviour using steering vectors generated by the model itself, which constrains their effectiveness to that specific model and excludes the possibility of leveraging powerful external expert models for steering. To address these limitations, we propose ExpertSteer, a novel approach that leverages arbitrary specialized expert models to generate steering vectors, enabling intervention in any LLMs. ExpertSteer transfers the knowledge from an expert model to a target LLM through a cohesive four-step process: first aligning representation dimensions with auto-encoders to enable cross-model transfer, then identifying intervention layer pairs based on mutual information analysis, next generating steering vectors from the expert model using Recursive Feature Machines, and finally applying these vectors on the identified layers during inference to selectively guide the target LLM without updating model parameters. We conduct comprehensive experiments using three LLMs on 15 popular benchmarks across four distinct domains. Experiments demonstrate that ExpertSteer significantly outperforms established baselines across diverse tasks at minimal cost.
[148] Shadow-FT: Tuning Instruct Model via Training on Paired Base Model
Taiqiang Wu, Runming Yang, Jiayi Li, Pengfei Hu, Yik-Chung Wu, Ngai Wong, Yujiu Yang
Main category: cs.CL
TL;DR: Shadow-FT is a novel fine-tuning framework that leverages Base models to improve Instruct models by grafting weight updates from tuned Base models to Instruct models, achieving better performance than conventional approaches.
Details
Motivation: Directly fine-tuning Instruct models often leads to marginal improvements or performance degeneration, while Base models are good learners but weak backbones without post-training.Method: Fine-tune the Base model and directly graft the learned weight updates to the Instruct model, introducing no additional parameters.
Result: Shadow-FT consistently outperforms conventional full-parameter and parameter-efficient tuning approaches across 19 benchmarks covering coding, reasoning, and mathematical tasks.
Conclusion: Shadow-FT is effective for mainstream LLMs, can be applied to multimodal LLMs, and combined with DPO, providing a simple yet powerful fine-tuning approach.
Abstract: Large language models (LLMs) consistently benefit from further fine-tuning on various tasks. However, we observe that directly tuning the Instruct (i.e., instruction-tuned) models often leads to marginal improvements and even performance degeneration. Notably, paired Base models, the foundation for these Instruct variants, contain highly similar weight values (i.e., less than 2% on average for Llama 3.1 8B). The Base model tends to be a good learner yet a weak backbone without post-training. Therefore, we propose a novel Shadow-FT framework to tune the Instruct models by leveraging the corresponding Base models. The key insight is to fine-tune the Base model, and then \textit{directly} graft the learned weight updates to the Instruct model. Our proposed Shadow-FT introduces no additional parameters, is easy to implement, and significantly improves performance. We conduct extensive experiments on tuning mainstream LLMs, such as Qwen 3 and Llama 3 series, and evaluate them across 19 benchmarks covering coding, reasoning, and mathematical tasks. Experimental results demonstrate that Shadow-FT consistently outperforms conventional full-parameter and parameter-efficient tuning approaches. Further analyses indicate that Shadow-FT can be applied to multimodal large language models (MLLMs) and combined with direct preference optimization~(DPO). Codes and weights are available at \href{https://github.com/wutaiqiang/Shadow-FT}{Github}.
[149] Language-Specific Latent Process Hinders Cross-Lingual Performance
Zheng Wei Lim, Alham Fikri Aji, Trevor Cohn
Main category: cs.CL
TL;DR: LLMs show cross-lingual transfer but produce inconsistent outputs across languages. Analysis reveals they rely on language-specific representations rather than shared semantic spaces, with larger models having more dissociated hidden states but better knowledge retrieval.
Details
Motivation: To understand how language models generalize knowledge across languages and why they produce inconsistent outputs when prompted with the same queries in different languages.Method: Measured representation similarity between languages and applied logit lens to interpret implicit steps in multilingual multi-choice reasoning questions.
Result: LLMs predict inconsistently due to dissimilar representations across languages. Larger models are more multilingual but have more dissociated hidden states, though better at retrieving cross-lingual knowledge. Small models can be improved by steering towards shared semantic space.
Conclusion: Knowledge sharing in small models can be enhanced by steering processing towards shared semantic space, improving multilingual reasoning performance through better knowledge transfer and output consistency with English.
Abstract: Large language models (LLMs) are demonstrably capable of cross-lingual transfer, but can produce inconsistent output when prompted with the same queries written in different languages. To understand how language models are able to generalize knowledge from one language to the others, we measure representation similarity between languages, and apply the logit lens to interpret the implicit steps taken by LLMs to solve multilingual multi-choice reasoning questions. Our analyses reveal LLMs predict inconsistently and are less accurate because they rely on representations that are dissimilar across languages, rather than working in a shared semantic space. While larger models are more multilingual, we show their hidden states are more likely to dissociate from the shared representation compared to smaller models, but are nevertheless more capable of retrieving knowledge embedded across different languages. Finally, we demonstrate that knowledge sharing in small models can be facilitated by steering their latent processing towards the shared semantic space. This improves the models’ multilingual reasoning performance, as a result of more knowledge transfer from, and better output consistency with English.
[150] UltraEdit: Training-, Subject-, and Memory-Free Lifelong Editing in Language Models
Xiaojie Gu, Ziying Huang, Jia-Chen Gu, Kai Zhang
Main category: cs.CL
TL;DR: UltraEdit is a training-free, subject-free, and memory-free approach for lifelong model editing that achieves 7x faster editing speed and uses less than 1/4 VRAM compared to previous methods, enabling editing of 7B LLMs on consumer-grade GPUs.
Details
Motivation: Current model editing methods struggle to meet practical lifelong adaptation demands at scale, requiring more efficient solutions for real-world deployment.Method: Computes parameter shifts in one step using only a hidden state and its gradient, employs lifelong normalization strategy to continuously update feature statistics across turns for adapting to distributional shifts.
Result: Achieves over 7x faster editing speed than previous SOTA, uses less than 1/4 VRAM, supports up to 2M edits while maintaining high accuracy, and can edit 7B LLMs on 24GB consumer GPUs.
Conclusion: UltraEdit represents significant progress toward safe and scalable lifelong learning, demonstrating superior performance across diverse model editing scenarios.
Abstract: Lifelong learning enables large language models (LLMs) to adapt to evolving information by continually updating their internal knowledge. An ideal system should support efficient, wide-ranging updates while preserving existing capabilities and ensuring reliable deployment. Model editing stands out as a promising solution for this goal, offering a focused and efficient way to revise a model’s internal knowledge. Although recent paradigms have made notable progress, they often struggle to meet the demands of practical lifelong adaptation at scale. To bridge this gap, we propose UltraEdit, a training-, subject-, and memory-free approach that is well-suited for ultra-scalable, real-world lifelong model editing. UltraEdit fundamentally differs from traditional paradigms by computing parameter shifts in one step using only a hidden state and its gradient, making the approach simple yet efficient. To improve scalability in lifelong settings, UltraEdit employs a lifelong normalization strategy that continuously updates feature statistics across turns, allowing it to adapt to distributional shifts and maintain consistency over time. UltraEdit achieves editing speeds over 7x faster than the previous state-of-the-art method, which was also the fastest known approach, while using less than 1/4 the VRAM. This makes it the only method currently capable of editing a 7B LLM on a 24GB consumer-grade GPU. Furthermore, we construct UltraEditBench, the largest dataset in the field to date with over 2M editing pairs, and demonstrate that our method supports up to 2M edits while maintaining high accuracy. Comprehensive experiments on five datasets and six models show that UltraEdit consistently achieves superior performance across diverse model editing scenarios, taking a further step towards safe and scalable lifelong learning. Our code is available at: https://github.com/XiaojieGu/UltraEdit
[151] UniErase: Towards Balanced and Precise Unlearning in Language Models
Miao Yu, Liang Lin, Guibin Zhang, Xinfeng Li, Junfeng Fang, Xingrui Yu, Ivor Tsang, Ningyu Zhang, Kun Wang, Yang Wang
Main category: cs.CL
TL;DR: UniErase is a novel LLM unlearning framework that uses Unlearning Tokens and lightweight edits to achieve precise knowledge removal while maintaining model ability, outperforming existing methods on the TOFU benchmark.
Details
Motivation: Address the limitations of current fine-tuning-based unlearning methods that lack precision in targeted unlearning and struggle to balance unlearning efficacy with general ability preservation, especially in massive and sequential settings.Method: Introduces Unlearning Token optimized to steer LLMs toward forgetting space, and lightweight Unlearning Edit to efficiently associate unlearning targets with this meta-token, modifying only ~3.66% of LLM parameters.
Result: Outperforms 8 baselines on TOFU benchmark: ~4.01× better model ability than best-forgetting baseline with higher unlearning efficacy, and 35.96% better unlearning efficacy than best-retaining method with better ability retention.
Conclusion: UniErase demonstrates balanced and dual top-tier performances in knowledge unlearning and ability retention, establishing a new paradigm for precise LLM unlearning.
Abstract: Large language models (LLMs) require iterative updates to address the outdated information problem, where LLM unlearning offers an approach for selective removal. However, mainstream unlearning methods primarily rely on fine-tuning techniques, which often lack precision in targeted unlearning and struggle to balance unlearning efficacy with general ability under massive and sequential settings. To bridge this gap, in this work, we introduce UniErase, a novel unlearning framework that demonstrates precision and balanced performances between knowledge unlearning and ability retaining. We first propose the Unlearning Token, which is optimized to steer LLMs toward a forgetting space. To achieve concrete unlearning behaviors, we further introduce the lightweight Unlearning Edit to efficiently associate the unlearning targets with this meta-token. Serving as a new unlearning paradigm via editing, UniErase achieves outstanding performances across batch, sequential, and precise unlearning tasks under fictitious and real-world knowledge scenarios. On the TOFU benchmark, compared with 8 baselines, UniErase, modifying only $\sim$ \textbf{3.66%} of the LLM parameters, outperforms the previous best-forgetting baseline by \textbf{$\sim$ 4.01$\times$} for \textbf{model ability} with even higher unlearning efficacy. Similarly, UniErase, with better ability retention, also surpasses the previous best-retaining method by \textbf{35.96%} for \textbf{unlearning efficacy}, showing balanced and dual top-tier performances in the current unlearning community.
[152] Beyond Early-Token Bias: Model-Specific and Language-Specific Position Effects in Multilingual LLMs
Mikhail Menschikov, Alexander Kharitonov, Maiia Kotyga, Vadim Porvatov, Anna Zhukovskaya, David Kagramanyan, Egor Shvetsov, Evgeny Burnaev
Main category: cs.CL
TL;DR: This paper investigates position bias in Large Language Models across multiple languages and model architectures, finding that bias patterns are model-driven with language-specific variations, challenging assumptions about universal early-token bias.
Details
Motivation: To explore how position bias patterns vary across different languages and model architectures, as previous research has shown LLMs exhibit systematic tendencies to neglect information at specific context positions but the patterns remain unexplored across languages.Method: Conducted a multilingual study across five typologically distinct languages (English, Russian, German, Hindi, Vietnamese) and five model architectures, examining how position bias interacts with prompt strategies and affects output entropy.
Result: Key findings: (1) Position bias is primarily model-driven with language-specific variations; (2) Explicit instructions about context relevance unexpectedly reduce accuracy; (3) Largest accuracy drop occurs with middle-position information but without corresponding entropy peak.
Conclusion: Position bias patterns in LLMs are more complex than previously assumed, being model-dependent with language variations, and common prompt engineering practices may be counterproductive.
Abstract: Large Language Models (LLMs) exhibit position bias - a systematic tendency to neglect information at specific context positions. However, the patterns of position bias behavior, depending on the language or model, remain unexplored. We present a multilingual study across five typologically distinct languages (English, Russian, German, Hindi, and Vietnamese) and five model architectures, examining how position bias interacts with prompt strategies and affects output entropy. Our key findings are: (1) Position bias is primarily model-driven, yet exhibits language-specific variations. For instance, Qwen2.5-7B-Instruct and DeepSeek 7B Chat consistently favors late positions, challenging established assumptions of a universal early-token bias in LLMs. (2) Explicitly instructing the model that “the context is relevant to the query” unexpectedly reduces accuracy across languages, undermining common prompt-engineering practices. (3) While the largest accuracy drop occurs when relevant information is placed in the middle of the context, this is not explicitly reflected by a corresponding peak in output entropy.
[153] Attributing Response to Context: A Jensen-Shannon Divergence Driven Mechanistic Study of Context Attribution in Retrieval-Augmented Generation
Ruizhe Li, Chen Chen, Yuchen Hu, Yanjun Gao, Xi Wang, Emine Yilmaz
Main category: cs.CL
TL;DR: ARC-JSD is a novel Jensen-Shannon Divergence method for efficient context attribution in RAG systems without fine-tuning, achieving superior accuracy and computational efficiency.
Details
Motivation: Current context attribution methods in RAG systems are computationally intensive, requiring extensive fine-tuning or human annotation, making reliable attribution challenging.Method: Proposes ARC-JSD using Jensen-Shannon Divergence to identify essential context sentences without additional fine-tuning, gradient calculation, or surrogate modeling.
Result: Superior accuracy and significant computational efficiency improvements on RAG benchmarks (TyDi QA, Hotpot QA, Musique) compared to previous surrogate-based methods. Mechanistic analysis reveals specific attention heads and MLP layers responsible for context attribution.
Conclusion: ARC-JSD provides an efficient and accurate solution for context attribution in RAG systems, with insights into model internals that affect RAG behaviors.
Abstract: Retrieval-Augmented Generation (RAG) leverages large language models (LLMs) combined with external contexts to enhance the accuracy and reliability of generated responses. However, reliably attributing generated content to specific context segments, context attribution, remains challenging due to the computationally intensive nature of current methods, which often require extensive fine-tuning or human annotation. In this work, we introduce a novel Jensen-Shannon Divergence driven method to Attribute Response to Context (ARC-JSD), enabling efficient and accurate identification of essential context sentences without additional fine-tuning, gradient-calculation or surrogate modelling. Evaluations on a wide range of RAG benchmarks, such as TyDi QA, Hotpot QA, and Musique, using instruction-tuned LLMs in different scales demonstrate superior accuracy and significant computational efficiency improvements compared to the previous surrogate-based method. Furthermore, our mechanistic analysis reveals specific attention heads and multilayer perceptron (MLP) layers responsible for context attribution, providing valuable insights into the internal workings of RAG models and how they affect RAG behaviours. Our code is available at https://github.com/ruizheliUOA/ARC_JSD.
[154] Beyond Static Testbeds: An Interaction-Centric Agent Simulation Platform for Dynamic Recommender Systems
Song Jin, Juntian Zhang, Yuhan Liu, Xun Zhang, Yufei Zhang, Guojun Yin, Fei Jiang, Wei Lin, Rui Yan
Main category: cs.CL
TL;DR: RecInter is an agent-based simulation platform for recommender systems that features dynamic interaction mechanisms where user actions update item attributes in real-time, enabling more realistic ecosystem evolution.
Details
Motivation: Traditional A/B testing is resource-intensive and offline methods struggle with dynamic user-platform interactions. Existing simulation platforms lack mechanisms for user actions to dynamically reshape the environment.Method: RecInter uses agent-based simulation with robust interaction mechanisms, multidimensional user profiling, advanced agent architecture, and LLMs fine-tuned on Chain-of-Thought enriched interaction data. It includes Merchant Agents that can reply to user actions.
Result: The platform achieves significantly improved simulation credibility and successfully replicates emergent phenomena like Brand Loyalty and the Matthew Effect. Experiments show the interaction mechanism is crucial for simulating realistic system evolution.
Conclusion: RecInter establishes itself as a credible testbed for recommender systems research by providing a more realistic and evolving simulation environment through dynamic interaction mechanisms.
Abstract: Evaluating and iterating upon recommender systems is crucial, yet traditional A/B testing is resource-intensive, and offline methods struggle with dynamic user-platform interactions. While agent-based simulation is promising, existing platforms often lack a mechanism for user actions to dynamically reshape the environment. To bridge this gap, we introduce RecInter, a novel agent-based simulation platform for recommender systems featuring a robust interaction mechanism. In RecInter platform, simulated user actions (e.g., likes, reviews, purchases) dynamically update item attributes in real-time, and introduced Merchant Agents can reply, fostering a more realistic and evolving ecosystem. High-fidelity simulation is ensured through Multidimensional User Profiling module, Advanced Agent Architecture, and LLM fine-tuned on Chain-of-Thought (CoT) enriched interaction data. Our platform achieves significantly improved simulation credibility and successfully replicates emergent phenomena like Brand Loyalty and the Matthew Effect. Experiments demonstrate that this interaction mechanism is pivotal for simulating realistic system evolution, establishing our platform as a credible testbed for recommender systems research. Our codes are available at https://github.com/jinsong8/RecInter.
[155] Unlearning Isn’t Deletion: Investigating Reversibility of Machine Unlearning in LLMs
Xiaoyu Xu, Xiang Yue, Yang Liu, Qingqing Ye, Huadi Zheng, Peizhao Hu, Minxin Du, Haibo Hu
Main category: cs.CL
TL;DR: Current task-level metrics for LLM unlearning are misleading as models can easily restore forgotten behavior through minimal fine-tuning. The paper introduces a representation-level analysis framework to properly evaluate unlearning effectiveness.
Details
Motivation: Existing unlearning evaluation metrics (accuracy, perplexity) are inadequate because they don't capture whether information is genuinely erased or just temporarily suppressed, leading to misleading assessments of unlearning success.Method: Developed a representation-level analysis framework using PCA-based similarity/shift, centered kernel alignment (CKA), Fisher information, and mean PCA distance to measure representational drift across six unlearning methods, three data domains, and two LLMs.
Result: Identified four distinct forgetting regimes based on reversibility and catastrophicity. Found that achieving ideal irreversible, non-catastrophic forgetting is extremely challenging. Discovered one case of seemingly irreversible, targeted forgetting.
Conclusion: Current unlearning evaluation practices have fundamental gaps. The proposed representation-level framework provides a foundation for trustworthy unlearning assessment and reveals the difficulty of achieving genuine, irreversible forgetting in LLMs.
Abstract: Unlearning in large language models (LLMs) aims to remove specified data, but its efficacy is typically assessed with task-level metrics like accuracy and perplexity. We demonstrate that these metrics are often misleading, as models can appear to forget while their original behavior is easily restored through minimal fine-tuning. This phenomenon of \emph{reversibility} suggests that information is merely suppressed, not genuinely erased. To address this critical evaluation gap, we introduce a \emph{representation-level analysis framework}. Our toolkit comprises PCA-based similarity and shift, centered kernel alignment (CKA), and Fisher information, complemented by a summary metric, the mean PCA distance, to measure representational drift. Applying this framework across six unlearning methods, three data domains, and two LLMs, we identify four distinct forgetting regimes based on their \emph{reversibility} and \emph{catastrophicity}. Our analysis reveals that achieving the ideal state–irreversible, non-catastrophic forgetting–is exceptionally challenging. By probing the limits of unlearning, we identify a case of seemingly irreversible, targeted forgetting, offering new insights for designing more robust erasure algorithms. Our findings expose a fundamental gap in current evaluation practices and establish a representation-level foundation for trustworthy unlearning.
[156] BP-Seg: A graphical model approach to unsupervised and non-contiguous text segmentation using belief propagation
Fengyi Li, Kayhan Behdin, Natesh Pillai, Xiaofeng Wang, Zhipeng Wang, Ercan Yildiz
Main category: cs.CL
TL;DR: BP-Seg: An unsupervised graphical model-based approach for text segmentation that considers both local coherence and long-range semantic similarity using belief propagation.
Details
Motivation: Text segmentation based on semantic meaning is fundamental for many downstream applications, requiring methods that can handle both local relationships and distant semantic connections.Method: Uses belief propagation on carefully constructed graphical models to capture local coherence between adjacent sentences while also grouping distant but semantically similar sentences.
Result: Experimental results on illustrative examples and long-form document datasets show favorable performance compared to competing approaches.
Conclusion: BP-Seg effectively addresses text segmentation by balancing local coherence and long-range semantic similarity through graphical modeling and belief propagation.
Abstract: Text segmentation based on the semantic meaning of sentences is a fundamental task with broad utility in many downstream applications. In this paper, we propose a graphical model-based unsupervised learning approach, named BP-Seg for efficient text segmentation. Our method not only considers local coherence, capturing the intuition that adjacent sentences are often more related, but also effectively groups sentences that are distant in the text yet semantically similar. This is achieved through belief propagation on the carefully constructed graphical models. Experimental results on both an illustrative example and a dataset with long-form documents demonstrate that our method performs favorably compared to competing approaches.
[157] From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning
Chen Shani, Liron Soffer, Dan Jurafsky, Yann LeCun, Ravid Shwartz-Ziv
Main category: cs.CL
TL;DR: LLMs and humans both organize knowledge but differ fundamentally: LLMs optimize for statistical compression efficiency while humans prioritize semantic richness and adaptive flexibility, revealing distinct strategies in artificial vs biological intelligence.
Details
Motivation: To understand whether LLMs achieve the same balance between compression and semantic meaning preservation that humans do when organizing knowledge into categories.Method: Applied Information Bottleneck principle to quantitatively compare LLMs and humans, analyzing embeddings from 40+ LLMs against classic human categorization benchmarks and examining training dynamics.
Result: Three key findings: 1) LLMs align with human categories but miss fine-grained semantic distinctions; 2) LLMs achieve optimal information-theoretic compression while humans prioritize contextual richness; 3) Encoder models outperform decoder models in human alignment, suggesting distinct mechanisms for generation vs understanding.
Conclusion: LLMs and humans employ divergent strategies - LLMs optimize for compression efficiency while humans for adaptive utility, revealing fundamental differences between artificial and biological intelligence that can guide development of more human-aligned AI.
Abstract: Humans organize knowledge into compact categories that balance compression with semantic meaning preservation. Large Language Models (LLMs) demonstrate striking linguistic abilities, yet whether they achieve this same balance remains unclear. We apply the Information Bottleneck principle to quantitatively compare how LLMs and humans navigate this compression-meaning trade-off. Analyzing embeddings from 40+ LLMs against classic human categorization benchmarks, we uncover three key findings. First, LLMs broadly align with human categories but miss fine-grained semantic distinctions crucial for human understanding. Second, LLMs demonstrate aggressive statistical compression, achieving ``optimal’’ information-theoretic efficiency, while humans prioritize contextual richness and adaptive flexibility. Third, encoder models surprisingly outperform decoder models in human alignment, suggesting that generation and understanding rely on distinct mechanisms in current architectures. In addition, training dynamics analysis reveals that conceptual structure develops in distinct phases: rapid initial formation followed by architectural reorganization, with semantic processing migrating from deeper to mid-network layers as models discover more efficient encoding. These divergent strategies, where LLMs optimize for compression and humans for adaptive utility, reveal fundamental differences between artificial and biological intelligence, guiding development toward more human-aligned AI.
[158] Prompting is not Enough: Exploring Knowledge Integration and Controllable Generation on Large Language Models
Tingjia Shen, Hao Wang, Chuan Qin, Ruijun Sun, Yang Song, Defu Lian, Hengshu Zhu, Enhong Chen
Main category: cs.CL
TL;DR: GenKI is a novel framework that improves OpenQA performance by simultaneously addressing knowledge integration and controllable generation in LLMs through dense passage retrieval, instruction-based fine-tuning, and ensemble-based generation.
Details
Motivation: To address two critical challenges in LLM-based OpenQA: effective knowledge integration into LLMs and adaptive generation with specific answer formats for various task situations.Method: Uses dense passage retrieval to get associated knowledge, introduces knowledge integration model that incorporates retrieval knowledge into instructions during fine-tuning, and employs ensemble-based generation for controllable output with text consistency (coherence, fluency, answer format assurance).
Result: Extensive experiments on TriviaQA, MSMARCO, and CMRC2018 datasets demonstrate GenKI’s effectiveness compared to state-of-the-art baselines. Ablation studies show linear relationship between retrieved knowledge frequency and accurate knowledge recall.
Conclusion: GenKI successfully addresses knowledge integration and controllable generation challenges in OpenQA, showing improved performance across diverse datasets and revealing insights about knowledge recall patterns.
Abstract: Open-domain question answering (OpenQA) represents a cornerstone in natural language processing (NLP), primarily focused on extracting answers from unstructured textual data. With the rapid advancements in Large Language Models (LLMs), LLM-based OpenQA methods have reaped the benefits of emergent understanding and answering capabilities enabled by massive parameters compared to traditional methods. However, most of these methods encounter two critical challenges: how to integrate knowledge into LLMs effectively and how to adaptively generate results with specific answer formats for various task situations. To address these challenges, we propose a novel framework named GenKI, which aims to improve the OpenQA performance by exploring Knowledge Integration and controllable Generation on LLMs simultaneously. Specifically, we first train a dense passage retrieval model to retrieve associated knowledge from a given knowledge base. Subsequently, we introduce a novel knowledge integration model that incorporates the retrieval knowledge into instructions during fine-tuning to intensify the model. Furthermore, to enable controllable generation in LLMs, we leverage a certain fine-tuned LLM and an ensemble based on text consistency incorporating all coherence, fluency, and answer format assurance. Finally, extensive experiments conducted on the TriviaQA, MSMARCO, and CMRC2018 datasets, featuring diverse answer formats, have demonstrated the effectiveness of GenKI with comparison of state-of-the-art baselines. Moreover, ablation studies have disclosed a linear relationship between the frequency of retrieved knowledge and the model’s ability to recall knowledge accurately against the ground truth. Our code of GenKI is available at https://github.com/USTC-StarTeam/GenKI
[159] BiomedSQL: Text-to-SQL for Scientific Reasoning on Biomedical Knowledge Bases
Mathew J. Koretsky, Maya Willey, Adi Asija, Owen Bianchi, Chelsea X. Alvarado, Tanay Nayak, Nicole Kuznetsov, Sungwon Kim, Mike A. Nalls, Daniel Khashabi, Faraz Faghri
Main category: cs.CL
TL;DR: BiomedSQL is the first benchmark for evaluating scientific reasoning in text-to-SQL generation over biomedical knowledge bases, revealing significant performance gaps between current LLMs and expert baselines.
Details
Motivation: Current text-to-SQL systems struggle with mapping qualitative scientific questions requiring implicit domain reasoning into executable SQL queries, particularly in biomedical contexts.Method: Created BiomedSQL benchmark with 68,000 question/SQL query/answer triples grounded in a harmonized BigQuery knowledge base integrating gene-disease associations, omics data causal inference, and drug approval records. Evaluated various LLMs across prompting strategies and interaction paradigms.
Result: GPT-o3-mini achieved 59.0% execution accuracy, while the custom multi-step agent BMSQL reached 62.6%, both significantly below the expert baseline of 90.0%.
Conclusion: BiomedSQL provides a foundation for advancing text-to-SQL systems capable of supporting scientific discovery through robust reasoning over structured biomedical knowledge bases, with current systems showing substantial room for improvement.
Abstract: Biomedical researchers increasingly rely on large-scale structured databases for complex analytical tasks. However, current text-to-SQL systems often struggle to map qualitative scientific questions into executable SQL, particularly when implicit domain reasoning is required. We introduce BiomedSQL, the first benchmark explicitly designed to evaluate scientific reasoning in text-to-SQL generation over a real-world biomedical knowledge base. BiomedSQL comprises 68,000 question/SQL query/answer triples grounded in a harmonized BigQuery knowledge base that integrates gene-disease associations, causal inference from omics data, and drug approval records. Each question requires models to infer domain-specific criteria, such as genome-wide significance thresholds, effect directionality, or trial phase filtering, rather than rely on syntactic translation alone. We evaluate a range of open- and closed-source LLMs across prompting strategies and interaction paradigms. Our results reveal a substantial performance gap: GPT-o3-mini achieves 59.0% execution accuracy, while our custom multi-step agent, BMSQL, reaches 62.6%, both well below the expert baseline of 90.0%. BiomedSQL provides a new foundation for advancing text-to-SQL systems capable of supporting scientific discovery through robust reasoning over structured biomedical knowledge bases. Our dataset is publicly available at https://huggingface.co/datasets/NIH-CARD/BiomedSQL, and our code is open-source at https://github.com/NIH-CARD/biomedsql.
[160] EmoBench-UA: A Benchmark Dataset for Emotion Detection in Ukrainian
Daryna Dementieva, Nikolay Babakov, Alexander Fraser
Main category: cs.CL
TL;DR: EmoBench-UA is the first annotated dataset for emotion detection in Ukrainian texts, created through crowdsourcing and evaluated with various approaches including linguistic baselines, synthetic data, and LLMs.
Details
Motivation: Ukrainian NLP has made progress in many text processing tasks, but emotion classification remains underexplored with no publicly available benchmark to date.Method: Created EmoBench-UA dataset through crowdsourcing using Toloka.ai platform, adapted annotation schema from English-centric works, and evaluated linguistic-based baselines, synthetic data translated from English, and large language models.
Result: The findings highlight the challenges of emotion classification in non-mainstream languages like Ukrainian.
Conclusion: There is a need for further development of Ukrainian-specific models and training resources for emotion classification.
Abstract: While Ukrainian NLP has seen progress in many texts processing tasks, emotion classification remains an underexplored area with no publicly available benchmark to date. In this work, we introduce EmoBench-UA, the first annotated dataset for emotion detection in Ukrainian texts. Our annotation schema is adapted from the previous English-centric works on emotion detection (Mohammad et al., 2018; Mohammad, 2022) guidelines. The dataset was created through crowdsourcing using the Toloka.ai platform ensuring high-quality of the annotation process. Then, we evaluate a range of approaches on the collected dataset, starting from linguistic-based baselines, synthetic data translated from English, to large language models (LLMs). Our findings highlight the challenges of emotion classification in non-mainstream languages like Ukrainian and emphasize the need for further development of Ukrainian-specific models and training resources.
[161] Table-R1: Inference-Time Scaling for Table Reasoning
Zheyuan Yang, Lyuhao Chen, Arman Cohan, Yilun Zhao
Main category: cs.CL
TL;DR: First study on inference-time scaling for table reasoning tasks using two post-training strategies: distillation from frontier model reasoning traces and reinforcement learning with verifiable rewards (RLVR).
Details
Motivation: To explore inference-time scaling on table reasoning tasks and develop efficient methods that can match or exceed frontier model performance using smaller models.Method: Two approaches: 1) Distillation using reasoning traces from DeepSeek-R1 to create Table-R1-SFT model, 2) RLVR with task-specific verifiable rewards using GRPO algorithm to create Table-R1-Zero model.
Result: Table-R1-Zero model matches or exceeds GPT-4.1 and DeepSeek-R1 performance using only 7B parameters, with strong out-of-domain generalization. Shows benefits of instruction tuning, architecture choices, and emergent table reasoning skills.
Conclusion: Inference-time scaling through post-training strategies enables smaller models to achieve frontier-level performance on table reasoning tasks, with RLVR showing particularly strong results and generalization capabilities.
Abstract: In this work, we present the first study to explore inference-time scaling on table reasoning tasks. We develop and evaluate two post-training strategies to enable inference-time scaling: distillation from frontier model reasoning traces and reinforcement learning with verifiable rewards (RLVR). For distillation, we introduce a large-scale dataset of reasoning traces generated by DeepSeek-R1, which we use to fine-tune LLMs into the Table-R1-SFT model. For RLVR, we propose task-specific verifiable reward functions and apply the GRPO algorithm to obtain the Table-R1-Zero model. We evaluate our Table-R1-series models across diverse table reasoning tasks, including short-form QA, fact verification, and free-form QA. Notably, the Table-R1-Zero model matches or exceeds the performance of GPT-4.1 and DeepSeek-R1, while using only a 7B-parameter LLM. It also demonstrates strong generalization to out-of-domain datasets. Extensive ablation and qualitative analyses reveal the benefits of instruction tuning, model architecture choices, and cross-task generalization, as well as emergence of essential table reasoning skills during RL training.
[162] InfiMed: Low-Resource Medical MLLMs with Advancing Understanding and Reasoning
Zeyu Liu, Zhitian Hou, Guanghao Zhu, Zhijie Sang, Congkai Xie, Hongxia Yang
Main category: cs.CL
TL;DR: The paper introduces InfiMed-Series models that address challenges in applying MLLMs to medical domains through strategic data augmentation and reflective reasoning enhancement, achieving state-of-the-art performance on medical benchmarks.
Details
Motivation: MLLMs face two key challenges in medical applications: scarcity of multimodal medical datasets with sparse information, and ineffectiveness of RLVR in reliably improving performance in the medical domain.Method: During SFT stage, incorporated high-quality textual reasoning data and general multimodal data alongside medical data. Synthesized reflective-pattern-injected CoT samples to equip models with initial reflective reasoning capabilities. Developed InfiMed-SFT-3B and InfiMed-RL-3B models.
Result: InfiMed-RL-3B achieved 59.2% average accuracy across seven multimodal medical benchmarks, outperforming larger models like InternVL3-8B (57.3%). Used 188K samples in SFT phase and 36K in RLVR phase.
Conclusion: The proposed training strategies effectively advance MLLM performance in medical scenarios, with both SFT and RLVR phases contributing to superior results.
Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable progress in domains such as visual understanding and mathematical reasoning. However, their application in the medical domain is constrained by two key challenges: (1) multimodal medical datasets are scarce and often contain sparse information, limiting reasoning depth; and (2) Reinforcement Learning with Verifiable Rewards (RLVR), though effective in general domains, cannot reliably improve model performance in the medical domain. To overcome these challenges, during the supervised fine-tuning (SFT) stage, we incorporate high-quality textual reasoning data and general multimodal data alongside multimodal medical data to efficiently enhance foundational medical capabilities and restore the base model’s reasoning ability. Moreover, considering that there are some multimodal medical datasets with sparse information, we further synthesize reflective-pattern-injected chain-of-thought (CoT) in addition to general CoT samples, equipping the model with initial reflective reasoning capabilities that provide a structured foundation for subsequent RLVR training. Finally, we introduce our InfiMed-Series models, InfiMed-SFT-3B and InfiMed-RL-3B, both of which deliver state-of-the-art performance across seven multimodal medical benchmarks. Notably, InfiMed-RL-3B achieves an average accuracy of 59.2%, outperforming even larger models like InternVL3-8B, which achieves 57.3%. Specifically, during the SFT phase, we utilized 188K samples, while the RLVR phase incorporated 36K samples, demonstrating the efficacy of both training strategies in achieving superior performance. We also conducted a series of extensive experiments, which provide valuable insights that contribute to advancing the performance of MLLMs in medical scenarios.
[163] RARE: Retrieval-Aware Robustness Evaluation for Retrieval-Augmented Generation Systems
Yixiao Zeng, Tianyu Cao, Danqing Wang, Xinran Zhao, Zimeng Qiu, Morteza Ziyadi, Tongshuang Wu, Lei Li
Main category: cs.CL
TL;DR: RARE is a unified framework for evaluating RAG systems’ robustness against real-world noise, conflicting contexts, and fast-changing facts through systematic query and document perturbations.
Details
Motivation: Existing RAG evaluations rarely test how systems handle real-world noise, conflicting internal/external contexts, or rapidly changing factual information, creating a gap in robustness assessment.Method: Developed RARE framework with knowledge-graph-driven synthesis pipeline (RARE-Get) that automatically extracts relations from time-sensitive corpora and generates multi-level question sets without manual intervention.
Result: Created RARE-Set dataset with 527 finance/economics/policy documents and 48,295 evolving questions; found RAG systems unexpectedly sensitive to perturbations and consistently less robust on multi-hop queries.
Conclusion: RAG systems show significant vulnerability to perturbations, particularly on complex multi-hop queries, highlighting the need for more robust retrieval-augmented generation approaches.
Abstract: Retrieval-Augmented Generation (RAG) enhances recency and factuality in answers. However, existing evaluations rarely test how well these systems cope with real-world noise, conflicting between internal and external retrieved contexts, or fast-changing facts. We introduce Retrieval-Aware Robustness Evaluation (RARE), a unified framework and large-scale benchmark that jointly stress-tests query and document perturbations over dynamic, time-sensitive corpora. One of the central features of RARE is a knowledge-graph-driven synthesis pipeline (RARE-Get) that automatically extracts single and multi-hop relations from the customized corpus and generates multi-level question sets without manual intervention. Leveraging this pipeline, we construct a dataset (RARE-Set) spanning 527 expert-level time-sensitive finance, economics, and policy documents and 48295 questions whose distribution evolves as the underlying sources change. To quantify resilience, we formalize retrieval-conditioned robustness metrics (RARE-Met) that capture a model’s ability to remain correct or recover when queries, documents, or real-world retrieval results are systematically altered. Our findings reveal that RAG systems are unexpectedly sensitive to perturbations. Moreover, they consistently demonstrate lower robustness on multi-hop queries compared to single-hop queries across all domains.
[164] Probing Neural Topology of Large Language Models
Yu Zheng, Yuan Yuan, Yue Zhuo, Yong Li, Paolo Santi
Main category: cs.CL
TL;DR: Graph probing reveals that neural topology in LLMs contains richer information about language performance than neural activations, enabling prediction of next-token performance with just 1% of connections and applications in model efficiency and safety.
Details
Motivation: To understand the complex relationship between neurons' functional co-activation and emergent model capabilities, which remains largely unknown and hinders deeper understanding and safer development of LLMs.Method: Graph probing method for uncovering functional connectivity of LLM neurons and relating it to language generation performance, tested across diverse LLM families and scales with interventional experiments.
Result: Discovered universal predictability of next-token prediction performance using only neural topology, even with just 1% of connections. Topology probing outperforms activation probing by up to 130.4%. Identified default networks and hub neurons, with causal evidence showing LLMs exploit topological information.
Conclusion: Neural topology contains orders of richer information about LLM performance than neural activation and can be leveraged to improve efficiency, reliability, and safety through applications in model pruning, hallucination detection, and LLM fingerprinting.
Abstract: Probing large language models (LLMs) has yielded valuable insights into their internal mechanisms by linking neural activations to interpretable semantics. However, the complex mechanisms that link neuron’s functional co-activation with the emergent model capabilities remains largely unknown, hindering a deeper understanding and safer development of LLMs. In this work, we introduce graph probing, a method for uncovering the functional connectivity of LLM neurons and relating it to language generation performance. By probing models across diverse LLM families and scales, we discover a universal predictability of next-token prediction performance using only neural topology, which persists even when retaining just 1% of neuron connections. Strikingly, probing on topology outperforms probing on activation by up to 130.4%, suggesting that neural topology contains orders of richer information of LLM performance than neural activation, which can be easily extracted with simple linear or MLP probes. To explain the dependence between neural topology and language performance, we identify default networks and hub neurons in LLMs and provide causal evidence by interventional experiments on multiple benchmarks, showing that LLMs actually exploit these topological information. Further analyses suggest that neural topology can be effectively leveraged to improve the efficiency, reliability, and safety of LLMs through proof-of-concept applications in model pruning, hallucination detection, and LLM fingerprinting. Codes and data for the graph probing toolbox are available at https://github.com/DavyMorgan/llm-graph-probing.
[165] Resisting Contextual Interference in RAG via Parametric-Knowledge Reinforcement
Chenyu Lin, Yilin Wen, Du Su, Hexiang Tan, Fei Sun, Muhan Chen, Chenfu Bao, Zhonghou Lyu
Main category: cs.CL
TL;DR: Knowledgeable-R1 is a reinforcement learning framework that trains LLMs to resist misleading retrieved context by leveraging parametric knowledge, improving robustness in RAG systems.
Details
Motivation: RAG systems can be derailed by wrong, irrelevant, or conflicting retrieved text, causing models to rely on inaccurate evidence and cascade errors.Method: Uses joint sampling to generate paired responses with/without retrieval, learning local and global advantages to quantify when to ignore misleading context versus adopt it. Employs asymmetric advantage transformation to amplify exploratory behaviors toward parametric knowledge.
Result: Significantly improves robustness and reasoning accuracy in knowledge conflict scenarios and general RAG scenarios, outperforming SOTA baselines by 23% in counterfactual scenarios, with no degradation when retrieved context is accurate.
Conclusion: The framework effectively trains LLMs to use parametric knowledge to resist contextual interference while still exploiting external context when reliably helpful, enhancing RAG system reliability.
Abstract: Retrieval-augmented generation (RAG) improves performance on knowledge-intensive tasks but can be derailed by wrong, irrelevant, or conflicting retrieved text, causing models to rely on inaccurate evidence and cascade errors. We propose Knowledgeable-R1, a reinforcement-learning framework that explicitly trains large language models to use parametric knowledge (PK) to resist contextual interference while still exploiting external context when it is reliably helpful. Knowledgeable-R1 introduces a joint sampling scheme that generates paired responses with and without retrieval, and learns both local advantages (within each decoding regime) and global advantages under the same input to quantify when to ignore misleading context versus adopt it. We employ an asymmetric advantage transformation that amplifies exploratory behaviors toward parametric knowledge. Experiments show that \method significantly improves robustness and reasoning accuracy in knowledge conflict scenarios and general RAG scenarios, outperforming SOTA baselines by 23% in counterfactual scenarios, and without degradation when the retrieved context is fully accurate.Our code are available at https://github.com/lcy80366872/knowledgeable-R1.
[166] QA-LIGN: Aligning LLMs through Constitutionally Decomposed QA
Jacob Dineen, Aswin RRV, Qin Liu, Zhikun Xu, Xiao Ye, Ming Shen, Zhaonan Li, Shijie Lu, Chitta Baral, Muhao Chen, Ben Zhou
Main category: cs.CL
TL;DR: QA-LIGN decomposes monolithic LLM rewards into interpretable principle-specific evaluations using natural language programs, improving alignment effectiveness through transparent feedback.
Details
Motivation: Traditional LLM alignment relies on scalar rewards that obscure which objectives drive training, making it difficult to understand and improve alignment processes.Method: Uses QA-LIGN to decompose rewards into interpretable principle-specific evaluations through structured natural language programs, employing a draft-critic-revise pipeline with symbolic evaluation against rubrics during GRPO training.
Result: Applied to Llama-3.1-8B-Instruct, reduces attack success rates by up to 68.7% while maintaining 0.67% false refusal rate, achieving Pareto optimal safety-helpfulness performance and outperforming DPO and GRPO with state-of-the-art reward models.
Conclusion: Making reward signals interpretable and modular improves alignment effectiveness, suggesting transparency enhances LLM safety.
Abstract: Alignment of large language models (LLMs) with principles like helpfulness, honesty, and harmlessness typically relies on scalar rewards that obscure which objectives drive the training signal. We introduce QA-LIGN, which decomposes monolithic rewards into interpretable principle-specific evaluations through structured natural language programs. Models learn through a draft, critique, and revise pipeline, where symbolic evaluation against the rubrics provides transparent feedback for both initial and revised responses during GRPO training. Applied to uncensored Llama-3.1-8B-Instruct, QA-LIGN reduces attack success rates by up to 68.7% while maintaining a 0.67% false refusal rate, achieving Pareto optimal safety-helpfulness performance and outperforming both DPO and GRPO with state-of-the-art reward models given equivalent training. These results demonstrate that making reward signals interpretable and modular improves alignment effectiveness, suggesting transparency enhances LLM safety.
[167] Personalized LLM Decoding via Contrasting Personal Preference
Hyungjune Bu, Chanjoo Jung, Minjae Kang, Jaehyung Kim
Main category: cs.CL
TL;DR: CoPe is a decoding-time approach for LLM personalization that uses contrastive decoding with implicit reward signals after parameter-efficient fine-tuning, improving personalization by 10.57% in ROUGE-L without external reward models.
Details
Motivation: Personalization of LLMs is important for real-world applications, but decoding-time algorithms for personalization remain underdeveloped despite their potential.Method: CoPe uses contrastive decoding to maximize users’ implicit reward signals after parameter-efficient fine-tuning on user-specific data, without requiring external reward models or additional training.
Result: CoPe improves personalization by an average of 10.57% in ROUGE-L across five open-ended personalized text generation tasks.
Conclusion: CoPe demonstrates that effective decoding-time algorithms can significantly enhance LLM personalization without external resources or additional training procedures.
Abstract: As large language models (LLMs) are progressively deployed in various real-world applications, personalization of LLMs has become increasingly important. While various approaches to LLM personalization such as prompt-based and training-based methods have been actively explored, the development of effective decoding-time algorithms remains largely overlooked, despite their demonstrated potential. In this paper, we propose CoPe (Contrasting Personal Preference), a novel decoding-time approach applied after performing parameter-efficient fine-tuning (PEFT) on user-specific data. Our core idea is to leverage reward-guided decoding specifically for personalization by maximizing each user’s implicit reward signal. We evaluate CoPe across five open-ended personalized text generation tasks. Our empirical results demonstrate that CoPe achieves strong performance, improving personalization by an average of 10.57% in ROUGE-L, without relying on external reward models or additional training procedures.
[168] MUCAR: Benchmarking Multilingual Cross-Modal Ambiguity Resolution for Multimodal Large Language Models
Xiaolong Wang, Zhaolu Kang, Wangyuxuan Zhai, Xinyue Lou, Yunghwei Lai, Ziyue Wang, Yawen Wang, Kaiyu Huang, Yile Wang, Peng Li, Yang Liu
Main category: cs.CL
TL;DR: MUCAR is a new benchmark for evaluating multimodal ambiguity resolution in multilingual and cross-modal scenarios, revealing significant performance gaps between current MLLMs and human-level understanding.
Details
Motivation: Existing multimodal benchmarks overlook linguistic and visual ambiguities, failing to exploit the mutual clarification potential between modalities. Real-world language and visual contexts often contain inherent ambiguities that current models struggle to resolve.Method: Created MUCAR benchmark with two components: 1) multilingual dataset where ambiguous text is resolved by visual context, and 2) dual-ambiguity dataset pairing ambiguous images with ambiguous text to yield clear interpretations through mutual disambiguation.
Result: Evaluation of 19 state-of-the-art multimodal models (open-source and proprietary) shows substantial performance gaps compared to human-level performance in ambiguity resolution.
Conclusion: Current multimodal models have significant limitations in cross-modal ambiguity comprehension, highlighting the need for more sophisticated methods to advance multimodal reasoning capabilities.
Abstract: Multimodal Large Language Models (MLLMs) have demonstrated significant advances across numerous vision-language tasks. MLLMs have shown promising capability in aligning visual and textual modalities, allowing them to process image-text pairs with clear and explicit meanings. However, resolving the inherent ambiguities present in real-world language and visual contexts remains a challenge. Existing multimodal benchmarks typically overlook linguistic and visual ambiguities, relying mainly on unimodal context for disambiguation and thus failing to exploit the mutual clarification potential between modalities. To bridge this gap, we introduce MUCAR, a novel and challenging benchmark designed explicitly for evaluating multimodal ambiguity resolution across multilingual and cross-modal scenarios. MUCAR includes first a multilingual dataset where ambiguous textual expressions are uniquely resolved by corresponding visual contexts, and second a dual-ambiguity dataset that systematically pairs ambiguous images with ambiguous textual contexts, with each combination carefully constructed to yield a single, clear interpretation through mutual disambiguation. Extensive evaluations involving 19 state-of-the-art multimodal models–encompassing both open-source and proprietary architectures–reveal substantial gaps compared to human-level performance, highlighting the need for future research into more sophisticated cross-modal ambiguity comprehension methods, further pushing the boundaries of multimodal reasoning.
[169] KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model
Xinping Zhao, Xinshuo Hu, Zifei Shan, Shouzheng Huang, Yao Zhou, Xin Zhang, Zetian Sun, Zhenyu Liu, Dongfang Li, Xinyuan Wei, Youcheng Pan, Yang Xiang, Meishan Zhang, Haofen Wang, Jun Yu, Baotian Hu, Min Zhang
Main category: cs.CL
TL;DR: KaLM-Embedding-V2 is a series of compact 0.5B parameter embedding models that achieve state-of-the-art performance through superior training techniques and high-quality data curation, outperforming models of comparable size and rivaling much larger models.
Details
Motivation: Current LLM-based text embedding models focus mainly on data scaling or synthesis, with limited exploration of training techniques and data quality, which constrains performance.Method: Uses 0.5B parameter architecture with mean-pooling and bidirectional attention; implements progressive multi-stage training (pre-training, fine-tuning, contrastive distillation) with focal-style reweighting and hard-negative mining; curates over 20 pre-training and 100 fine-tuning categories with task-specific instructions and multi-class labeling.
Result: Achieves state-of-the-art performance on Massive Text Embedding Benchmark, outperforming comparable-sized models and rivaling models 3-26x larger.
Conclusion: Sets new standard for versatile compact embedding models under 1B parameters, demonstrating that superior training techniques and data quality can overcome size limitations.
Abstract: Recent advancements in Large Language Models (LLMs)-based text embedding models primarily focus on data scaling or synthesis, yet limited exploration of training techniques and data quality, thereby constraining performance. In this work, we propose KaLM-Embedding-V2, a series of versatile and compact embedding models, systematically incentivizing advanced embedding capability in LLMs by superior training techniques and high-quality data. For model architecture, we implement the models on a 0.5B compact size with simple mean-pooling to produce fixed-length embeddings and remove the causal attention mask to enable fully bidirectional representation learning. For training techniques, we propose a progressive multi-stage training pipeline: pre-training on weakly supervised large-scale datasets, fine-tuning with supervised high-quality datasets, and contrastive distillation with fine-grained soft signals, integrated with focal-style reweighting and online hard-negative mixing to emphasize difficult samples and enrich hard negatives, respectively. For training data, we curate over 20 categories for pre-training and 100 categories for fine-tuning and contrastive distillation, to improve both performance and generalization, leveraging task-specific instructions, hard-negative mining, and example-based multi-class labeling to ensure high quality. Combining these techniques, our KaLM-Embedding-V2 series achieves state-of-the-art performance on the Massive Text Embedding Benchmark, outperforming models of comparable size and rivaling models 3-26x larger, setting a new standard for versatile and compact embedding models under 1B parameters. The code, data, and models will be publicly available to facilitate academic research.
[170] VAT-KG: Knowledge-Intensive Multimodal Knowledge Graph Dataset for Retrieval-Augmented Generation
Hyeongcheol Park, Jiyoung Seo, MinHyuk Jang, Hogun Park, Ha Dam Baek, Gyusam Chang, Hyeonsoo Im, Sangpil Kim
Main category: cs.CL
TL;DR: VAT-KG is the first concept-centric multimodal knowledge graph covering visual, audio, and text modalities, enabling better multimodal reasoning for MLLMs through a novel RAG framework.
Details
Motivation: Existing MMKGs are limited in scope, outdated, and support only narrow modalities like text and visual, restricting their applicability to modern MLLMs that use richer modalities like video and audio.Method: Proposed VAT-KG construction pipeline with stringent filtering and alignment steps for cross-modal knowledge alignment, enabling automatic generation from any multimodal dataset. Also introduced a novel multimodal RAG framework for concept-level knowledge retrieval.
Result: Experiments on question answering tasks across various modalities demonstrated VAT-KG’s effectiveness in supporting MLLMs and its practical value in unifying multimodal knowledge.
Conclusion: VAT-KG successfully addresses limitations of existing MMKGs by providing comprehensive multimodal coverage and enabling more grounded reasoning through concept-level knowledge retrieval.
Abstract: Multimodal Knowledge Graphs (MMKGs), which represent explicit knowledge across multiple modalities, play a pivotal role by complementing the implicit knowledge of Multimodal Large Language Models (MLLMs) and enabling more grounded reasoning via Retrieval Augmented Generation (RAG). However, existing MMKGs are generally limited in scope: they are often constructed by augmenting pre-existing knowledge graphs, which restricts their knowledge, resulting in outdated or incomplete knowledge coverage, and they often support only a narrow range of modalities, such as text and visual information. These limitations restrict applicability to multimodal tasks, particularly as recent MLLMs adopt richer modalities like video and audio. Therefore, we propose the Visual-Audio-Text Knowledge Graph (VAT-KG), the first concept-centric and knowledge-intensive multimodal knowledge graph that covers visual, audio, and text information, where each triplet is linked to multimodal data and enriched with detailed descriptions of concepts. Specifically, our construction pipeline ensures cross-modal knowledge alignment between multimodal data and fine-grained semantics through a series of stringent filtering and alignment steps, enabling the automatic generation of MMKGs from any multimodal dataset. We further introduce a novel multimodal RAG framework that retrieves detailed concept-level knowledge in response to queries from arbitrary modalities. Experiments on question answering tasks across various modalities demonstrate the effectiveness of VAT-KG in supporting MLLMs, highlighting its practical value in unifying and leveraging multimodal knowledge.
[171] WildSpeech-Bench: Benchmarking End-to-End SpeechLLMs in the Wild
Linhao Zhang, Jian Zhang, Bokai Lei, Chuhan Wu, Aiwei Liu, Wei Jia, Xiao Zhou
Main category: cs.CL
TL;DR: Introduces the first comprehensive benchmark for evaluating end-to-end speech LLMs, addressing gaps in existing text-based evaluations by incorporating speech-specific characteristics and real-world conversational scenarios.
Details
Motivation: Existing benchmarks for speech LLMs often adapt text-based methods, overlooking unique speech challenges like prosody, homophones, stuttering, and user expectations, which hinders optimization for real-world applications.Method: Systematically curates real-world chat data with diverse speaker attributes and acoustic conditions, augments with speech-specific phenomena, and designs query-aware evaluation using customized checklists and prompts for accurate automatic assessment.
Result: Comprehensive testing reveals significant performance differences across speech scenarios, with query-aware evaluation enabling finer-grained assessment of various speech-specific challenges.
Conclusion: The benchmark provides valuable insights for speech model development and evaluation, addressing critical gaps in current assessment methods for end-to-end speech LLMs.
Abstract: Recent multi-modal Large Language Models (LLMs) such as GPT-4o have demonstrated strong capabilities of direct speech interaction. However, the lack of specialized and comprehensive benchmarks for end-to-end speech LLM evaluation hinders optimizing the user experience of Audio LLMs in real-world applications. Existing evaluation methods often adapt text-based benchmarks, overlooking speech’s unique characteristics and challenges, including prosody, homophones, stuttering, and differing user expectations. Here, we introduce the first comprehensive benchmark designed to systematically evaluate end-to-end speechLLMs in practical speech conversations. We systematically curate real-world chat data relevant to spoken scenarios, introduce diversity in speaker attributes and acoustic conditions, and augment the dataset with speech-specific phenomena. We further design a query-aware evaluation method to use customized evaluation checklists and prompts to enhance the accuracy of automatic evaluation. We conduct comprehensive testing and detailed analysis of various mainstream speech models, revealing significant differences in model performance across different speech scenarios. The use of query-aware evaluation further enables a finer-grained assessment under various speech-specific scenarios. Our benchmark can provide valuable insights for speech model development and evaluation.
[172] Why Reinforcement Fine-Tuning Enables MLLMs Preserve Prior Knowledge Better: A Data Perspective
Zhihao Zhang, Qiaole Dong, Qi Zhang, Jun Zhao, Enyu Zhou, Zhiheng Xi, Senjie Jin, Xiaoran Fan, Yuhao Zhou, Mingqi Wu, Yanwei Fu, Tao Ji, Tao Gui, Xuanjing Huang, Kai Chen
Main category: cs.CL
TL;DR: SFT enables rapid task acquisition but causes catastrophic forgetting, while RFT learns more slowly but maintains prior knowledge in multimodal LLMs.
Details
Motivation: To understand the impact of post-training algorithms (SFT and RFT) on prior knowledge in multimodal large language models, as their effects on knowledge retention remain unclear.Method: Used jigsaw puzzles as a novel task absent from pretraining corpora, systematically studied SFT and RFT on Qwen2.5-VL models, analyzed learning dynamics through magnitude and direction of training data influence.
Result: RFT mainly reinforces correct samples aligned with base model’s probability landscape, causing weaker interference with prior knowledge. Training on RFT-simulated rollouts allows SFT to preserve knowledge while learning new tasks.
Conclusion: Distribution of training data, not algorithmic differences, plays central role in forgetting; RFT shows potential for stable continual learning in multimodal LLMs.
Abstract: Post-training algorithms such as Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT) are widely used to adapt multimodal large language models to downstream tasks. While effective at task adaptation, their impact on prior knowledge remains unclear. In this paper, we introduce jigsaw puzzles as a novel task absent from existing pretraining corpora and systematically study the behavior of SFT and RFT on open-source multimodal model, Qwen2.5-VL series. Our experiments reveal a sharp trade-off: SFT enables rapid task acquisition but leads to catastrophic forgetting, whereas RFT learns more slowly but maintains prior knowledge. We study this phenomenon through learning dynamics by examining both the magnitude and direction of how training data influence prior knowledge. Our analysis shows that RFT mainly reinforces correct samples naturally aligned with the base model’s probability landscape, leading to weaker interference with prior knowledge. Moreover, training on RFT-simulated rollouts, which exert a small magnitude of influence and are well aligned in direction to prior knowledge, allows SFT to preserve prior knowledge better while rapidly learning new tasks. These findings suggest that distribution of training data, rather than algorithmic differences, plays a central role in forgetting, and highlight RFT’s potential for stable continual learning in multimodal large language models.
[173] Unveiling the Potential of Diffusion Large Language Model in Controllable Generation
Zhen Xiong, Yujun Cai, Zhecheng Li, Yiwei Wang
Main category: cs.CL
TL;DR: The paper proposes Self-adaptive Schema Scaffolding (S³), a framework that enables diffusion-based LLMs to generate reliable structured outputs like JSON by leveraging their reverse reasoning capabilities and global context awareness.
Details
Motivation: Current autoregressive LLMs exhibit unreliability in generating structured output, while diffusion-based LLMs offer architectural advantages like global information-sharing that could enable better controllable generation.Method: S³ initiates a schematic template directly in the output context as a starting state for diffusion LLMs, providing a more robust alternative to prompt optimization by leveraging the model’s innate reverse reasoning capability.
Result: Experiments show S³ substantially improves dLLMs’ performance in controllable generation across structure adherence, content fidelity, and faithfulness metrics.
Conclusion: The method establishes new perspectives and practical pathways for deploying language models in controllable generation tasks, unlocking dLLMs’ potential in this domain.
Abstract: Controllable generation is a fundamental task in NLP with many applications, providing a basis for function calling to agentic communication. However, even state-of-the-art autoregressive Large Language Models (LLMs) today exhibit unreliability when required to generate structured output. Inspired by the current new diffusion-based large language models (dLLM), we realize that the architectural difference, especially the global information-sharing mechanism for language modeling, may be the key to unlock next-level controllable generation. To explore the possibility, we propose Self-adaptive Schema Scaffolding ($S^3$), a novel framework that enables dLLM to stably generate reliable structured outputs (e.g., JSON) by utilizing its innate reverse reasoning capability and global context awareness. $S^3$ initiates a schematic template directly in the output context as a starting state for dLLM, offering a more robust and general method than intricate prompt optimization. Experiments demonstrate that our method substantially unlocks the dLLM’s potential in controllable generation in terms of structure adherence, content fidelity, and faithfulness. These results establish new perspectives and practical pathways for deploying language models in controllable generation tasks.
[174] Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions
Yuanzhe Hu, Yu Wang, Julian McAuley
Main category: cs.CL
TL;DR: MemoryAgentBench is a new benchmark for evaluating memory capabilities in LLM agents, covering four core competencies: accurate retrieval, test-time learning, long-range understanding, and selective forgetting.
Details
Motivation: Existing benchmarks focus on reasoning and planning but neglect memory capabilities in LLM agents. Current benchmarks either use limited context or static long-context settings, failing to capture the interactive, multi-turn nature of memory agents that incrementally accumulate information.Method: The authors transform existing long-context datasets and incorporate newly constructed datasets into a multi-turn format to simulate incremental information processing. The benchmark systematically covers all four memory competencies through careful dataset selection and curation.
Result: Evaluation of diverse memory agents (context-based, RAG systems, agents with external memory, and tool-integrated agents) shows that current methods fail to master all four memory competencies, highlighting limitations in existing approaches.
Conclusion: There is a significant need for further research into comprehensive memory mechanisms for LLM agents, as current methods are insufficient for handling the full spectrum of memory capabilities required for effective agent performance.
Abstract: Recent benchmarks for Large Language Model (LLM) agents primarily focus on evaluating reasoning, planning, and execution capabilities, while another critical component-memory, encompassing how agents memorize, update, and retrieve long-term information-is under-evaluated due to the lack of benchmarks. We term agents with memory mechanisms as memory agents. In this paper, based on classic theories from memory science and cognitive science, we identify four core competencies essential for memory agents: accurate retrieval, test-time learning, long-range understanding, and selective forgetting. Existing benchmarks either rely on limited context lengths or are tailored for static, long-context settings like book-based QA, which do not reflect the interactive, multi-turn nature of memory agents that incrementally accumulate information. Moreover, no existing benchmarks cover all four competencies. We introduce MemoryAgentBench, a new benchmark specifically designed for memory agents. Our benchmark transforms existing long-context datasets and incorporates newly constructed datasets into a multi-turn format, effectively simulating the incremental information processing characteristic of memory agents. By carefully selecting and curating datasets, our benchmark provides comprehensive coverage of the four core memory competencies outlined above, thereby offering a systematic and challenging testbed for assessing memory quality. We evaluate a diverse set of memory agents, ranging from simple context-based and retrieval-augmented generation (RAG) systems to advanced agents with external memory modules and tool integration. Empirical results reveal that current methods fall short of mastering all four competencies, underscoring the need for further research into comprehensive memory mechanisms for LLM agents.
[175] Learn Globally, Speak Locally: Bridging the Gaps in Multilingual Reasoning
Jaedong Hwang, Kumar Tanmay, Seok-Jin Lee, Ayush Agrawal, Hamid Palangi, Kumar Ayush, Ila Fiete, Paul Pu Liang
Main category: cs.CL
TL;DR: M2A method improves multilingual reasoning by aligning models across languages and using language-consistency rewards, while GeoFact-X benchmark evaluates reasoning in 5 languages.
Details
Motivation: LLMs perform poorly in low-resource languages, often defaulting to English reasoning, which undermines accuracy and trust in multilingual applications.Method: Proposed M2A method combines multi-scale multilingual alignment with language-consistency rewards on machine-translated questions to train models to reason directly in target languages.
Result: M2A significantly enhances multilingual reasoning fidelity in both mathematical and factual reasoning tasks across multiple languages.
Conclusion: Reasoning-aware multilingual reinforcement learning is crucial for robust cross-lingual generalization in LLMs.
Abstract: Large Language Models (LLMs) have achieved strong performance in domains like mathematics, factual question answering, and code generation, yet their ability to reason on these tasks in different languages remains underdeveloped. Especially for low-resource languages such as Swahili or Thai, LLMs can often misinterpret prompts or default to reasoning in English. This implicit bias toward high-resource languages undermines factual accuracy, interpretability, and trust. We propose M2A, a novel method that combines multi-scale multilingual alignment with language-consistency rewards on machine-translated questions, training models to reason directly and accurately in the target language. Furthermore, existing multilingual benchmarks only evaluate on final answers, overlooking whether reasoning occurs in the intended language. To close this gap, we introduce GeoFact-X, a geography-based multilingual factual reasoning benchmark together with reasoning traces in five languages: English, Hindi, Japanese, Swahili, and Thai. Our results show that M2A significantly enhances multilingual reasoning fidelity in both mathematical and factual reasoning tasks, highlighting that reasoning-aware multilingual reinforcement learning is crucial for robust cross-lingual generalization. https://jd730.github.io/projects/M2A_GeoFact-X
[176] What Factors Affect LLMs and RLLMs in Financial Question Answering?
Peng Wang, Xuesi Hu, Jiageng Wu, Yuntao Zou, Qiancheng Zhang, Dagang Li
Main category: cs.CL
TL;DR: This paper systematically evaluates how prompting methods, agent frameworks, and multilingual alignment affect LLMs and RLLMs in financial question-answering, finding that conventional methods enhance LLMs by simulating Long CoT but have limited impact on RLLMs due to their inherent reasoning capabilities.
Details
Motivation: To systematically explore methods that can fully unlock the performance of LLMs and RLLMs in the financial domain, as few works have comprehensively investigated this area despite growing interest in reasoning-enhanced language models.Method: Used five LLMs and three RLLMs to assess the effects of prompting methods, agentic frameworks, and multilingual alignment methods on financial question-answering tasks through systematic evaluation.
Result: (1) Prompting methods and agent frameworks enhance LLM performance by simulating Long CoT; (2) RLLMs’ inherent Long CoT capabilities limit conventional methods’ effectiveness; (3) Multilingual alignment methods mainly improve LLM multilingual performance by extending reasoning length, with minimal benefits for RLLMs.
Conclusion: The study provides important references for enhancing LLM and RLLM performance in financial question-answering and discusses strategies that may inspire future improvements in this domain.
Abstract: Recently, the development of large language models (LLMs) and reasoning large language models (RLLMs) have gained considerable attention from many researchers. RLLMs enhance the reasoning capabilities of LLMs through Long Chain-of-Thought (Long CoT) processes, significantly improving the performance of LLMs in addressing complex problems. However, there are few works that systematically explore what methods can fully unlock the performance of LLMs and RLLMs within the financial domain. To investigate the impact of various methods on LLMs and RLLMs, we utilize five LLMs and three RLLMs to assess the effects of prompting methods, agentic frameworks, and multilingual alignment methods on financial question-answering tasks. Our research findings indicate: (1) Current prompting methods and agent frameworks enhance the performance of LLMs in financial question answering by simulating Long CoT; (2) RLLMs possess inherent Long CoT capabilities, which limits the effectiveness of conventional methods in further enhancing their performance; (3) Current advanced multilingual alignment methods primarily improve the multilingual performance of LLMs by extending the reasoning length, which yields minimal benefits for RLLMs. Additionally, we discuss strategies for enhancing the performance of LLMs and RLLMs in financial question answering, which may serve as a inspiration for future improvements. We hope that this study can serve as an important reference for LLMs and RLLMs in the field of financial question answering.
[177] KV Cache Steering for Controlling Frozen LLMs
Max Belitsky, Dawid J. Kopiczko, Michael Dorkenwald, M. Jehanzeb Mirza, James R. Glass, Cees G. M. Snoek, Yuki M. Asano
Main category: cs.CL
TL;DR: Cache steering is a lightweight method that uses one-shot interventions on key-value cache to steer language models toward chain-of-thought reasoning without fine-tuning or prompt changes.
Details
Motivation: To enable implicit steering of language models toward more explicit, multi-step reasoning without the computational overhead of continuous interventions or the need for fine-tuning.Method: Constructs steering vectors from reasoning traces (from teacher models or human annotations) and applies them directly to the key-value cache in a one-shot manner to shift model behavior.
Result: Improves both qualitative reasoning structure and quantitative task performance on diverse reasoning benchmarks, scales to larger models, and provides gains on challenging datasets like GPQA and MATH.
Conclusion: Cache steering offers substantial advantages over prior activation steering techniques in inference latency, hyperparameter stability, and API integration, while enabling controllable transfer of reasoning styles for practical behavior-level guidance.
Abstract: We propose cache steering, a lightweight method for implicit steering of language models via a one-shot intervention applied directly to the key-value cache. To validate its effectiveness, we apply cache steering to induce chain-of-thought reasoning in small language models. Our approach constructs steering vectors from reasoning traces, obtained either from teacher models (e.g., GPT-4o) or existing human annotations, that shift model behavior toward more explicit, multi-step reasoning without fine-tuning or prompt modifications. Experimental evaluations on diverse reasoning benchmarks demonstrate that cache steering improves both the qualitative structure of model reasoning and quantitative task performance. Additional experiments show that the method also scales to larger models and yields further gains on challenging datasets such as GPQA and MATH. Compared to prior activation steering techniques that require continuous interventions, our one-shot cache steering offers substantial advantages in terms of inference latency, hyperparameter stability, and ease of integration with existing inference APIs. Beyond mere reasoning induction, we show that cache steering enables controllable transfer of reasoning styles (e.g., stepwise, causal, analogical), making it a practical tool for behavior-level guidance of language models.
[178] The Imitation Game: Turing Machine Imitator is Length Generalizable Reasoner
Zhouqi Hua, Wenwei Zhang, Chengqi Lyu, Yuzhe Gu, Songyang Gao, Kuikun Liu, Dahua Lin, Kai Chen
Main category: cs.CL
TL;DR: TAIL improves LLM length generalization by synthesizing CoT data that imitates Turing Machine execution, using atomic states and explicit memory fetch to handle longer sequences.
Details
Motivation: To address the core challenge of length generalization in Transformers by focusing on computable reasoning problems that algorithms can solve, rather than task-specific approaches.Method: TAIL synthesizes chain-of-thoughts data that imitate Turing Machine execution through computer programs, using linear expansion into atomic states and explicit memory fetch mechanisms.
Result: TAIL significantly improves length generalization and performance of Qwen2.5-7B on various tasks using only synthetic data, surpassing previous methods and DeepSeek-R1.
Conclusion: Turing Machine concepts (not thinking styles) are key for length generalization, enabling read-write behaviors in attention layers, providing a promising direction for LLM reasoning from synthetic data.
Abstract: Length generalization, the ability to solve problems of longer sequences than those observed during training, poses a core challenge of Transformer-based large language models (LLM). Although existing studies have predominantly focused on data-driven approaches for arithmetic operations and symbolic manipulation tasks, these approaches tend to be task-specific with limited overall performance. To pursue a more general solution, this paper focuses on a broader case of reasoning problems that are computable, i.e., problems that algorithms can solve, thus can be solved by the Turing Machine. From this perspective, this paper proposes Turing MAchine Imitation Learning (TAIL) to improve the length generalization ability of LLMs. TAIL synthesizes chain-of-thoughts (CoT) data that imitate the execution process of a Turing Machine by computer programs, which linearly expands the reasoning steps into atomic states to alleviate shortcut learning and explicit memory fetch mechanism to reduce the difficulties of dynamic and long-range data access in elementary operations. To validate the reliability and universality of TAIL, we construct a challenging synthetic dataset covering 8 classes of algorithms and 18 tasks. Without bells and whistles, TAIL significantly improves the length generalization ability as well as the performance of Qwen2.5-7B on various tasks using only synthetic data, surpassing previous methods and DeepSeek-R1. The experimental results reveal that the key concepts in the Turing Machine, instead of the thinking styles, are indispensable for TAIL for length generalization, through which the model exhibits read-and-write behaviors consistent with the properties of the Turing Machine in their attention layers. This work provides a promising direction for future research in the learning of LLM reasoning from synthetic data.
[179] LoopServe: An Adaptive Dual-phase LLM Inference Acceleration System for Multi-Turn Dialogues
Haoyang Li, Zhanchao Xu, Yiming Li, Xuejia Chen, Darian Li, Anxin Tian, Qingfa Xiao, Cheng Deng, Jun Wang, Qing Li, Lei Chen, Mingxuan Yuan
Main category: cs.CL
TL;DR: LoopServe is an adaptive dual-phase inference acceleration framework for LLMs in multi-turn dialogues that dynamically sparsifies attention during prefilling and compresses KV cache during decoding, achieving superior performance and speed.
Details
Motivation: Existing LLMs face computational and memory challenges with long conversation histories, and current acceleration methods use fixed heuristics that don't adapt well to dynamic multi-turn conversation patterns, leading to degraded response quality.Method: Two-phase approach: 1) Online sparsification during prefilling by dynamically selecting important attention matrix parts, 2) Progressive KV compression during decoding by adaptively maintaining relevant cache based on recent output tokens. Also introduces new benchmark with 11 multi-turn datasets.
Result: Extensive experiments show LoopServe consistently achieves superior effectiveness compared to existing baselines and significantly accelerates LLM inference across wide range of long-context dialogue tasks.
Conclusion: LoopServe provides an effective adaptive acceleration framework that outperforms existing methods in multi-turn dialogue scenarios while maintaining response quality.
Abstract: Multi-turn dialogues are essential in many real-world applications of large language models, such as chatbots and virtual assistants. As conversation histories become longer, existing large language models face increasing computational and memory challenges, which hinder their ability to provide efficient and responsive interactions. Most current acceleration methods either compress the context or optimize key value caching, but they often rely on fixed or position-based heuristics that do not adapt well to the dynamic and unpredictable patterns found in actual multi-turn conversations. As a result, these models cannot accurately identify and prioritize the most relevant context, leading to degraded response quality. In this paper, we present LoopServe, an adaptive dual-phase inference acceleration framework for large language models in multi-turn dialogues. LoopServe introduces two main innovations. First, it performs online sparsification during the prefilling phase by dynamically selecting the most important parts of the attention matrix for each new input. Second, it uses progressive key value compression during decoding by adaptively maintaining a relevant and efficient cache based on the most recently generated output tokens. We also propose a new benchmark with eleven multi-turn datasets that reflect realistic query positions and conversational dependencies. Extensive experiments demonstrate that LoopServe consistently achieves superior effectiveness compared to existing baselines and significantly accelerates LLM inference across a wide range of long-context dialogue tasks.
[180] Wide-In, Narrow-Out: Revokable Decoding for Efficient and Effective DLLMs
Feng Hong, Geng Yu, Yushi Ye, Haicheng Huang, Huangjie Zheng, Ya Zhang, Yanfeng Wang, Jiangchao Yao
Main category: cs.CL
TL;DR: WINO is a training-free decoding algorithm that improves the quality-speed trade-off in Diffusion Large Language Models by enabling revokable decoding through parallel draft-and-verify mechanism.
Details
Motivation: Existing DLLMs suffer from severe quality-speed trade-off where faster parallel decoding leads to significant performance degradation due to irreversibility of standard decoding and early error accumulation.Method: WINO employs parallel draft-and-verify mechanism that aggressively drafts multiple tokens while simultaneously using bidirectional context to verify and re-mask suspicious tokens for refinement.
Result: WINO accelerates inference by 6× while improving accuracy by 2.58% on GSM8K math benchmark, and achieves 10× speedup with higher performance on Flickr30K captioning.
Conclusion: WINO decisively improves the quality-speed trade-off in DLLMs and provides superior performance compared to existing methods.
Abstract: Diffusion Large Language Models (DLLMs) have emerged as a compelling alternative to Autoregressive models, designed for fast parallel generation. However, existing DLLMs are plagued by a severe quality-speed trade-off, where faster parallel decoding leads to significant performance degradation. We attribute this to the irreversibility of standard decoding in DLLMs, which is easily polarized into the wrong decoding direction along with early error context accumulation. To resolve this, we introduce Wide-In, Narrow-Out (WINO), a training-free decoding algorithm that enables revokable decoding in DLLMs. WINO employs a parallel draft-and-verify mechanism, aggressively drafting multiple tokens while simultaneously using the model’s bidirectional context to verify and re-mask suspicious ones for refinement. Verified in open-source DLLMs like LLaDA and MMaDA, WINO is shown to decisively improve the quality-speed trade-off. For instance, on the GSM8K math benchmark, it accelerates inference by 6$\times$ while improving accuracy by 2.58%; on Flickr30K captioning, it achieves a 10$\times$ speedup with higher performance. More comprehensive experiments are conducted to demonstrate the superiority and provide an in-depth understanding of WINO.
[181] Geometric-Mean Policy Optimization
Yuzhong Zhao, Yue Liu, Junpeng Liu, Jingye Chen, Xun Wu, Yaru Hao, Tengchao Lv, Shaohan Huang, Lei Cui, Qixiang Ye, Fang Wan, Furu Wei
Main category: cs.CL
TL;DR: GMPO improves GRPO by using geometric mean instead of arithmetic mean for token-level rewards, reducing sensitivity to outliers and stabilizing policy updates.
Details
Motivation: GRPO suffers from unstable policy updates due to outlier token rewards causing extreme importance sampling ratios during training.Method: Replace GRPO’s arithmetic mean with geometric mean of token-level rewards, which is less sensitive to outliers and maintains more stable importance sampling ratios.
Result: GMPO-7B improves average Pass@1 by up to 4.1% over GRPO on multiple mathematical reasoning benchmarks, outperforming state-of-the-art approaches.
Conclusion: GMPO provides a plug-and-play improvement to GRPO that enhances stability and performance through geometric mean optimization of token rewards.
Abstract: Group Relative Policy Optimization (GRPO) has significantly enhanced the reasoning capability of large language models by optimizing the arithmetic mean of token-level rewards. Unfortunately, GRPO is observed to suffer from unstable policy updates when facing tokens with outlier importance-weighted rewards, which manifest as extreme importance sampling ratios during training. In this study, we propose Geometric-Mean Policy Optimization (GMPO), with the aim to improve the stability of GRPO through suppressing token reward outliers. Instead of optimizing the arithmetic mean, GMPO maximizes the geometric mean of token-level rewards, which is inherently less sensitive to outliers and maintains a more stable range of importance sampling ratio. GMPO is plug-and-play-simply replacing GRPO’s arithmetic mean with the geometric mean of token-level rewards, as the latter is inherently less sensitive to outliers. GMPO is theoretically plausible-analysis reveals that both GMPO and GRPO are weighted forms of the policy gradient while the former enjoys more stable weights, which consequently benefits policy optimization and performance. Experiments on multiple mathematical reasoning benchmarks show that GMPO-7B improves the average Pass@1 of GRPO by up to 4.1%, outperforming many state-of-the-art approaches. Code is available at https://github.com/callsys/GMPO.
[182] Persona-Augmented Benchmarking: Evaluating LLMs Across Diverse Writing Styles
Kimberly Le Truong, Riccardo Fogliato, Hoda Heidari, Zhiwei Steven Wu
Main category: cs.CL
TL;DR: LLM performance varies significantly with writing style variations, even when semantic content remains identical, revealing brittleness in current benchmarks that lack style diversity.
Details
Motivation: Current LLM benchmarks lack writing style diversity and don't capture human communication variety, potentially leading to brittle performance on non-standard inputs.Method: Used persona-based LLM prompting to rewrite evaluation prompts with diverse writing styles while keeping semantic content identical.
Result: Writing style variations significantly impact LLM performance estimates, with certain styles consistently triggering either low or high performance across models and tasks.
Conclusion: Persona-based prompting offers a scalable method to augment benchmarks and improve external validity of LLM performance assessments across linguistic variations.
Abstract: Current benchmarks for evaluating Large Language Models (LLMs) often do not exhibit enough writing style diversity, with many adhering primarily to standardized conventions. Such benchmarks do not fully capture the rich variety of communication patterns exhibited by humans. Thus, it is possible that LLMs, which are optimized on these benchmarks, may demonstrate brittle performance when faced with “non-standard” input. In this work, we test this hypothesis by rewriting evaluation prompts using persona-based LLM prompting, a low-cost method to emulate diverse writing styles. Our results show that, even with identical semantic content, variations in writing style and prompt formatting significantly impact the estimated performance of the LLM under evaluation. Notably, we identify distinct writing styles that consistently trigger either low or high performance across a range of models and tasks, irrespective of model family, size, and recency. Our work offers a scalable approach to augment existing benchmarks, improving the external validity of the assessments they provide for measuring LLM performance across linguistic variations.
[183] PilotRL: Training Language Model Agents via Global Planning-Guided Progressive Reinforcement Learning
Keer Lu, Chong Chen, Bin Cui, Huang Leng, Wentao Zhang
Main category: cs.CL
TL;DR: AdaPlan is a new agent paradigm that combines global planning with execution, and PilotRL is a reinforcement learning framework that trains LLM agents to follow explicit guidance, optimize plan quality, and coordinate planning with execution for better long-horizon decision-making.
Details
Motivation: Current LLM agents like ReAct have limitations in complex tasks requiring long-term strategic planning, suffer from poor planner-executor coordination, and rely on supervised fine-tuning that restricts generalization to novel problems.Method: Proposed AdaPlan paradigm for global planning-guided agents, and PilotRL framework using progressive reinforcement learning to train agents in three stages: following explicit guidance, optimizing plan quality, and joint optimization of planning-execution coordination.
Result: PilotRL achieves state-of-the-art performance with LLaMA3.1-8B-Instruct + PilotRL surpassing GPT-4o by 3.60% and showing 55.78% improvement over GPT-4o-mini at comparable parameter scale.
Conclusion: The adaptive global plan-based paradigm and progressive reinforcement learning framework effectively address limitations of current LLM agents, enabling better long-horizon decision-making and generalization capabilities.
Abstract: Large Language Models (LLMs) have shown remarkable advancements in tackling agent-oriented tasks. Despite their potential, existing work faces challenges when deploying LLMs in agent-based environments. The widely adopted agent paradigm ReAct centers on integrating single-step reasoning with immediate action execution, which limits its effectiveness in complex tasks requiring long-term strategic planning. Furthermore, the coordination between the planner and executor during problem-solving is also a critical factor to consider in agent design. Additionally, current approaches predominantly rely on supervised fine-tuning, which often leads models to memorize established task completion trajectories, thereby restricting their generalization ability when confronted with novel problem contexts. To address these challenges, we introduce an adaptive global plan-based agent paradigm AdaPlan, aiming to synergize high-level explicit guidance with execution to support effective long-horizon decision-making. Based on the proposed paradigm, we further put forward PilotRL, a global planning-guided training framework for LLM agents driven by progressive reinforcement learning. We first develop the model’s ability to follow explicit guidance from global plans when addressing agent tasks. Subsequently, based on this foundation, we focus on optimizing the quality of generated plans. Finally, we conduct joint optimization of the model’s planning and execution coordination. Experiments indicate that PilotRL could achieve state-of-the-art performances, with LLaMA3.1-8B-Instruct + PilotRL surpassing closed-sourced GPT-4o by 3.60%, while showing a more substantial gain of 55.78% comparing to GPT-4o-mini at a comparable parameter scale.
[184] DAMR: Efficient and Adaptive Context-Aware Knowledge Graph Question Answering with LLM-Guided MCTS
Yingxu Wang, Shiqi Fan, Mengzhu Wang, Siyang Gao, Chao Wang, Nan Yin
Main category: cs.CL
TL;DR: DAMR is a novel KGQA framework that combines LLM-guided MCTS with adaptive path evaluation, using an LLM-based planner to reduce search space and a Transformer-based scorer for context-aware plausibility estimation, achieving SOTA performance.
Details
Motivation: Existing KGQA methods either lack adaptability due to static path extraction or suffer from high computational costs and limited accuracy from repeated LLM calls and fixed scoring functions.Method: Integrates LLM-guided MCTS with adaptive path evaluation, using LLM-based planner for relation selection, Transformer-based scorer for context-aware estimation, and dynamic pseudo-path refinement for continuous adaptation.
Result: Extensive experiments on multiple KGQA benchmarks show DAMR significantly outperforms SOTA methods.
Conclusion: DAMR enables efficient and context-aware KGQA through the integration of LLM-guided MCTS with adaptive evaluation mechanisms, addressing limitations of existing approaches.
Abstract: Knowledge Graph Question Answering (KGQA) aims to interpret natural language queries and perform structured reasoning over knowledge graphs by leveraging their relational and semantic structures to retrieve accurate answers. Existing methods primarily follow either the retrieve-then-reason paradigm, which relies on Graph Neural Networks or heuristic rules to extract static candidate paths, or dynamic path generation strategies that employ LLMs with prompting to jointly perform retrieval and reasoning. However, the former lacks adaptability due to static path extraction and the absence of contextual refinement, while the latter suffers from high computational costs and limited evaluation accuracy because of their dependence on fixed scoring functions and repeated LLM calls. To address these issues, this paper proposes Dynamically Adaptive MCTS-based Reasoning (DAMR), a novel framework that integrates LLM-guided Monte Carlo Tree Search (MCTS) with adaptive path evaluation to enable efficient and context-aware KGQA. DAMR leverages MCTS as a backbone, where an LLM-based planner selects the top-$k$ semantically relevant relations at each expansion step to effectively reduce the search space. To enhance evaluation accuracy, we introduce a lightweight Transformer-based scorer that performs context-aware plausibility estimation by jointly encoding the question and relation sequence through cross-attention, thereby capturing fine-grained semantic shifts during multi-hop reasoning. Furthermore, to mitigate the scarcity of high-quality supervision, DAMR incorporates a dynamic pseudo-path refinement mechanism that periodically generates training signals from partial paths explored during search, enabling the scorer to continually adapt to the evolving distribution of reasoning trajectories. Extensive experiments on multiple KGQA benchmarks show that DAMR significantly outperforms SOTA methods.
[185] MLP Memory: A Retriever-Pretrained Memory for Large Language Models
Rubin Wei, Jiaqi Cao, Jiarui Wang, Jushi Kai, Qipeng Guo, Bowen Zhou, Zhouhan Lin
Main category: cs.CL
TL;DR: MLP Memory is a parametric module that learns retrieval patterns from kNN retrievers, achieving better performance than RAG while being faster, without the drawbacks of fine-tuning.
Details
Motivation: Address the trade-off between RAG's flexible knowledge access (but high latency) and parametric fine-tuning's risks (catastrophic forgetting, degraded capabilities).Method: Pretrain an MLP to imitate kNN retriever behavior on pretraining data, then integrate it with Transformer decoders through probability interpolation.
Result: 12.3% relative improvement on QA benchmarks, 5.2 points gain on general NLP tasks, 10 points reduction in hallucinations, and 2.5x faster inference than RAG.
Conclusion: Learning retrieval patterns parametrically bridges efficient inference and effective knowledge access, offering a practical alternative to RAG and fine-tuning.
Abstract: Modern approaches to enhancing Large Language Models’ factual accuracy and knowledge utilization face a fundamental trade-off: non-parametric retrieval-augmented generation (RAG) provides flexible access to external knowledge but suffers from high inference latency and shallow integration, while parametric fine-tuning methods like LoRA risk catastrophic forgetting and degraded general capabilities. In this work, we propose MLP Memory, a lightweight parametric module that learns to internalize retrieval patterns without explicit document access. By pretraining an MLP to imitate a $k$NN retriever’s behavior on the entire pretraining dataset, we create a differentiable memory component that captures the benefits of retrieval-based knowledge access in a fully parametric form. Our architecture integrates this pretrained MLP Memory with Transformer decoders through simple probability interpolation, achieving 12.3% relative improvement on five question-answering benchmarks and 5.2 points absolute gain across nine general NLP tasks, while reducing hallucinations by up to 10 points on HaluEval. Moreover, MLP Memory delivers 2.5$\times$ faster inference than RAG with superior accuracy. Our findings show that learning retrieval patterns parametrically bridges the gap between efficient inference and effective knowledge access, offering a practical alternative to both RAG and fine-tuning approaches.
[186] Hallucination to Truth: A Review of Fact-Checking and Factuality Evaluation in Large Language Models
Subhey Sadi Rahman, Md. Adnanul Islam, Md. Mahbub Alam, Musarrat Zeba, Md. Abdur Rahman, Sadia Sultana Chowa, Mohaimenul Azam Khan Raiaan, Sami Azam
Main category: cs.CL
TL;DR: This review analyzes how LLM-generated content is evaluated for factual accuracy, highlighting challenges like hallucinations and dataset limitations, and proposes frameworks for robust fact-checking using advanced prompting, fine-tuning, and RAG methods.
Details
Motivation: LLMs are trained on vast internet corpora containing inaccurate content, leading to potential misinformation generation, making robust fact-checking essential for trustworthy AI systems.Method: Systematic review of literature from 2020-2025, analyzing evaluation methods and mitigation techniques including instruction tuning, multi-agent reasoning, and RAG frameworks for external knowledge access.
Result: Key findings show limitations of current metrics, importance of validated external evidence, and improved factual consistency through domain-specific customization.
Conclusion: The review emphasizes the need for more accurate, understandable, and context-aware fact-checking frameworks to advance research toward more trustworthy LLM models.
Abstract: Large Language Models (LLMs) are trained on vast and diverse internet corpora that often include inaccurate or misleading content. Consequently, LLMs can generate misinformation, making robust fact-checking essential. This review systematically analyzes how LLM-generated content is evaluated for factual accuracy by exploring key challenges such as hallucinations, dataset limitations, and the reliability of evaluation metrics. The review emphasizes the need for strong fact-checking frameworks that integrate advanced prompting strategies, domain-specific fine-tuning, and retrieval-augmented generation (RAG) methods. It proposes five research questions that guide the analysis of the recent literature from 2020 to 2025, focusing on evaluation methods and mitigation techniques. Instruction tuning, multi-agent reasoning, and RAG frameworks for external knowledge access are also reviewed. The key findings demonstrate the limitations of current metrics, the importance of validated external evidence, and the improvement of factual consistency through domain-specific customization. The review underscores the importance of building more accurate, understandable, and context-aware fact-checking. These insights contribute to the advancement of research toward more trustworthy models.
[187] GTPO and GRPO-S: Token and Sequence-Level Reward Shaping with Policy Entropy
Hongze Tan, Jianfei Pan, Jinghao Lin, Tao Chen, Zhihang Zheng, Zhihao Tang, Haihua Yang
Main category: cs.CL
TL;DR: Proposes Dynamic Entropy Weighting for fine-grained RL in LLM reasoning, using policy entropy as reward signal to assign credit per token rather than uniform sequence rewards.
Details
Motivation: Conventional RL algorithms use coarse-grained credit assignment with uniform rewards for all tokens, which is problematic for long-chain reasoning tasks where individual token contributions vary.Method: Introduces Dynamic Entropy Weighting with two algorithms: Group Token Policy Optimization (GTPO) assigns entropy-weighted rewards to each token, and Sequence-Level GRPO (GRPO-S). Uses policy entropy as heuristic for cognitive effort at pivotal reasoning junctures.
Result: Experimental results across challenging reasoning benchmarks show significant performance improvements over DAPO baseline, validating the effectiveness of entropy-weighting mechanism.
Conclusion: Dynamic Entropy Weighting enables true per-token credit assignment in RL for LLM reasoning, with policy entropy serving as a powerful learning signal that drives performance improvements in complex reasoning tasks.
Abstract: Reinforcement learning (RL) is a pivotal task for enhancing Large Language Model (LLM) reasoning. Conventional algorithms, however, typically adhere to a coarse-grained credit assignment paradigm, applying a uniform reward to all tokens in a sequence, a critical flaw in long-chain reasoning tasks. In this paper, we address this challenge and propose Dynamic Entropy Weighting, a novel mechanism that facilitates fine-grained rewards through two new algorithms: Group Token Policy Optimization (GTPO), which assigns an entropy-weighted reward to each token, and the analogous algorithm Sequence-Level GRPO (GRPO-S). Our approach is founded on the hypothesis that high policy entropy within a reasoning path is a powerful heuristic for cognitive effort at pivotal junctures, which can be repurposed into a learning signal. By repurposing policy entropy for reward shaping, we achieve true per-token credit assignment. Experimental results across challenging reasoning benchmarks validate the superiority of our approach, showing our methods significantly outperform a strong DAPO baseline and confirming our entropy-weighting mechanism as the key driver of this performance boost.
[188] ASCoT: An Adaptive Self-Correction Chain-of-Thought Method for Late-Stage Fragility in LLMs
Dongxu Zhang, Ning Yang, Jihua Zhu, Jinnan Yang, Miao Xin, Baoliang Tian
Main category: cs.CL
TL;DR: This paper challenges the ‘cascading failure’ hypothesis in Chain-of-Thought reasoning by discovering ‘Late-Stage Fragility’ - errors in later reasoning stages are more harmful than early errors. It introduces ASCoT, an adaptive self-correction method that prioritizes late-stage error correction.
Details
Motivation: To address the reliability challenges in Chain-of-Thought reasoning and challenge the widely held assumption that early errors are most detrimental, by systematically investigating when errors actually cause the most damage in reasoning chains.Method: Introduced Adaptive Self-Correction Chain-of-Thought (ASCoT) with two main components: Adaptive Verification Manager (AVM) that uses Positional Impact Score function I(k) to prioritize late-stage steps, and Multi-Perspective Self-Correction Engine (MSCE) that applies dual-path correction to identified failure parts.
Result: Extensive experiments on GSM8K and MATH benchmarks show ASCoT achieves outstanding accuracy, outperforming strong baselines including standard CoT, demonstrating the effectiveness of adaptive, vulnerability-aware correction.
Conclusion: The work emphasizes the importance of diagnosing specific failure modes in LLM reasoning and advocates for shifting from uniform verification to adaptive, vulnerability-aware correction mechanisms, particularly addressing the newly discovered Late-Stage Fragility phenomenon.
Abstract: Chain-of-Thought (CoT) prompting has significantly advanced the reasoning capabilities of Large Language Models (LLMs), yet the reliability of these reasoning chains remains a critical challenge. A widely held “cascading failure” hypothesis suggests that errors are most detrimental when they occur early in the reasoning process. This paper challenges that assumption through systematic error-injection experiments, revealing a counter-intuitive phenomenon we term “Late-Stage Fragility”: errors introduced in the later stages of a CoT chain are significantly more likely to corrupt the final answer than identical errors made at the beginning. To address this specific vulnerability, we introduce the Adaptive Self-Correction Chain-of-Thought (ASCoT) method. ASCoT employs a modular pipeline in which an Adaptive Verification Manager (AVM) operates first, followed by the Multi-Perspective Self-Correction Engine (MSCE). The AVM leverages a Positional Impact Score function I(k) that assigns different weights based on the position within the reasoning chains, addressing the Late-Stage Fragility issue by identifying and prioritizing high-risk, late-stage steps. Once these critical steps are identified, the MSCE applies robust, dual-path correction specifically to the failure parts. Extensive experiments on benchmarks such as GSM8K and MATH demonstrate that ASCoT achieves outstanding accuracy, outperforming strong baselines, including standard CoT. Our work underscores the importance of diagnosing specific failure modes in LLM reasoning and advocates for a shift from uniform verification strategies to adaptive, vulnerability-aware correction mechanisms.
[189] Is GPT-OSS Good? A Comprehensive Evaluation of OpenAI’s Latest Open Source Models
Ziqian Bi, Keyu Chen, Chiung-Yi Tseng, Danyang Zhang, Tianyang Wang, Hongying Luo, Lu Chen, Junming Huang, Jibin Guan, Junfeng Hao, Junhao Song
Main category: cs.CL
TL;DR: OpenAI’s GPT-OSS models (20B and 120B parameters) were evaluated against contemporary open source LLMs. The smaller 20B model outperformed the larger 120B model on several benchmarks despite lower resource requirements, suggesting diminishing returns from scaling sparse architectures.
Details
Motivation: To empirically evaluate OpenAI's first open weight LLMs since GPT-2 and assess how scaling in sparse mixture-of-experts architectures affects performance relative to resource requirements.Method: Evaluated GPT-OSS models against six contemporary open source LLMs (14.7B-235B parameters) across ten benchmarks covering general knowledge, math reasoning, code generation, multilingual understanding, and conversational ability. Used standardized inference settings with statistical validation via McNemar’s test and effect size analysis.
Result: GPT-OSS-20B consistently outperformed GPT-OSS-120B on benchmarks like HumanEval and MMLU despite requiring substantially less memory and energy. Both models showed mid-tier overall performance with strengths in code generation and weaknesses in multilingual tasks.
Conclusion: Scaling in sparse architectures may not yield proportional performance gains, highlighting the need for better optimization strategies and more efficient model selection for open source deployments.
Abstract: In August 2025, OpenAI released GPT-OSS models, its first open weight large language models since GPT-2 in 2019, comprising two mixture of experts architectures with 120B and 20B parameters. We evaluated both variants against six contemporary open source large language models ranging from 14.7B to 235B parameters, representing both dense and sparse designs, across ten benchmarks covering general knowledge, mathematical reasoning, code generation, multilingual understanding, and conversational ability. All models were tested in unquantised form under standardised inference settings, with statistical validation using McNemars test and effect size analysis. Results show that gpt-oss-20B consistently outperforms gpt-oss-120B on several benchmarks, such as HumanEval and MMLU, despite requiring substantially less memory and energy per response. Both models demonstrate mid-tier overall performance within the current open source landscape, with relative strength in code generation and notable weaknesses in multilingual tasks. These findings provide empirical evidence that scaling in sparse architectures may not yield proportional performance gains, underscoring the need for further investigation into optimisation strategies and informing more efficient model selection for future open source deployments. More details and evaluation scripts are available at the \href{https://ai-agent-lab.github.io/gpt-oss}{Project Webpage}.
[190] Conflict-Aware Soft Prompting for Retrieval-Augmented Generation
Eunseong Choi, June Park, Hyeri Lee, Jongwuk Lee
Main category: cs.CL
TL;DR: CARE addresses context-memory conflicts in RAG systems by using a context assessor with memory token embeddings and soft prompting to identify unreliable external context and guide reasoning toward more reliable knowledge sources.
Details
Motivation: RAG systems often fail when retrieved external context contradicts the LLM's correct parametric knowledge, creating context-memory conflicts that degrade performance.Method: CARE uses a context assessor that encodes compact memory token embeddings and employs grounded/adversarial soft prompting to detect unreliable context and provide guidance signals for reasoning.
Result: CARE achieves an average 5.0% performance gain on QA and fact-checking benchmarks by effectively mitigating context-memory conflicts.
Conclusion: CARE establishes a promising direction for developing trustworthy and adaptive RAG systems that can handle conflicts between external context and parametric knowledge.
Abstract: Retrieval-augmented generation (RAG) enhances the capabilities of large language models (LLMs) by incorporating external knowledge into their input prompts. However, when the retrieved context contradicts the LLM’s parametric knowledge, it often fails to resolve the conflict between incorrect external context and correct parametric knowledge, known as context-memory conflict. To tackle this problem, we introduce Conflict-Aware REtrieval-Augmented Generation (CARE), consisting of a context assessor and a base LLM. The context assessor encodes compact memory token embeddings from raw context tokens. Through grounded/adversarial soft prompting, the context assessor is trained to discern unreliable context and capture a guidance signal that directs reasoning toward the more reliable knowledge source. Extensive experiments show that CARE effectively mitigates context-memory conflicts, leading to an average performance gain of 5.0% on QA and fact-checking benchmarks, establishing a promising direction for trustworthy and adaptive RAG systems.
[191] Influence-driven Curriculum Learning for Pre-training on Limited Data
Loris Schoenegger, Lukas Thoma, Terra Blevins, Benjamin Roth
Main category: cs.CL
TL;DR: Curriculum learning using model-centric difficulty metrics (training data influence) outperforms random training by over 10 percentage points in language model pre-training.
Details
Motivation: Traditional curriculum learning with human-centered difficulty metrics has shown limited success for pre-training language models, suggesting the need for better difficulty measures that align with actual model training dynamics.Method: Used training data influence scores to sort training examples by difficulty, creating a curriculum that presents data from simpler to more complex examples based on how much each example affects the model’s output.
Result: Models trained with this curriculum learning approach outperformed randomly trained models by over 10 percentage points in benchmark evaluations.
Conclusion: Curriculum learning is effective for language model pre-training when using model-centric difficulty metrics like training data influence, rather than conventional human-centered difficulty measures.
Abstract: Curriculum learning, a training technique where data is presented to the model in order of example difficulty (e.g., from simpler to more complex documents), has shown limited success for pre-training language models. In this work, we investigate whether curriculum learning becomes competitive if we replace conventional human-centered difficulty metrics with one that more closely corresponds to example difficulty as observed during model training. Specifically, we experiment with sorting training examples by their \textit{training data influence}, a score which estimates the effect of individual training examples on the model’s output. Models trained on our curricula are able to outperform ones trained in random order by over 10 percentage points in benchmarks, confirming that curriculum learning is beneficial for language model pre-training, as long as a more model-centric notion of difficulty is adopted.
[192] Dream to Chat: Model-based Reinforcement Learning on Dialogues with User Belief Modeling
Yue Zhao, Xiaoyu Wang, Dan Wang, Zhonglin Jiang, Qingqing Gu, Teng Chen, Ningyuan Xi, Jinxian Qu, Yong Chen, Luo Ji
Main category: cs.CL
TL;DR: DreamCUB is a dialogue world model that predicts user emotions, sentiments, intentions, and future utterances using POMDP and information bottleneck, achieving SOTA performance in emotion classification and sentiment identification while improving dialogue quality through model-based reinforcement learning.
Details
Motivation: World models are widely used in robotics and gaming but have limited applications in natural language tasks. The paper aims to extend world models to dialogue systems to better understand and predict user states.Method: Constructed a dialogue world model using POMDP to model emotion, sentiment, and intention as user beliefs, solved by maximizing information bottleneck. Applied model-based reinforcement learning framework with joint training of policy, critic, and dialogue world model.
Result: Achieved state-of-the-art performance on emotion classification and sentiment identification. Dialogue quality was enhanced through joint training. The approach maintains good exploration-exploitation balance and transfers well to out-of-domain scenarios like empathetic dialogues.
Conclusion: The dialogue world model framework DreamCUB successfully extends world modeling to natural language tasks, demonstrating strong performance in user state prediction and dialogue quality improvement with good generalization capabilities.
Abstract: World models have been widely utilized in robotics, gaming, and auto-driving. However, their applications on natural language tasks are relatively limited. In this paper, we construct the dialogue world model, which could predict the user’s emotion, sentiment, and intention, and future utterances. By defining a POMDP, we argue emotion, sentiment and intention can be modeled as the user belief and solved by maximizing the information bottleneck. By this user belief modeling, we apply the model-based reinforcement learning framework to the dialogue system, and propose a framework called DreamCUB. Experiments show that the pretrained dialogue world model can achieve state-of-the-art performances on emotion classification and sentiment identification, while dialogue quality is also enhanced by joint training of the policy, critic and dialogue world model. Further analysis shows that this manner holds a reasonable exploration-exploitation balance and also transfers well to out-of-domain scenarios such as empathetic dialogues.
[193] JudgeAgent: Knowledge-wise and Dynamic LLM Evaluation with Agent-as-Interviewer
Zhichao Shi, Xuhui Jiang, Chengjin Xu, Cangli Yao, Zhenxin Huang, Shengjie Ma, Yinghan Shen, Jian Guo, Yuanzhuo Wang
Main category: cs.CL
TL;DR: Agent-as-Interviewer is a dynamic evaluation paradigm using LLM agents for multi-turn interactions to better assess knowledge boundaries and capabilities of target LLMs, addressing limitations of current evaluation methods.
Details
Motivation: Current LLM evaluation methods suffer from overestimated/biased evaluations, mismatched question difficulty, and incomplete knowledge boundary assessment, hindering effective application and optimization.Method: Uses LLM agents to conduct multi-turn interactions, invoke knowledge tools for deeper knowledge in question generation, and plan query strategies for difficulty adjustment. Implemented as JudgeAgent framework with knowledge-driven synthesis and difficulty scoring.
Result: Extensive experiments validate JudgeAgent’s effectiveness in providing valuable suggestions and accurately identifying knowledge/capability boundaries of target models.
Conclusion: Agent-as-Interviewer paradigm enables more comprehensive evaluation of LLMs’ knowledge boundaries and capabilities through dynamic multi-turn interactions.
Abstract: Current evaluation paradigms for large language models (LLMs) suffer from overestimated or biased evaluations and mismatched question difficulty, leading to incomplete evaluations of knowledge and capability boundaries, which hinder their effective application and optimization. To address these challenges, we propose Agent-as-Interviewer, a dynamic evaluation paradigm that employs LLM agents to conduct multi-turn interactions for evaluation. Unlike current benchmarking or dynamic interaction paradigms, Agent-as-Interviewer utilizes agents to invoke knowledge tools for wider and deeper knowledge in the dynamic multi-turn question generation, achieving more comprehensive evaluations of LLM’s knowledge boundaries. It also leverages agents to plan query strategies for adjustment of the question difficulty levels, enhancing the difficulty control to match the actual capabilities of target LLMs. Based on this paradigm, we develop JudgeAgent, a knowledge-wise dynamic evaluation framework that employs knowledge-driven synthesis as the agent’s tool and uses difficulty scoring as strategy guidance, thereby finally providing valuable suggestions to help targets optimize themselves. Extensive experiments validate the effectiveness of JudgeAgent’s suggestions, demonstrating that Agent-as-Interviewer can accurately identify the knowledge and capability boundaries of target models. The source code is available on https://github.com/DataArcTech/JudgeAgent.
[194] CMRAG: Co-modality-based visual document retrieval and question answering
Wang Chen, Wenhan Yu, Guanqiang Qi, Weikang Li, Yang Li, Lei Sha, Deguo Xia, Jizhou Huang
Main category: cs.CL
TL;DR: CMRAG framework integrates text and images for multimodal document QA, using unified encoding and retrieval methods to outperform single-modality RAG approaches.
Details
Motivation: Existing RAG methods struggle with multimodal documents - text-only methods miss visual content, while vision-only approaches ignore semantic text advantages, leading to suboptimal performance.Method: Proposes Co-Modality-based RAG (CMRAG) with Unified Encoding Model (UEM) for shared embedding space via triplet training, and Unified Co-Modality-informed Retrieval (UCMR) for cross-modal similarity fusion.
Result: CMRAG consistently outperforms single-modality RAG methods across multiple visual document question-answering benchmarks.
Conclusion: Unified integration of co-modality information effectively improves performance in complex visual document QA systems.
Abstract: Retrieval-Augmented Generation (RAG) has become a core paradigm in document question answering tasks. However, existing methods have limitations when dealing with multimodal documents: one category of methods relies on layout analysis and text extraction, which can only utilize explicit text information and struggle to capture images or unstructured content; the other category treats document segmentation as visual input and directly passes it to visual language models (VLMs) for processing, yet it ignores the semantic advantages of text, leading to suboptimal retrieval and generation results. To address these research gaps, we propose the Co-Modality-based RAG (CMRAG) framework, which can simultaneously leverage texts and images for more accurate retrieval and generation. Our framework includes two key components: (1) a Unified Encoding Model (UEM) that projects queries, parsed text, and images into a shared embedding space via triplet-based training, and (2) a Unified Co-Modality-informed Retrieval (UCMR) method that statistically normalizes similarity scores to effectively fuse cross-modal signals. To support research in this direction, we further construct and release a large-scale triplet dataset of (query, text, image) examples. Experiments demonstrate that our proposed framework consistently outperforms single-modality–based RAG in multiple visual document question-answering (VDQA) benchmarks. The findings of this paper show that integrating co-modality information into the RAG framework in a unified manner is an effective approach to improving the performance of complex VDQA systems.
[195] Chain or tree? Re-evaluating complex reasoning from the perspective of a matrix of thought
Fengxiao Tang, Yufeng Li, Zongzong Wu, Ming Zhao
Main category: cs.CL
TL;DR: Matrix of Thought (MoT) is a novel reasoning structure for LLMs that addresses limitations in Chain of Thought and Tree of Thought approaches by enabling multi-dimensional thinking through column-cell communication and fact-correction mechanisms.
Details
Motivation: Existing thought structures like CoT and ToT suffer from redundancy and singularity issues, while RAG methods still face problems with fragmented and erroneous verification knowledge that misleeds LLM reasoning.Method: MoT explores problems horizontally and vertically using column-cell communication, reduces redundancy in thought nodes, and employs fact-correction with knowledge graph triples and original text to construct knowledge units.
Result: Extensive experiments in 24-point game, question answering, and proposition writing show MoT outperforms state-of-the-art methods with reasoning time only 14.4% of baseline.
Conclusion: MoT provides an efficient and accurate reasoning framework that enhances LLM capabilities through multi-strategy deep thinking and knowledge correction mechanisms.
Abstract: Large Language Models (LLMs) face significant accuracy degradation due to insufficient reasoning ability when dealing with complex and abstract tasks. Thought structures such as Chain of Thought (CoT) and Tree of Thought (ToT) focus on enhancing the reasoning capability of LLMs. However, they suffer from inherent drawbacks such as redundancy within the same layer of the tree structure and the singularity of the paths in the chain structure. Some studies have utilized Retrieval-Augmented Generation (RAG) methods to enhance CoT and ToT in mitigating hallucinations in LLMs, yet the fundamental shortcomings of the thought structures still persist. Furthermore, when dealing with multi-entity and multi-hop information, the retrieved verification knowledge often contains large amounts of fragmented, superficial, or even erroneous data, misleading the reasoning process of LLMs. To address these issues, we propose the Matrix of Thought (MoT), a novel and efficient thought structure for LLMs. MoT explores problems in both horizontal and vertical dimensions through a “column-cell communication” mechanism, enabling LLMs to actively engage in multi-strategy and deep thinking while reducing redundancy in the thought nodes within the column cells, thereby enhancing the reasoning capability of LLMs. Additionally, through a fact-correction mechanism, it leverages the knowledge graph triples retrieved by RAG and the original text to construct knowledge units and correct erroneous answers. To validate the effectiveness of this method, we conducted extensive experiments in three tasks: 24-point game, question answering evaluation, and proposition writing.The results demonstrate that our framework outperforms state-of-the-art methods, with reasoning time only 14.4% of that of the baseline method, proving its efficiency and accuracy. The code for framework is available at https://github.com/lyfiter/mtqa.
[196] Towards an AI Musician: Synthesizing Sheet Music Problems for Musical Reasoning
Zhilin Wang, Zhe Yang, Yun Luo, Yafu Li, Xiaoye Qu, Ziqian Qiao, Haoran Zhang, Runzhe Zhan, Derek F. Wong, Jizhe Zhou, Yu Cheng
Main category: cs.CL
TL;DR: A framework for synthesizing sheet music reasoning problems using music theory rules as programmatic functions, creating SSMR-Bench evaluation benchmark and training data to improve LLMs/MLLMs’ sheet music interpretation and composition capabilities.
Details
Motivation: Current research lacks both evaluation benchmarks and training data for sheet music reasoning, which is crucial for building AI musicians with sheet music interpretation abilities.Method: Treats core music theory rules (beats, intervals) as programmatic functions to systematically synthesize a vast corpus of verifiable sheet music reasoning problems in both textual and visual modalities.
Result: Models show significant improvements on SSMR-Bench and established human-crafted benchmarks (MusicTheoryBench, MMMU music subset) when trained with synthetic data. Enhanced reasoning ability also facilitates music composition.
Conclusion: The approach successfully addresses the data scarcity problem in sheet music reasoning, demonstrating that synthetic data generation and reasoning training significantly improve LLMs/MLLMs’ sheet music interpretation and composition capabilities.
Abstract: Enhancing the ability of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) to interpret sheet music is a crucial step toward building AI musicians. However, current research lacks both evaluation benchmarks and training data for sheet music reasoning. Inspired by mathematics, where simple operations yield infinite verifiable problems, we introduce a novel approach that treats core music theory rules, such as those governing beats and intervals, as programmatic functions to systematically synthesize a vast and diverse corpus of sheet music reasoning problems. This approach allows us to introduce a data synthesis framework that generates verifiable sheet music questions in both textual and visual modalities, leading to the Synthetic Sheet Music Reasoning Benchmark (SSMR-Bench) and a complementary training set. Evaluation results on SSMR-Bench highlight the key role reasoning plays in interpreting sheet music, while also pointing out the ongoing challenges in understanding sheet music in a visual format. By leveraging synthetic data for RLVR, all models show significant improvements on the SSMR-Bench. Additionally, they also demonstrate considerable advancements on previously established human-crafted benchmarks, such as MusicTheoryBench and the music subset of MMMU. Finally, our results show that the enhanced reasoning ability can also facilitate music composition.
[197] WebExplorer: Explore and Evolve for Training Long-Horizon Web Agents
Junteng Liu, Yunji Li, Chi Zhang, Jingyang Li, Aili Chen, Ke Ji, Weiyu Cheng, Zijia Wu, Chengyu Du, Qidi Xu, Jiayuan Song, Zhengmao Zhu, Wenhu Chen, Pengyu Zhao, Junxian He
Main category: cs.CL
TL;DR: WebExplorer introduces a systematic data generation approach using model-based exploration and query evolution to create challenging web navigation tasks, enabling the development of WebExplorer-8B - an advanced web agent that achieves state-of-the-art performance on information-seeking benchmarks.
Details
Motivation: Existing open-source web agents have limited information-seeking abilities on complex tasks or lack transparent implementations, with the key challenge being scarcity of challenging data for information seeking.Method: Systematic data generation using model-based exploration and iterative, long-to-short query evolution to create challenging query-answer pairs requiring multi-step reasoning and complex web navigation. The model is developed through supervised fine-tuning followed by reinforcement learning, supporting 128K context length and up to 100 tool calling turns.
Result: WebExplorer-8B achieves state-of-the-art performance at its scale across diverse information-seeking benchmarks. As an 8B model, it effectively searches over an average of 16 turns after RL training, outperforming WebSailor-72B on BrowseComp-en/zh and achieving best performance among models up to 100B parameters on WebWalkerQA and FRAMES. It also shows strong generalization on HLE benchmark.
Conclusion: The approach provides a practical path toward long-horizon web agents, demonstrating that systematic data generation can enable smaller models to achieve competitive performance against much larger models in complex web navigation tasks.
Abstract: The paradigm of Large Language Models (LLMs) has increasingly shifted toward agentic applications, where web browsing capabilities are fundamental for retrieving information from diverse online sources. However, existing open-source web agents either demonstrate limited information-seeking abilities on complex tasks or lack transparent implementations. In this work, we identify that the key challenge lies in the scarcity of challenging data for information seeking. To address this limitation, we introduce WebExplorer: a systematic data generation approach using model-based exploration and iterative, long-to-short query evolution. This method creates challenging query-answer pairs that require multi-step reasoning and complex web navigation. By leveraging our curated high-quality dataset, we successfully develop advanced web agent WebExplorer-8B through supervised fine-tuning followed by reinforcement learning. Our model supports 128K context length and up to 100 tool calling turns, enabling long-horizon problem solving. Across diverse information-seeking benchmarks, WebExplorer-8B achieves the state-of-the-art performance at its scale. Notably, as an 8B-sized model, WebExplorer-8B is able to effectively search over an average of 16 turns after RL training, achieving higher accuracy than WebSailor-72B on BrowseComp-en/zh and attaining the best performance among models up to 100B parameters on WebWalkerQA and FRAMES. Beyond these information-seeking tasks, our model also achieves strong generalization on the HLE benchmark even though it is only trained on knowledge-intensive QA data. These results highlight our approach as a practical path toward long-horizon web agents.
[198] Modelling Analogies and Analogical Reasoning: Connecting Cognitive Science Theory and NLP Research
Molly R Petersen, Claire E Stevenson, Lonneke van der Plas
Main category: cs.CL
TL;DR: This paper connects cognitive science theories of analogical reasoning to NLP research, showing how cognitive processes can improve relational understanding in language models.
Details
Motivation: To bridge cognitive science theories about analogical reasoning with current NLP research, as these cognitive processes are generally not viewed through a cognitive lens in NLP despite being relevant for major challenges.Method: Summarizing key cognitive science theories about analogical reasoning processes and relating them to concepts in natural language processing, showing their relevance beyond just analogy solving.
Result: Demonstrates that cognitive processes underlying analogical reasoning are relevant for several major challenges in NLP research and can guide better optimization of relational understanding in text.
Conclusion: Cognitive perspectives on analogical reasoning can help NLP researchers move beyond entity-level similarity and improve relational understanding in language models for various NLP challenges.
Abstract: Analogical reasoning is an essential aspect of human cognition. In this paper, we summarize key theory about the processes underlying analogical reasoning from the cognitive science literature and relate it to current research in natural language processing. While these processes can be easily linked to concepts in NLP, they are generally not viewed through a cognitive lens. Furthermore, we show how these notions are relevant for several major challenges in NLP research, not directly related to analogy solving. This may guide researchers to better optimize relational understanding in text, as opposed to relying heavily on entity-level similarity.
[199] Reasoning Under Uncertainty: Exploring Probabilistic Reasoning Capabilities of LLMs
Mobina Pournemat, Keivan Rezaei, Gaurang Sriramanan, Arman Zarei, Jiaxiang Fu, Yang Wang, Hamid Eghbalzadeh, Soheil Feizi
Main category: cs.CL
TL;DR: Comprehensive study shows LLMs have probabilistic reasoning capabilities with clear performance gaps between model sizes, but suffer from notation sensitivity and context length degradation.
Details
Motivation: Despite LLMs' success in language tasks, their probabilistic reasoning behavior remains unclear and inconsistent, requiring systematic evaluation.Method: Evaluated LLMs on three probabilistic tasks (mode identification, maximum likelihood estimation, sample generation) using explicit discrete probability distributions and prompting for joint/conditional distribution queries.
Result: Larger models show stronger inference and surprising sample generation capabilities, but performance degrades over 60% with longer context and models are sensitive to probability notation variations.
Conclusion: LLMs possess probabilistic reasoning abilities with size-dependent performance, but need improvements in notation robustness and context handling for reliable probabilistic inference.
Abstract: Despite widespread success in language understanding and generation, large language models (LLMs) exhibit unclear and often inconsistent behavior when faced with tasks that require probabilistic reasoning. In this work, we present the first comprehensive study of the reasoning capabilities of LLMs over explicit discrete probability distributions. Given observations from a probability distribution, we evaluate models on three carefully designed tasks, mode identification, maximum likelihood estimation, and sample generation, by prompting them to provide responses to queries about either the joint distribution or its conditionals. These tasks thus probe a range of probabilistic skills, including frequency analysis, marginalization, and generative behavior. Through comprehensive empirical evaluations, we demonstrate that there exists a clear performance gap between smaller and larger models, with the latter demonstrating stronger inference and surprising capabilities in sample generation. Furthermore, our investigations reveal notable limitations, including sensitivity to variations in the notation utilized to represent probabilistic outcomes and performance degradation of over 60% as context length increases. Together, our results provide a detailed understanding of the probabilistic reasoning abilities of LLMs and identify key directions for future improvement.
[200] Positional Encoding via Token-Aware Phase Attention
Yu Wang, Sheng Shen, Rémi Munos, Hongyuan Zhan, Yuandong Tian
Main category: cs.CL
TL;DR: RoPE has distance-dependent bias limiting long-context modeling. TAPA introduces learnable phase attention that preserves long-range token interactions, extends to longer contexts with light fine-tuning, and achieves better perplexity than RoPE methods.
Details
Motivation: RoPE positional embeddings have intrinsic bias that limits long-context modeling, and existing extension methods require heavy post-hoc adjustments after pretraining.Method: Token-Aware Phase Attention (TAPA) incorporates a learnable phase function into the attention mechanism to preserve token interactions over long distances.
Result: TAPA extends to longer contexts with direct and light fine-tuning, extrapolates to unseen lengths, and achieves significantly lower perplexity on long-context than RoPE families.
Conclusion: TAPA provides an effective positional encoding method that overcomes RoPE’s limitations for long-context modeling with minimal fine-tuning requirements.
Abstract: We prove under practical assumptions that Rotary Positional Embedding (RoPE) introduces an intrinsic distance-dependent bias in attention scores that limits RoPE’s ability to model long-context. RoPE extension methods may alleviate this issue, but they typically require post-hoc adjustments after pretraining, such as rescaling or hyperparameters retuning. This paper introduces Token-Aware Phase Attention (TAPA), a new positional encoding method that incorporates a learnable phase function into the attention mechanism. TAPA preserves token interactions over long range, extends to longer contexts with direct and light fine-tuning, extrapolates to unseen lengths, and attains significantly lower perplexity on long-context than RoPE families.
[201] DivLogicEval: A Framework for Benchmarking Logical Reasoning Evaluation in Large Language Models
Tsz Ting Chung, Lemao Liu, Mo Yu, Dit-Yan Yeung
Main category: cs.CL
TL;DR: DivLogicEval is a new classical logic benchmark using natural sentences with diverse statements arranged counterintuitively, addressing limitations in existing benchmarks and introducing a bias-mitigating evaluation metric.
Details
Motivation: Existing logic reasoning benchmarks have issues: they entangle multiple reasoning skills, lack language diversity, and have distributions deviated from ideal logic reasoning benchmarks, leading to unfaithful and biased evaluations of LLMs' logic reasoning capabilities.Method: Proposed DivLogicEval benchmark with natural sentences composed of diverse statements in counterintuitive ways, and introduced a new evaluation metric that mitigates bias and randomness in LLMs.
Result: Experiments demonstrated the extent of logical reasoning required for DivLogicEval questions and compared performance of different popular LLMs in logical reasoning tasks.
Conclusion: DivLogicEval provides a more reliable benchmark for evaluating logical reasoning in LLMs by addressing distributional biases and introducing better evaluation metrics.
Abstract: Logic reasoning in natural language has been recognized as an important measure of human intelligence for Large Language Models (LLMs). Popular benchmarks may entangle multiple reasoning skills and thus provide unfaithful evaluations on the logic reasoning skill. Meanwhile, existing logic reasoning benchmarks are limited in language diversity and their distributions are deviated from the distribution of an ideal logic reasoning benchmark, which may lead to biased evaluation results. This paper thereby proposes a new classical logic benchmark DivLogicEval, consisting of natural sentences composed of diverse statements in a counterintuitive way. To ensure a more reliable evaluation, we also introduce a new evaluation metric that mitigates the influence of bias and randomness inherent in LLMs. Through experiments, we demonstrate the extent to which logical reasoning is required to answer the questions in DivLogicEval and compare the performance of different popular LLMs in conducting logical reasoning.
[202] Distribution-Aligned Decoding for Efficient LLM Task Adaptation
Senkang Hu, Xudong Han, Jinqi Jiang, Yihang Tao, Zihan Fang, Yong Dai, Sam Tak Wu Kwong, Yuguang Fang
Main category: cs.CL
TL;DR: Steering Vector Decoding (SVD) is a lightweight method that adapts language models to tasks by steering output distributions during decoding rather than updating weights, achieving performance gains without additional trainable parameters.
Details
Motivation: To reduce the cost of adapting billion-parameter language models to downstream tasks, even with parameter-efficient fine-tuning, by re-framing adaptation as output-distribution alignment during decoding rather than weight updates.Method: Use a short warm-start fine-tune, extract a task-aware steering vector from KL divergence gradient between warm-started and pre-trained models, then use this vector to guide decoding to steer output distribution toward task distribution.
Result: Across three tasks and nine benchmarks, SVD improved multiple-choice accuracy by up to 5 points and open-ended truthfulness by 2 points, with 1-2 point gains on commonsense datasets, without adding trainable parameters beyond PEFT adapters.
Conclusion: SVD offers a lightweight, theoretically grounded path to stronger task adaptation for large language models by directly aligning output distributions during decoding rather than indirectly through weight updates.
Abstract: Adapting billion-parameter language models to a downstream task is still costly, even with parameter-efficient fine-tuning (PEFT). We re-cast task adaptation as output-distribution alignment: the objective is to steer the output distribution toward the task distribution directly during decoding rather than indirectly through weight updates. Building on this view, we introduce Steering Vector Decoding (SVD), a lightweight, PEFT-compatible, and theoretically grounded method. We start with a short warm-start fine-tune and extract a task-aware steering vector from the Kullback-Leibler (KL) divergence gradient between the output distribution of the warm-started and pre-trained models. This steering vector is then used to guide the decoding process to steer the model’s output distribution towards the task distribution. We theoretically prove that SVD is first-order equivalent to the gradient step of full fine-tuning and derive a globally optimal solution for the strength of the steering vector. Across three tasks and nine benchmarks, SVD paired with four standard PEFT methods improves multiple-choice accuracy by up to 5 points and open-ended truthfulness by 2 points, with similar gains (1-2 points) on commonsense datasets without adding trainable parameters beyond the PEFT adapter. SVD thus offers a lightweight, theoretically grounded path to stronger task adaptation for large language models.
[203] RPG: A Repository Planning Graph for Unified and Scalable Codebase Generation
Jane Luo, Xin Zhang, Steven Liu, Jie Wu, Yiming Huang, Yangyu Huang, Chengyu Yin, Ying Xin, Jianfeng Liu, Yuefeng Zhan, Hao Sun, Qi Chen, Scarlett Li, Mao Yang
Main category: cs.CL
TL;DR: ZeroRepo introduces Repository Planning Graph (RPG) for structured repository generation, achieving 3.9x larger code output than baselines with 81.5% coverage and 69.7% test accuracy.
Details
Motivation: Current natural language planning for repository generation produces unclear specifications, misaligned components, and brittle designs due to ambiguity and lack of structure.Method: ZeroRepo uses RPG - a structured graph representation of capabilities, file structures, data flows, and functions - in three stages: proposal planning, implementation construction, and graph-guided code generation with test validation.
Result: On RepoCraft benchmark (6 projects, 1,052 tasks), ZeroRepo generated nearly 36K code lines and 445K tokens (3.9x larger than Claude Code), with 81.5% coverage and 69.7% test accuracy, improving over Claude Code by 27.3 and 35.8 points.
Conclusion: RPG effectively models complex dependencies, enables sophisticated planning through near-linear scaling, and improves agent understanding of repositories, accelerating localization.
Abstract: Large language models excel at generating individual functions or single files of code, yet generating complete repositories from scratch remains a fundamental challenge. This capability is key to building coherent software systems from high-level specifications and realizing the full potential of automated code generation. The process requires planning at two levels: deciding what features and modules to build (proposal stage) and defining their implementation details (implementation stage). Current approaches rely on natural language planning, which often produces unclear specifications, misaligned components, and brittle designs due to its inherent ambiguity and lack of structure. To address these limitations, we introduce the Repository Planning Graph (RPG), a structured representation that encodes capabilities, file structures, data flows, and functions in a unified graph. By replacing free-form natural language with an explicit blueprint, RPG enables consistent long-horizon planning for repository generation. Building on RPG, we develop ZeroRepo, a graph-driven framework that operates in three stages: proposal-level planning, implementation-level construction, and graph-guided code generation with test validation. To evaluate, we construct RepoCraft, a benchmark of six real-world projects with 1,052 tasks. On RepoCraft, ZeroRepo produces nearly 36K Code Lines and 445K Code Tokens, on average 3.9$\times$ larger than the strongest baseline (Claude Code), and 68$\times$ larger than other baselines. It achieves 81.5% coverage and 69.7% test accuracy, improving over Claude Code by 27.3 and 35.8 points. Further analysis shows that RPG models complex dependencies, enables more sophisticated planning through near-linear scaling, and improves agent understanding of repositories, thus accelerating localization.
[204] QWHA: Quantization-Aware Walsh-Hadamard Adaptation for Parameter-Efficient Fine-Tuning on Large Language Models
Hyesung Jeon, Seojune Lee, Beomseok Kang, Yulhwa Kim, Jae-Joon Kim
Main category: cs.CL
TL;DR: QWHA is a novel quantization-aware PEFT method that uses Walsh-Hadamard Transform-based adapters with adaptive parameter selection to reduce quantization errors and computational costs in LLM deployment.
Details
Motivation: Existing quantization-aware PEFT methods suffer from limited representational capacity with low-rank adapters, while Fourier-transform based adapters have computational overhead and ineffective error reduction in quantized models.Method: Proposes QWHA method that integrates FT-based adapters using Walsh-Hadamard Transform as the transform kernel, with novel adapter initialization scheme including adaptive parameter selection and value refinement.
Result: QWHA consistently outperforms baselines in low-bit quantization accuracy and achieves significant training speedups over existing FT-based adapters.
Conclusion: QWHA effectively mitigates quantization errors while facilitating fine-tuning, with substantially reduced computational cost compared to existing approaches.
Abstract: The demand for efficient deployment of large language models (LLMs) has driven interest in quantization, which reduces inference cost, and parameter-efficient fine-tuning (PEFT), which lowers training overhead. This motivated the development of quantization-aware PEFT to produce accurate yet efficient quantized models. In this setting, reducing quantization error prior to fine-tuning is crucial for achieving high model accuracy. However, existing methods that rely on low-rank adaptation suffer from limited representational capacity. Recent Fourier-related transform (FT)-based adapters offer greater representational power than low-rank adapters, but their direct integration into quantized models often results in ineffective error reduction and increased computational overhead. To overcome these limitations, we propose QWHA, a method that integrates FT-based adapters into quantized models by employing the Walsh-Hadamard Transform (WHT) as the transform kernel, together with a novel adapter initialization scheme incorporating adaptive parameter selection and value refinement. We demonstrate that QWHA effectively mitigates quantization errors while facilitating fine-tuning, and that its design substantially reduces computational cost. Experimental results show that QWHA consistently outperforms baselines in low-bit quantization accuracy and achieves significant training speedups over existing FT-based adapters. The code is available at https://github.com/vantaa89/qwha.
[205] Semantic Reformulation Entropy for Robust Hallucination Detection in QA Tasks
Chaodong Tong, Qi Zhang, Lei Jiang, Yanbing Liu, Nannan Sun, Wei Li
Main category: cs.CL
TL;DR: SRE improves LLM uncertainty estimation using semantic reformulations and hybrid clustering for better hallucination detection.
Details
Motivation: Existing entropy-based uncertainty methods suffer from sampling noise and unstable clustering of variable-length answers, leading to unreliable hallucination detection.Method: Proposes Semantic Reformulation Entropy (SRE) with input-side semantic reformulations for faithful paraphrases and progressive energy-based hybrid clustering for stable semantic grouping.
Result: Experiments on SQuAD and TriviaQA show SRE outperforms strong baselines, providing more robust and generalizable hallucination detection.
Conclusion: Combining input diversification with multi-signal clustering substantially enhances semantic-level uncertainty estimation in LLMs.
Abstract: Reliable question answering with large language models (LLMs) is challenged by hallucinations, fluent but factually incorrect outputs arising from epistemic uncertainty. Existing entropy-based semantic-level uncertainty estimation methods are limited by sampling noise and unstable clustering of variable-length answers. We propose Semantic Reformulation Entropy (SRE), which improves uncertainty estimation in two ways. First, input-side semantic reformulations produce faithful paraphrases, expand the estimation space, and reduce biases from superficial decoder tendencies. Second, progressive, energy-based hybrid clustering stabilizes semantic grouping. Experiments on SQuAD and TriviaQA show that SRE outperforms strong baselines, providing more robust and generalizable hallucination detection. These results demonstrate that combining input diversification with multi-signal clustering substantially enhances semantic-level uncertainty estimation.
[206] HiCoLoRA: Addressing Context-Prompt Misalignment via Hierarchical Collaborative LoRA for Zero-Shot DST
Shuyu Zhang, Yifan Wei, Xinru Wang, Yanmin Zhu, Yangfan He, Yixuan Weng, Bin Li
Main category: cs.CL
TL;DR: HiCoLoRA is a hierarchical LoRA framework that enhances zero-shot dialog state tracking through dynamic layer-specific processing, spectral clustering for transferable associations, and semantic-enhanced initialization to overcome semantic misalignment between dialog contexts and prompts.
Details
Motivation: Address semantic misalignment between dynamic dialog contexts and static prompts in zero-shot dialog state tracking, which causes inflexible cross-layer coordination, domain interference, and catastrophic forgetting when generalizing to new domains without data annotation.Method: Hierarchical Collaborative Low-Rank Adaptation (HiCoLoRA) with: 1) hierarchical LoRA architecture for dynamic layer-specific processing, 2) Spectral Joint Domain-Slot Clustering to identify transferable associations, 3) Adaptive Linear Fusion Mechanism, and 4) Semantic-Enhanced SVD Initialization (SemSVD-Init) to preserve pre-trained knowledge.
Result: Outperforms baselines on multi-domain datasets MultiWOZ and SGD, achieving state-of-the-art performance in zero-shot dialog state tracking.
Conclusion: HiCoLoRA effectively addresses semantic misalignment challenges in zero-shot DST through hierarchical adaptation, spectral clustering, and knowledge-preserving initialization, demonstrating superior generalization to new domains without costly data annotation.
Abstract: Zero-shot Dialog State Tracking (zs-DST) is essential for enabling Task-Oriented Dialog Systems (TODs) to generalize to new domains without costly data annotation. A central challenge lies in the semantic misalignment between dynamic dialog contexts and static prompts, leading to inflexible cross-layer coordination, domain interference, and catastrophic forgetting. To tackle this, we propose Hierarchical Collaborative Low-Rank Adaptation (HiCoLoRA), a framework that enhances zero-shot slot inference through robust prompt alignment. It features a hierarchical LoRA architecture for dynamic layer-specific processing (combining lower-layer heuristic grouping and higher-layer full interaction), integrates Spectral Joint Domain-Slot Clustering to identify transferable associations (feeding an Adaptive Linear Fusion Mechanism), and employs Semantic-Enhanced SVD Initialization (SemSVD-Init) to preserve pre-trained knowledge. Experiments on multi-domain datasets MultiWOZ and SGD show that HiCoLoRA outperforms baselines, achieving SOTA in zs-DST. Code is available at https://github.com/carsonz/HiCoLoRA.
[207] Thinking Augmented Pre-training
Liang Wang, Nan Yang, Shaohan Huang, Li Dong, Furu Wei
Main category: cs.CL
TL;DR: TPT improves LLM training data efficiency by augmenting text with automatically generated thinking trajectories, making complex tokens more learnable through step-by-step reasoning.
Details
Motivation: The compute for pre-training LLMs is growing rapidly while high-quality data remains limited, and complex tokens are difficult to learn due to their deep underlying rationales.Method: Thinking augmented Pre-Training (TPT) - a universal methodology that augments text data with automatically generated thinking trajectories to increase training volume and make high-quality tokens more learnable.
Result: TPT enhances data efficiency by 3x, improves post-training performance by over 10% on reasoning benchmarks for a 3B parameter model, and works across diverse training configurations up to 100B tokens.
Conclusion: TPT is an effective approach that substantially improves LLM performance across various model sizes and families by making complex tokens more learnable through reasoning augmentation.
Abstract: This paper introduces a simple and scalable approach to improve the data efficiency of large language model (LLM) training by augmenting existing text data with thinking trajectories. The compute for pre-training LLMs has been growing at an unprecedented rate, while the availability of high-quality data remains limited. Consequently, maximizing the utility of available data constitutes a significant research challenge. A primary impediment is that certain high-quality tokens are difficult to learn given a fixed model capacity, as the underlying rationale for a single token can be exceptionally complex and deep. To address this issue, we propose Thinking augmented Pre-Training (TPT), a universal methodology that augments text with automatically generated thinking trajectories. Such augmentation effectively increases the volume of the training data and makes high-quality tokens more learnable through step-by-step reasoning and decomposition. We apply TPT across diverse training configurations up to $100$B tokens, encompassing pre-training with both constrained and abundant data, as well as mid-training from strong open-source checkpoints. Experimental results indicate that our method substantially improves the performance of LLMs across various model sizes and families. Notably, TPT enhances the data efficiency of LLM pre-training by a factor of $3$. For a $3$B parameter model, it improves the post-training performance by over $10%$ on several challenging reasoning benchmarks.
[208] Learning to Summarize by Learning to Quiz: Adversarial Agentic Collaboration for Long Document Summarization
Weixuan Wang, Minghao Wu, Barry Haddow, Alexandra Birch
Main category: cs.CL
TL;DR: SummQ is a novel adversarial multi-agent framework for long document summarization that uses collaborative intelligence between summarization and quizzing agents to address information loss, factual inconsistencies, and coherence issues.
Details
Motivation: Current LLMs struggle with long document summarization due to information loss, factual inconsistencies, and coherence problems when processing excessively long documents.Method: Uses collaborative intelligence between specialized agents: summary generators and reviewers for creating/evaluating summaries, and quiz generators and reviewers for creating comprehension questions as quality checks. Includes an examinee agent to validate if summaries contain information needed to answer quiz questions, enabling iterative refinement through adversarial feedback.
Result: Significantly outperforms state-of-the-art methods on three benchmarks across ROUGE, BERTScore, LLM-as-a-Judge, and human evaluations. Comprehensive analyses show effectiveness of multi-agent collaboration, agent configurations, and quizzing mechanism.
Conclusion: Establishes a new approach for long document summarization using adversarial agentic collaboration to improve summarization quality through multifaceted feedback mechanisms.
Abstract: Long document summarization remains a significant challenge for current large language models (LLMs), as existing approaches commonly struggle with information loss, factual inconsistencies, and coherence issues when processing excessively long documents. We propose SummQ, a novel adversarial multi-agent framework that addresses these limitations through collaborative intelligence between specialized agents operating in two complementary domains: summarization and quizzing. Our approach employs summary generators and reviewers that work collaboratively to create and evaluate comprehensive summaries, while quiz generators and reviewers create comprehension questions that serve as continuous quality checks for the summarization process. This adversarial dynamic, enhanced by an examinee agent that validates whether the generated summary contains the information needed to answer the quiz questions, enables iterative refinement through multifaceted feedback mechanisms. We evaluate SummQ on three widely used long document summarization benchmarks. Experimental results demonstrate that our framework significantly outperforms existing state-of-the-art methods across ROUGE and BERTScore metrics, as well as in LLM-as-a-Judge and human evaluations. Our comprehensive analyses reveal the effectiveness of the multi-agent collaboration dynamics, the influence of different agent configurations, and the impact of the quizzing mechanism. This work establishes a new approach for long document summarization that uses adversarial agentic collaboration to improve summarization quality.
[209] Cross-Linguistic Analysis of Memory Load in Sentence Comprehension: Linear Distance and Structural Density
Krishna Aggarwal
Main category: cs.CL
TL;DR: This study examines memory load in sentence comprehension, finding that Intervener Complexity (number of intervening heads between syntactically related words) provides explanatory power beyond linear distance measures.
Details
Motivation: To reconcile linear and hierarchical perspectives on locality in sentence processing by examining whether memory load is better explained by linear proximity or structural density of intervening material.Method: Used harmonized dependency treebanks and mixed-effects modeling across multiple languages to evaluate sentence length, dependency length, and Intervener Complexity as predictors of memory load.
Result: All three factors (sentence length, dependency length, Intervener Complexity) are positively associated with memory load, with sentence length having the broadest influence and Intervener Complexity offering explanatory power beyond linear distance.
Conclusion: The findings reconcile linear and hierarchical perspectives by treating dependency length as a surface signature while identifying intervening heads as a more proximate indicator of integration demands, providing a principled path for evaluating memory load theories.
Abstract: This study examines whether sentence-level memory load in comprehension is better explained by linear proximity between syntactically related words or by the structural density of the intervening material. Building on locality-based accounts and cross-linguistic evidence for dependency length minimization, the work advances Intervener Complexity-the number of intervening heads between a head and its dependent-as a structurally grounded lens that refines linear distance measures. Using harmonized dependency treebanks and a mixed-effects framework across multiple languages, the analysis jointly evaluates sentence length, dependency length, and Intervener Complexity as predictors of the Memory-load measure. Studies in Psycholinguistics have reported the contributions of feature interference and misbinding to memory load during processing. For this study, I operationalized sentence-level memory load as the linear sum of feature misbinding and feature interference for tractability; current evidence does not establish that their cognitive contributions combine additively. All three factors are positively associated with memory load, with sentence length exerting the broadest influence and Intervener Complexity offering explanatory power beyond linear distance. Conceptually, the findings reconcile linear and hierarchical perspectives on locality by treating dependency length as an important surface signature while identifying intervening heads as a more proximate indicator of integration and maintenance demands. Methodologically, the study illustrates how UD-based graph measures and cross-linguistic mixed-effects modelling can disentangle linear and structural contributions to processing efficiency, providing a principled path for evaluating competing theories of memory load in sentence comprehension.
[210] Learning the Wrong Lessons: Syntactic-Domain Spurious Correlations in Language Models
Chantal Shaib, Vinith M. Suriyakumar, Levent Sagun, Byron C. Wallace, Marzyeh Ghassemi
Main category: cs.CL
TL;DR: LLMs can develop spurious correlations between syntactic templates and domains during training, which can override prompt semantics, lower performance on entity knowledge tasks, and be exploited to bypass safety refusals.
Details
Motivation: To understand how LLMs associate syntactic patterns with domains during training, and how these spurious correlations can negatively impact model performance and safety.Method: Used synthetic training datasets to analyze syntactic-domain correlations, developed an evaluation framework to detect this phenomenon in trained models, and conducted case studies on safety finetuning implications.
Result: Found that syntactic-domain correlations lower performance on entity knowledge tasks (mean 0.51 +/- 0.06) and can be used to bypass refusals in both open (OLMo-2-7B, Llama-4-Maverick) and closed (GPT-4o) models.
Conclusion: There is a need to explicitly test for syntactic-domain correlations and ensure syntactic diversity in training data within domains to prevent spurious correlations that can compromise model performance and safety.
Abstract: For an LLM to correctly respond to an instruction it must understand both the semantics and the domain (i.e., subject area) of a given task-instruction pair. However, syntax can also convey implicit information Recent work shows that syntactic templates – frequent sequences of Part-of-Speech (PoS) tags – are prevalent in training data and often appear in model outputs. In this work we characterize syntactic templates, domain, and semantics in task-instruction pairs. We identify cases of spurious correlations between syntax and domain, where models learn to associate a domain with syntax during training; this can sometimes override prompt semantics. Using a synthetic training dataset, we find that the syntactic-domain correlation can lower performance (mean 0.51 +/- 0.06) on entity knowledge tasks in OLMo-2 models (1B-13B). We introduce an evaluation framework to detect this phenomenon in trained models, and show that it occurs on a subset of the FlanV2 dataset in open (OLMo-2-7B; Llama-4-Maverick), and closed (GPT-4o) models. Finally, we present a case study on the implications for safety finetuning, showing that unintended syntactic-domain correlations can be used to bypass refusals in OLMo-2-7B Instruct and GPT-4o. Our findings highlight two needs: (1) to explicitly test for syntactic-domain correlations, and (2) to ensure syntactic diversity in training data, specifically within domains, to prevent such spurious correlations.
[211] GEP: A GCG-Based method for extracting personally identifiable information from chatbots built on small language models
Jieli Zhu, Vi Ngoc-Nha Tran
Main category: cs.CL
TL;DR: This paper investigates PII leakage in small language models (SLMs) for medical chatbots, proposing a new attack method called GEP that significantly outperforms previous template-based approaches in extracting personal information.
Details
Motivation: While SLMs offer comparable performance to LLMs with lower computational costs, their vulnerability to PII leakage in downstream tasks remains unexplored, particularly in sensitive domains like healthcare.Method: The authors fine-tuned ChatBioGPT from BioGPT on medical datasets, then developed GEP - a greedy coordinate gradient-based method specifically designed for PII extraction from SLMs, testing it against template-based approaches.
Result: GEP demonstrated up to 60x more PII leakage compared to template-based methods, and maintained a 4.53% leakage rate even in free-style insertion scenarios with varied syntactic expressions.
Conclusion: SLMs are vulnerable to sophisticated PII extraction attacks, and the proposed GEP method effectively reveals these vulnerabilities, highlighting the need for better privacy protection in SLM deployments.
Abstract: Small language models (SLMs) become unprecedentedly appealing due to their approximately equivalent performance compared to large language models (LLMs) in certain fields with less energy and time consumption during training and inference. However, the personally identifiable information (PII) leakage of SLMs for downstream tasks has yet to be explored. In this study, we investigate the PII leakage of the chatbot based on SLM. We first finetune a new chatbot, i.e., ChatBioGPT based on the backbone of BioGPT using medical datasets Alpaca and HealthCareMagic. It shows a matchable performance in BERTscore compared with previous studies of ChatDoctor and ChatGPT. Based on this model, we prove that the previous template-based PII attacking methods cannot effectively extract the PII in the dataset for leakage detection under the SLM condition. We then propose GEP, which is a greedy coordinate gradient-based (GCG) method specifically designed for PII extraction. We conduct experimental studies of GEP and the results show an increment of up to 60$\times$ more leakage compared with the previous template-based methods. We further expand the capability of GEP in the case of a more complicated and realistic situation by conducting free-style insertion where the inserted PII in the dataset is in the form of various syntactic expressions instead of fixed templates, and GEP is still able to reveal a PII leakage rate of up to 4.53%.
[212] Sycophancy Is Not One Thing: Causal Separation of Sycophantic Behaviors in LLMs
Daniel Vennemeyer, Phan Anh Duong, Tiffany Zhan, Tianyu Jiang
Main category: cs.CL
TL;DR: Sycophantic behaviors in LLMs can be decomposed into distinct types (agreement and praise) that are encoded along separate linear directions in latent space and can be independently controlled.
Details
Motivation: To understand whether sycophantic behaviors in LLMs arise from a single mechanism or multiple distinct processes, and to decompose sycophancy into specific types.Method: Used difference-in-means directions, activation additions, and subspace geometry analysis across multiple models and datasets to examine sycophantic behaviors.
Result: Found that sycophantic agreement, sycophantic praise, and genuine agreement are encoded along distinct linear directions; each behavior can be independently amplified or suppressed; and the representational structure is consistent across model families and scales.
Conclusion: Sycophantic behaviors correspond to distinct, independently steerable representations rather than a single unified mechanism.
Abstract: Large language models (LLMs) often exhibit sycophantic behaviors – such as excessive agreement with or flattery of the user – but it is unclear whether these behaviors arise from a single mechanism or multiple distinct processes. We decompose sycophancy into sycophantic agreement and sycophantic praise, contrasting both with genuine agreement. Using difference-in-means directions, activation additions, and subspace geometry across multiple models and datasets, we show that: (1) the three behaviors are encoded along distinct linear directions in latent space; (2) each behavior can be independently amplified or suppressed without affecting the others; and (3) their representational structure is consistent across model families and scales. These results suggest that sycophantic behaviors correspond to distinct, independently steerable representations.
cs.CV
[213] StableDub: Taming Diffusion Prior for Generalized and Efficient Visual Dubbing
Liyang Chen, Tianze Zhou, Xu He, Boshi Tang, Zhiyong Wu, Yang Huang, Yang Wu, Zhongqian Sun, Wei Yang, Helen Meng
Main category: cs.CV
TL;DR: StableDub is a novel visual dubbing framework that addresses two key limitations: inadequate speaker-specific lip habit modeling and poor occlusion handling. It integrates lip-habit-aware modeling with occlusion-robust synthesis using Stable-Diffusion backbone.
Details
Motivation: Current visual dubbing methods have two critical deficiencies: audio-only driving fails to capture speaker-specific lip habits, and conventional blind-inpainting produces visual artifacts when handling obstructions like microphones or hands.Method: Built on Stable-Diffusion backbone with lip-habit-modulated mechanism that jointly models phonemic audio-visual synchronization and speaker-specific orofacial dynamics. Uses occlusion-aware training strategy by explicitly exposing occlusion objects to inpainting process. Employs hybrid Mamba-Transformer architecture for training efficiency.
Result: Achieves superior performance in lip habit resemblance and occlusion robustness. Surpasses other methods in audio-lip sync, video quality, and resolution consistency. Eliminates need for cost-intensive priors and exhibits superior training efficiency.
Conclusion: StableDub expands the applicability of visual dubbing methods by addressing key limitations in lip habit modeling and occlusion handling, making it more practical for real-world deployment.
Abstract: The visual dubbing task aims to generate mouth movements synchronized with the driving audio, which has seen significant progress in recent years. However, two critical deficiencies hinder their wide application: (1) Audio-only driving paradigms inadequately capture speaker-specific lip habits, which fail to generate lip movements similar to the target avatar; (2) Conventional blind-inpainting approaches frequently produce visual artifacts when handling obstructions (e.g., microphones, hands), limiting practical deployment. In this paper, we propose StableDub, a novel and concise framework integrating lip-habit-aware modeling with occlusion-robust synthesis. Specifically, building upon the Stable-Diffusion backbone, we develop a lip-habit-modulated mechanism that jointly models phonemic audio-visual synchronization and speaker-specific orofacial dynamics. To achieve plausible lip geometries and object appearances under occlusion, we introduce the occlusion-aware training strategy by explicitly exposing the occlusion objects to the inpainting process. By incorporating the proposed designs, the model eliminates the necessity for cost-intensive priors in previous methods, thereby exhibiting superior training efficiency on the computationally intensive diffusion-based backbone. To further optimize training efficiency from the perspective of model architecture, we introduce a hybrid Mamba-Transformer architecture, which demonstrates the enhanced applicability in low-resource research scenarios. Extensive experimental results demonstrate that StableDub achieves superior performance in lip habit resemblance and occlusion robustness. Our method also surpasses other methods in audio-lip sync, video quality, and resolution consistency. We expand the applicability of visual dubbing methods from comprehensive aspects, and demo videos can be found at https://stabledub.github.io.
[214] Random Direct Preference Optimization for Radiography Report Generation
Valentin Samokhin, Boris Shirokikh, Mikhail Goncharov, Dmitriy Umerenkov, Maksim Bobrin, Ivan Oseledets, Dmitry Dylov, Mikhail Belyaev
Main category: cs.CV
TL;DR: This paper introduces a model-agnostic framework using Direct Preference Optimization (DPO) with random contrastive sampling to improve radiography report generation accuracy without requiring additional training data or human annotations.
Details
Motivation: Existing radiography report generation methods lack the quality needed for real-world clinical deployment, while large visual language models have shown success using LLM training strategies like alignment techniques.Method: Proposes Random DPO framework that uses random contrastive sampling to construct training pairs for Direct Preference Optimization, eliminating the need for reward models or human preference annotations.
Result: Experiments show the method improves clinical performance metrics by up to 5% when supplementing three state-of-the-art models, without requiring additional training data.
Conclusion: The Random DPO framework effectively enhances radiography report generation accuracy in a model-agnostic way, making it suitable for clinical deployment without extra data requirements.
Abstract: Radiography Report Generation (RRG) has gained significant attention in medical image analysis as a promising tool for alleviating the growing workload of radiologists. However, despite numerous advancements, existing methods have yet to achieve the quality required for deployment in real-world clinical settings. Meanwhile, large Visual Language Models (VLMs) have demonstrated remarkable progress in the general domain by adopting training strategies originally designed for Large Language Models (LLMs), such as alignment techniques. In this paper, we introduce a model-agnostic framework to enhance RRG accuracy using Direct Preference Optimization (DPO). Our approach leverages random contrastive sampling to construct training pairs, eliminating the need for reward models or human preference annotations. Experiments on supplementing three state-of-the-art models with our Random DPO show that our method improves clinical performance metrics by up to 5%, without requiring any additional training data.
[215] Taming Flow-based I2V Models for Creative Video Editing
Xianghao Kong, Hansheng Chen, Yuwei Guo, Lvmin Zhang, Gordon Wetzstein, Maneesh Agrawala, Anyi Rao
Main category: cs.CV
TL;DR: IF-V2V is an inversion-free method that adapts flow-matching-based image-to-video models for video editing without requiring inversion or extensive optimization, achieving high-quality video editing with good consistency.
Details
Motivation: Existing video editing methods require inversion with model-specific design or extensive optimization, limiting their ability to leverage modern image-to-video models for transferring image editing capabilities to videos.Method: Proposes Vector Field Rectification with Sample Deviation to incorporate source video information into denoising, Structure-and-Motion-Preserving Initialization for motion-aware noise generation, and Deviation Caching to minimize computational overhead.
Result: The method achieves superior editing quality and consistency compared to existing approaches, offering a lightweight plug-and-play solution for video editing.
Conclusion: IF-V2V provides an efficient inversion-free approach for video editing that effectively transfers image editing capabilities to videos while maintaining consistency and quality.
Abstract: Although image editing techniques have advanced significantly, video editing, which aims to manipulate videos according to user intent, remains an emerging challenge. Most existing image-conditioned video editing methods either require inversion with model-specific design or need extensive optimization, limiting their capability of leveraging up-to-date image-to-video (I2V) models to transfer the editing capability of image editing models to the video domain. To this end, we propose IF-V2V, an Inversion-Free method that can adapt off-the-shelf flow-matching-based I2V models for video editing without significant computational overhead. To circumvent inversion, we devise Vector Field Rectification with Sample Deviation to incorporate information from the source video into the denoising process by introducing a deviation term into the denoising vector field. To further ensure consistency with the source video in a model-agnostic way, we introduce Structure-and-Motion-Preserving Initialization to generate motion-aware temporally correlated noise with structural information embedded. We also present a Deviation Caching mechanism to minimize the additional computational cost for denoising vector rectification without significantly impacting editing quality. Evaluations demonstrate that our method achieves superior editing quality and consistency over existing approaches, offering a lightweight plug-and-play solution to realize visual creativity.
[216] Improving Autism Detection with Multimodal Behavioral Analysis
William Saakyan, Matthias Norden, Lola Eversmann, Simon Kirsch, Muyu Lin, Simon Guendelman, Isabel Dziobek, Hanna Drimalla
Main category: cs.CV
TL;DR: This paper proposes a multimodal approach for autism detection using video data, addressing limitations in gaze feature performance and generalizability by analyzing facial expressions, voice, head motion, heart rate, and improved gaze descriptors.
Details
Motivation: Current computer-aided autism diagnostic methods struggle with poor gaze feature performance and lack real-world generalizability, despite promising results on some datasets.Method: Analyzed a large balanced dataset (168 ASC, 157 non-autistic participants) using multimodal analysis of facial expressions, voice prosody, head motion, HRV, and gaze behavior. Introduced novel statistical descriptors for gaze variability and used late fusion for classification.
Result: Improved gaze-based classification from 64% to 69% using novel gaze descriptors. Achieved 74% accuracy with multimodal late fusion, demonstrating effective integration of behavioral markers across modalities.
Conclusion: The findings highlight the potential for scalable, video-based screening tools to support autism assessment through improved multimodal analysis.
Abstract: Due to the complex and resource-intensive nature of diagnosing Autism Spectrum Condition (ASC), several computer-aided diagnostic support methods have been proposed to detect autism by analyzing behavioral cues in patient video data. While these models show promising results on some datasets, they struggle with poor gaze feature performance and lack of real-world generalizability. To tackle these challenges, we analyze a standardized video dataset comprising 168 participants with ASC (46% female) and 157 non-autistic participants (46% female), making it, to our knowledge, the largest and most balanced dataset available. We conduct a multimodal analysis of facial expressions, voice prosody, head motion, heart rate variability (HRV), and gaze behavior. To address the limitations of prior gaze models, we introduce novel statistical descriptors that quantify variability in eye gaze angles, improving gaze-based classification accuracy from 64% to 69% and aligning computational findings with clinical research on gaze aversion in ASC. Using late fusion, we achieve a classification accuracy of 74%, demonstrating the effectiveness of integrating behavioral markers across multiple modalities. Our findings highlight the potential for scalable, video-based screening tools to support autism assessment.
[217] KV-Efficient VLA: A Method of Speed up Vision Language Model with RNN-Gated Chunked KV Cache
Wanshun Xu, Long Zhuang
Main category: cs.CV
TL;DR: KV-Efficient VLA is a memory compression framework that reduces KV cache size and speeds up inference for Vision-Language-Action models by selectively retaining high-utility context through chunking and gating mechanisms.
Details
Motivation: Current VLA models face scalability issues due to quadratic attention costs and unbounded KV memory growth during long-horizon inference, which hinders real-time deployment despite improved generalization through scaling.Method: Partitions KV cache into fixed-size chunks and uses a recurrent gating module to summarize and filter historical context based on learned utility scores, preserving recent details while pruning stale memory while maintaining causality.
Result: Achieves up to 1.21x inference speedup and 36% KV memory reduction with minimal impact on task success, seamlessly integrating into existing autoregressive and hybrid VLA stacks.
Conclusion: KV-Efficient VLA enables scalable inference for VLA models without requiring modifications to training pipelines or downstream control logic, addressing critical inference inefficiencies for real-world deployment.
Abstract: Vision-Language-Action (VLA) models promise unified robotic perception and control, yet their scalability is constrained by the quadratic cost of attention and the unbounded growth of key-value (KV) memory during long-horizon inference. While recent methods improve generalization through scaling backbone architectures, they often neglect the inference inefficiencies critical to real-time deployment. In this work, we present KV-Efficient VLA, a model-agnostic memory compression framework that addresses these limitations by introducing a lightweight, training-friendly mechanism to selectively retain high-utility context. Our method partitions the KV cache into fixed size chunks and employs a recurrent gating module to summarize and filter historical context according to learned utility scores. This design preserves recent fine-grained detail while aggressively pruning stale, low-relevance memory, all while maintaining causality. Theoretically, KV-Efficient VLA yields up to 1.21x inference speedup and 36% KV memory reduction, with minimal impact on task success. Our method integrates seamlessly into existing autoregressive and hybrid VLA stacks, enabling scalable inference without modifying training pipelines or downstream control logic.
[218] Phrase-grounded Fact-checking for Automatically Generated Chest X-ray Reports
Razi Mahmood, Diego Machado-Reyes, Joy Wu, Parisa Kaviani, Ken C. L. Wong, Niharika D’Souza, Mannudeep Kalra, Ge Wang, Pingkun Yan, Tanveer Syeda-Mahmood
Main category: cs.CV
TL;DR: A phrase-grounded fact-checking model that detects errors in findings and their locations in automatically generated chest radiology reports, using synthetic data and cross-modal contrastive learning.
Details
Motivation: Large-scale vision language models can generate realistic radiology reports but suffer from factual errors and hallucinations, limiting clinical translation.Method: Created synthetic dataset by perturbing findings and locations in ground truth reports, then trained multi-label cross-modal contrastive regression network on real/fake findings-location pairs with images.
Result: Achieved high accuracy in finding veracity prediction and localization on multiple X-ray datasets, with 0.997 concordance correlation coefficient with ground truth-based verification for SOTA report generators.
Conclusion: The model shows robustness and effectiveness for error detection in radiology reports, pointing to its utility in clinical inference workflows.
Abstract: With the emergence of large-scale vision language models (VLM), it is now possible to produce realistic-looking radiology reports for chest X-ray images. However, their clinical translation has been hampered by the factual errors and hallucinations in the produced descriptions during inference. In this paper, we present a novel phrase-grounded fact-checking model (FC model) that detects errors in findings and their indicated locations in automatically generated chest radiology reports. Specifically, we simulate the errors in reports through a large synthetic dataset derived by perturbing findings and their locations in ground truth reports to form real and fake findings-location pairs with images. A new multi-label cross-modal contrastive regression network is then trained on this dataset. We present results demonstrating the robustness of our method in terms of accuracy of finding veracity prediction and localization on multiple X-ray datasets. We also show its effectiveness for error detection in reports of SOTA report generators on multiple datasets achieving a concordance correlation coefficient of 0.997 with ground truth-based verification, thus pointing to its utility during clinical inference in radiology workflows.
[219] MDF-MLLM: Deep Fusion Through Cross-Modal Feature Alignment for Contextually Aware Fundoscopic Image Classification
Jason Jordan, Mohammadreza Akbari Lor, Peter Koulen, Mei-Ling Shyu, Shu-Ching Chen
Main category: cs.CV
TL;DR: A novel multimodal deep learning architecture (MDF-MLLM) that integrates fine-grained retinal image features with global textual context significantly improves disease classification accuracy from fundus images, achieving 94% accuracy compared to 60% baseline.
Details
Motivation: Existing multimodal large language models struggle to capture low-level spatial details critical for diagnosing retinal diseases like glaucoma, diabetic retinopathy, and retinitis pigmentosa, limiting their clinical utility.Method: The MDF-MLLM integrates skip features from four U-Net encoder layers into cross-attention blocks within a LLaMA 3.2 11B MLLM, using patch-wise projection, scaled cross-attention, and FiLM-based U-Net modulation for feature fusion.
Result: MDF-MLLM achieved 94% accuracy on dual-type disease classification, representing a 56% improvement over baseline (60%). Recall and F1-scores improved by up to 67% and 35% respectively, with particular gains for inherited diseases with rich clinical text.
Conclusion: MDF-MLLM provides a generalizable, interpretable framework for fundus image classification that outperforms traditional MLLMs through multi-scale feature fusion, showing promise for clinical decision support systems.
Abstract: This study aimed to enhance disease classification accuracy from retinal fundus images by integrating fine-grained image features and global textual context using a novel multimodal deep learning architecture. Existing multimodal large language models (MLLMs) often struggle to capture low-level spatial details critical for diagnosing retinal diseases such as glaucoma, diabetic retinopathy, and retinitis pigmentosa. This model development and validation study was conducted on 1,305 fundus image-text pairs compiled from three public datasets (FIVES, HRF, and StoneRounds), covering acquired and inherited retinal diseases, and evaluated using classification accuracy and F1-score. The MDF-MLLM integrates skip features from four U-Net encoder layers into cross-attention blocks within a LLaMA 3.2 11B MLLM. Vision features are patch-wise projected and fused using scaled cross-attention and FiLM-based U-Net modulation. Baseline MLLM achieved 60% accuracy on the dual-type disease classification task. MDF-MLLM, with both U-Net and MLLM components fully fine-tuned during training, achieved a significantly higher accuracy of 94%, representing a 56% improvement. Recall and F1-scores improved by as much as 67% and 35% over baseline, respectively. Ablation studies confirmed that the multi-depth fusion approach contributed to substantial gains in spatial reasoning and classification, particularly for inherited diseases with rich clinical text. MDF-MLLM presents a generalizable, interpretable, and modular framework for fundus image classification, outperforming traditional MLLM baselines through multi-scale feature fusion. The architecture holds promise for real-world deployment in clinical decision support systems. Future work will explore synchronized training techniques, a larger pool of diseases for more generalizability, and extending the model for segmentation tasks.
[220] Multimodal Prompt Decoupling Attack on the Safety Filters in Text-to-Image Models
Xingkai Peng, Jun Jiang, Meng Tong, Shuai Li, Weiming Zhang, Nenghai Yu, Kejiang Chen
Main category: cs.CV
TL;DR: MPDA is a multimodal jailbreak attack that decouples unsafe text prompts into pseudo-safe and harmful components, uses adversarial rewriting to bypass safety filters, and employs iterative refinement with visual feedback to generate NSFW content from T2I models.
Details
Motivation: Existing jailbreak methods focus on text manipulation and struggle to bypass T2I model safety filters. Image-based vulnerabilities remain unexplored, and current approaches face limitations in evading detection.Method: Three-step approach: 1) LLM decouples unsafe prompts into pseudo-safe and harmful components, 2) LLM rewrites harmful prompts into adversarial prompts to bypass filters, 3) Visual language model generates captions for iterative refinement to maintain semantic consistency.
Result: The method successfully bypasses T2I model safety filters to generate NSFW content while maintaining semantic alignment with original unsafe prompts through multimodal feedback loops.
Conclusion: MPDA demonstrates that multimodal approaches combining text and image modalities can effectively bypass T2I safety filters, revealing new vulnerabilities in current content moderation systems that require enhanced multimodal defense mechanisms.
Abstract: Text-to-image (T2I) models have been widely applied in generating high-fidelity images across various domains. However, these models may also be abused to produce Not-Safe-for-Work (NSFW) content via jailbreak attacks. Existing jailbreak methods primarily manipulate the textual prompt, leaving potential vulnerabilities in image-based inputs largely unexplored. Moreover, text-based methods face challenges in bypassing the model’s safety filters. In response to these limitations, we propose the Multimodal Prompt Decoupling Attack (MPDA), which utilizes image modality to separate the harmful semantic components of the original unsafe prompt. MPDA follows three core steps: firstly, a large language model (LLM) decouples unsafe prompts into pseudo-safe prompts and harmful prompts. The former are seemingly harmless sub-prompts that can bypass filters, while the latter are sub-prompts with unsafe semantics that trigger filters. Subsequently, the LLM rewrites the harmful prompts into natural adversarial prompts to bypass safety filters, which guide the T2I model to modify the base image into an NSFW output. Finally, to ensure semantic consistency between the generated NSFW images and the original unsafe prompts, the visual language model generates image captions, providing a new pathway to guide the LLM in iterative rewriting and refining the generated content.
[221] VidCRAFT3: Camera, Object, and Lighting Control for Image-to-Video Generation
Sixiao Zheng, Zimian Peng, Yanpeng Zhou, Yi Zhu, Hang Xu, Xiangru Huang, Yanwei Fu
Main category: cs.CV
TL;DR: VidCRAFT3 is a unified image-to-video framework that enables simultaneous control over camera motion, object motion, and lighting direction through 3D reconstruction, optical flow encoding, and spatial attention mechanisms.
Details
Motivation: Existing image-to-video methods treat control signals separately due to dataset limitations and mismatched control spaces, lacking precise joint control over camera, object motion, and lighting.Method: Integrates three components: Image2Cloud for 3D point cloud reconstruction from images, ObjMotionNet for encoding object trajectories into optical flow features, and Spatial Triple-Attention Transformer for lighting control. Uses a three-stage training strategy with the VLD dataset.
Result: Outperforms existing methods in control precision and visual coherence, demonstrating superior performance in joint control scenarios.
Conclusion: VidCRAFT3 provides a unified framework for flexible and precise joint control in image-to-video generation, addressing key limitations of previous approaches through integrated 3D reconstruction and attention mechanisms.
Abstract: Controllable image-to-video (I2V) generation transforms a reference image into a coherent video guided by user-specified control signals. In content creation workflows, precise and simultaneous control over camera motion, object motion, and lighting direction enhances both accuracy and flexibility. However, existing approaches typically treat these control signals separately, largely due to the scarcity of datasets with high-quality joint annotations and mismatched control spaces across modalities. We present VidCRAFT3, a unified and flexible I2V framework that supports both independent and joint control over camera motion, object motion, and lighting direction by integrating three core components. Image2Cloud reconstructs a 3D point cloud from the reference image to enable precise camera motion control. ObjMotionNet encodes sparse object trajectories into multi-scale optical flow features to guide object motion. The Spatial Triple-Attention Transformer integrates lighting direction embeddings via parallel cross-attention. To address the scarcity of jointly annotated data, we curate the VideoLightingDirection (VLD) dataset of synthetic static-scene video clips with per-frame lighting-direction labels, and adopt a three-stage training strategy that enables robust learning without fully joint annotations. Extensive experiments show that VidCRAFT3 outperforms existing methods in control precision and visual coherence. Code and data will be released. Project page: https://sixiaozheng.github.io/VidCRAFT3/.
[222] A Mutual Learning Method for Salient Object Detection with intertwined Multi-Supervision–Revised
Runmin Wu, Mengyang Feng, Wenlong Guan, Dong Wang, Huchuan Lu, Errui Ding
Main category: cs.CV
TL;DR: The paper proposes a multi-task learning approach for salient object detection that combines salient object detection, foreground contour detection, and edge detection to address incomplete predictions and inaccurate boundaries.
Details
Motivation: To solve the problems of incomplete predictions due to object complexity and inaccurate boundaries caused by convolution/pooling operations in deep learning-based salient object detection.Method: Proposes a mutual learning framework with three intertwined tasks: salient object detection, foreground contour detection, and edge detection. Uses a novel Mutual Learning Module (MLM) with multiple network branches trained in mutual learning manner.
Result: Extensive experiments on seven challenging datasets demonstrate state-of-the-art results in both salient object detection and edge detection.
Conclusion: The proposed multi-task learning approach with mutual learning effectively improves salient object detection performance by addressing incomplete predictions and boundary accuracy issues.
Abstract: Though deep learning techniques have made great progress in salient object detection recently, the predicted saliency maps still suffer from incomplete predictions due to the internal complexity of objects and inaccurate boundaries caused by strides in convolution and pooling operations. To alleviate these issues, we propose to train saliency detection networks by exploiting the supervision from not only salient object detection, but also foreground contour detection and edge detection. First, we leverage salient object detection and foreground contour detection tasks in an intertwined manner to generate saliency maps with uniform highlight. Second, the foreground contour and edge detection tasks guide each other simultaneously, thereby leading to precise foreground contour prediction and reducing the local noises for edge prediction. In addition, we develop a novel mutual learning module (MLM) which serves as the building block of our method. Each MLM consists of multiple network branches trained in a mutual learning manner, which improves the performance by a large margin. Extensive experiments on seven challenging datasets demonstrate that the proposed method has delivered state-of-the-art results in both salient object detection and edge detection.
[223] ShipwreckFinder: A QGIS Tool for Shipwreck Detection in Multibeam Sonar Data
Anja Sheppard, Tyler Smithline, Andrew Scheffer, David Smith, Advaith V. Sethuraman, Ryan Bird, Sabrina Lin, Katherine A. Skinner
Main category: cs.CV
TL;DR: ShipwreckFinder is an open-source QGIS plugin that automatically detects shipwrecks from multibeam sonar data using deep learning, outperforming existing tools like ArcGIS toolkit and traditional sinkhole detection methods.
Details
Motivation: Manual inspection of bathymetric data for shipwreck detection is time-consuming and requires expert analysis. There's a need for automated tools to streamline this process for maritime archaeology and historical research.Method: Developed a deep learning model trained on shipwreck data from Great Lakes and Irish coasts, enhanced with synthetic data generation. The tool automatically preprocesses bathymetry data, performs inference, thresholds outputs, and produces segmentation masks or bounding boxes.
Result: Demonstrated superior segmentation performance compared to deep learning-based ArcGIS toolkit and classical inverse sinkhole detection methods.
Conclusion: ShipwreckFinder provides an effective open-source solution for automated shipwreck detection, making maritime archaeology more accessible and efficient.
Abstract: In this paper, we introduce ShipwreckFinder, an open-source QGIS plugin that detects shipwrecks from multibeam sonar data. Shipwrecks are an important historical marker of maritime history, and can be discovered through manual inspection of bathymetric data. However, this is a time-consuming process and often requires expert analysis. Our proposed tool allows users to automatically preprocess bathymetry data, perform deep learning inference, threshold model outputs, and produce either pixel-wise segmentation masks or bounding boxes of predicted shipwrecks. The backbone of this open-source tool is a deep learning model, which is trained on a variety of shipwreck data from the Great Lakes and the coasts of Ireland. Additionally, we employ synthetic data generation in order to increase the size and diversity of our dataset. We demonstrate superior segmentation performance with our open-source tool and training pipeline as compared to a deep learning-based ArcGIS toolkit and a more classical inverse sinkhole detection method. The open-source tool can be found at https://github.com/umfieldrobotics/ShipwreckFinderQGISPlugin.
[224] MAJORScore: A Novel Metric for Evaluating Multimodal Relevance via Joint Representation
Zhicheng Du, Qingyang Shi, Jiasheng Lu, Yingshan Liang, Xinyu Zhang, Yiran Wang, Peiwu Qin
Main category: cs.CV
TL;DR: MAJORScore is a new evaluation metric for multimodal relevance that supports N modalities (N≥3) using multimodal joint representation, overcoming limitations of existing bimodal metrics like CLIP.
Details
Motivation: Existing multimodal relevance metrics are limited to bimodal analysis (e.g., CLIP), which restricts evaluation of similarity across multiple modalities.Method: Uses multimodal joint representation to integrate multiple modalities into the same latent space, enabling fair relevance scoring across different modalities at one scale.
Result: MAJORScore increases relevance scoring by 26.03%-64.29% for consistent modalities and decreases by 13.28%-20.54% for inconsistent modalities compared to existing methods.
Conclusion: MAJORScore provides a more reliable metric for evaluating similarity on large-scale multimodal datasets and multimodal model performance evaluation.
Abstract: The multimodal relevance metric is usually borrowed from the embedding ability of pretrained contrastive learning models for bimodal data, which is used to evaluate the correlation between cross-modal data (e.g., CLIP). However, the commonly used evaluation metrics are only suitable for the associated analysis between two modalities, which greatly limits the evaluation of multimodal similarity. Herein, we propose MAJORScore, a brand-new evaluation metric for the relevance of multiple modalities ($N$ modalities, $N\ge3$) via multimodal joint representation for the first time. The ability of multimodal joint representation to integrate multiple modalities into the same latent space can accurately represent different modalities at one scale, providing support for fair relevance scoring. Extensive experiments have shown that MAJORScore increases by 26.03%-64.29% for consistent modality and decreases by 13.28%-20.54% for inconsistence compared to existing methods. MAJORScore serves as a more reliable metric for evaluating similarity on large-scale multimodal datasets and multimodal model performance evaluation.
[225] TUN3D: Towards Real-World Scene Understanding from Unposed Images
Anton Konushin, Nikita Drozdov, Bulat Gabdullin, Alexey Zakharov, Anna Vorontsova, Danila Rukhovich, Maksim Kolodiazhnyi
Main category: cs.CV
TL;DR: TUN3D is the first method for joint layout estimation and 3D object detection from multi-view images without requiring ground-truth camera poses or depth supervision, achieving state-of-the-art performance across multiple benchmarks.
Details
Motivation: Most existing approaches rely on point cloud input, which is limiting since consumer cameras typically lack depth sensors and visual-only data is more common. There's a need for methods that work with multi-view images without requiring camera poses or depth supervision.Method: Builds on a lightweight sparse-convolutional backbone with two dedicated heads: one for 3D object detection and one for layout estimation using a novel parametric wall representation. Works with multi-view images without ground-truth camera poses or depth supervision.
Result: Achieves state-of-the-art performance across three challenging benchmarks: using ground-truth point clouds, posed images, and unposed images. Performs on par with specialized 3D object detection methods while significantly advancing layout estimation.
Conclusion: TUN3D sets a new benchmark in holistic indoor scene understanding by enabling joint layout estimation and 3D object detection from multi-view images without requiring camera poses or depth supervision.
Abstract: Layout estimation and 3D object detection are two fundamental tasks in indoor scene understanding. When combined, they enable the creation of a compact yet semantically rich spatial representation of a scene. Existing approaches typically rely on point cloud input, which poses a major limitation since most consumer cameras lack depth sensors and visual-only data remains far more common. We address this issue with TUN3D, the first method that tackles joint layout estimation and 3D object detection in real scans, given multi-view images as input, and does not require ground-truth camera poses or depth supervision. Our approach builds on a lightweight sparse-convolutional backbone and employs two dedicated heads: one for 3D object detection and one for layout estimation, leveraging a novel and effective parametric wall representation. Extensive experiments show that TUN3D achieves state-of-the-art performance across three challenging scene understanding benchmarks: (i) using ground-truth point clouds, (ii) using posed images, and (iii) using unposed images. While performing on par with specialized 3D object detection methods, TUN3D significantly advances layout estimation, setting a new benchmark in holistic indoor scene understanding. Code is available at https://github.com/col14m/tun3d .
[226] Safety Assessment of Scaffolding on Construction Site using AI
Sameer Prabhu, Amit Patwardhan, Ramin Karim
Main category: cs.CV
TL;DR: This paper proposes an AI-powered cloud platform that automates scaffolding inspection by comparing point cloud data with certified reference designs to detect structural modifications, aiming to reduce manual inspection time and improve construction site safety.
Details
Motivation: Current scaffolding inspections are primarily visual, time-intensive, and prone to human errors, which can lead to unsafe conditions on construction sites. There's a need for more accurate and efficient inspection methods.Method: Developed a cloud-based AI platform that processes and analyzes point cloud data of scaffolding structures. The system compares recent point cloud data with certified reference data to detect structural modifications and deviations from design rules.
Result: The proposed system enables automated monitoring of scaffolding structures, detecting alterations that may compromise integrity and stability.
Conclusion: AI and digitization can enhance scaffolding inspection accuracy, reduce manual inspection time and effort, and contribute to improved safety on construction sites through automated structural monitoring.
Abstract: In the construction industry, safety assessment is vital to ensure both the reliability of assets and the safety of workers. Scaffolding, a key structural support asset requires regular inspection to detect and identify alterations from the design rules that may compromise the integrity and stability. At present, inspections are primarily visual and are conducted by site manager or accredited personnel to identify deviations. However, visual inspection is time-intensive and can be susceptible to human errors, which can lead to unsafe conditions. This paper explores the use of Artificial Intelligence (AI) and digitization to enhance the accuracy of scaffolding inspection and contribute to the safety improvement. A cloud-based AI platform is developed to process and analyse the point cloud data of scaffolding structure. The proposed system detects structural modifications through comparison and evaluation of certified reference data with the recent point cloud data. This approach may enable automated monitoring of scaffolding, reducing the time and effort required for manual inspections while enhancing the safety on a construction site.
[227] Skeleton Sparsification and Densification Scale-Spaces
Julia Gierke, Pascal Peter
Main category: cs.CV
TL;DR: The paper introduces skeletonisation scale-spaces, a hierarchical framework that combines medial axis sparsification with scale-space theory to address noise sensitivity in skeletonisation while maintaining key scale-space properties.
Details
Motivation: The medial axis (Hamilton-Jacobi skeleton) is sensitive to noise, where minor boundary variations cause disproportionate skeletal expansions. Classical pruning methods help but lack systematic hierarchical properties.Method: Proposes skeletonisation scale-spaces that leverage sparsification of the medial axis for hierarchical shape simplification. The framework satisfies scale-space properties including hierarchical architecture, controllable simplification, and geometric equivariance. Also introduces densification for inverse progression from coarse to fine scales.
Result: Proof-of-concept experiments demonstrate effectiveness for robust skeletonisation, shape compression, and stiffness enhancement for additive manufacturing. The framework produces hierarchical simplifications and can create overcomplete shape representations.
Conclusion: The skeletonisation scale-space framework provides a theoretically grounded approach that overcomes noise sensitivity in medial axis computation while enabling hierarchical shape analysis with practical applications in multiple domains.
Abstract: The Hamilton-Jacobi skeleton, also known as the medial axis, is a powerful shape descriptor that represents binary objects in terms of the centres of maximal inscribed discs. Despite its broad applicability, the medial axis suffers from sensitivity to noise: minor boundary variations can lead to disproportionately large and undesirable expansions of the skeleton. Classical pruning methods mitigate this shortcoming by systematically removing extraneous skeletal branches. This sequential simplification of skeletons resembles the principle of sparsification scale-spaces that embed images into a family of reconstructions from increasingly sparse pixel representations. We combine both worlds by introducing skeletonisation scale-spaces: They leverage sparsification of the medial axis to achieve hierarchical simplification of shapes. Unlike conventional pruning, our framework inherently satisfies key scale-space properties such as hierarchical architecture, controllable simplification, and equivariance to geometric transformations. We provide a rigorous theoretical foundation in both continuous and discrete formulations and extend the concept further with densification. This allows inverse progression from coarse to fine scales and can even reach beyond the original skeleton to produce overcomplete shape representations with relevancy for practical applications. Through proof-of-concept experiments, we demonstrate the effectiveness of our framework for practical tasks including robust skeletonisation, shape compression, and stiffness enhancement for additive manufacturing.
[228] Automated Prompt Generation for Creative and Counterfactual Text-to-image Synthesis
Aleksa Jelaca, Ying Jiao, Chang Tian, Marie-Francine Moens
Main category: cs.CV
TL;DR: Proposes an automatic prompt engineering framework for counterfactual size control in text-to-image generation, achieving better performance than state-of-the-art methods.
Details
Motivation: Address the challenge of counterfactual controllability in text-to-image generation, particularly for generating images that contradict common-sense patterns like size relationships.Method: Three-component framework: image evaluator for dataset construction, supervised prompt rewriter for revised prompts, and DPO-trained ranker for optimal prompt selection. Uses extended Grounded SAM for improved evaluation.
Result: Created first counterfactual size text-image dataset. Achieved 114% improvement over backbone model. Outperformed state-of-the-art baselines and ChatGPT-4o.
Conclusion: Establishes a foundation for future research on counterfactual controllability in text-to-image generation.
Abstract: Text-to-image generation has advanced rapidly with large-scale multimodal training, yet fine-grained controllability remains a critical challenge. Counterfactual controllability, defined as the capacity to deliberately generate images that contradict common-sense patterns, remains a major challenge but plays a crucial role in enabling creativity and exploratory applications. In this work, we address this gap with a focus on counterfactual size (e.g., generating a tiny walrus beside a giant button) and propose an automatic prompt engineering framework that adapts base prompts into revised prompts for counterfactual images. The framework comprises three components: an image evaluator that guides dataset construction by identifying successful image generations, a supervised prompt rewriter that produces revised prompts, and a DPO-trained ranker that selects the optimal revised prompt. We construct the first counterfactual size text-image dataset and enhance the image evaluator by extending Grounded SAM with refinements, achieving a 114 percent improvement over its backbone. Experiments demonstrate that our method outperforms state-of-the-art baselines and ChatGPT-4o, establishing a foundation for future research on counterfactual controllability.
[229] On the Status of Foundation Models for SAR Imagery
Nathan Inkawhich
Main category: cs.CV
TL;DR: The paper investigates foundational AI/ML models for SAR object recognition, showing that while current visual models perform poorly on SAR data off-the-shelf, self-supervised finetuning with SAR data creates state-of-the-art SAR foundation models that outperform existing SAR-domain models.
Details
Motivation: To apply the transformative power of foundational AI models (trained with SSL on web-scale data) to SAR domain, leveraging their benefits like limited labeled data adaptation, robustness to distribution shift, and transferable features.Method: Tested powerful visual foundational models (DINOv2, DINOv3, PE-Core) on SAR data, then performed self-supervised finetuning of SSL models with SAR data to create AFRL-DINOv2 models, analyzing different backbones and downstream adaptation recipes.
Result: Self-supervised finetuning with SAR data produced state-of-the-art SAR foundation models (AFRL-DINOv2) that significantly outperformed the best existing SAR-domain model SARATR-X, with analysis of performance trade-offs across different backbones and adaptation methods.
Conclusion: While self-supervised finetuning is a viable path forward for SAR foundation models and achieves new state-of-the-art results, there is still significant room for improvement in SAR foundation model development.
Abstract: In this work we investigate the viability of foundational AI/ML models for Synthetic Aperture Radar (SAR) object recognition tasks. We are inspired by the tremendous progress being made in the wider community, particularly in the natural image domain where frontier labs are training huge models on web-scale datasets with unprecedented computing budgets. It has become clear that these models, often trained with Self-Supervised Learning (SSL), will transform how we develop AI/ML solutions for object recognition tasks - they can be adapted downstream with very limited labeled data, they are more robust to many forms of distribution shift, and their features are highly transferable out-of-the-box. For these reasons and more, we are motivated to apply this technology to the SAR domain. In our experiments we first run tests with today’s most powerful visual foundational models, including DINOv2, DINOv3 and PE-Core and observe their shortcomings at extracting semantically-interesting discriminative SAR target features when used off-the-shelf. We then show that Self-Supervised finetuning of publicly available SSL models with SAR data is a viable path forward by training several AFRL-DINOv2s and setting a new state-of-the-art for SAR foundation models, significantly outperforming today’s best SAR-domain model SARATR-X. Our experiments further analyze the performance trade-off of using different backbones with different downstream task-adaptation recipes, and we monitor each model’s ability to overcome challenges within the downstream environments (e.g., extended operating conditions and low amounts of labeled data). We hope this work will inform and inspire future SAR foundation model builders, because despite our positive results, we still have a long way to go.
[230] In silico Deep Learning Protocols for Label-Free Super-Resolution Microscopy: A Comparative Study of Network Architectures and SNR Dependence
Shiraz S Kaderuppan, Jonathan Mar, Andrew Irvine, Anurag Sharma, Muhammad Ramadan Saifuddin, Wai Leong Eugene Wong, Wai Lok Woo
Main category: cs.CV
TL;DR: This study evaluates deep neural networks (O-Net and Theta-Net) for super-resolution optical microscopy using non-fluorescent phase-contrast techniques, showing they are complementary approaches that perform differently based on image signal-to-noise ratios.
Details
Motivation: To address the resolution limitations of conventional optical microscopy (~200nm) without requiring expensive specialized equipment or fluorescent techniques, making super-resolution microscopy more accessible to non-specialist users.Method: Used two custom deep neural network architectures (O-Net and Theta-Net) to super-resolve images from non-fluorescent phase-modulated microscopy (Zernike PCM and DIC), tested on custom-fabricated nanoscale targets calibrated via atomic force microscopy.
Result: Both O-Net and Theta-Net performed well but were complementary: O-Net performed better with high signal-to-noise ratio images, while Theta-Net was preferred for low signal-to-noise ratio images.
Conclusion: Model architecture and source image signal-to-noise ratio significantly impact super-resolution performance, even when using the same training data and epochs, highlighting the importance of choosing appropriate DNN models for non-fluorescent optical nanoscopy.
Abstract: The field of optical microscopy spans across numerous industries and research domains, ranging from education to healthcare, quality inspection and analysis. Nonetheless, a key limitation often cited by optical microscopists refers to the limit of its lateral resolution (typically defined as ~200nm), with potential circumventions involving either costly external modules (e.g. confocal scan heads, etc) and/or specialized techniques [e.g. super-resolution (SR) fluorescent microscopy]. Addressing these challenges in a normal (non-specialist) context thus remains an aspect outside the scope of most microscope users & facilities. This study thus seeks to evaluate an alternative & economical approach to achieving SR optical microscopy, involving non-fluorescent phase-modulated microscopical modalities such as Zernike phase contrast (PCM) and differential interference contrast (DIC) microscopy. Two in silico deep neural network (DNN) architectures which we developed previously (termed O-Net and Theta-Net) are assessed on their abilities to resolve a custom-fabricated test target containing nanoscale features calibrated via atomic force microscopy (AFM). The results of our study demonstrate that although both O-Net and Theta-Net seemingly performed well when super-resolving these images, they were complementary (rather than competing) approaches to be considered for image SR, particularly under different image signal-to-noise ratios (SNRs). High image SNRs favoured the application of O-Net models, while low SNRs inclined preferentially towards Theta-Net models. These findings demonstrate the importance of model architectures (in conjunction with the source image SNR) on model performance and the SR quality of the generated images where DNN models are utilized for non-fluorescent optical nanoscopy, even where the same training dataset & number of epochs are being used.
[231] Dynamic Multi-Target Fusion for Efficient Audio-Visual Navigation
Yinfeng Yu, Hailong Zhang, Meiling Zhu
Main category: cs.CV
TL;DR: DMTF-AVN is a novel audio-visual navigation method that uses multi-target architecture and refined Transformer to dynamically fuse visual and audio cues, achieving state-of-the-art performance on Replica and Matterport3D datasets.
Details
Motivation: Prior works overlook deeper perceptual context in audiovisual navigation and fail to effectively leverage multimodal cues for guiding navigation.Method: Proposes Dynamic Multi-Target Fusion for Efficient Audio-Visual Navigation (DMTF-AVN) with multi-target architecture and refined Transformer mechanism to filter and selectively fuse cross-modal information.
Result: Achieves state-of-the-art performance on Replica and Matterport3D datasets, outperforming existing methods in success rate (SR), path efficiency (SPL), and scene adaptation (SNA).
Conclusion: The model exhibits strong scalability and generalizability, paving the way for advanced multimodal fusion strategies in robotic navigation.
Abstract: Audiovisual embodied navigation enables robots to locate audio sources by dynamically integrating visual observations from onboard sensors with the auditory signals emitted by the target. The core challenge lies in effectively leveraging multimodal cues to guide navigation. While prior works have explored basic fusion of visual and audio data, they often overlook deeper perceptual context. To address this, we propose the Dynamic Multi-Target Fusion for Efficient Audio-Visual Navigation (DMTF-AVN). Our approach uses a multi-target architecture coupled with a refined Transformer mechanism to filter and selectively fuse cross-modal information. Extensive experiments on the Replica and Matterport3D datasets demonstrate that DMTF-AVN achieves state-of-the-art performance, outperforming existing methods in success rate (SR), path efficiency (SPL), and scene adaptation (SNA). Furthermore, the model exhibits strong scalability and generalizability, paving the way for advanced multimodal fusion strategies in robotic navigation. The code and videos are available at https://github.com/zzzmmm-svg/DMTF.
[232] SAEmnesia: Erasing Concepts in Diffusion Models with Sparse Autoencoders
Enrico Cassano, Riccardo Renzulli, Marco Nurisso, Mirko Zaffaroni, Alan Perotti, Marco Grangetto
Main category: cs.CV
TL;DR: SAEmnesia is a supervised sparse autoencoder method that creates one-to-one concept-neuron mappings for efficient concept unlearning in diffusion models, reducing search complexity by 96.67% and improving state-of-the-art performance by 9.22%.
Details
Motivation: Current concept unlearning methods require extensive search procedures because concept representations are distributed across multiple latent features, even with sparse autoencoders that reduce neuron polysemanticity.Method: Supervised sparse autoencoder training with systematic concept labeling to promote one-to-one concept-neuron mappings, mitigating feature splitting and promoting feature centralization through cross-entropy computation.
Result: Learns specialized neurons with stronger concept associations, reduces hyperparameter search by 96.67%, achieves 9.22% improvement on UnlearnCanvas benchmark, and shows 28.4% improvement in sequential unlearning for 9-object removal.
Conclusion: SAEmnesia provides an efficient and scalable approach to concept unlearning with minimal computational overhead and significant performance improvements over current methods.
Abstract: Effective concept unlearning in text-to-image diffusion models requires precise localization of concept representations within the model’s latent space. While sparse autoencoders successfully reduce neuron polysemanticity (i.e., multiple concepts per neuron) compared to the original network, individual concept representations can still be distributed across multiple latent features, requiring extensive search procedures for concept unlearning. We introduce SAEmnesia, a supervised sparse autoencoder training method that promotes one-to-one concept-neuron mappings through systematic concept labeling, mitigating feature splitting and promoting feature centralization. Our approach learns specialized neurons with significantly stronger concept associations compared to unsupervised baselines. The only computational overhead introduced by SAEmnesia is limited to cross-entropy computation during training. At inference time, this interpretable representation reduces hyperparameter search by 96.67% with respect to current approaches. On the UnlearnCanvas benchmark, SAEmnesia achieves a 9.22% improvement over the state-of-the-art. In sequential unlearning tasks, we demonstrate superior scalability with a 28.4% improvement in unlearning accuracy for 9-object removal.
[233] Coreset selection based on Intra-class diversity
Imran Ashraf, Mukhtar Ullah, Muhammad Faisal Nadeem, Muhammad Nouman Noor
Main category: cs.CV
TL;DR: The paper proposes an intelligent coreset selection method that extracts intra-class diversity through clustering to create representative subsets for deep learning model training, outperforming random sampling on biomedical image classification tasks.
Details
Motivation: Deep learning models for biomedical image classification require substantial computational resources and time due to large datasets and hyperparameter search. Random sampling for coreset selection fails to capture intra-class diversity and can be biased towards dominant classes in imbalanced datasets.Method: The paper introduces a lightweight mechanism that extracts intra-class diversity by forming per-class clusters, which are then used for final sampling to create representative coresets.
Result: Extensive classification experiments on a biomedical imaging dataset show the proposed scheme outperforms random sampling on several performance metrics under uniform conditions.
Conclusion: The intelligent coreset selection method effectively addresses the limitations of random sampling by capturing intra-class diversity, providing a more representative subset for efficient deep learning model training.
Abstract: Deep Learning models have transformed various domains, including the healthcare sector, particularly biomedical image classification by learning intricate features and enabling accurate diagnostics pertaining to complex diseases. Recent studies have adopted two different approaches to train DL models: training from scratch and transfer learning. Both approaches demand substantial computational time and resources due to the involvement of massive datasets in model training. These computational demands are further increased due to the design-space exploration required for selecting optimal hyperparameters, which typically necessitates several training rounds. With the growing sizes of datasets, exploring solutions to this problem has recently gained the research community’s attention. A plausible solution is to select a subset of the dataset for training and hyperparameter search. This subset, referred to as the corset, must be a representative set of the original dataset. A straightforward approach to selecting the coreset could be employing random sampling, albeit at the cost of compromising the representativeness of the original dataset. A critical limitation of random sampling is the bias towards the dominant classes in an imbalanced dataset. Even if the dataset has inter-class balance, this random sampling will not capture intra-class diversity. This study addresses this issue by introducing an intelligent, lightweight mechanism for coreset selection. Specifically, it proposes a method to extract intra-class diversity, forming per-class clusters that are utilized for the final sampling. We demonstrate the efficacy of the proposed methodology by conducting extensive classification experiments on a well-known biomedical imaging dataset. Results demonstrate that the proposed scheme outperforms the random sampling approach on several performance metrics for uniform conditions.
[234] The LongiMam model for improved breast cancer risk prediction using longitudinal mammograms
Manel Rakez, Thomas Louis, Julien Guillaumin, Foucauld Chamming’s, Pierre Fillard, Brice Amadeo, Virginie Rondeau
Main category: cs.CV
TL;DR: LongiMam is a deep learning model that uses current and up to four prior mammograms to improve breast cancer prediction by capturing both spatial and temporal patterns, outperforming single-visit models.
Details
Motivation: Current deep learning models for breast cancer screening use limited prior mammograms and lack adaptation for real-world settings with imbalanced outcomes and heterogeneous follow-up.Method: Developed LongiMam - an end-to-end deep learning model combining convolutional and recurrent neural networks to integrate current and up to four prior mammograms, capturing spatial and temporal patterns.
Result: LongiMam consistently improved prediction when prior mammograms were included, with current+prior visits outperforming single-visit models. Model performed best in women with observed mammographic density changes over time and was effective across key risk groups.
Conclusion: Longitudinal modeling enhances breast cancer prediction and supports using repeated mammograms to refine risk stratification in screening programs. LongiMam is available as open-source software.
Abstract: Risk-adapted breast cancer screening requires robust models that leverage longitudinal imaging data. Most current deep learning models use single or limited prior mammograms and lack adaptation for real-world settings marked by imbalanced outcome distribution and heterogeneous follow-up. We developed LongiMam, an end-to-end deep learning model that integrates both current and up to four prior mammograms. LongiMam combines a convolutional and a recurrent neural network to capture spatial and temporal patterns predictive of breast cancer. The model was trained and evaluated using a large, population-based screening dataset with disproportionate case-to-control ratio typical of clinical screening. Across several scenarios that varied in the number and composition of prior exams, LongiMam consistently improved prediction when prior mammograms were included. The addition of prior and current visits outperformed single-visit models, while priors alone performed less well, highlighting the importance of combining historical and recent information. Subgroup analyses confirmed the model’s efficacy across key risk groups, including women with dense breasts and those aged 55 years or older. Moreover, the model performed best in women with observed changes in mammographic density over time. These findings demonstrate that longitudinal modeling enhances breast cancer prediction and support the use of repeated mammograms to refine risk stratification in screening programs. LongiMam is publicly available as open-source software.
[235] WAVE: Learning Unified & Versatile Audio-Visual Embeddings with Multimodal LLM
Changli Tang, Qinfan Xiao, Ke Mei, Tianyi Wang, Fengyun Rao, Chao Zhang
Main category: cs.CV
TL;DR: WAVE introduces unified audio-visual embeddings using multimodal LLMs, enabling any-to-any cross-modal retrieval and prompt-aware embeddings through hierarchical feature fusion and joint multi-task training.
Details
Motivation: Current multimodal LLM embeddings are underexplored for dynamic modalities like audio and video, creating a need for unified representations across text, audio, and video.Method: Uses hierarchical feature fusion strategy and joint multi-modal, multi-task training to create unified embeddings that support any-to-any cross-modal retrieval and prompt-aware generation.
Result: Sets new SOTA on MMEB-v2 video benchmark, achieves superior audio/video-to-audio retrieval, and significantly outperforms existing models in multimodal QA. Ablation studies confirm joint training benefits.
Conclusion: WAVE enables broad cross-modal applications and opens possibilities for versatile audio-visual learning with its unified representation space and prompt-aware capabilities.
Abstract: While embeddings from multimodal large language models (LLMs) excel as general-purpose representations, their application to dynamic modalities like audio and video remains underexplored. We introduce WAVE (\textbf{u}nified & \textbf{v}ersatile \textbf{a}udio-\textbf{v}isual \textbf{e}mbeddings), the first LLM-based embedding that creates a unified representation space for text, audio, and video modalities. WAVE employs a novel hierarchical feature fusion strategy and a joint multi-modal, multi-task training approach to enable two key capabilities: any-to-any cross-modal retrieval and the generation of prompt-aware embeddings tailored to user instructions. Experimentally, WAVE sets a new state-of-the-art on the MMEB-v2 video benchmark and achieves superior results in audio and video-to-audio retrieval. Its prompt-aware nature also yields remarkable performance in multimodal question answering, significantly outperforming existing embedding models. Ablation studies validate our joint training strategy, demonstrating improved performance across all modalities. With a newly introduced benchmark for versatile audio-visual learning, WAVE opens up broad possibilities for cross-modal, any-to-any applications. Our code, checkpoints, and data will be released.
[236] Assessing the Alignment of Popular CNNs to the Brain for Valence Appraisal
Laurent Mertens, Elahe’ Yargholi, Laura Van Hove, Hans Op de Beeck, Jan Van den Stock, Joost Vennekens
Main category: cs.CV
TL;DR: CNNs show limited correspondence with human social cognition processes like image valence appraisal, struggling to go beyond basic visual processing and not reflecting higher-order brain functions.
Details
Motivation: To explore whether CNN-brain correspondences extend beyond general visual perception to more complex social cognition processes, specifically image valence appraisal.Method: Used correlation analysis between CNN architectures and human behavioral/fMRI data, plus developed Object2Brain framework combining GradCAM, object detection, and correlation analysis to study object class influences.
Result: CNNs struggle with higher-order social cognition tasks and show different object class sensitivities across architectures despite similar correlation trends.
Conclusion: CNN-brain correspondence is limited for complex social cognition tasks, suggesting CNNs may not adequately model higher-order human brain processing in social contexts.
Abstract: Convolutional Neural Networks (CNNs) are a popular type of computer model that have proven their worth in many computer vision tasks. Moreover, they form an interesting study object for the field of psychology, with shown correspondences between the workings of CNNs and the human brain. However, these correspondences have so far mostly been studied in the context of general visual perception. In contrast, this paper explores to what extent this correspondence also holds for a more complex brain process, namely social cognition. To this end, we assess the alignment between popular CNN architectures and both human behavioral and fMRI data for image valence appraisal through a correlation analysis. We show that for this task CNNs struggle to go beyond simple visual processing, and do not seem to reflect higher-order brain processing. Furthermore, we present Object2Brain, a novel framework that combines GradCAM and object detection at the CNN-filter level with the aforementioned correlation analysis to study the influence of different object classes on the CNN-to-human correlations. Despite similar correlation trends, different CNN architectures are shown to display different object class sensitivities.
[237] High-Quality Sound Separation Across Diverse Categories via Visually-Guided Generative Modeling
Chao Huang, Susan Liang, Yapeng Tian, Anurag Kumar, Chenliang Xu
Main category: cs.CV
TL;DR: DAVIS is a generative audio-visual separation framework using diffusion models (DDPM and Flow Matching) that directly synthesizes separated sound spectrograms from noise, conditioned on mixed audio and visual inputs.
Details
Motivation: Existing mask-based regression methods for sound separation have limitations in capturing complex data distributions needed for high-quality separation across diverse sound categories.Method: Uses Denoising Diffusion Probabilistic Models (DDPM) and Flow Matching integrated within a Separation U-Net to synthesize separated sound spectrograms from noise distribution, conditioned on mixed audio and visual information.
Result: Both DDPM and Flow Matching variants of DAVIS surpass existing methods on AVE and MUSIC datasets, demonstrating superior separation quality.
Conclusion: The generative framework proves effective for audio-visual source separation, with both diffusion-based approaches outperforming current state-of-the-art methods.
Abstract: We propose DAVIS, a Diffusion-based Audio-VIsual Separation framework that solves the audio-visual sound source separation task through generative learning. Existing methods typically frame sound separation as a mask-based regression problem, achieving significant progress. However, they face limitations in capturing the complex data distribution required for high-quality separation of sounds from diverse categories. In contrast, DAVIS circumvents these issues by leveraging potent generative modeling paradigms, specifically Denoising Diffusion Probabilistic Models (DDPM) and the more recent Flow Matching (FM), integrated within a specialized Separation U-Net architecture. Our framework operates by synthesizing the desired separated sound spectrograms directly from a noise distribution, conditioned concurrently on the mixed audio input and associated visual information. The inherent nature of its generative objective makes DAVIS particularly adept at producing high-quality sound separations for diverse sound categories. We present comparative evaluations of DAVIS, encompassing both its DDPM and Flow Matching variants, against leading methods on the standard AVE and MUSIC datasets. The results affirm that both variants surpass existing approaches in separation quality, highlighting the efficacy of our generative framework for tackling the audio-visual source separation task.
[238] Debugging Concept Bottleneck Models through Removal and Retraining
Eric Enouen, Sainyam Galhotra
Main category: cs.CV
TL;DR: CBDebug is an interpretable debugging framework for Concept Bottleneck Models that removes undesired concepts and retrains models using concept-level feedback converted to sample-level labels for bias mitigation.
Details
Motivation: Address systemic misalignment between CBMs and expert reasoning, particularly when models learn shortcuts from biased data that concept interventions alone cannot fix.Method: Two-step process: 1) Removal step where experts identify and remove undesired concepts using concept explanations; 2) Retraining step using CBDebug to convert concept-level feedback into sample-level auxiliary labels for supervised bias mitigation and targeted augmentation.
Result: CBDebug significantly outperforms prior retraining methods across multiple CBM architectures (PIP-Net, Post-hoc CBM) and benchmarks with known spurious correlations.
Conclusion: The framework effectively reduces model reliance on undesired concepts by leveraging CBM interpretability as a bridge for converting expert feedback into actionable training signals.
Abstract: Concept Bottleneck Models (CBMs) use a set of human-interpretable concepts to predict the final task label, enabling domain experts to not only validate the CBM’s predictions, but also intervene on incorrect concepts at test time. However, these interventions fail to address systemic misalignment between the CBM and the expert’s reasoning, such as when the model learns shortcuts from biased data. To address this, we present a general interpretable debugging framework for CBMs that follows a two-step process of Removal and Retraining. In the Removal step, experts use concept explanations to identify and remove any undesired concepts. In the Retraining step, we introduce CBDebug, a novel method that leverages the interpretability of CBMs as a bridge for converting concept-level user feedback into sample-level auxiliary labels. These labels are then used to apply supervised bias mitigation and targeted augmentation, reducing the model’s reliance on undesired concepts. We evaluate our framework with both real and automated expert feedback, and find that CBDebug significantly outperforms prior retraining methods across multiple CBM architectures (PIP-Net, Post-hoc CBM) and benchmarks with known spurious correlations.
[239] Do Sparse Subnetworks Exhibit Cognitively Aligned Attention? Effects of Pruning on Saliency Map Fidelity, Sparsity, and Concept Coherence
Sanish Suwal, Dipkamal Bhusal, Michael Clifford, Nidhi Rastogi
Main category: cs.CV
TL;DR: This paper investigates how magnitude-based pruning affects neural network interpretability, finding that light-to-moderate pruning improves saliency map focus and faithfulness while retaining meaningful concepts, but aggressive pruning reduces interpretability despite maintaining accuracy.
Details
Motivation: To understand the impact of neural network pruning on model interpretability, as prior works showed networks can be heavily pruned while preserving performance but the effect on interpretability remained unclear.Method: Used ResNet-18 trained on ImageNette, applied magnitude-based pruning followed by fine-tuning, compared post-hoc explanations from Vanilla Gradients and Integrated Gradients across pruning levels, and applied CRAFT-based concept extraction to track semantic coherence changes.
Result: Light-to-moderate pruning improved saliency-map focus and faithfulness while retaining distinct, semantically meaningful concepts. Aggressive pruning merged heterogeneous features, reducing saliency map sparsity and concept coherence despite maintaining accuracy.
Conclusion: Pruning can shape internal representations toward more human-aligned attention patterns, but excessive pruning undermines interpretability.
Abstract: Prior works have shown that neural networks can be heavily pruned while preserving performance, but the impact of pruning on model interpretability remains unclear. In this work, we investigate how magnitude-based pruning followed by fine-tuning affects both low-level saliency maps and high-level concept representations. Using a ResNet-18 trained on ImageNette, we compare post-hoc explanations from Vanilla Gradients (VG) and Integrated Gradients (IG) across pruning levels, evaluating sparsity and faithfulness. We further apply CRAFT-based concept extraction to track changes in semantic coherence of learned concepts. Our results show that light-to-moderate pruning improves saliency-map focus and faithfulness while retaining distinct, semantically meaningful concepts. In contrast, aggressive pruning merges heterogeneous features, reducing saliency map sparsity and concept coherence despite maintaining accuracy. These findings suggest that while pruning can shape internal representations toward more human-aligned attention patterns, excessive pruning undermines interpretability.
[240] Large AI Model-Enabled Generative Semantic Communications for Image Transmission
Qiyu Ma, Wanli Ni, Zhijin Qin
Main category: cs.CV
TL;DR: A generative semantic communication system that segments images into key/non-key regions, processes them differently, and uses lightweight deployment strategies to improve transmission efficiency and quality.
Details
Motivation: Existing methods neglect the varying importance of different image regions, compromising reconstruction quality of critical visual content.Method: Segment images into key and non-key regions; key regions processed with image-oriented semantic encoder, non-key regions compressed via image-to-text modeling; uses model quantization and low-rank adaptation for lightweight deployment.
Result: Outperforms traditional methods in both semantic fidelity and visual quality for image transmission tasks.
Conclusion: The proposed system effectively enhances image transmission by addressing regional importance differences and optimizing resource utilization.
Abstract: The rapid development of generative artificial intelligence (AI) has introduced significant opportunities for enhancing the efficiency and accuracy of image transmission within semantic communication systems. Despite these advancements, existing methodologies often neglect the difference in importance of different regions of the image, potentially compromising the reconstruction quality of visually critical content. To address this issue, we introduce an innovative generative semantic communication system that refines semantic granularity by segmenting images into key and non-key regions. Key regions, which contain essential visual information, are processed using an image oriented semantic encoder, while non-key regions are efficiently compressed through an image-to-text modeling approach. Additionally, to mitigate the substantial storage and computational demands posed by large AI models, the proposed system employs a lightweight deployment strategy incorporating model quantization and low-rank adaptation fine-tuning techniques, significantly boosting resource utilization without sacrificing performance. Simulation results demonstrate that the proposed system outperforms traditional methods in terms of both semantic fidelity and visual quality, thereby affirming its effectiveness for image transmission tasks.
[241] Frequency-Domain Refinement with Multiscale Diffusion for Super Resolution
Xingjian Wang, Li Chai, Jiming Chen
Main category: cs.CV
TL;DR: FDDiff is a novel frequency domain-guided multiscale diffusion model for super-resolution that progressively complements high-frequency details using wavelet packet decomposition to avoid hallucination artifacts.
Details
Motivation: Existing diffusion-based super-resolution models directly predict wide bandwidth high-frequency information using only high-resolution ground truth, which causes hallucination problems and mismatching artifacts.Method: Proposes a wavelet packet-based frequency degradation pyramid to provide multiscale intermediate targets with increasing bandwidth, and guides reverse diffusion to progressively complement high-frequency details. Also designs a multiscale frequency refinement network to predict high-frequency components at multiple scales.
Result: Comprehensive evaluations on popular benchmarks show FDDiff outperforms prior generative methods with higher-fidelity super-resolution results.
Conclusion: The proposed FDDiff successfully addresses hallucination problems in diffusion-based super-resolution by decomposing the high-frequency complementing process into finer-grained steps using frequency domain guidance.
Abstract: The performance of single image super-resolution depends heavily on how to generate and complement high-frequency details to low-resolution images. Recently, diffusion-based DDPM models exhibit great potential in generating high-quality details for super-resolution tasks. They tend to directly predict high-frequency information of wide bandwidth by solely utilizing the high-resolution ground truth as the target for all sampling timesteps. However, as a result, they encounter hallucination problem that they generate mismatching artifacts. To tackle this problem and achieve higher-quality super-resolution, we propose a novel Frequency Domain-guided multiscale Diffusion model (FDDiff), which decomposes the high-frequency information complementing process into finer-grained steps. In particular, a wavelet packet-based frequency degradation pyramid is developed to provide multiscale intermediate targets with increasing bandwidth. Based on these targets, FDDiff guides reverse diffusion process to progressively complement missing high-frequency details over timesteps. Moreover, a multiscale frequency refinement network is designed to predict the required high-frequency components at multiple scales within one unified network. Comprehensive evaluations on popular benchmarks are conducted, and demonstrate that FDDiff outperforms prior generative methods with higher-fidelity super-resolution results.
[242] mmHSense: Multi-Modal and Distributed mmWave ISAC Datasets for Human Sensing
Nabeel Nisar Bhat, Maksim Karnaukh, Stein Vandenbroeke, Wouter Lemoine, Jakob Struye, Jesus Omar Lacruz, Siddhartha Kumar, Mohammad Hossein Moghaddam, Joerg Widmer, Rafael Berkvens, Jeroen Famaey
Main category: cs.CV
TL;DR: mmHSense is an open labeled mmWave dataset collection for human sensing research in ISAC systems, supporting applications like gesture recognition, person identification, pose estimation, and localization.
Details
Motivation: To provide comprehensive datasets that support research in mmWave Integrated Sensing and Communication (ISAC) systems for human sensing applications, addressing the need for standardized data in this emerging field.Method: Created labeled mmWave datasets using a testbed with specific experimental settings and signal features, and demonstrated utility through validation on downstream tasks using parameter-efficient fine-tuning to adapt ISAC models.
Result: Successfully developed and released open datasets that enable research in various human sensing applications, and demonstrated effective model adaptation through fine-tuning that reduces computational complexity while maintaining performance.
Conclusion: mmHSense datasets provide valuable resources for advancing mmWave ISAC research, supporting both signal processing and deep learning approaches while enabling efficient model adaptation across different human sensing tasks.
Abstract: This article presents mmHSense, a set of open labeled mmWave datasets to support human sensing research within Integrated Sensing and Communication (ISAC) systems. The datasets can be used to explore mmWave ISAC for various end applications such as gesture recognition, person identification, pose estimation, and localization. Moreover, the datasets can be used to develop and advance signal processing and deep learning research on mmWave ISAC. This article describes the testbed, experimental settings, and signal features for each dataset. Furthermore, the utility of the datasets is demonstrated through validation on a specific downstream task. In addition, we demonstrate the use of parameter-efficient fine-tuning to adapt ISAC models to different tasks, significantly reducing computational complexity while maintaining performance on prior tasks.
[243] HiSin: A Sinogram-Aware Framework for Efficient High-Resolution Inpainting
Jiaze E, Srutarshi Banerjee, Tekin Bicer, Guannan Wang, Yanfu Zhang, Bin Ren
Main category: cs.CV
TL;DR: HiSin is a diffusion-based framework for efficient high-resolution sinogram inpainting that reduces memory usage by 30.81% and inference time by 17.58% while maintaining accuracy.
Details
Motivation: High-resolution sinogram inpainting is essential for CT reconstruction to avoid artifacts, but current diffusion models face excessive memory and computational demands for high-resolution inputs.Method: HiSin exploits spectral sparsity and structural heterogeneity of projection data, progressively extracting global structure at low resolution and deferring high-resolution inference to small patches with frequency-aware patch skipping and structure-adaptive step allocation.
Result: HiSin reduces peak memory usage by up to 30.81% and inference time by up to 17.58% compared to state-of-the-art methods while maintaining inpainting accuracy.
Conclusion: The proposed HiSin framework enables efficient high-resolution sinogram inpainting by leveraging spectral properties and structural features, making diffusion models practical for CT reconstruction applications.
Abstract: High-resolution sinogram inpainting is essential for computed tomography reconstruction, as missing high-frequency projections can lead to visible artifacts and diagnostic errors. Diffusion models are well-suited for this task due to their robustness and detail-preserving capabilities, but their application to high-resolution inputs is limited by excessive memory and computational demands. To address this limitation, we propose HiSin, a novel diffusion-based framework for efficient sinogram inpainting that exploits spectral sparsity and structural heterogeneity of projection data. It progressively extracts global structure at low resolution and defers high-resolution inference to small patches, enabling memory-efficient inpainting. Considering the structural features of sinograms, we incorporate frequency-aware patch skipping and structure-adaptive step allocation to reduce redundant computation. Experimental results show that HiSin reduces peak memory usage by up to 30.81% and inference time by up to 17.58% than the state-of-the-art framework, and maintains inpainting accuracy across.
[244] Downscaling climate projections to 1 km with single-image super resolution
Petr Košťál, Pavel Kordík, Ondřej Podsztavek
Main category: cs.CV
TL;DR: Using single-image super-resolution models to downscale low-resolution climate projections from 12.5 km to 1 km resolution, trained on observational data and evaluated using climate indicators.
Details
Motivation: High-resolution climate projections are needed for local decision-making but current projections have low spatial resolution (12.5 km), limiting their usability.Method: Train single-image super-resolution models on high-resolution observational gridded data, then apply them to low-resolution climate projections. Use climate indicator-based assessment with observed climate indices from weather stations for evaluation.
Result: Experiments on daily mean temperature show that super-resolution models can downscale climate projections without increasing the error of climate indicators compared to original low-resolution projections.
Conclusion: Single-image super-resolution models are effective for statistically downscaling climate projections to higher resolution when high-resolution training data is unavailable.
Abstract: High-resolution climate projections are essential for local decision-making. However, available climate projections have low spatial resolution (e.g. 12.5 km), which limits their usability. We address this limitation by leveraging single-image super-resolution models to statistically downscale climate projections to 1-km resolution. Since high-resolution climate projections are unavailable for training, we train models on a high-resolution observational gridded data set and apply them to low-resolution climate projections. We propose a climate indicator-based assessment using observed climate indices computed at weather station locations to evaluate the downscaled climate projections without ground-truth high-resolution climate projections. Experiments on daily mean temperature demonstrate that single-image super-resolution models can downscale climate projections without increasing the error of climate indicators compared to low-resolution climate projections.
[245] STQE: Spatial-Temporal Attribute Quality Enhancement for G-PCC Compressed Dynamic Point Clouds
Tian Guo, Hui Yuan, Xiaolong Mao, Shiqi Jiang, Raouf Hamzaoui, Sam Kwong
Main category: cs.CV
TL;DR: Proposed STQE network enhances compressed dynamic point cloud quality by exploiting spatial-temporal correlations through motion compensation, temporal attention, spatial feature aggregation, and joint loss optimization.
Details
Motivation: Very few studies address quality enhancement for compressed dynamic point clouds, with effective exploitation of spatial-temporal correlations between frames remaining largely unexplored.Method: STQE network includes: recoloring-based motion compensation for inter-frame alignment, channel-aware temporal attention for dynamic region highlighting, Gaussian-guided neighborhood feature aggregation for spatial dependencies, and Pearson correlation-based joint loss to prevent over-smoothing.
Result: Applied to G-PCC test model, achieved improvements of 0.855 dB, 0.682 dB, and 0.828 dB delta PSNR with BD-rate reductions of -25.2%, -31.6%, and -32.5% for Luma, Cb, and Cr components respectively.
Conclusion: STQE network effectively enhances visual quality of compressed dynamic point clouds by exploiting spatial-temporal correlations, demonstrating significant performance improvements over existing methods.
Abstract: Very few studies have addressed quality enhancement for compressed dynamic point clouds. In particular, the effective exploitation of spatial-temporal correlations between point cloud frames remains largely unexplored. Addressing this gap, we propose a spatial-temporal attribute quality enhancement (STQE) network that exploits both spatial and temporal correlations to improve the visual quality of G-PCC compressed dynamic point clouds. Our contributions include a recoloring-based motion compensation module that remaps reference attribute information to the current frame geometry to achieve precise inter-frame geometric alignment, a channel-aware temporal attention module that dynamically highlights relevant regions across bidirectional reference frames, a Gaussian-guided neighborhood feature aggregation module that efficiently captures spatial dependencies between geometry and color attributes, and a joint loss function based on the Pearson correlation coefficient, designed to alleviate over-smoothing effects typical of point-wise mean squared error optimization. When applied to the latest G-PCC test model, STQE achieved improvements of 0.855 dB, 0.682 dB, and 0.828 dB in delta PSNR, with Bj{\o}ntegaard Delta rate (BD-rate) reductions of -25.2%, -31.6%, and -32.5% for the Luma, Cb, and Cr components, respectively.
[246] JaiLIP: Jailbreaking Vision-Language Models via Loss Guided Image Perturbation
Md Jueal Mia, M. Hadi Amini
Main category: cs.CV
TL;DR: JaiLIP is a jailbreaking attack method that uses loss-guided image perturbations to make Vision-Language Models generate harmful outputs while maintaining imperceptible modifications.
Details
Motivation: Vision-Language Models are vulnerable to image-based jailbreaking attacks that can bypass safety alignments, and existing methods have unstable performance and visible perturbations.Method: Jailbreaking with Loss-guided Image Perturbation (JaiLIP) minimizes a joint objective combining MSE loss between clean/adversarial images and the model’s harmful-output loss.
Result: JaiLIP generates highly effective and imperceptible adversarial images that outperform existing methods in producing toxicity, as measured by Perspective API and Detoxify metrics.
Conclusion: Image-based jailbreak attacks pose practical challenges for VLMs, highlighting the need for efficient defense mechanisms to protect against such vulnerabilities.
Abstract: Vision-Language Models (VLMs) have remarkable abilities in generating multimodal reasoning tasks. However, potential misuse or safety alignment concerns of VLMs have increased significantly due to different categories of attack vectors. Among various attack vectors, recent studies have demonstrated that image-based perturbations are particularly effective in generating harmful outputs. In the literature, many existing techniques have been proposed to jailbreak VLMs, leading to unstable performance and visible perturbations. In this study, we propose Jailbreaking with Loss-guided Image Perturbation (JaiLIP), a jailbreaking attack in the image space that minimizes a joint objective combining the mean squared error (MSE) loss between clean and adversarial image with the models harmful-output loss. We evaluate our proposed method on VLMs using standard toxicity metrics from Perspective API and Detoxify. Experimental results demonstrate that our method generates highly effective and imperceptible adversarial images, outperforming existing methods in producing toxicity. Moreover, we have evaluated our method in the transportation domain to demonstrate the attacks practicality beyond toxic text generation in specific domain. Our findings emphasize the practical challenges of image-based jailbreak attacks and the need for efficient defense mechanisms for VLMs.
[247] Overview of ExpertLifeCLEF 2018: how far automated identification systems are from the best experts?
Herve Goeau, Pierre Bonnet, Alexis Joly
Main category: cs.CV
TL;DR: The LifeCLEF 2018 ExpertCLEF challenge compared automated plant identification systems with human botanical experts, finding that deep learning models now approach human-level performance.
Details
Motivation: To quantify the gap between automated species identification systems and human expertise, and compare their performance given the inherent uncertainty in visual observations.Method: Evaluated 19 deep-learning systems from 4 research teams against 9 expert botanists specializing in French flora, using standardized resources and assessments.
Result: State-of-the-art deep learning models achieved performance close to the most advanced human expertise in plant identification.
Conclusion: Automated species identification using deep learning has reached near-human expert levels, demonstrating significant progress in the field.
Abstract: Automated identification of plants and animals has improved considerably in the last few years, in particular thanks to the recent advances in deep learning. The next big question is how far such automated systems are from the human expertise. Indeed, even the best experts are sometimes confused and/or disagree between each others when validating visual or audio observations of living organism. A picture actually contains only a partial information that is usually not sufficient to determine the right species with certainty. Quantifying this uncertainty and comparing it to the performance of automated systems is of high interest for both computer scientists and expert naturalists. The LifeCLEF 2018 ExpertCLEF challenge presented in this paper was designed to allow this comparison between human experts and automated systems. In total, 19 deep-learning systems implemented by 4 different research teams were evaluated with regard to 9 expert botanists of the French flora. The main outcome of this work is that the performance of state-of-the-art deep learning models is now close to the most advanced human expertise. This paper presents more precisely the resources and assessments of the challenge, summarizes the approaches and systems employed by the participating research groups, and provides an analysis of the main outcomes.
[248] QuadGPT: Native Quadrilateral Mesh Generation with Autoregressive Models
Jian Liu, Chunshi Wang, Song Guo, Haohan Weng, Zhen Zhou, Zhiqi Li, Jiaao Yu, Yiling Zhu, Jing Xu, Biwen Lei, Zhuo Chen, Chunchao Guo
Main category: cs.CV
TL;DR: QuadGPT is the first autoregressive framework for end-to-end quadrilateral mesh generation, using sequence prediction with unified tokenization for mixed topologies and RL fine-tuning to surpass previous triangle-to-quad conversion methods.
Details
Motivation: Existing methods generate quad meshes by first creating triangle meshes and then merging them, which produces poor topology quality in the resulting quadrilateral meshes.Method: QuadGPT uses an autoregressive framework with sequence prediction, featuring: 1) unified tokenization to handle mixed triangle/quad topologies, and 2) specialized Reinforcement Learning fine-tuning (tDPO) for better generation quality.
Result: Extensive experiments show QuadGPT significantly outperforms previous triangle-to-quad conversion pipelines in both geometric accuracy and topological quality.
Conclusion: The work establishes a new benchmark for native quad-mesh generation and demonstrates the effectiveness of combining large-scale autoregressive models with topology-aware RL refinement for structured 3D asset creation.
Abstract: The generation of quadrilateral-dominant meshes is a cornerstone of professional 3D content creation. However, existing generative models generate quad meshes by first generating triangle meshes and then merging triangles into quadrilaterals with some specific rules, which typically produces quad meshes with poor topology. In this paper, we introduce QuadGPT, the first autoregressive framework for generating quadrilateral meshes in an end-to-end manner. QuadGPT formulates this as a sequence prediction paradigm, distinguished by two key innovations: a unified tokenization method to handle mixed topologies of triangles and quadrilaterals, and a specialized Reinforcement Learning fine-tuning method tDPO for better generation quality. Extensive experiments demonstrate that QuadGPT significantly surpasses previous triangle-to-quad conversion pipelines in both geometric accuracy and topological quality. Our work establishes a new benchmark for native quad-mesh generation and showcases the power of combining large-scale autoregressive models with topology-aware RL refinement for creating structured 3D assets.
[249] DyME: Dynamic Multi-Concept Erasure in Diffusion Models with Bi-Level Orthogonal LoRA Adaptation
Jiaqi Liu, Lan Zhang, Xiaoyong Yuan
Main category: cs.CV
TL;DR: DyME is a dynamic multi-concept erasure framework that uses lightweight LoRA adapters with bi-level orthogonality constraints to enable on-demand removal of copyrighted styles and protected visual concepts from diffusion models, outperforming static erasure methods.
Details
Motivation: Existing concept erasure methods use static fine-tuning that doesn't scale to practical multi-concept scenarios, leading to degraded performance and fidelity issues when handling varying erasure requests.Method: Trains concept-specific LoRA adapters and dynamically composes them at inference, with bi-level orthogonality constraints at feature and parameter levels to prevent interference between adapters.
Result: Outperforms state-of-the-art baselines on ErasureBench-H and standard datasets, achieving higher multi-concept erasure fidelity with minimal collateral degradation.
Conclusion: DyME provides a scalable, flexible solution for multi-concept erasure that matches real-world usage patterns while maintaining model quality.
Abstract: Text-to-image diffusion models (DMs) inadvertently reproduce copyrighted styles and protected visual concepts, raising legal and ethical concerns. Concept erasure has emerged as a safeguard, aiming to selectively suppress such concepts through fine-tuning. However, existing methods do not scale to practical settings where providers must erase multiple and possibly conflicting concepts. The core bottleneck is their reliance on static erasure: a single checkpoint is fine-tuned to remove all target concepts, regardless of the actual erasure needs at inference. This rigid design mismatches real-world usage, where requests vary per generation, leading to degraded erasure success and reduced fidelity for non-target content. We propose DyME, an on-demand erasure framework that trains lightweight, concept-specific LoRA adapters and dynamically composes only those needed at inference. This modular design enables flexible multi-concept erasure, but naive composition causes interference among adapters, especially when many or semantically related concepts are suppressed. To overcome this, we introduce bi-level orthogonality constraints at both the feature and parameter levels, disentangling representation shifts and enforcing orthogonal adapter subspaces. We further develop ErasureBench-H, a new hierarchical benchmark with brand-series-character structure, enabling principled evaluation across semantic granularities and erasure set sizes. Experiments on ErasureBench-H and standard datasets (e.g., CIFAR-100, Imagenette) demonstrate that DyME consistently outperforms state-of-the-art baselines, achieving higher multi-concept erasure fidelity with minimal collateral degradation.
[250] video-SALMONN 2: Caption-Enhanced Audio-Visual Large Language Models
Changli Tang, Yixuan Li, Yudong Yang, Jimin Zhuang, Guangzhi Sun, Wei Li, Zejun Ma, Chao Zhang
Main category: cs.CV
TL;DR: Video-SALMONN 2 introduces multi-round DPO (MrDPO) with caption-quality objective, achieving SOTA in video description and QA across multiple benchmarks with 3B, 7B, and 72B models.
Details
Motivation: To improve video understanding by addressing limitations of standard DPO through continual reference policy updates and joint optimization for completeness and factual accuracy.Method: Multi-round direct preference optimisation (MrDPO) with periodic reference policy refresh via bootstrapping from re-initialized lightweight adapters, plus supervised fine-tuning using generated high-quality video captions.
Result: Achieves SOTA results on Video-MME, WorldSense, AVUT, Video-Holmes, DailyOmni, MLVU, and LVBench benchmarks; 72B model surpasses all open-source systems; produces more detailed and accurate captions than GPT-4o and Gemini-1.5 Pro.
Conclusion: MrDPO effectively addresses reference staleness in DPO, enabling continual improvement and strong performance transfer from captioning to complex video-QA tasks, with released open-source code, models, and data.
Abstract: We present video-SALMONN 2, a family of audio-visual large language models that set new state-of-the-art (SOTA) results in video description and question answering (QA). Our core contribution is multi-round direct preference optimisation (MrDPO), paired with a caption-quality objective that jointly rewards completeness and factual accuracy. Unlike standard DPO with a fixed reference policy, MrDPO periodically refreshes the reference by bootstrapping from a newly re-initialised lightweight adapter trained on the latest preferences, avoiding reference staleness and enabling continual improvement. This strategy produces captions that are consistently more detailed and accurate than those from proprietary systems such as GPT-4o and Gemini-1.5 Pro. We further distil these gains by using our model to generate a high-quality video-caption corpus for supervised fine-tuning of new models, transferring benefits beyond captioning to strong performance on complex video-QA tasks. Across widely used audio-visual and visual-only understanding benchmarks (including Video-MME, WorldSense, AVUT, Video-Holmes, DailyOmni, MLVU, and LVBench), our 3B and 7B models achieve SOTA results at comparable scales, while the 72B model surpasses all other open-source systems. Our source code, models, and data are released at \href{https://github.com/bytedance/video-SALMONN-2}{https://github.com/bytedance/video-SALMONN-2}.
[251] VideoJudge: Bootstrapping Enables Scalable Supervision of MLLM-as-a-Judge for Video Understanding
Abdul Waheed, Zhen Wu, Dareen Alharthi, Seungone Kim, Bhiksha Raj
Main category: cs.CV
TL;DR: VideoJudge is a specialized 3B/7B multimodal LLM for evaluating video understanding models, outperforming larger baselines and showing video inputs are crucial for accurate evaluation.
Details
Motivation: Existing metrics like BLEU, ROUGE, and BERTScore fail to capture human judgment nuances for video understanding, while manual evaluation is costly. LLM/MLLM evaluators for video tasks remain underexplored.Method: Training uses generator-evaluator interplay: generator produces responses conditioned on target ratings, and mismatched responses are discarded. Specialized MLLM architecture for video-text evaluation.
Result: VideoJudge-7B outperforms larger MLLM baselines (Qwen2.5-VL 32B/72B) on 3 out of 4 benchmarks. LLM judges perform worse than MLLM judges, and chain-of-thought reasoning doesn’t help, confirming video input importance.
Conclusion: VideoJudge demonstrates specialized MLLM judges can effectively evaluate video understanding tasks, with video inputs being essential for accurate assessment.
Abstract: Precisely evaluating video understanding models remains challenging: commonly used metrics such as BLEU, ROUGE, and BERTScore fail to capture the fineness of human judgment, while obtaining such judgments through manual evaluation is costly. Recent work has explored using large language models (LLMs) or multimodal LLMs (MLLMs) as evaluators, but their extension to video understanding remains relatively unexplored. In this work, we introduce VideoJudge, a 3B and 7B-sized MLLM judge specialized to evaluate outputs from video understanding models (\textit{i.e.}, text responses conditioned on videos). To train VideoJudge, our recipe builds on the interplay between a generator and an evaluator: the generator is prompted to produce responses conditioned on a target rating, and responses not matching the evaluator’s rating are discarded. Across three out of four meta-evaluation benchmarks, VideoJudge-7B outperforms larger MLLM judge baselines such as Qwen2.5-VL (32B and 72B). Notably, we find that LLM judges (Qwen3) models perform worse than MLLM judges (Qwen2.5-VL) and long chain-of-thought reasoning does not improve performance, indicating that providing video inputs is crucial for evaluation of video understanding tasks.
[252] Residual Vector Quantization For Communication-Efficient Multi-Agent Perception
Dereje Shenkut, B. V. K Vijaya Kumar
Main category: cs.CV
TL;DR: ReVQom is a learned feature compression method for multi-agent collaborative perception that achieves 273x to 1365x compression (6-30 bpp) while maintaining spatial identity and minimal accuracy loss.
Details
Motivation: Communication bandwidth constraints limit the scalability of multi-agent collaborative perception systems where agents need to share information for improved scene understanding.Method: End-to-end compression using a simple bottleneck network followed by multi-stage residual vector quantization (RVQ), transmitting only per-pixel code indices instead of raw features.
Result: Achieves 273x compression at 30 bpp to 1365x compression at 6 bpp on DAIR-V2X dataset. At 18 bpp (455x), matches or outperforms raw-feature collaborative perception.
Conclusion: ReVQom enables efficient and accurate multi-agent collaborative perception with ultra-low-bandwidth operation, making practical V2X deployment more feasible.
Abstract: Multi-agent collaborative perception (CP) improves scene understanding by sharing information across connected agents such as autonomous vehicles, unmanned aerial vehicles, and robots. Communication bandwidth, however, constrains scalability. We present ReVQom, a learned feature codec that preserves spatial identity while compressing intermediate features. ReVQom is an end-to-end method that compresses feature dimensions via a simple bottleneck network followed by multi-stage residual vector quantization (RVQ). This allows only per-pixel code indices to be transmitted, reducing payloads from 8192 bits per pixel (bpp) of uncompressed 32-bit float features to 6-30 bpp per agent with minimal accuracy loss. On DAIR-V2X real-world CP dataset, ReVQom achieves 273x compression at 30 bpp to 1365x compression at 6 bpp. At 18 bpp (455x), ReVQom matches or outperforms raw-feature CP, and at 6-12 bpp it enables ultra-low-bandwidth operation with graceful degradation. ReVQom allows efficient and accurate multi-agent collaborative perception with a step toward practical V2X deployment.
[253] Gender Stereotypes in Professional Roles Among Saudis: An Analytical Study of AI-Generated Images Using Language Models
Khaloud S. AlKhalifah, Malak Mashaabi, Hend Al-Khalifa
Main category: cs.CV
TL;DR: Text-to-Image AI models show strong gender bias and cultural inaccuracies when generating images of Saudi professionals, with DALL-E V3 exhibiting the most gender stereotyping (96% male outputs).
Details
Motivation: To investigate how contemporary Text-to-Image AI models perpetuate gender stereotypes and cultural inaccuracies when depicting Saudi professionals.Method: Analyzed 1,006 images from ImageFX, DALL-E V3, and Grok for 56 Saudi professions using neutral prompts. Two Saudi annotators evaluated images on five dimensions (gender, clothing, background, activities, age), with a third researcher adjudicating disagreements.
Result: Strong gender imbalance found: ImageFX 85% male, Grok 86.6% male, DALL-E V3 96% male. Cultural inaccuracies in clothing, settings, and activities were frequent. Counter-stereotypical images often resulted from cultural misinterpretations rather than progressive portrayals.
Conclusion: Current models mirror societal biases from training data and offer limited reflection of Saudi labor market dynamics. Need for more diverse training data, fairer algorithms, and culturally sensitive evaluation frameworks.
Abstract: This study investigates the extent to which contemporary Text-to-Image artificial intelligence (AI) models perpetuate gender stereotypes and cultural inaccuracies when generating depictions of professionals in Saudi Arabia. We analyzed 1,006 images produced by ImageFX, DALL-E V3, and Grok for 56 diverse Saudi professions using neutral prompts. Two trained Saudi annotators evaluated each image on five dimensions: perceived gender, clothing and appearance, background and setting, activities and interactions, and age. A third senior researcher adjudicated whenever the two primary raters disagreed, yielding 10,100 individual judgements. The results reveal a strong gender imbalance, with ImageFX outputs being 85% male, Grok 86.6% male, and DALL-E V3 96% male, indicating that DALL-E V3 exhibited the strongest overall gender stereotyping. This imbalance was most evident in leadership and technical roles. Moreover, cultural inaccuracies in clothing, settings, and depicted activities were frequently observed across all three models. Counter-stereotypical images often arise from cultural misinterpretations rather than genuinely progressive portrayals. We conclude that current models mirror societal biases embedded in their training data, generated by humans, offering only a limited reflection of the Saudi labour market’s gender dynamics and cultural nuances. These findings underscore the urgent need for more diverse training data, fairer algorithms, and culturally sensitive evaluation frameworks to ensure equitable and authentic visual outputs.
[254] Reasoning-Enhanced Domain-Adaptive Pretraining of Multimodal Large Language Models for Short Video Content Moderation
Zixuan Wang, Yu Sun, Hongwei Wang, Baoyu Jing, Xiang Shen, Xin Dong, Zhuolin Hao, Hongyu Xiong, Yang Song
Main category: cs.CV
TL;DR: A reasoning-enhanced multimodal large language model pretraining paradigm for unified inappropriate content detection in short videos, addressing distribution gaps and complex issue definitions through three targeted tasks.
Details
Motivation: Existing approaches train separate small models for each content issue type, requiring extensive human-labeled data and lacking cross-issue generalization capabilities.Method: Three targeted pretraining tasks: Caption (enhances video detail perception), Visual Question Answering (deepens understanding of issue definitions), and Chain-of-Thought (enhances reasoning capability).
Result: Significant performance improvements in both zero-shot and supervised fine-tuning settings, with strong generalization to emergent, previously unseen issues.
Conclusion: The proposed reasoning-enhanced MLLM pretraining paradigm effectively addresses distribution gaps and complex issue definitions, enabling unified inappropriate content detection with strong generalization capabilities.
Abstract: Short video platforms are evolving rapidly, making the identification of inappropriate content increasingly critical. Existing approaches typically train separate and small classification models for each type of issue, which requires extensive human-labeled data and lacks cross-issue generalization. We propose a reasoning-enhanced multimodal large language model (MLLM) pretraining paradigm for unified inappropriate content detection. To address the distribution gap between short video content and the original pretraining data of MLLMs, as well as the complex issue definitions, we introduce three targeted pretraining tasks: (1) \textit{Caption}, to enhance the MLLM’s perception of video details; (2) \textit{Visual Question Answering (VQA)}, to deepen the MLLM’s understanding of issue definitions and annotation guidelines; (3) \textit{Chain-of-Thought (CoT)}, to enhance the MLLM’s reasoning capability. Experimental results show that our pretraining approach significantly improves the MLLM’s performance in both zero-shot and supervised fine-tuning (SFT) settings. In addition, our pretrained model demonstrates strong generalization capabilities to emergent, previously unseen issues.
[255] Learning GUI Grounding with Spatial Reasoning from Visual Feedback
Yu Zhao, Wei-Ning Chen, Huseyin Atahan Inan, Samuel Kessler, Lu Wang, Lukas Wutschitz, Fangkai Yang, Chaoyun Zhang, Pasquale Minervini, Saravan Rajmohan, Robert Sim
Main category: cs.CV
TL;DR: GUI grounding is reframed from coordinate prediction to interactive search, where a VLM moves a cursor step-by-step to locate UI elements, improving accuracy on complex GUI layouts.
Details
Motivation: Traditional coordinate prediction for GUI grounding fails with high-resolution GUI images and complex layouts in Vision Language Models (VLMs).Method: Proposed GUI-Cursor model uses interactive search: at each step, the model identifies target objects, evaluates spatial relations, and moves cursor closer based on movement history, trained with multi-step online reinforcement learning.
Result: GUI-Cursor based on Qwen2.5-VL-7B improves GUI grounding accuracy, achieving state-of-the-art results on ScreenSpot-v2 (88.8% → 93.9%) and ScreenSpot-Pro (26.8% → 56.5%), solving problems within two steps for 95% of instances.
Conclusion: Interactive search approach with visual cursor feedback effectively addresses GUI grounding challenges, enabling adaptive step-wise problem solving and significantly outperforming coordinate prediction methods.
Abstract: Graphical User Interface (GUI) grounding is commonly framed as a coordinate prediction task – given a natural language instruction, generate on-screen coordinates for actions such as clicks and keystrokes. However, recent Vision Language Models (VLMs) often fail to predict accurate numeric coordinates when processing high-resolution GUI images with complex layouts. To address this issue, we reframe GUI grounding as an \emph{interactive search task}, where the VLM generates actions to move a cursor in the GUI to locate UI elements. At each step, the model determines the target object, evaluates the spatial relations between the cursor and the target, and moves the cursor closer to the target conditioned on the movement history. In this interactive process, the rendered cursor provides visual feedback to help the model align its predictions with the corresponding on-screen locations. We train our GUI grounding model, GUI-Cursor, using multi-step online reinforcement learning with a dense trajectory-based reward function. Our experimental results show that GUI-Cursor, based on Qwen2.5-VL-7B, improves the GUI grounding accuracy and achieves state-of-the-art results on ScreenSpot-v2 ($88.8% \rightarrow 93.9%$) and ScreenSpot-Pro ($26.8% \rightarrow 56.5%$). Moreover, we observe that GUI-Cursor learns to solve the problem within two steps for 95% of instances and can adaptively conduct more steps on more difficult examples.
[256] X-CoT: Explainable Text-to-Video Retrieval via LLM-based Chain-of-Thought Reasoning
Prasanna Reddy Pulakurthi, Jiamian Wang, Majid Rabbani, Sohail Dianat, Raghuveer Rao, Zhiqiang Tao
Main category: cs.CV
TL;DR: X-CoT is an explainable text-to-video retrieval framework that replaces traditional embedding models with LLM Chain-of-Thought reasoning to provide interpretable ranking results and improve retrieval performance.
Details
Motivation: Current text-to-video retrieval systems have two main limitations: they are vulnerable to low-quality text-video data pairs which are hard to identify, and they lack interpretability since cosine similarity alone provides no explanation for ranking results.Method: Proposes X-CoT framework that uses LLM Chain-of-Thought reasoning instead of embedding models. Expands benchmarks with additional video annotations for better semantic understanding and reduced data bias. Devises a retrieval CoT consisting of pairwise comparison steps for detailed reasoning and complete ranking.
Result: X-CoT empirically improves retrieval performance and produces detailed rationales for ranking decisions. It also facilitates model behavior analysis and data quality assessment.
Conclusion: X-CoT provides an explainable alternative to traditional text-to-video retrieval systems, addressing both performance and interpretability issues while enabling better analysis of model behavior and data quality.
Abstract: Prevalent text-to-video retrieval systems mainly adopt embedding models for feature extraction and compute cosine similarities for ranking. However, this design presents two limitations. Low-quality text-video data pairs could compromise the retrieval, yet are hard to identify and examine. Cosine similarity alone provides no explanation for the ranking results, limiting the interpretability. We ask that can we interpret the ranking results, so as to assess the retrieval models and examine the text-video data? This work proposes X-CoT, an explainable retrieval framework upon LLM CoT reasoning in place of the embedding model-based similarity ranking. We first expand the existing benchmarks with additional video annotations to support semantic understanding and reduce data bias. We also devise a retrieval CoT consisting of pairwise comparison steps, yielding detailed reasoning and complete ranking. X-CoT empirically improves the retrieval performance and produces detailed rationales. It also facilitates the model behavior and data quality analysis. Code and data are available at: https://github.com/PrasannaPulakurthi/X-CoT.
[257] Unsupervised Defect Detection for Surgical Instruments
Joseph Huang, Yichi Zhang, Jingxi Yu, Wei Chen, Seunghyun Hwang, Qiang Qiu, Amy R. Reibman, Edward J. Delp, Fengqing Zhu
Main category: cs.CV
TL;DR: Proposes a method to adapt unsupervised defect detection for surgical instruments, addressing domain shift issues from natural/industrial images.
Details
Motivation: Manual inspection of surgical instruments is error-prone, and existing automated methods trained on natural/industrial images fail to transfer effectively to the surgical domain, causing false positives and poor sensitivity to small defects.Method: Integrates background masking, patch-based analysis strategy, and efficient domain adaptation to overcome limitations of existing approaches.
Result: The method enables reliable detection of fine-grained defects in surgical instrument imagery by addressing domain shift and improving sensitivity to small defects.
Conclusion: The proposed versatile method successfully adapts unsupervised defect detection for surgical instruments, overcoming the limitations of existing approaches and enabling reliable defect detection in this specialized domain.
Abstract: Ensuring the safety of surgical instruments requires reliable detection of visual defects. However, manual inspection is prone to error, and existing automated defect detection methods, typically trained on natural/industrial images, fail to transfer effectively to the surgical domain. We demonstrate that simply applying or fine-tuning these approaches leads to issues: false positive detections arising from textured backgrounds, poor sensitivity to small, subtle defects, and inadequate capture of instrument-specific features due to domain shift. To address these challenges, we propose a versatile method that adapts unsupervised defect detection methods specifically for surgical instruments. By integrating background masking, a patch-based analysis strategy, and efficient domain adaptation, our method overcomes these limitations, enabling the reliable detection of fine-grained defects in surgical instrument imagery.
[258] No Alignment Needed for Generation: Learning Linearly Separable Representations in Diffusion Models
Junno Yun, Yaşar Utku Alçalar, Mehmet Akçakaya
Main category: cs.CV
TL;DR: Proposes LSEP regularization for diffusion models that promotes linear separability of intermediate representations, eliminating need for external encoders while improving training efficiency and generation quality.
Details
Motivation: Current alignment-based approaches require computationally expensive pretrained encoders for representation alignment. Need alternative regularization that improves discriminative features without external dependencies.Method: Introduces Linear SEParability (LSEP) regularization that directly incorporates linear probing into network learning dynamics, promoting linear separability of intermediate layer representations without auxiliary encoders.
Result: Achieves substantial improvements in training efficiency and generation quality, with FID of 1.46 on 256×256 ImageNet dataset using flow-based transformer architectures like SiTs.
Conclusion: LSEP provides effective alternative to alignment-based methods, eliminating dependency on external encoders while maintaining or improving representation quality and generation performance.
Abstract: Efficient training strategies for large-scale diffusion models have recently emphasized the importance of improving discriminative feature representations in these models. A central line of work in this direction is representation alignment with features obtained from powerful external encoders, which improves the representation quality as assessed through linear probing. Alignment-based approaches show promise but depend on large pretrained encoders, which are computationally expensive to obtain. In this work, we propose an alternative regularization for training, based on promoting the Linear SEParability (LSEP) of intermediate layer representations. LSEP eliminates the need for an auxiliary encoder and representation alignment, while incorporating linear probing directly into the network’s learning dynamics rather than treating it as a simple post-hoc evaluation tool. Our results demonstrate substantial improvements in both training efficiency and generation quality on flow-based transformer architectures such as SiTs, achieving an FID of 1.46 on $256 \times 256$ ImageNet dataset.
[259] Enhancing Contrastive Learning for Geolocalization by Discovering Hard Negatives on Semivariograms
Boyi Chen, Zhangyu Wang, Fabian Deuser, Johann Maximilian Zollner, Martin Werner
Main category: cs.CV
TL;DR: Proposes a spatially regularized contrastive learning method using semivariograms to model spatial dependencies in image-based geo-localization, addressing false negatives and hard negatives by incorporating geographic distance relationships.
Details
Motivation: Current contrastive learning methods for geo-localization neglect spatial dependencies, leading to issues with false negatives (visually/geographically similar images labeled as negatives) and difficulty distinguishing hard negatives (visually similar but geographically distant images).Method: Integrates semivariogram - a geostatistical tool - into contrastive learning by relating feature space distance to geographical distance, capturing spatial correlation patterns to identify hard negatives and false negatives based on expected visual dissimilarity at given spatial distances.
Result: Evaluation on OSV5M dataset shows improved geo-localization performance, especially at finer granularity, demonstrating that explicit spatial prior modeling enhances accuracy.
Conclusion: Modeling spatial priors through semivariogram-based regularization significantly improves contrastive learning for image-based geo-localization by better handling spatial dependencies and addressing false/hard negative issues.
Abstract: Accurate and robust image-based geo-localization at a global scale is challenging due to diverse environments, visually ambiguous scenes, and the lack of distinctive landmarks in many regions. While contrastive learning methods show promising performance by aligning features between street-view images and corresponding locations, they neglect the underlying spatial dependency in the geographic space. As a result, they fail to address the issue of false negatives – image pairs that are both visually and geographically similar but labeled as negatives, and struggle to effectively distinguish hard negatives, which are visually similar but geographically distant. To address this issue, we propose a novel spatially regularized contrastive learning strategy that integrates a semivariogram, which is a geostatistical tool for modeling how spatial correlation changes with distance. We fit the semivariogram by relating the distance of images in feature space to their geographical distance, capturing the expected visual content in a spatial correlation. With the fitted semivariogram, we define the expected visual dissimilarity at a given spatial distance as reference to identify hard negatives and false negatives. We integrate this strategy into GeoCLIP and evaluate it on the OSV5M dataset, demonstrating that explicitly modeling spatial priors improves image-based geo-localization performance, particularly at finer granularity.
[260] X-Streamer: Unified Human World Modeling with Audiovisual Interaction
You Xie, Tianpei Gu, Zenan Li, Chenxu Zhang, Guoxian Song, Xiaochen Zhao, Chao Liang, Jianwen Jiang, Hongyi Xu, Linjie Luo
Main category: cs.CV
TL;DR: X-Streamer is an end-to-end multimodal framework for creating digital human agents that can engage in infinite real-time interactions across text, speech, and video using a single unified architecture.
Details
Motivation: To build digital human agents capable of persistent and intelligent audiovisual interactions from a single portrait, enabling real-time open-ended video calls with streaming multimodal inputs.Method: Uses a Thinker-Actor dual-transformer architecture: Thinker module (pretrained large language-speech model) perceives and reasons over streaming inputs, while Actor module (chunk-wise autoregressive diffusion model) cross-attends to Thinker’s hidden states to generate synchronized multimodal responses with interleaved text/audio tokens and video latents.
Result: X-Streamer runs in real time on two A100 GPUs, sustaining hours-long consistent video chat experiences from arbitrary portraits with fine-grained cross-modality alignment and long-horizon stability.
Conclusion: The framework paves the way toward unified world modeling of interactive digital humans by enabling persistent, intelligent audiovisual interactions through a single unified architecture.
Abstract: We introduce X-Streamer, an end-to-end multimodal human world modeling framework for building digital human agents capable of infinite interactions across text, speech, and video within a single unified architecture. Starting from a single portrait, X-Streamer enables real-time, open-ended video calls driven by streaming multimodal inputs. At its core is a Thinker-Actor dual-transformer architecture that unifies multimodal understanding and generation, turning a static portrait into persistent and intelligent audiovisual interactions. The Thinker module perceives and reasons over streaming user inputs, while its hidden states are translated by the Actor into synchronized multimodal streams in real time. Concretely, the Thinker leverages a pretrained large language-speech model, while the Actor employs a chunk-wise autoregressive diffusion model that cross-attends to the Thinker’s hidden states to produce time-aligned multimodal responses with interleaved discrete text and audio tokens and continuous video latents. To ensure long-horizon stability, we design inter- and intra-chunk attentions with time-aligned multimodal positional embeddings for fine-grained cross-modality alignment and context retention, further reinforced by chunk-wise diffusion forcing and global identity referencing. X-Streamer runs in real time on two A100 GPUs, sustaining hours-long consistent video chat experiences from arbitrary portraits and paving the way toward unified world modeling of interactive digital humans.
[261] What Happens Next? Anticipating Future Motion by Generating Point Trajectories
Gabrijel Boduljak, Laurynas Karazija, Iro Laina, Christian Rupprecht, Andrea Vedaldi
Main category: cs.CV
TL;DR: The paper proposes a method for forecasting object motion from single images by generating dense trajectory grids instead of pixels, achieving better accuracy and diversity than prior approaches.
Details
Motivation: Current video generators struggle with motion forecasting from single images even in simple physical scenarios, due to the overhead of pixel generation rather than directly modeling motion.Method: Formulate motion forecasting as conditional generation of dense trajectory grids using an architecture similar to modern video generators but outputting motion trajectories instead of pixels.
Result: The approach captures scene-wide dynamics and uncertainty, yielding more accurate and diverse predictions than prior regressors and generators, and shows effectiveness in robotics applications and real-world physics datasets.
Conclusion: Directly modeling motion trajectories rather than generating pixels enables more effective motion forecasting from single images, overcoming limitations of current video generators even when fine-tuned on physical scenarios.
Abstract: We consider the problem of forecasting motion from a single image, i.e., predicting how objects in the world are likely to move, without the ability to observe other parameters such as the object velocities or the forces applied to them. We formulate this task as conditional generation of dense trajectory grids with a model that closely follows the architecture of modern video generators but outputs motion trajectories instead of pixels. This approach captures scene-wide dynamics and uncertainty, yielding more accurate and diverse predictions than prior regressors and generators. We extensively evaluate our method on simulated data, demonstrate its effectiveness on downstream applications such as robotics, and show promising accuracy on real-world intuitive physics datasets. Although recent state-of-the-art video generators are often regarded as world models, we show that they struggle with forecasting motion from a single image, even in simple physical scenarios such as falling blocks or mechanical object interactions, despite fine-tuning on such data. We show that this limitation arises from the overhead of generating pixels rather than directly modeling motion.
[262] Temporal vs. Spatial: Comparing DINOv3 and V-JEPA2 Feature Representations for Video Action Analysis
Sai Varun Kodathala, Rakesh Vunnam
Main category: cs.CV
TL;DR: Comparative analysis of DINOv3 (spatial processing) vs V-JEPA2 (temporal modeling) for video action recognition, showing DINOv3 excels at pose recognition while V-JEPA2 provides more consistent performance across action types.
Details
Motivation: To understand architectural trade-offs between spatial and temporal modeling approaches in self-supervised learning for video action recognition, and provide empirical guidance for method selection.Method: Evaluated both architectures on UCF Sports dataset using multiple metrics: classification accuracy, clustering performance (Silhouette score), intra-class consistency, and inter-class discrimination.
Result: DINOv3 achieved superior clustering (Silhouette: 0.31 vs 0.21) and discrimination (6.16x separation ratio) for pose-identifiable actions, while V-JEPA2 showed lower performance variance (0.094 vs 0.288) and balanced performance across all action types.
Conclusion: Spatial processing excels at static pose recognition but degrades on motion-dependent actions, while temporal modeling provides consistent reliability across diverse actions, informing architectural choices based on task requirements.
Abstract: This study presents a comprehensive comparative analysis of two prominent self-supervised learning architectures for video action recognition: DINOv3, which processes frames independently through spatial feature extraction, and V-JEPA2, which employs joint temporal modeling across video sequences. We evaluate both approaches on the UCF Sports dataset, examining feature quality through multiple dimensions including classification accuracy, clustering performance, intra-class consistency, and inter-class discrimination. Our analysis reveals fundamental architectural trade-offs: DINOv3 achieves superior clustering performance (Silhouette score: 0.31 vs 0.21) and demonstrates exceptional discrimination capability (6.16x separation ratio) particularly for pose-identifiable actions, while V-JEPA2 exhibits consistent reliability across all action types with significantly lower performance variance (0.094 vs 0.288). Through action-specific evaluation, we identify that DINOv3’s spatial processing architecture excels at static pose recognition but shows degraded performance on motion-dependent actions, whereas V-JEPA2’s temporal modeling provides balanced representation quality across diverse action categories. These findings contribute to the understanding of architectural design choices in video analysis systems and provide empirical guidance for selecting appropriate feature extraction methods based on task requirements and reliability constraints.
[263] VLCE: A Knowledge-Enhanced Framework for Image Description in Disaster Assessment
Md. Mahfuzur Rahman, Kishor Datta Gupta, Marufa Kamal, Fahad Rahman, Sunzida Siddique, Ahmed Rafi Hasan, Mohd Ariful Haque, Roy George
Main category: cs.CV
TL;DR: VLCE is a multimodal system that generates comprehensive disaster damage descriptions from satellite and UAV imagery, outperforming baseline models in informativeness while maintaining semantic alignment.
Details
Motivation: Traditional damage assessment methods are slow and dangerous, and current computer vision approaches only provide limited outputs like classification labels or segmentation masks, lacking comprehensive situational understanding.Method: Uses dual-architecture approach: CNN-LSTM with ResNet50 backbone for satellite imagery (xBD dataset) and Vision Transformer (ViT) for UAV pictures (RescueNet dataset), enhanced with external semantic knowledge from ConceptNet and WordNet.
Result: VLCE significantly outperforms baseline models (LLaVA and QwenVL), achieving up to 95.33% on InfoMetIC for caption informativeness while maintaining competitive semantic alignment measured by CLIPScore.
Conclusion: The dual-architecture system shows significant potential for improving disaster damage assessment by automating the production of actionable, information-dense descriptions from aerial imagery.
Abstract: Immediate damage assessment is essential after natural catastrophes; yet, conventional hand evaluation techniques are sluggish and perilous. Although satellite and unmanned aerial vehicle (UAV) photos offer extensive perspectives of impacted regions, current computer vision methodologies generally yield just classification labels or segmentation masks, so constraining their capacity to deliver a thorough situational comprehension. We introduce the Vision Language Caption Enhancer (VLCE), a multimodal system designed to produce comprehensive, contextually-informed explanations of disaster imagery. VLCE employs a dual-architecture approach: a CNN-LSTM model with a ResNet50 backbone pretrained on EuroSat satellite imagery for the xBD dataset, and a Vision Transformer (ViT) model pretrained on UAV pictures for the RescueNet dataset. Both systems utilize external semantic knowledge from ConceptNet and WordNet to expand vocabulary coverage and improve description accuracy. We assess VLCE in comparison to leading vision-language models (LLaVA and QwenVL) utilizing CLIPScore for semantic alignment and InfoMetIC for caption informativeness. Experimental findings indicate that VLCE markedly surpasses baseline models, attaining a maximum of 95.33% on InfoMetIC while preserving competitive semantic alignment. Our dual-architecture system demonstrates significant potential for improving disaster damage assessment by automating the production of actionable, information-dense descriptions from satellite and drone photos.
[264] A Data-driven Typology of Vision Models from Integrated Representational Metrics
Jialin Wu, Shreya Saha, Yiqing Bo, Meenakshi Khosla
Main category: cs.CV
TL;DR: The paper develops a framework using representational similarity metrics and Similarity Network Fusion to analyze how vision model representations differ across architectural families and training paradigms.
Details
Motivation: To understand which aspects of vision model representations are shared across different architectures and training methods, and which reflect distinctive computational strategies.Method: Used multiple representational similarity metrics (RSA, Soft Matching, Linear Predictivity) and integrated them using Similarity Network Fusion (SNF) to create composite signatures for clustering analysis.
Result: Geometry and tuning metrics strongly separate model families, while linear decodability shows weaker separation. SNF achieved substantially sharper family discrimination and revealed clustering patterns: supervised ResNets/ViTs form distinct clusters, self-supervised models group together, and hybrid architectures cluster with masked autoencoders.
Conclusion: Computational strategies shaped by both architecture and training objective define representational structure beyond surface design categories, with convergence between architectural modernization and reconstruction-based training.
Abstract: Large vision models differ widely in architecture and training paradigm, yet we lack principled methods to determine which aspects of their representations are shared across families and which reflect distinctive computational strategies. We leverage a suite of representational similarity metrics, each capturing a different facet-geometry, unit tuning, or linear decodability-and assess family separability using multiple complementary measures. Metrics preserving geometry or tuning (e.g., RSA, Soft Matching) yield strong family discrimination, whereas flexible mappings such as Linear Predictivity show weaker separation. These findings indicate that geometry and tuning carry family-specific signatures, while linearly decodable information is more broadly shared. To integrate these complementary facets, we adapt Similarity Network Fusion (SNF), a method inspired by multi-omics integration. SNF achieves substantially sharper family separation than any individual metric and produces robust composite signatures. Clustering of the fused similarity matrix recovers both expected and surprising patterns: supervised ResNets and ViTs form distinct clusters, yet all self-supervised models group together across architectural boundaries. Hybrid architectures (ConvNeXt, Swin) cluster with masked autoencoders, suggesting convergence between architectural modernization and reconstruction-based training. This biology-inspired framework provides a principled typology of vision models, showing that emergent computational strategies-shaped jointly by architecture and training objective-define representational structure beyond surface design categories.
[265] FantasyWorld: Geometry-Consistent World Modeling via Unified Video and 3D Prediction
Yixiang Dai, Fan Jiang, Chiyu Wang, Mu Xu, Yonggang Qi
Main category: cs.CV
TL;DR: FantasyWorld is a geometry-enhanced framework that integrates frozen video foundation models with a trainable geometric branch to enable joint modeling of video latents and implicit 3D fields, improving 3D awareness and spatial consistency in video generation.
Details
Motivation: Current video foundation models lack explicit 3D grounding capabilities, limiting their spatial consistency and utility for downstream 3D reasoning tasks like AR/VR content creation and robotic navigation.Method: Augments frozen video foundation models with a trainable geometric branch that enables joint modeling of video latents and implicit 3D fields in a single forward pass, using cross-branch supervision where geometry cues guide video generation and video priors regularize 3D prediction.
Result: Outperforms recent geometry-consistent baselines in multi-view coherence and style consistency, effectively bridging video imagination and 3D perception without requiring per-scene optimization or fine-tuning.
Conclusion: FantasyWorld successfully integrates video imagination with 3D perception through unified backbone and cross-branch information exchange, producing consistent and generalizable 3D-aware video representations suitable for downstream 3D tasks.
Abstract: High-quality 3D world models are pivotal for embodied intelligence and Artificial General Intelligence (AGI), underpinning applications such as AR/VR content creation and robotic navigation. Despite the established strong imaginative priors, current video foundation models lack explicit 3D grounding capabilities, thus being limited in both spatial consistency and their utility for downstream 3D reasoning tasks. In this work, we present FantasyWorld, a geometry-enhanced framework that augments frozen video foundation models with a trainable geometric branch, enabling joint modeling of video latents and an implicit 3D field in a single forward pass. Our approach introduces cross-branch supervision, where geometry cues guide video generation and video priors regularize 3D prediction, thus yielding consistent and generalizable 3D-aware video representations. Notably, the resulting latents from the geometric branch can potentially serve as versatile representations for downstream 3D tasks such as novel view synthesis and navigation, without requiring per-scene optimization or fine-tuning. Extensive experiments show that FantasyWorld effectively bridges video imagination and 3D perception, outperforming recent geometry-consistent baselines in multi-view coherence and style consistency. Ablation studies further confirm that these gains stem from the unified backbone and cross-branch information exchange.
[266] MORPH: Shape-agnostic PDE Foundation Models
Mahindra Singh Rautela, Alexander Most, Siddharth Mansingh, Bradley C. Love, Ayan Biswas, Diane Oyen, Earl Lawrence
Main category: cs.CV
TL;DR: MORPH is a shape-agnostic autoregressive foundation model for PDEs that handles heterogeneous spatiotemporal datasets with varying dimensions, resolutions, and mixed scalar/vector fields through a convolutional vision transformer architecture.
Details
Motivation: To create a flexible backbone for learning from the heterogeneous and multimodal nature of scientific observations, enabling scalable and data-efficient scientific machine learning across diverse PDE datasets.Method: Built on convolutional vision transformer with: (i) component-wise convolution for joint scalar/vector processing, (ii) inter-field cross-attention for information propagation between fields, (iii) axial attentions to reduce computational burden while maintaining expressivity. Pretrained on diverse PDE datasets with evaluation using full fine-tuning and LoRA adapters.
Result: MORPH outperforms models trained from scratch in both zero-shot and full-shot generalization, matching or surpassing strong baselines and state-of-the-art models across extensive evaluations.
Conclusion: MORPH presents a powerful and flexible foundation model for scientific machine learning that effectively handles heterogeneous PDE data and enables efficient transfer learning across diverse prediction tasks.
Abstract: We introduce MORPH, a shape-agnostic, autoregressive foundation model for partial differential equations (PDEs). MORPH is built on a convolutional vision transformer backbone that seamlessly handles heterogeneous spatiotemporal datasets of varying data dimensionality (1D–3D) at different resolutions, multiple fields with mixed scalar and vector components. The architecture combines (i) component-wise convolution, which jointly processes scalar and vector channels to capture local interactions, (ii) inter-field cross-attention, which models and selectively propagates information between different physical fields, (iii) axial attentions, which factorizes full spatiotemporal self-attention along individual spatial and temporal axes to reduce computational burden while retaining expressivity. We pretrain multiple model variants on a diverse collection of heterogeneous PDE datasets and evaluate transfer to a range of downstream prediction tasks. Using both full-model fine-tuning and parameter-efficient low-rank adapters (LoRA), MORPH outperforms models trained from scratch in both zero-shot and full-shot generalization. Across extensive evaluations, MORPH matches or surpasses strong baselines and recent state-of-the-art models. Collectively, these capabilities present a flexible and powerful backbone for learning from heterogeneous and multimodal nature of scientific observations, charting a path toward scalable and data-efficient scientific machine learning.
[267] MS-YOLO: Infrared Object Detection for Edge Deployment via MobileNetV4 and SlideLoss
Jiali Zhang, Thomas S. White, Haoliang Zhang, Wenqing Hu, Donald C. Wunsch II, Jian Liu
Main category: cs.CV
TL;DR: MS-YOLO combines MobileNetV4 backbone with SlideLoss to improve infrared object detection efficiency and address class imbalance, achieving competitive performance with low computational cost.
Details
Motivation: To address challenges in infrared object detection including class imbalance, thermal noise, and computational constraints for real-time urban applications.Method: Replaces YOLOv8’s CSPDarknet with MobileNetV4 backbone and introduces SlideLoss - a novel loss function that dynamically emphasizes under-represented and occluded samples.
Result: Achieves competitive mAP and superior precision with only 6.7 GFLOPs, reducing computational overhead by 1.5% while sustaining high accuracy on FLIR ADAS V2 dataset.
Conclusion: MS-YOLO effectively balances high detection quality with minimal computational costs, making it suitable for real-time edge deployment in urban environments.
Abstract: Infrared imaging has emerged as a robust solution for urban object detection under low-light and adverse weather conditions, offering significant advantages over traditional visible-light cameras. However, challenges such as class imbalance, thermal noise, and computational constraints can significantly hinder model performance in practical settings. To address these issues, we evaluate multiple YOLO variants on the FLIR ADAS V2 dataset, ultimately selecting YOLOv8 as our baseline due to its balanced accuracy and efficiency. Building on this foundation, we present \texttt{MS-YOLO} (\textbf{M}obileNetv4 and \textbf{S}lideLoss based on YOLO), which replaces YOLOv8’s CSPDarknet backbone with the more efficient MobileNetV4, reducing computational overhead by \textbf{1.5%} while sustaining high accuracy. In addition, we introduce \emph{SlideLoss}, a novel loss function that dynamically emphasizes under-represented and occluded samples, boosting precision without sacrificing recall. Experiments on the FLIR ADAS V2 benchmark show that \texttt{MS-YOLO} attains competitive mAP and superior precision while operating at only \textbf{6.7 GFLOPs}. These results demonstrate that \texttt{MS-YOLO} effectively addresses the dual challenge of maintaining high detection quality while minimizing computational costs, making it well-suited for real-time edge deployment in urban environments.
[268] Motion-Aware Transformer for Multi-Object Tracking
Xu Yang, Gady Agam
Main category: cs.CV
TL;DR: MATR introduces motion-aware tracking by explicitly predicting object movements to update track queries, reducing conflicts in DETR-based multi-object tracking frameworks and achieving state-of-the-art performance.
Details
Motivation: Existing DETR-based MOT frameworks process detection and tracking queries jointly in a single Transformer layer, causing conflicts and degraded association accuracy due to complex object motions in crowded scenes.Method: Motion-Aware Transformer (MATR) explicitly predicts object movements across frames to update track queries in advance, reducing query collisions and enabling more consistent training for both detection and association.
Result: MATR achieves significant improvements: 71.3 HOTA on DanceTrack (9+ points over MOTR), 72.2 HOTA on SportsMOT, and 54.7 mTETA/41.6 mHOTA on BDD100k, all state-of-the-art results without external datasets.
Conclusion: Explicitly modeling motion within end-to-end Transformers provides a simple yet highly effective approach to advance multi-object tracking performance across diverse datasets.
Abstract: Multi-object tracking (MOT) in videos remains challenging due to complex object motions and crowded scenes. Recent DETR-based frameworks offer end-to-end solutions but typically process detection and tracking queries jointly within a single Transformer Decoder layer, leading to conflicts and degraded association accuracy. We introduce the Motion-Aware Transformer (MATR), which explicitly predicts object movements across frames to update track queries in advance. By reducing query collisions, MATR enables more consistent training and improves both detection and association. Extensive experiments on DanceTrack, SportsMOT, and BDD100k show that MATR delivers significant gains across standard metrics. On DanceTrack, MATR improves HOTA by more than 9 points over MOTR without additional data and reaches a new state-of-the-art score of 71.3 with supplementary data. MATR also achieves state-of-the-art results on SportsMOT (72.2 HOTA) and BDD100k (54.7 mTETA, 41.6 mHOTA) without relying on external datasets. These results demonstrate that explicitly modeling motion within end-to-end Transformers offers a simple yet highly effective approach to advancing multi-object tracking.
[269] DeLiVR: Differential Spatiotemporal Lie Bias for Efficient Video Deraining
Shuning Sun, Jialang Lu, Xiang Chen, Jichao Wang, Dianjie Lu, Guijuan Zhang, Guangwei Gao, Zhuoran Zheng
Main category: cs.CV
TL;DR: DeLiVR is an efficient video deraining method that uses Lie-group differential biases in attention scores to address rain streaks, blur, and noise while maintaining spatiotemporal consistency.
Details
Motivation: Existing video deraining methods rely on computationally expensive optical flow or heuristic alignment, which are less robust to camera pose changes and temporal artifacts.Method: Proposes two components: 1) rotation-bounded Lie relative bias for geometry-consistent alignment using in-plane angle prediction, and 2) differential group displacement that computes angular differences between frames with temporal decay and attention masks.
Result: Extensive experiments show effectiveness on publicly available benchmarks.
Conclusion: Lie groups provide a principled way to enforce spatial and temporal consistency in video deraining, making DeLiVR an efficient and robust solution.
Abstract: Videos captured in the wild often suffer from rain streaks, blur, and noise. In addition, even slight changes in camera pose can amplify cross-frame mismatches and temporal artifacts. Existing methods rely on optical flow or heuristic alignment, which are computationally expensive and less robust. To address these challenges, Lie groups provide a principled way to represent continuous geometric transformations, making them well-suited for enforcing spatial and temporal consistency in video modeling. Building on this insight, we propose DeLiVR, an efficient video deraining method that injects spatiotemporal Lie-group differential biases directly into attention scores of the network. Specifically, the method introduces two complementary components. First, a rotation-bounded Lie relative bias predicts the in-plane angle of each frame using a compact prediction module, where normalized coordinates are rotated and compared with base coordinates to achieve geometry-consistent alignment before feature aggregation. Second, a differential group displacement computes angular differences between adjacent frames to estimate a velocity. This bias computation combines temporal decay and attention masks to focus on inter-frame relationships while precisely matching the direction of rain streaks. Extensive experimental results demonstrate the effectiveness of our method on publicly available benchmarks.
[270] UISim: An Interactive Image-Based UI Simulator for Dynamic Mobile Environments
Jiannan Xiang, Yun Zhu, Lei Shu, Maria Wang, Lijun Yu, Gabriel Barcik, James Lyon, Srinivas Sunkara, Jindong Chen
Main category: cs.CV
TL;DR: UISim is an image-based UI simulator that predicts and synthesizes realistic mobile UI transitions from screen images, enabling scalable testing, prototyping, and AI agent training without physical devices.
Details
Motivation: Existing UI testing methods rely on physical devices or static screenshot analysis, which are cumbersome and limit scalable testing and development of intelligent UI agents for dynamic mobile environments.Method: Two-stage approach: given initial screen image and user action, first predicts abstract layout of next UI state, then synthesizes visually consistent image based on predicted layout to simulate realistic UI transitions.
Result: UISim outperforms end-to-end UI generation baselines in generating realistic and coherent subsequent UI states, demonstrating high fidelity and effectiveness for UI simulation.
Conclusion: UISim provides practical benefits for UI testing, rapid prototyping, and synthetic data generation, while enabling advanced applications like UI navigation task planning for AI agents, streamlining UI development and enhancing AI training.
Abstract: Developing and testing user interfaces (UIs) and training AI agents to interact with them are challenging due to the dynamic and diverse nature of real-world mobile environments. Existing methods often rely on cumbersome physical devices or limited static analysis of screenshots, which hinders scalable testing and the development of intelligent UI agents. We introduce UISim, a novel image-based UI simulator that offers a dynamic and interactive platform for exploring mobile phone environments purely from screen images. Our system employs a two-stage method: given an initial phone screen image and a user action, it first predicts the abstract layout of the next UI state, then synthesizes a new, visually consistent image based on this predicted layout. This approach enables the realistic simulation of UI transitions. UISim provides immediate practical benefits for UI testing, rapid prototyping, and synthetic data generation. Furthermore, its interactive capabilities pave the way for advanced applications, such as UI navigation task planning for AI agents. Our experimental results show that UISim outperforms end-to-end UI generation baselines in generating realistic and coherent subsequent UI states, highlighting its fidelity and potential to streamline UI development and enhance AI agent training.
[271] LFA-Net: A Lightweight Network with LiteFusion Attention for Retinal Vessel Segmentation
Mehwish Mehmood, Ivor Spence, Muhammad Fahim
Main category: cs.CV
TL;DR: LFA-Net is a lightweight retinal vessel segmentation network that uses a novel LiteFusion-Attention module to efficiently capture local and global context with minimal computational resources.
Details
Motivation: To address challenges in retinal vessel segmentation including small vessel detection and high computational costs, especially for resource-constrained clinical environments.Method: Proposed LFA-Net with LiteFusion-Attention module that incorporates residual learning connections, Vision Mamba-inspired dynamics, and modulation-based attention for efficient local and global context capture.
Result: Achieved outstanding performance on DRIVE, STARE, and CHASE_DB datasets with dice scores of 83.28%, 87.44%, 84.50% and Jaccard indices of 72.85%, 79.31%, 74.70% respectively, using only 0.11M parameters, 0.42MB memory, and 4.46 GFLOPs.
Conclusion: LFA-Net provides high-performance retinal vessel segmentation with minimal computational requirements, making it ideal for real-world clinical applications with limited resources.
Abstract: Lightweight retinal vessel segmentation is important for the early diagnosis of vision-threatening and systemic diseases, especially in a real-world clinical environment with limited computational resources. Although segmentation methods based on deep learning are improving, existing models are still facing challenges of small vessel segmentation and high computational costs. To address these challenges, we proposed a new vascular segmentation network, LFA-Net, which incorporates a newly designed attention module, LiteFusion-Attention. This attention module incorporates residual learning connections, Vision Mamba-inspired dynamics, and modulation-based attention, enabling the model to capture local and global context efficiently and in a lightweight manner. LFA-Net offers high performance with 0.11 million parameters, 0.42 MB memory size, and 4.46 GFLOPs, which make it ideal for resource-constrained environments. We validated our proposed model on DRIVE, STARE, and CHASE_DB with outstanding performance in terms of dice scores of 83.28, 87.44, and 84.50% and Jaccard indices of 72.85, 79.31, and 74.70%, respectively. The code of LFA-Net is available online https://github.com/Mehwish4593/LFA-Net.
[272] Incorporating Scene Context and Semantic Labels for Enhanced Group-level Emotion Recognition
Qing Zhu, Wangdong Guo, Qirong Mao, Xiaohua Huang, Xiuyan Shao, Wenming Zheng
Main category: cs.CV
TL;DR: A novel framework for group-level emotion recognition that integrates visual scene context and label-guided semantic information to improve emotion understanding in multi-person scenes.
Details
Motivation: Current methods underestimate the importance of visual scene contextual information for modeling individual relationships and overlook the role of semantic information from emotional labels for complete emotion understanding.Method: Proposes a framework with visual context encoding module using multi-scale scene information, emotion semantic encoding module using LLM-generated emotion lexicons refined through structured emotion tree, and similarity-aware interaction to align visual and semantic information.
Result: Achieves competitive performance compared to state-of-the-art methods on three widely adopted GER datasets.
Conclusion: The proposed integration of visual scene context and label-guided semantic information effectively enhances group-level emotion recognition performance.
Abstract: Group-level emotion recognition (GER) aims to identify holistic emotions within a scene involving multiple individuals. Current existed methods underestimate the importance of visual scene contextual information in modeling individual relationships. Furthermore, they overlook the crucial role of semantic information from emotional labels for complete understanding of emotions. To address this limitation, we propose a novel framework that incorporates visual scene context and label-guided semantic information to improve GER performance. It involves the visual context encoding module that leverages multi-scale scene information to diversely encode individual relationships. Complementarily, the emotion semantic encoding module utilizes group-level emotion labels to prompt a large language model to generate nuanced emotion lexicons. These lexicons, in conjunction with the emotion labels, are then subsequently refined into comprehensive semantic representations through the utilization of a structured emotion tree. Finally, similarity-aware interaction is proposed to align and integrate visual and semantic information, thereby generating enhanced group-level emotion representations and subsequently improving the performance of GER. Experiments on three widely adopted GER datasets demonstrate that our proposed method achieves competitive performance compared to state-of-the-art methods.
[273] KG-SAM: Injecting Anatomical Knowledge into Segment Anything Models via Conditional Random Fields
Yu Li, Da Chang, Xi Xiao
Main category: cs.CV
TL;DR: KG-SAM enhances medical image segmentation by integrating anatomical knowledge graphs, CRF-based boundary refinement, and uncertainty estimation to overcome SAM’s limitations in medical imaging.
Details
Motivation: Direct application of Segment Anything Model (SAM) to medical imaging faces challenges including ambiguous boundaries, lack of anatomical relationship modeling, and absence of uncertainty quantification, which are critical for clinical reliability.Method: KG-SAM integrates: (i) medical knowledge graph for anatomical relationships, (ii) energy-based Conditional Random Field for anatomically consistent predictions, and (iii) uncertainty-aware fusion module for enhanced reliability.
Result: Achieved 82.69% average Dice score on prostate segmentation, 78.05% on abdominal MRI segmentation, and 79.68% on abdominal CT segmentation across multi-center datasets.
Conclusion: KG-SAM establishes a robust and generalizable framework that significantly advances medical image segmentation by synergistically combining anatomical priors with uncertainty-aware boundary refinement.
Abstract: While the Segment Anything Model (SAM) has achieved remarkable success in image segmentation, its direct application to medical imaging remains hindered by fundamental challenges, including ambiguous boundaries, insufficient modeling of anatomical relationships, and the absence of uncertainty quantification. To address these limitations, we introduce KG-SAM, a knowledge-guided framework that synergistically integrates anatomical priors with boundary refinement and uncertainty estimation. Specifically, KG-SAM incorporates (i) a medical knowledge graph to encode fine-grained anatomical relationships, (ii) an energy-based Conditional Random Field (CRF) to enforce anatomically consistent predictions, and (iii) an uncertainty-aware fusion module to enhance reliability in high-stakes clinical scenarios. Extensive experiments across multi-center medical datasets demonstrate the effectiveness of our approach: KG-SAM achieves an average Dice score of 82.69% on prostate segmentation and delivers substantial gains in abdominal segmentation, reaching 78.05% on MRI and 79.68% on CT. These results establish KG-SAM as a robust and generalizable framework for advancing medical image segmentation.
[274] UniVid: Unifying Vision Tasks with Pre-trained Video Generation Models
Lan Chen, Yuchao Gu, Qi Mao
Main category: cs.CV
TL;DR: UniVid is a framework that fine-tunes pre-trained video diffusion transformers to handle diverse vision tasks without task-specific modifications, using visual sentences to represent tasks and enabling cross-modal and cross-source generalization.
Details
Motivation: To explore whether pre-trained video generation models can adapt to diverse image and video tasks, avoiding costly task-specific pre-training across modalities and enabling scalability to unseen tasks.Method: Fine-tunes a video diffusion transformer to handle various vision tasks by representing tasks as visual sentences, where context sequences define both the task and expected output modality. Tasks can switch between understanding and generation by reversing visual sentence order.
Result: UniVid generalizes well in cross-modal inference (images and videos) and cross-source tasks (natural to annotated data) despite being trained only on natural video data. It demonstrates the potential of pre-trained video generation models as scalable unified vision foundations.
Conclusion: Pre-trained video generation models can serve as scalable and unified foundations for vision modeling, enabling generalization across modalities and sources without multi-source pre-training.
Abstract: Large language models, trained on extensive corpora, successfully unify diverse linguistic tasks within a single generative framework. Inspired by this, recent works like Large Vision Model (LVM) extend this paradigm to vision by organizing tasks into sequential visual sentences, where visual prompts serve as the context to guide outputs. However, such modeling requires task-specific pre-training across modalities and sources, which is costly and limits scalability to unseen tasks. Given that pre-trained video generation models inherently capture temporal sequence dependencies, we explore a more unified and scalable alternative: can a pre-trained video generation model adapt to diverse image and video tasks? To answer this, we propose UniVid, a framework that fine-tunes a video diffusion transformer to handle various vision tasks without task-specific modifications. Tasks are represented as visual sentences, where the context sequence defines both the task and the expected output modality. We evaluate the generalization of UniVid from two perspectives: (1) cross-modal inference with contexts composed of both images and videos, extending beyond LVM’s uni-modal setting; (2) cross-source tasks from natural to annotated data, without multi-source pre-training. Despite being trained solely on natural video data, UniVid generalizes well in both settings. Notably, understanding and generation tasks can easily switch by simply reversing the visual sentence order in this paradigm. These findings highlight the potential of pre-trained video generation models to serve as a scalable and unified foundation for vision modeling. Our code will be released at https://github.com/CUC-MIPG/UniVid.
[275] CubistMerge: Spatial-Preserving Token Merging For Diverse ViT Backbones
Wenyi Gong, Mieszko Lis
Main category: cs.CV
TL;DR: A token merging method that maintains spatial structure for compatibility with modern ViT architectures using window attention and spatial designs.
Details
Motivation: Existing token reduction methods fail to preserve spatial structure required by modern ViT architectures with window attention, decomposed relative positional embeddings, and RoPE.Method: 2D reduction strategy for structured token layouts, spatial-aware merging algorithm that maintains relative positions, and max-magnitude-per-dimension token representation to preserve salient features.
Result: 1.25x speedup on SAM-H with only 0.7% mIOU drop on COCO off-the-shelf, and 1.15x speedup on DeiT-B with no top-1 accuracy drop on ImageNet after one epoch fine-tuning.
Conclusion: The method achieves state-of-the-art performance on both spatial and non-spatial architectures across various vision tasks while maintaining compatibility with spatial designs.
Abstract: Many modern ViT backbones adopt spatial architectural designs, such as window attention, decomposed relative positional embeddings in SAM, and RoPE in DINOv3. Such architectures impose new challenges on token reduction, as the vast majority of existing methods fail to preserve the spatial structure these architectures depend on. In this paper, we introduce a simple yet effective token merging method that maintains spatial integrity, enabling seamless compatibility with spatial architectures. We reconcile two seemingly conflicting requirements: (i)exploiting the uneven information distribution across the spatial layout while (ii)preserving the spatial structure post-merging. Our approach employs (i)a 2D reduction strategy to enforce structured token layouts, (ii)a spatial-aware merging algorithm that maintains relative token positions, and (iii)a novel max-magnitude-per-dimension token representation that preserves salient features. Our method demonstrates strong performance both off-the-shelf and with fine-tuning, achieving state-of-the-art results on spatial and non-spatial architectures across various vision tasks. Specifically, we achieve 1.25x speedup on SAM-H with only 0.7% mIOU drop evaluated on COCO off-the-shelf, and 1.15x speedup on DeiT-B with no top-1 accuracy drop on ImageNet within just one epoch of fine-tuning.
[276] Training-Free Multimodal Deepfake Detection via Graph Reasoning
Yuxin Liu, Fei Wang, Kun Li, Yiqi Nie, Junjie Chen, Yanyan Wei, Zhangling Duan, Zhaohong Jia
Main category: cs.CV
TL;DR: GASP-ICL is a training-free framework for multimodal deepfake detection that enhances LVLMs through guided in-context learning, adaptive scoring, and cross-sample relation propagation.
Details
Motivation: Current LVLMs struggle with multimodal deepfake detection due to difficulties in capturing subtle forgery cues, resolving cross-modal inconsistencies, and performing task-aligned retrieval.Method: Uses MDD-adapted feature extractor for aligned image-text pair retrieval, Graph-Structured Taylor Adaptive Scorer (GSTAS) for cross-sample relations and query-aligned signal propagation, and in-context learning to inject task-aware knowledge into LVLMs.
Result: Outperforms strong baselines on four forgery types without requiring LVLM fine-tuning.
Conclusion: GASP-ICL effectively enhances LVLMs for robust multimodal deepfake detection through training-free adaptation and precise demonstration selection.
Abstract: Multimodal deepfake detection (MDD) aims to uncover manipulations across visual, textual, and auditory modalities, thereby reinforcing the reliability of modern information systems. Although large vision-language models (LVLMs) exhibit strong multimodal reasoning, their effectiveness in MDD is limited by challenges in capturing subtle forgery cues, resolving cross-modal inconsistencies, and performing task-aligned retrieval. To this end, we propose Guided Adaptive Scorer and Propagation In-Context Learning (GASP-ICL), a training-free framework for MDD. GASP-ICL employs a pipeline to preserve semantic relevance while injecting task-aware knowledge into LVLMs. We leverage an MDD-adapted feature extractor to retrieve aligned image-text pairs and build a candidate set. We further design the Graph-Structured Taylor Adaptive Scorer (GSTAS) to capture cross-sample relations and propagate query-aligned signals, producing discriminative exemplars. This enables precise selection of semantically aligned, task-relevant demonstrations, enhancing LVLMs for robust MDD. Experiments on four forgery types show that GASP-ICL surpasses strong baselines, delivering gains without LVLM fine-tuning.
[277] Prompt-guided Representation Disentanglement for Action Recognition
Tianci Wu, Guangming Zhu, Jiang Lu, Siyuan Wang, Ning Wang, Nuoye Xiong, Zhang Liang
Main category: cs.CV
TL;DR: ProDA is a novel framework for action recognition that disentangles specified actions from multi-action scenes using spatio-temporal scene graphs and dynamic prompts to generate action-specific representations.
Details
Motivation: Existing methods extract unified features for all actions in a video, making it challenging to model interactions between different objects in multi-action scenarios. Disentangling specified actions from complex scenes is needed.Method: ProDA uses Spatio-temporal Scene Graphs (SSGs) and introduces Dynamic Prompt Module (DPM) to guide a Graph Parsing Neural Network (GPNN) in generating action-specific representations. It features a video-adapted GPNN that aggregates information using dynamic weights.
Result: Experiments in video action recognition demonstrate the effectiveness of ProDA when compared with state-of-the-art methods.
Conclusion: The proposed ProDA framework successfully disentangles specified actions from multi-action scenes and achieves competitive performance in action recognition tasks.
Abstract: Action recognition is a fundamental task in video understanding. Existing methods typically extract unified features to process all actions in one video, which makes it challenging to model the interactions between different objects in multi-action scenarios. To alleviate this issue, we explore disentangling any specified actions from complex scenes as an effective solution. In this paper, we propose Prompt-guided Disentangled Representation for Action Recognition (ProDA), a novel framework that disentangles any specified actions from a multi-action scene. ProDA leverages Spatio-temporal Scene Graphs (SSGs) and introduces Dynamic Prompt Module (DPM) to guide a Graph Parsing Neural Network (GPNN) in generating action-specific representations. Furthermore, we design a video-adapted GPNN that aggregates information using dynamic weights. Experiments in video action recognition demonstrate the effectiveness of our approach when compared with the state-of-the-art methods. Our code can be found in https://github.com/iamsnaping/ProDA.git
[278] DeHate: A Stable Diffusion-based Multimodal Approach to Mitigate Hate Speech in Images
Dwip Dalal, Gautam Vashishtha, Anku Ranui, Aishwarya Reganti, Parth Patwa, Mohd Sarique, Chandan Gupta, Keshav Nath, Viswanatha Reddy, Vinija Jain, Aman Chadha, Amitava Das, Amit Sheth, Asif Ekbal
Main category: cs.CV
TL;DR: A multimodal dataset and vision-language model called DeHater are introduced for detecting and removing hateful content in digital images using stable diffusion techniques and attention analysis.
Details
Motivation: To address the rise of harmful online content that distorts public discourse and challenges digital environment health by developing AI tools for hate detection and removal.Method: Uses watermarked, stability-enhanced stable diffusion techniques combined with Digital Attention Analysis Module (DAAM) to identify hateful elements in images and generate hate attention maps for blurring hateful regions.
Result: Created a multimodal dataset for hate identification and developed DeHater model that sets new standards in AI-driven image hate detection with textual prompts.
Conclusion: The approach contributes to developing more ethical AI applications in social media by providing effective tools for multimodal hate detection and content moderation.
Abstract: The rise in harmful online content not only distorts public discourse but also poses significant challenges to maintaining a healthy digital environment. In response to this, we introduce a multimodal dataset uniquely crafted for identifying hate in digital content. Central to our methodology is the innovative application of watermarked, stability-enhanced, stable diffusion techniques combined with the Digital Attention Analysis Module (DAAM). This combination is instrumental in pinpointing the hateful elements within images, thereby generating detailed hate attention maps, which are used to blur these regions from the image, thereby removing the hateful sections of the image. We release this data set as a part of the dehate shared task. This paper also describes the details of the shared task. Furthermore, we present DeHater, a vision-language model designed for multimodal dehatification tasks. Our approach sets a new standard in AI-driven image hate detection given textual prompts, contributing to the development of more ethical AI applications in social media.
[279] MIRG-RL: Multi-Image Reasoning and Grounding with Reinforcement Learning
Lihao Zheng, Jiawei Chen, Xintian Shen, Hao Ma, Tao Wei
Main category: cs.CV
TL;DR: MIRG-RL is a unified framework that enhances multi-image reasoning and grounding in LVLMs through a two-stage training paradigm combining supervised fine-tuning and image-aware reinforcement learning with dual reward functions.
Details
Motivation: Current LVLMs lack cross-image reasoning capabilities and sufficient cross-image reference reward modeling, limiting their ability to understand complex relationships across multiple images.Method: Two-stage training: supervised fine-tuning with annotated trajectories and image-aware RL optimization. Uses a novel trajectory construction method integrating object-level and image-level annotations, and designs an image-aware RL policy with dual reward functions for objects and images.
Result: Achieves SOTA performance with 64.82% on cross-image reasoning tasks, exceeding previous best method by 1%.
Conclusion: MIRG-RL effectively addresses cross-image reasoning challenges and demonstrates superior performance in multi-image grounding benchmarks.
Abstract: Multi-image reasoning and grounding require understanding complex cross-image relationships at both object levels and image levels. Current Large Visual Language Models (LVLMs) face two critical challenges: the lack of cross-image reasoning capabilities and insufficient cross-image reference reward modeling. To address these issues, we propose a unified framework - Multi-Image Reasoning and Grounding with Reinforcement Learning (MIRG-RL). Specifically, our two-stage training paradigm combines supervised fine-tuning with annotated trajectories and image-aware reinforcement learning optimization, progressively developing multi-image reasoning capabilities. Furthermore, we innovatively propose a method for constructing the trajectory data, which integrates object-level and image-level annotation information, and use this method to generate a lightweight reasoning-enhanced dataset. To effectively resolve cross-image ambiguities, we design an image-aware RL policy with dual reward functions for objects and images. Experiments demonstrate that MIRG-RL achieves state-of-the-art (SOTA) performance in multi-image grounding benchmarks, attaining 64.82% on cross-image reasoning tasks - exceeding the previous best method by 1%. The code and dataset have been released at https://github.com/ZEUS2035/MIRG-RL.
[280] LongScape: Advancing Long-Horizon Embodied World Models with Context-Aware MoE
Yu Shang, Lei Jin, Yiding Ma, Xin Zhang, Chen Gao, Wei Wu, Yong Li
Main category: cs.CV
TL;DR: LongScape is a hybrid video generation framework that combines diffusion denoising with autoregressive generation using action-guided chunking and context-aware mixture-of-experts to achieve stable long-horizon video generation for embodied manipulation.
Details
Motivation: Current video generation methods struggle with stable long-horizon generation - diffusion-based approaches suffer from temporal inconsistency and visual drift, while autoregressive methods compromise visual detail.Method: Hybrid framework combining intra-chunk diffusion denoising with inter-chunk autoregressive generation, featuring action-guided variable-length chunking and Context-aware Mixture-of-Experts (CMoE) that adaptively activates specialized experts for each chunk.
Result: Achieves stable and consistent long-horizon generation over extended rollouts, demonstrating improved temporal consistency and visual quality compared to existing methods.
Conclusion: LongScape successfully addresses the limitations of current video generation approaches by providing a flexible framework that maintains both visual detail and temporal consistency for long-horizon embodied manipulation scenarios.
Abstract: Video-based world models hold significant potential for generating high-quality embodied manipulation data. However, current video generation methods struggle to achieve stable long-horizon generation: classical diffusion-based approaches often suffer from temporal inconsistency and visual drift over multiple rollouts, while autoregressive methods tend to compromise on visual detail. To solve this, we introduce LongScape, a hybrid framework that adaptively combines intra-chunk diffusion denoising with inter-chunk autoregressive causal generation. Our core innovation is an action-guided, variable-length chunking mechanism that partitions video based on the semantic context of robotic actions. This ensures each chunk represents a complete, coherent action, enabling the model to flexibly generate diverse dynamics. We further introduce a Context-aware Mixture-of-Experts (CMoE) framework that adaptively activates specialized experts for each chunk during generation, guaranteeing high visual quality and seamless chunk transitions. Extensive experimental results demonstrate that our method achieves stable and consistent long-horizon generation over extended rollouts. Our code is available at: https://github.com/tsinghua-fib-lab/Longscape.
[281] MoWM: Mixture-of-World-Models for Embodied Planning via Latent-to-Pixel Feature Modulation
Yu Shang, Yangcheng Yu, Xin Zhang, Xin Jin, Haisheng Su, Wei Wu, Yong Li
Main category: cs.CV
TL;DR: MoWM is a mixture-of-world-model framework that fuses motion-aware latent representations with fine-grained visual features for embodied action planning, achieving state-of-the-art performance on the CALVIN benchmark.
Details
Motivation: Current video generation world models suffer from visual redundancies that hinder action decoding, while latent world models overlook fine-grained details needed for precise manipulation. There's a need to combine the strengths of both approaches.Method: Proposes MoWM framework that uses motion-aware representations from latent models as high-level priors to guide extraction of fine-grained visual features from pixel space models, highlighting informative visual details for action decoding.
Result: Extensive evaluations on CALVIN benchmark demonstrate state-of-the-art task success rates and superior generalization compared to existing methods.
Conclusion: The hybrid approach effectively combines motion awareness with fine-grained visual details, providing valuable insights for future embodied planning research.
Abstract: Embodied action planning is a core challenge in robotics, requiring models to generate precise actions from visual observations and language instructions. While video generation world models are promising, their reliance on pixel-level reconstruction often introduces visual redundancies that hinder action decoding and generalization. Latent world models offer a compact, motion-aware representation, but overlook the fine-grained details critical for precise manipulation. To overcome these limitations, we propose MoWM, a mixture-of-world-model framework that fuses representations from hybrid world models for embodied action planning. Our approach uses motion-aware representations from a latent model as a high-level prior, which guides the extraction of fine-grained visual features from the pixel space model. This design allows MoWM to highlight the informative visual details needed for action decoding. Extensive evaluations on the CALVIN benchmark demonstrate that our method achieves state-of-the-art task success rates and superior generalization. We also provide a comprehensive analysis of the strengths of each feature space, offering valuable insights for future research in embodied planning. The code is available at: https://github.com/tsinghua-fib-lab/MoWM.
[282] DiTraj: training-free trajectory control for video diffusion transformer
Cheng Lei, Jiayu Zhang, Yue Ma, Xinyu Wang, Long Chen, Liang Tang, Yiqiang Yan, Fei Su, Zhicheng Zhao
Main category: cs.CV
TL;DR: DiTraj is a training-free framework for trajectory control in text-to-video generation using Diffusion Transformers (DiT), featuring foreground-background separation guidance and inter-frame Spatial-Temporal Decoupled 3D-RoPE.
Details
Motivation: Existing trajectory control methods either require substantial training resources or are designed for U-Net, not leveraging DiT's superior performance. DiTraj addresses this by providing a training-free solution specifically tailored for DiT models.Method: 1) Foreground-background separation guidance using LLM to convert prompts into foreground/background components. 2) Inter-frame Spatial-Temporal Decoupled 3D-RoPE (STD-RoPE) that modifies foreground tokens’ position embedding to eliminate cross-frame spatial discrepancies and strengthen cross-frame attention. 3) 3D-aware trajectory control through position embedding density regulation.
Result: Extensive experiments show DiTraj outperforms previous methods in both video quality and trajectory controllability.
Conclusion: DiTraj provides an effective training-free framework for trajectory control in DiT-based video generation, achieving superior performance without requiring additional training resources.
Abstract: Diffusion Transformers (DiT)-based video generation models with 3D full attention exhibit strong generative capabilities. Trajectory control represents a user-friendly task in the field of controllable video generation. However, existing methods either require substantial training resources or are specifically designed for U-Net, do not take advantage of the superior performance of DiT. To address these issues, we propose DiTraj, a simple but effective training-free framework for trajectory control in text-to-video generation, tailored for DiT. Specifically, first, to inject the object’s trajectory, we propose foreground-background separation guidance: we use the Large Language Model (LLM) to convert user-provided prompts into foreground and background prompts, which respectively guide the generation of foreground and background regions in the video. Then, we analyze 3D full attention and explore the tight correlation between inter-token attention scores and position embedding. Based on this, we propose inter-frame Spatial-Temporal Decoupled 3D-RoPE (STD-RoPE). By modifying only foreground tokens’ position embedding, STD-RoPE eliminates their cross-frame spatial discrepancies, strengthening cross-frame attention among them and thus enhancing trajectory control. Additionally, we achieve 3D-aware trajectory control by regulating the density of position embedding. Extensive experiments demonstrate that our method outperforms previous methods in both video quality and trajectory controllability.
[283] A Comprehensive Evaluation of Transformer-Based Question Answering Models and RAG-Enhanced Design
Zichen Zhang, Kunlong Zhang, Hongwei Ruan, Yiming Luo
Main category: cs.CV
TL;DR: Hybrid retrieval method combining dense embeddings with lexical overlap and re-ranking significantly outperforms baseline methods in multi-hop QA, achieving 50% EM and 47% F1 improvement over cosine similarity on HotpotQA.
Details
Motivation: Transformer models struggle with multi-hop reasoning requiring evidence combination across multiple passages, necessitating better retrieval strategies for retrieval-augmented generation frameworks.Method: Evaluated cosine similarity, maximal marginal relevance, and hybrid method integrating dense embeddings with lexical overlap and re-ranking. Adapted EfficientRAG pipeline with token labeling and iterative refinement for query optimization.
Result: Hybrid approach substantially outperformed baseline methods with 50% relative improvement in exact match and 47% in F1 score compared to cosine similarity. Improved entity recall and evidence complementarity.
Conclusion: Hybrid retrieval-augmented generation provides practical zero-shot solution for multi-hop QA, balancing accuracy, efficiency, and interpretability, though limited in handling distractors and temporal reasoning.
Abstract: Transformer-based models have advanced the field of question answering, but multi-hop reasoning, where answers require combining evidence across multiple passages, remains difficult. This paper presents a comprehensive evaluation of retrieval strategies for multi-hop question answering within a retrieval-augmented generation framework. We compare cosine similarity, maximal marginal relevance, and a hybrid method that integrates dense embeddings with lexical overlap and re-ranking. To further improve retrieval, we adapt the EfficientRAG pipeline for query optimization, introducing token labeling and iterative refinement while maintaining efficiency. Experiments on the HotpotQA dataset show that the hybrid approach substantially outperforms baseline methods, achieving a relative improvement of 50 percent in exact match and 47 percent in F1 score compared to cosine similarity. Error analysis reveals that hybrid retrieval improves entity recall and evidence complementarity, while remaining limited in handling distractors and temporal reasoning. Overall, the results suggest that hybrid retrieval-augmented generation provides a practical zero-shot solution for multi-hop question answering, balancing accuracy, efficiency, and interpretability.
[284] Dynamic Novel View Synthesis in High Dynamic Range
Kaixuan Zhang, Zhipeng Xiong, Minxian Li, Mingwu Ren, Jiankang Deng, Xiatian Zhu
Main category: cs.CV
TL;DR: HDR-4DGS enables High Dynamic Range Dynamic Novel View Synthesis for scenes with moving objects and varying lighting by using Gaussian Splatting with dynamic tone-mapping to maintain temporal radiance coherence.
Details
Motivation: Current HDR NVS methods only handle static scenes, but real-world scenarios often contain dynamic elements like moving objects and changing lighting conditions, creating a more challenging problem.Method: Proposed HDR-4DGS uses Gaussian Splatting with an innovative dynamic tone-mapping module that connects HDR and LDR domains while maintaining temporal radiance coherence by adapting tone-mapping functions according to evolving radiance distributions.
Result: HDR-4DGS achieves temporal radiance consistency and spatially accurate color translation, enabling photorealistic HDR renderings from arbitrary viewpoints and time instances, surpassing state-of-the-art methods in both quantitative performance and visual fidelity.
Conclusion: The proposed HDR-4DGS framework successfully addresses the challenging problem of HDR Dynamic Novel View Synthesis by jointly modeling temporal radiance variations and 3D translation between LDR and HDR domains.
Abstract: High Dynamic Range Novel View Synthesis (HDR NVS) seeks to learn an HDR 3D model from Low Dynamic Range (LDR) training images captured under conventional imaging conditions. Current methods primarily focus on static scenes, implicitly assuming all scene elements remain stationary and non-living. However, real-world scenarios frequently feature dynamic elements, such as moving objects, varying lighting conditions, and other temporal events, thereby presenting a significantly more challenging scenario. To address this gap, we propose a more realistic problem named HDR Dynamic Novel View Synthesis (HDR DNVS), where the additional dimension ``Dynamic’’ emphasizes the necessity of jointly modeling temporal radiance variations alongside sophisticated 3D translation between LDR and HDR. To tackle this complex, intertwined challenge, we introduce HDR-4DGS, a Gaussian Splatting-based architecture featured with an innovative dynamic tone-mapping module that explicitly connects HDR and LDR domains, maintaining temporal radiance coherence by dynamically adapting tone-mapping functions according to the evolving radiance distributions across the temporal dimension. As a result, HDR-4DGS achieves both temporal radiance consistency and spatially accurate color translation, enabling photorealistic HDR renderings from arbitrary viewpoints and time instances. Extensive experiments demonstrate that HDR-4DGS surpasses existing state-of-the-art methods in both quantitative performance and visual fidelity. Source code will be released.
[285] SRHand: Super-Resolving Hand Images and 3D Shapes via View/Pose-aware Neural Image Representations and Explicit 3D Meshes
Minje Kim, Tae-Kyun Kim
Main category: cs.CV
TL;DR: SRHand reconstructs detailed 3D hand geometry and textures from low-resolution images using a geometric-aware implicit image function that jointly optimizes image upsampling and 3D hand shapes.
Details
Motivation: Prior methods require high-resolution multi-view images and fail on low-resolution inputs. Existing super-resolution methods work only for static objects, not articulated hands.Method: Proposes SRHand with geometric-aware implicit image function (GIIF) that learns hand priors to upsample images while jointly optimizing implicit image function and explicit 3D hand meshes.
Result: Significantly outperforms state-of-the-art methods on InterHand2.6M and Goliath datasets, achieving fine details like wrinkles and nails while preserving multi-view and pose consistency.
Conclusion: SRHand successfully reconstructs detailed 3D hand avatars from low-resolution images by combining implicit image representation with explicit hand geometry optimization.
Abstract: Reconstructing detailed hand avatars plays a crucial role in various applications. While prior works have focused on capturing high-fidelity hand geometry, they heavily rely on high-resolution multi-view image inputs and struggle to generalize on low-resolution images. Multi-view image super-resolution methods have been proposed to enforce 3D view consistency. These methods, however, are limited to static objects/scenes with fixed resolutions and are not applicable to articulated deformable hands. In this paper, we propose SRHand (Super-Resolution Hand), the method for reconstructing detailed 3D geometry as well as textured images of hands from low-resolution images. SRHand leverages the advantages of implicit image representation with explicit hand meshes. Specifically, we introduce a geometric-aware implicit image function (GIIF) that learns detailed hand prior by upsampling the coarse input images. By jointly optimizing the implicit image function and explicit 3D hand shapes, our method preserves multi-view and pose consistency among upsampled hand images, and achieves fine-detailed 3D reconstruction (wrinkles, nails). In experiments using the InterHand2.6M and Goliath datasets, our method significantly outperforms state-of-the-art image upsampling methods adapted to hand datasets, and 3D hand reconstruction methods, quantitatively and qualitatively. Project page: https://yunminjin2.github.io/projects/srhand
[286] Deepfakes: we need to re-think the concept of “real” images
Janis Keuper, Margret Keuper
Main category: cs.CV
TL;DR: The paper argues that current fake image detection research focuses too much on generative models while neglecting proper definition and data collection of “real” images, and questions whether fake detection is a sound objective given modern smartphone photography uses neural networks similar to fake generators.
Details
Motivation: To address the shortcomings in current fake image detection research, which relies on outdated "real" image datasets and fails to account for modern smartphone photography that uses neural network-based image formation algorithms.Method: Position paper analyzing current research practices, highlighting reliance on old low-resolution datasets like ImageNet, and examining how modern smartphone photography algorithms resemble fake image generators.
Result: The analysis reveals fundamental flaws in current fake detection approaches, showing that the distinction between “real” and “fake” is becoming increasingly blurred due to similar underlying technologies in both image generation and modern photography.
Conclusion: The paper calls for rethinking the concept of “real” images, developing clear technical definitions, creating new benchmark datasets, and questioning whether fake image detection remains a valid research objective given technological convergence.
Abstract: The wide availability and low usability barrier of modern image generation models has triggered the reasonable fear of criminal misconduct and negative social implications. The machine learning community has been engaging this problem with an extensive series of publications proposing algorithmic solutions for the detection of “fake”, e.g. entirely generated or partially manipulated images. While there is undoubtedly some progress towards technical solutions of the problem, we argue that current and prior work is focusing too much on generative algorithms and “fake” data-samples, neglecting a clear definition and data collection of “real” images. The fundamental question “what is a real image?” might appear to be quite philosophical, but our analysis shows that the development and evaluation of basically all current “fake”-detection methods is relying on only a few, quite old low-resolution datasets of “real” images like ImageNet. However, the technology for the acquisition of “real” images, aka taking photos, has drastically evolved over the last decade: Today, over 90% of all photographs are produced by smartphones which typically use algorithms to compute an image from multiple inputs (over time) from multiple sensors. Based on the fact that these image formation algorithms are typically neural network architectures which are closely related to “fake”-image generators, we state the position that today, we need to re-think the concept of “real” images. The purpose of this position paper is to raise the awareness of the current shortcomings in this active field of research and to trigger an open discussion whether the detection of “fake” images is a sound objective at all. At the very least, we need a clear technical definition of “real” images and new benchmark datasets.
[287] Unlocking the Essence of Beauty: Advanced Aesthetic Reasoning with Relative-Absolute Policy Optimization
Boyang Liu, Yifan Hu, Senjie Jin, Shihan Dou, Gonglei Shi, Jie Shao, Tao Gui, Xuanjing Huang
Main category: cs.CV
TL;DR: Aes-R1 is a reinforcement learning framework that improves multimodal LLMs’ aesthetic assessment by generating interpretable rationales alongside accurate scores through chain-of-thought reasoning and joint optimization of absolute and relative judgments.
Details
Motivation: Multimodal LLMs struggle with aesthetic assessment due to scarce multimodal aesthetic reasoning data and the subjective nature of aesthetic judgment, making it difficult to generate accurate judgments with interpretable rationales.Method: Proposes Aes-R1 framework with AesCoT pipeline for constructing chain-of-thought aesthetic reasoning data, and RAPO (Relative-Absolute Policy Optimization) RL algorithm that jointly optimizes absolute score regression and relative ranking order.
Result: Improves backbone’s average PLCC/SRCC by 47.9%/34.8%, surpassing state-of-the-art baselines of similar size, with robust generalization under limited supervision and out-of-distribution scenarios.
Conclusion: Aes-R1 enables MLLMs to generate grounded explanations alongside faithful scores, enhancing aesthetic scoring and reasoning in a unified framework.
Abstract: Multimodal large language models (MLLMs) are well suited to image aesthetic assessment, as they can capture high-level aesthetic features leveraging their cross-modal understanding capacity. However, the scarcity of multimodal aesthetic reasoning data and the inherently subjective nature of aesthetic judgment make it difficult for MLLMs to generate accurate aesthetic judgments with interpretable rationales. To this end, we propose Aes-R1, a comprehensive aesthetic reasoning framework with reinforcement learning (RL). Concretely, Aes-R1 integrates a pipeline, AesCoT, to construct and filter high-quality chain-of-thought aesthetic reasoning data used for cold-start. After teaching the model to generate structured explanations prior to scoring, we then employ the Relative-Absolute Policy Optimization (RAPO), a novel RL algorithm that jointly optimizes absolute score regression and relative ranking order, improving both per-image accuracy and cross-image preference judgments. Aes-R1 enables MLLMs to generate grounded explanations alongside faithful scores, thereby enhancing aesthetic scoring and reasoning in a unified framework. Extensive experiments demonstrate that Aes-R1 improves the backbone’s average PLCC/SRCC by 47.9%/34.8%, surpassing state-of-the-art baselines of similar size. More ablation studies validate Aes-R1’s robust generalization under limited supervision and in out-of-distribution scenarios.
[288] Drag4D: Align Your Motion with Text-Driven 3D Scene Generation
Minjun Kang, Inkyu Shin, Taeyeop Lee, In So Kweon, Kuk-Jin Yoon
Main category: cs.CV
TL;DR: Drag4D is an interactive framework for text-driven 3D scene generation with object motion control, enabling users to define 3D trajectories for objects and integrate them into high-quality 3D backgrounds.
Details
Motivation: To create an interactive system that allows users to control object motion within generated 3D scenes, addressing the need for precise spatial and temporal alignment of animated objects in 3D environments.Method: Three-stage pipeline: 1) Enhanced text-to-3D background generation using 2D Gaussian Splatting with panoramic images, 2) 3D copy-and-paste approach with physics-aware object positioning, 3) Temporal animation using a part-augmented, motion-conditioned video diffusion model for view-consistent motion.
Result: The framework successfully generates harmonized 3D scenes with user-controlled object motion, demonstrating effective spatial alignment and view-consistent temporal animation through comprehensive evaluations.
Conclusion: Drag4D provides a unified architecture for interactive 3D scene generation with motion control, achieving high-quality results with precise object positioning and consistent motion animation in 3D environments.
Abstract: We introduce Drag4D, an interactive framework that integrates object motion control within text-driven 3D scene generation. This framework enables users to define 3D trajectories for the 3D objects generated from a single image, seamlessly integrating them into a high-quality 3D background. Our Drag4D pipeline consists of three stages. First, we enhance text-to-3D background generation by applying 2D Gaussian Splatting with panoramic images and inpainted novel views, resulting in dense and visually complete 3D reconstructions. In the second stage, given a reference image of the target object, we introduce a 3D copy-and-paste approach: the target instance is extracted in a full 3D mesh using an off-the-shelf image-to-3D model and seamlessly composited into the generated 3D scene. The object mesh is then positioned within the 3D scene via our physics-aware object position learning, ensuring precise spatial alignment. Lastly, the spatially aligned object is temporally animated along a user-defined 3D trajectory. To mitigate motion hallucination and ensure view-consistent temporal alignment, we develop a part-augmented, motion-conditioned video diffusion model that processes multiview image pairs together with their projected 2D trajectories. We demonstrate the effectiveness of our unified architecture through evaluations at each stage and in the final results, showcasing the harmonized alignment of user-controlled object motion within a high-quality 3D background.
[289] Syncphony: Synchronized Audio-to-Video Generation with Diffusion Transformers
Jibin Song, Mingi Kwon, Jaeseok Jeong, Youngjung Uh
Main category: cs.CV
TL;DR: Syncphony is an audio-to-video generation model that creates 380x640 resolution, 24fps videos synchronized with audio inputs using motion-aware loss and audio sync guidance to improve temporal alignment.
Details
Motivation: Existing audio-to-video models struggle with fine-grained synchronization due to indirect conditioning or limited temporal modeling, while audio provides temporal cues aligned with video motion for better temporal control.Method: Builds on pre-trained video backbone with two key components: (1) Motion-aware Loss emphasizing learning at high-motion regions, (2) Audio Sync Guidance using visually aligned off-sync model without audio layers to better exploit audio cues at inference.
Result: Outperforms existing methods on AVSync15 and The Greatest Hits datasets in both synchronization accuracy and visual quality, generating 380x640 resolution, 24fps videos.
Conclusion: Syncphony effectively addresses the synchronization challenge in audio-to-video generation through motion-aware training and audio guidance, achieving superior temporal alignment while maintaining visual quality.
Abstract: Text-to-video and image-to-video generation have made rapid progress in visual quality, but they remain limited in controlling the precise timing of motion. In contrast, audio provides temporal cues aligned with video motion, making it a promising condition for temporally controlled video generation. However, existing audio-to-video (A2V) models struggle with fine-grained synchronization due to indirect conditioning mechanisms or limited temporal modeling capacity. We present Syncphony, which generates 380x640 resolution, 24fps videos synchronized with diverse audio inputs. Our approach builds upon a pre-trained video backbone and incorporates two key components to improve synchronization: (1) Motion-aware Loss, which emphasizes learning at high-motion regions; (2) Audio Sync Guidance, which guides the full model using a visually aligned off-sync model without audio layers to better exploit audio cues at inference while maintaining visual quality. To evaluate synchronization, we propose CycleSync, a video-to-audio-based metric that measures the amount of motion cues in the generated video to reconstruct the original audio. Experiments on AVSync15 and The Greatest Hits datasets demonstrate that Syncphony outperforms existing methods in both synchronization accuracy and visual quality. Project page is available at: https://jibin86.github.io/syncphony_project_page
[290] LG-CD: Enhancing Language-Guided Change Detection through SAM2 Adaptation
Yixiao Liu, Yizhou Yang, Jinwen Li, Jun Tao, Ruoyu Li, Xiangkun Wang, Min Zhu, Junlong Cheng
Main category: cs.CV
TL;DR: A novel Language-Guided Change Detection model (LG-CD) that uses natural language prompts to improve remote sensing change detection accuracy by integrating visual and textual information through a multimodal approach.
Details
Motivation: Most deep learning methods focus only on unimodal visual information and neglect the rich semantic information from multimodal data like text, limiting change detection performance.Method: Uses SAM2 visual foundation model for multi-scale feature extraction, multi-layer adapters for fine-tuning, Text Fusion Attention Module (TFAM) for visual-text alignment, and Vision-Semantic Fusion Decoder (V-SFD) with cross-attention for final change masks.
Result: LG-CD consistently outperforms state-of-the-art methods on three datasets (LEVIR-CD, WHU-CD, SYSU-CD), demonstrating improved accuracy and robustness in change detection.
Conclusion: The approach provides new insights for generalized change detection by leveraging multimodal information, showing that language guidance significantly enhances remote sensing change detection performance.
Abstract: Remote Sensing Change Detection (RSCD) typically identifies changes in land cover or surface conditions by analyzing multi-temporal images. Currently, most deep learning-based methods primarily focus on learning unimodal visual information, while neglecting the rich semantic information provided by multimodal data such as text. To address this limitation, we propose a novel Language-Guided Change Detection model (LG-CD). This model leverages natural language prompts to direct the network’s attention to regions of interest, significantly improving the accuracy and robustness of change detection. Specifically, LG-CD utilizes a visual foundational model (SAM2) as a feature extractor to capture multi-scale pyramid features from high-resolution to low-resolution across bi-temporal remote sensing images. Subsequently, multi-layer adapters are employed to fine-tune the model for downstream tasks, ensuring its effectiveness in remote sensing change detection. Additionally, we design a Text Fusion Attention Module (TFAM) to align visual and textual information, enabling the model to focus on target change regions using text prompts. Finally, a Vision-Semantic Fusion Decoder (V-SFD) is implemented, which deeply integrates visual and semantic information through a cross-attention mechanism to produce highly accurate change detection masks. Our experiments on three datasets (LEVIR-CD, WHU-CD, and SYSU-CD) demonstrate that LG-CD consistently outperforms state-of-the-art change detection methods. Furthermore, our approach provides new insights into achieving generalized change detection by leveraging multimodal information.
[291] TDEdit: A Unified Diffusion Framework for Text-Drag Guided Image Manipulation
Qihang Wang, Yaxiong Wang, Lechao Cheng, Zhun Zhong
Main category: cs.CV
TL;DR: A unified diffusion framework for joint text-drag image editing that combines text-driven texture control with drag-driven spatial precision through point-cloud deterministic drag and dynamic denoising guidance.
Details
Motivation: Current text-driven methods lack precise spatial control while drag-driven approaches miss fine-grained texture guidance, creating complementary limitations that need to be addressed through joint control.Method: Proposes a diffusion-based framework with two innovations: Point-Cloud Deterministic Drag for enhanced latent-space layout control via 3D feature mapping, and Drag-Text Guided Denoising that dynamically balances drag and text conditions during denoising.
Result: Achieves high-fidelity joint editing while matching or surpassing specialized text-only or drag-only approaches, supporting flexible editing modes (text-only, drag-only, or combined).
Conclusion: Establishes a versatile and generalizable solution for controllable image manipulation that integrates the strengths of both text and drag editing paradigms.
Abstract: This paper explores image editing under the joint control of text and drag interactions. While recent advances in text-driven and drag-driven editing have achieved remarkable progress, they suffer from complementary limitations: text-driven methods excel in texture manipulation but lack precise spatial control, whereas drag-driven approaches primarily modify shape and structure without fine-grained texture guidance. To address these limitations, we propose a unified diffusion-based framework for joint drag-text image editing, integrating the strengths of both paradigms. Our framework introduces two key innovations: (1) Point-Cloud Deterministic Drag, which enhances latent-space layout control through 3D feature mapping, and (2) Drag-Text Guided Denoising, dynamically balancing the influence of drag and text conditions during denoising. Notably, our model supports flexible editing modes - operating with text-only, drag-only, or combined conditions - while maintaining strong performance in each setting. Extensive quantitative and qualitative experiments demonstrate that our method not only achieves high-fidelity joint editing but also matches or surpasses the performance of specialized text-only or drag-only approaches, establishing a versatile and generalizable solution for controllable image manipulation. Code will be made publicly available to reproduce all results presented in this work.
[292] Enhancing Vehicle Detection under Adverse Weather Conditions with Contrastive Learning
Boying Li, Chang Liu, Petter Kyösti, Mattias Öhman, Devashish Singha Roy, Sofia Plazzi, Hamam Mokayed, Olle Hagner
Main category: cs.CV
TL;DR: A sideload-CL-adaptation framework that uses contrastive learning on unannotated UAV data to improve vehicle detection in Nordic regions with snow coverage challenges, achieving 3.8-9.5% mAP50 improvement.
Details
Motivation: Vehicle detection from UAV images in Nordic regions faces visibility challenges and domain shifts due to snow coverage, while annotated data is expensive but unannotated data is cheap to obtain.Method: Train CNN-based representation extractor through contrastive learning on unannotated data, then sideload it to frozen YOLO11n backbone during fine-tuning. Extensive experiments compare fusion methods and granularity.
Result: Proposed sideload-CL-adaptation model improves detection performance by 3.8% to 9.5% in terms of mAP50 on the NVD dataset.
Conclusion: The framework effectively leverages unannotated data to enhance vehicle detection performance in challenging Nordic conditions with snow coverage.
Abstract: Aside from common challenges in remote sensing like small, sparse targets and computation cost limitations, detecting vehicles from UAV images in the Nordic regions faces strong visibility challenges and domain shifts caused by diverse levels of snow coverage. Although annotated data are expensive, unannotated data is cheaper to obtain by simply flying the drones. In this work, we proposed a sideload-CL-adaptation framework that enables the use of unannotated data to improve vehicle detection using lightweight models. Specifically, we propose to train a CNN-based representation extractor through contrastive learning on the unannotated data in the pretraining stage, and then sideload it to a frozen YOLO11n backbone in the fine-tuning stage. To find a robust sideload-CL-adaptation, we conducted extensive experiments to compare various fusion methods and granularity. Our proposed sideload-CL-adaptation model improves the detection performance by 3.8% to 9.5% in terms of mAP50 on the NVD dataset.
[293] Multi-View Crowd Counting With Self-Supervised Learning
Hong Mo, Xiong Zhang, Tengfei Shi, Zhongbo Wu
Main category: cs.CV
TL;DR: SSLCounter is a self-supervised learning framework for multi-view counting that uses neural volumetric rendering to reduce reliance on large annotated datasets, achieving state-of-the-art performance with only 70% of training data.
Details
Motivation: Current multi-view counting methods rely heavily on fully supervised learning with large annotated datasets, which is resource-intensive and limits scalability.Method: Proposes SSLCounter framework that learns implicit scene representations through neural volumetric rendering, enabling reconstruction of continuous geometry and view-dependent appearance via differential neural rendering.
Result: Achieves state-of-the-art performance and demonstrates competitive results using only 70% of training data, showing superior data efficiency across multiple benchmarks.
Conclusion: SSLCounter provides an effective self-supervised alternative to fully supervised approaches for multi-view counting, offering better data efficiency and seamless integration into existing frameworks.
Abstract: Multi-view counting (MVC) methods have attracted significant research attention and stimulated remarkable progress in recent years. Despite their success, most MVC methods have focused on improving performance by following the fully supervised learning (FSL) paradigm, which often requires large amounts of annotated data. In this work, we propose SSLCounter, a novel self-supervised learning (SSL) framework for MVC that leverages neural volumetric rendering to alleviate the reliance on large-scale annotated datasets. SSLCounter learns an implicit representation w.r.t. the scene, enabling the reconstruction of continuous geometry shape and the complex, view-dependent appearance of their 2D projections via differential neural rendering. Owing to its inherent flexibility, the key idea of our method can be seamlessly integrated into exsiting frameworks. Notably, extensive experiments demonstrate that SSLCounter not only demonstrates state-of-the-art performances but also delivers competitive performance with only using 70% proportion of training data, showcasing its superior data efficiency across multiple MVC benchmarks.
[294] Spatial Reasoning in Foundation Models: Benchmarking Object-Centric Spatial Understanding
Vahid Mirjalili, Ramin Giahi, Sriram Kollipara, Akshay Kekuda, Kehui Yao, Kai Zhao, Jianpeng Xu, Kaushiki Nag, Sinduja Subramaniam, Topojoy Biswas, Evren Korpeoglu, Kannan Achan
Main category: cs.CV
TL;DR: This paper presents a systematic benchmark for evaluating object-centric spatial reasoning in vision foundation models, revealing a trade-off between precise localization and relational reasoning capabilities.
Details
Motivation: There's a gap in current benchmarks that focus on localization accuracy rather than true spatial understanding - how objects are arranged and related within scenes, which is crucial for effective scene understanding.Method: Used a controlled synthetic dataset to evaluate state-of-the-art vision models and VLMs across three tasks: spatial localization, spatial reasoning, and downstream retrieval tasks.
Result: Found a stable trade-off: detectors provide precise bounding boxes with limited relational reasoning, while VLMs offer coarse layout cues and fluent captions but struggle with fine-grained spatial context.
Conclusion: Highlights the gap between localization and true spatial understanding, pointing toward the need for spatially-aware foundation models in the community.
Abstract: Spatial understanding is a critical capability for vision foundation models. While recent advances in large vision models or vision-language models (VLMs) have expanded recognition capabilities, most benchmarks emphasize localization accuracy rather than whether models capture how objects are arranged and related within a scene. This gap is consequential; effective scene understanding requires not only identifying objects, but reasoning about their relative positions, groupings, and depth. In this paper, we present a systematic benchmark for object-centric spatial reasoning in foundation models. Using a controlled synthetic dataset, we evaluate state-of-the-art vision models (e.g., GroundingDINO, Florence-2, OWLv2) and large VLMs (e.g., InternVL, LLaVA, GPT-4o) across three tasks: spatial localization, spatial reasoning, and downstream retrieval tasks. We find a stable trade-off: detectors such as GroundingDINO and OWLv2 deliver precise boxes with limited relational reasoning, while VLMs like SmolVLM and GPT-4o provide coarse layout cues and fluent captions but struggle with fine-grained spatial context. Our study highlights the gap between localization and true spatial understanding, and pointing toward the need for spatially-aware foundation models in the community.
[295] PANICL: Mitigating Over-Reliance on Single Prompt in Visual In-Context Learning
Jiahao Zhang, Bowen Wang, Hong Liu, Yuta Nakashima, Hajime Nagahara
Main category: cs.CV
TL;DR: PANICL is a training-free framework that improves visual in-context learning by using multiple in-context pairs instead of relying on a single pair, reducing bias and improving stability across various vision tasks.
Details
Motivation: Visual In-Context Learning often suffers from over-reliance on single in-context pairs, leading to biased and unstable predictions.Method: PAtch-based k-Nearest neighbor visual In-Context Learning (PANICL) leverages multiple in-context pairs to smooth assignment scores across pairs without requiring additional training.
Result: Extensive experiments show consistent improvements on foreground segmentation, object detection, colorization, multi-object segmentation, and keypoint detection. PANICL also demonstrates strong robustness to domain shifts and generalizes well to other VICL models.
Conclusion: PANICL is a versatile and broadly applicable framework that effectively mitigates bias in visual in-context learning through multi-patch utilization.
Abstract: Visual In-Context Learning (VICL) uses input-output image pairs, referred to as in-context pairs (or examples), as prompts alongside query images to guide models in performing diverse vision tasks. However, VICL often suffers from over-reliance on a single in-context pair, which can lead to biased and unstable predictions. We introduce PAtch-based $k$-Nearest neighbor visual In-Context Learning (PANICL), a general training-free framework that mitigates this issue by leveraging multiple in-context pairs. PANICL smooths assignment scores across pairs, reducing bias without requiring additional training. Extensive experiments on a variety of tasks, including foreground segmentation, single object detection, colorization, multi-object segmentation, and keypoint detection, demonstrate consistent improvements over strong baselines. Moreover, PANICL exhibits strong robustness to domain shifts, including dataset-level shift (e.g., from COCO to Pascal) and label-space shift (e.g., FSS-1000), and generalizes well to other VICL models such as SegGPT, Painter, and LVM, highlighting its versatility and broad applicability.
[296] SingRef6D: Monocular Novel Object Pose Estimation with a Single RGB Reference
Jiahui Wang, Haiyue Zhu, Haoren Guo, Abdullah Al Mamun, Cheng Xiang, Tong Heng Lee
Main category: cs.CV
TL;DR: SingRef6D is a lightweight 6D pose estimation pipeline that uses only a single RGB reference image, eliminating the need for depth sensors or multi-view acquisition. It improves depth prediction for challenging surfaces and integrates depth-aware matching to handle difficult materials and lighting conditions.
Details
Motivation: Existing 6D pose estimation methods have practical limitations: depth-based methods fail with transparent/reflective surfaces, while RGB-based methods struggle in low-light and texture-less scenes due to lack of geometry information.Method: Proposes a token-scaler-based fine-tuning mechanism with novel optimization loss on Depth-Anything v2 for better depth prediction, and a depth-aware matching process that integrates spatial relationships within LoFTR for handling challenging materials and lighting.
Result: Achieves 14.41% improvement in depth prediction on REAL275 compared to Depth-Anything v2, and surpasses state-of-the-art methods on REAL275, ClearPose, and Toyota-Light datasets with 6.1% improvement in average recall for pose estimation.
Conclusion: SingRef6D provides a robust and capable solution for 6D pose estimation in resource-limited settings, effectively handling challenging surface conditions and lighting scenarios without requiring depth sensors or dense templates.
Abstract: Recent 6D pose estimation methods demonstrate notable performance but still face some practical limitations. For instance, many of them rely heavily on sensor depth, which may fail with challenging surface conditions, such as transparent or highly reflective materials. In the meantime, RGB-based solutions provide less robust matching performance in low-light and texture-less scenes due to the lack of geometry information. Motivated by these, we propose SingRef6D, a lightweight pipeline requiring only a single RGB image as a reference, eliminating the need for costly depth sensors, multi-view image acquisition, or training view synthesis models and neural fields. This enables SingRef6D to remain robust and capable even under resource-limited settings where depth or dense templates are unavailable. Our framework incorporates two key innovations. First, we propose a token-scaler-based fine-tuning mechanism with a novel optimization loss on top of Depth-Anything v2 to enhance its ability to predict accurate depth, even for challenging surfaces. Our results show a 14.41% improvement (in $\delta_{1.05}$) on REAL275 depth prediction compared to Depth-Anything v2 (with fine-tuned head). Second, benefiting from depth availability, we introduce a depth-aware matching process that effectively integrates spatial relationships within LoFTR, enabling our system to handle matching for challenging materials and lighting conditions. Evaluations of pose estimation on the REAL275, ClearPose, and Toyota-Light datasets show that our approach surpasses state-of-the-art methods, achieving a 6.1% improvement in average recall.
[297] DynaNav: Dynamic Feature and Layer Selection for Efficient Visual Navigation
Jiahui Wang, Changhao Chen
Main category: cs.CV
TL;DR: DynaNav is a dynamic visual navigation framework that adaptively selects features and layers based on scene complexity, achieving significant efficiency gains while improving navigation performance.
Details
Motivation: Existing foundation models for visual navigation suffer from high computational overhead and lack interpretability, limiting deployment in resource-constrained scenarios.Method: Proposes DynaNav with trainable hard feature selector for sparse operations and integrates feature selection with early-exit mechanism using Bayesian Optimization for optimal exit thresholds.
Result: Achieves 2.26x reduction in FLOPs, 42.3% lower inference time, and 32.8% lower memory usage compared to ViNT, while improving navigation performance across four public datasets.
Conclusion: DynaNav effectively addresses computational efficiency and interpretability challenges in visual navigation, making it suitable for resource-tight deployment scenarios.
Abstract: Visual navigation is essential for robotics and embodied AI. However, existing foundation models, particularly those with transformer decoders, suffer from high computational overhead and lack interpretability, limiting their deployment in resource-tight scenarios. To address this, we propose DynaNav, a Dynamic Visual Navigation framework that adapts feature and layer selection based on scene complexity. It employs a trainable hard feature selector for sparse operations, enhancing efficiency and interpretability. Additionally, we integrate feature selection into an early-exit mechanism, with Bayesian Optimization determining optimal exit thresholds to reduce computational cost. Extensive experiments in real-world-based datasets and simulated environments demonstrate the effectiveness of DynaNav. Compared to ViNT, DynaNav achieves a 2.26x reduction in FLOPs, 42.3% lower inference time, and 32.8% lower memory usage, while improving navigation performance across four public datasets.
[298] SemanticControl: A Training-Free Approach for Handling Loosely Aligned Visual Conditions in ControlNet
Woosung Joung, Daewon Chae, Jinkyu Kim
Main category: cs.CV
TL;DR: SemanticControl is a training-free method that enables text-to-image diffusion models to effectively use loosely aligned visual conditions (like human poses for animal scenes) by adaptively suppressing conflicting visual guidance and strengthening text guidance using attention masks from surrogate prompts.
Details
Motivation: ControlNet requires precisely aligned visual conditions with text prompts, which is impractical for uncommon or imaginative scenes where suitable visual conditions are unavailable. Loosely aligned conditions (e.g., human poses for animal scenes) exist but current methods struggle with them, causing low text fidelity or artifacts.Method: Run auxiliary denoising with a surrogate prompt aligned with the visual condition to extract attention masks, then use these masks during denoising of the target prompt to suppress conflicting visual guidance and strengthen text guidance.
Result: Experimental results show improved performance under loosely aligned conditions across various condition types (depth maps, edge maps, human skeletons), outperforming existing baselines.
Conclusion: SemanticControl effectively leverages misaligned but semantically relevant visual conditions without requiring training, enabling better text-to-image generation for uncommon scenes where precisely aligned visual conditions are unavailable.
Abstract: ControlNet has enabled detailed spatial control in text-to-image diffusion models by incorporating additional visual conditions such as depth or edge maps. However, its effectiveness heavily depends on the availability of visual conditions that are precisely aligned with the generation goal specified by text prompt-a requirement that often fails in practice, especially for uncommon or imaginative scenes. For example, generating an image of a cat cooking in a specific pose may be infeasible due to the lack of suitable visual conditions. In contrast, structurally similar cues can often be found in more common settings-for instance, poses of humans cooking are widely available and can serve as rough visual guides. Unfortunately, existing ControlNet models struggle to use such loosely aligned visual conditions, often resulting in low text fidelity or visual artifacts. To address this limitation, we propose SemanticControl, a training-free method for effectively leveraging misaligned but semantically relevant visual conditions. Our approach adaptively suppresses the influence of the visual condition where it conflicts with the prompt, while strengthening guidance from the text. The key idea is to first run an auxiliary denoising process using a surrogate prompt aligned with the visual condition (e.g., “a human playing guitar” for a human pose condition) to extract informative attention masks, and then utilize these masks during the denoising of the actual target prompt (e.g., cat playing guitar). Experimental results demonstrate that our method improves performance under loosely aligned conditions across various conditions, including depth maps, edge maps, and human skeletons, outperforming existing baselines. Our code is available at https://mung3477.github.io/semantic-control.
[299] Customizing Visual Emotion Evaluation for MLLMs: An Open-vocabulary, Multifaceted, and Scalable Approach
Daiqing Wu, Dongbao Yang, Sicheng Zhao, Can Ma, Yu Zhou
Main category: cs.CV
TL;DR: The paper proposes a new Emotion Statement Judgment task and automated pipeline to evaluate Multimodal Large Language Models’ ability to perceive emotions from images, revealing their strengths in emotion interpretation but limitations in understanding perception subjectivity compared to humans.
Details
Motivation: Existing evaluation methods for MLLMs' emotion perception have constraints including oversight of plausible responses, limited emotional taxonomies, neglect of contextual factors, and labor-intensive annotations, leading to inconsistent results in zero-shot scenarios.Method: Proposed an Emotion Statement Judgment task and developed an automated pipeline to efficiently construct emotion-centric statements with minimal human effort, enabling systematic evaluation of MLLMs’ visual emotion perception capabilities.
Result: MLLMs show stronger performance in emotion interpretation and context-based emotion judgment, but have relative limitations in comprehending perception subjectivity. Even top-performing models like GPT4o demonstrate significant performance gaps compared to humans.
Conclusion: The study provides a fundamental evaluation framework that contributes to advancing emotional intelligence in MLLMs by identifying key areas for future improvement in visual emotion perception capabilities.
Abstract: Recently, Multimodal Large Language Models (MLLMs) have achieved exceptional performance across diverse tasks, continually surpassing previous expectations regarding their capabilities. Nevertheless, their proficiency in perceiving emotions from images remains debated, with studies yielding divergent results in zero-shot scenarios. We argue that this inconsistency stems partly from constraints in existing evaluation methods, including the oversight of plausible responses, limited emotional taxonomies, neglect of contextual factors, and labor-intensive annotations. To facilitate customized visual emotion evaluation for MLLMs, we propose an Emotion Statement Judgment task that overcomes these constraints. Complementing this task, we devise an automated pipeline that efficiently constructs emotion-centric statements with minimal human effort. Through systematically evaluating prevailing MLLMs, our study showcases their stronger performance in emotion interpretation and context-based emotion judgment, while revealing relative limitations in comprehending perception subjectivity. When compared to humans, even top-performing MLLMs like GPT4o demonstrate remarkable performance gaps, underscoring key areas for future improvement. By developing a fundamental evaluation framework and conducting a comprehensive MLLM assessment, we hope this work contributes to advancing emotional intelligence in MLLMs. Project page: https://github.com/wdqqdw/MVEI.
[300] MultiCrafter: High-Fidelity Multi-Subject Generation via Spatially Disentangled Attention and Identity-Aware Reinforcement Learning
Tao Wu, Yibo Jiang, Yehao Lu, Zhizhong Wang, Zeyi Huang, Zequn Qin, Xi Li
Main category: cs.CV
TL;DR: MultiCrafter is a framework for multi-subject image generation that addresses attribute leakage through explicit positional supervision, uses Mixture-of-Experts architecture for better scenario handling, and employs online reinforcement learning for human preference alignment.
Details
Motivation: Existing multi-subject image generation methods suffer from attribute leakage that compromises subject fidelity and fail to align with nuanced human preferences, particularly those using In-Context-Learning with simple reconstruction-based objectives.Method: 1) Explicit positional supervision to separate attention regions and mitigate attribute leakage; 2) Mixture-of-Experts architecture to handle diverse scenarios; 3) Online reinforcement learning framework with scoring mechanism for multi-subject fidelity assessment and stable training for MoE architecture.
Result: Experiments show the framework significantly improves subject fidelity while better aligning with human preferences compared to existing methods.
Conclusion: MultiCrafter effectively addresses the limitations of existing multi-subject image generation methods by tackling attribute leakage through attention separation, enhancing scenario handling via MoE, and aligning with human preferences through reinforcement learning.
Abstract: Multi-subject image generation aims to synthesize user-provided subjects in a single image while preserving subject fidelity, ensuring prompt consistency, and aligning with human aesthetic preferences. However, existing methods, particularly those built on the In-Context-Learning paradigm, are limited by their reliance on simple reconstruction-based objectives, leading to both severe attribute leakage that compromises subject fidelity and failing to align with nuanced human preferences. To address this, we propose MultiCrafter, a framework that ensures high-fidelity, preference-aligned generation. First, we find that the root cause of attribute leakage is a significant entanglement of attention between different subjects during the generation process. Therefore, we introduce explicit positional supervision to explicitly separate attention regions for each subject, effectively mitigating attribute leakage. To enable the model to accurately plan the attention region of different subjects in diverse scenarios, we employ a Mixture-of-Experts architecture to enhance the model’s capacity, allowing different experts to focus on different scenarios. Finally, we design a novel online reinforcement learning framework to align the model with human preferences, featuring a scoring mechanism to accurately assess multi-subject fidelity and a more stable training strategy tailored for the MoE architecture. Experiments validate that our framework significantly improves subject fidelity while aligning with human preferences better.
[301] PartSAM: A Scalable Promptable Part Segmentation Model Trained on Native 3D Data
Zhe Zhu, Le Wan, Rui Xu, Yiheng Zhang, Honghua Chen, Zhiyang Dou, Cheng Lin, Yuan Liu, Mingqiang Wei
Main category: cs.CV
TL;DR: PartSAM is the first promptable 3D part segmentation model trained on large-scale 3D data, using a triplane-based dual-branch encoder and achieving superior performance over existing methods.
Details
Motivation: To overcome limitations of existing open-world part segmentation methods that rely on 2D foundation models and fail to capture intrinsic 3D geometry, leading to surface-only understanding and limited generalization.Method: Uses encoder-decoder architecture with triplane-based dual-branch encoder for scalable part-aware representation learning, trained on over 5 million 3D shape-part pairs curated through model-in-the-loop annotation pipeline.
Result: Outperforms state-of-the-art methods by large margins across multiple benchmarks, achieving highly accurate part identification with single prompts and automatic decomposition into surface and internal structures.
Conclusion: PartSAM represents a decisive step toward foundation models for 3D part understanding, demonstrating emergent open-world capabilities through scalable architecture and diverse 3D data.
Abstract: Segmenting 3D objects into parts is a long-standing challenge in computer vision. To overcome taxonomy constraints and generalize to unseen 3D objects, recent works turn to open-world part segmentation. These approaches typically transfer supervision from 2D foundation models, such as SAM, by lifting multi-view masks into 3D. However, this indirect paradigm fails to capture intrinsic geometry, leading to surface-only understanding, uncontrolled decomposition, and limited generalization. We present PartSAM, the first promptable part segmentation model trained natively on large-scale 3D data. Following the design philosophy of SAM, PartSAM employs an encoder-decoder architecture in which a triplane-based dual-branch encoder produces spatially structured tokens for scalable part-aware representation learning. To enable large-scale supervision, we further introduce a model-in-the-loop annotation pipeline that curates over five million 3D shape-part pairs from online assets, providing diverse and fine-grained labels. This combination of scalable architecture and diverse 3D data yields emergent open-world capabilities: with a single prompt, PartSAM achieves highly accurate part identification, and in a Segment-Every-Part mode, it automatically decomposes shapes into both surface and internal structures. Extensive experiments show that PartSAM outperforms state-of-the-art methods by large margins across multiple benchmarks, marking a decisive step toward foundation models for 3D part understanding. Our code and model will be released soon.
[302] No-Reference Image Contrast Assessment with Customized EfficientNet-B0
Javad Hassannataj Joloudari, Bita Mesbahzadeh, Omid Zare, Emrah Arslan, Roohallah Alizadehsani, Hossein Moosaei
Main category: cs.CV
TL;DR: Proposed a deep learning framework for blind contrast quality assessment using customized pre-trained models (EfficientNet B0, ResNet18, MobileNetV2) and a Siamese network, achieving state-of-the-art performance on benchmark datasets.
Details
Motivation: Most no-reference image quality assessment models struggle to accurately evaluate contrast distortions under diverse real-world conditions, despite contrast being fundamental to visual perception and image quality.Method: Customized and fine-tuned three pre-trained architectures (EfficientNet B0, ResNet18, MobileNetV2) with contrast-aware regression heads, plus a Siamese network model. Trained end-to-end using targeted data augmentations on CID2013 and CCID2014 datasets containing synthetic and authentic contrast distortions.
Result: Customized EfficientNet B0 achieved state-of-the-art performance with PLCC=0.9286/SRCC=0.9178 on CCID2014 and PLCC=0.9581/SRCC=0.9369 on CID2013, surpassing traditional methods and other deep baselines. Other models showed limited ability to capture perceptual contrast distortions.
Conclusion: Contrast-aware adaptation of lightweight pre-trained networks provides a high-performing, scalable solution for no-reference contrast quality assessment suitable for real-time and resource-constrained applications.
Abstract: Image contrast was a fundamental factor in visual perception and played a vital role in overall image quality. However, most no reference image quality assessment NR IQA models struggled to accurately evaluate contrast distortions under diverse real world conditions. In this study, we proposed a deep learning based framework for blind contrast quality assessment by customizing and fine-tuning three pre trained architectures, EfficientNet B0, ResNet18, and MobileNetV2, for perceptual Mean Opinion Score, along with an additional model built on a Siamese network, which indicated a limited ability to capture perceptual contrast distortions. Each model is modified with a contrast-aware regression head and trained end to end using targeted data augmentations on two benchmark datasets, CID2013 and CCID2014, containing synthetic and authentic contrast distortions. Performance is evaluated using Pearson Linear Correlation Coefficient and Spearman Rank Order Correlation Coefficient, which assess the alignment between predicted and human rated scores. Among these three models, our customized EfficientNet B0 model achieved state-of-the-art performance with PLCC = 0.9286 and SRCC = 0.9178 on CCID2014 and PLCC = 0.9581 and SRCC = 0.9369 on CID2013, surpassing traditional methods and outperforming other deep baselines. These results highlighted the models robustness and effectiveness in capturing perceptual contrast distortion. Overall, the proposed method demonstrated that contrast aware adaptation of lightweight pre trained networks can yield a high performing, scalable solution for no reference contrast quality assessment suitable for real time and resource constrained applications.
[303] Geo-R1: Improving Few-Shot Geospatial Referring Expression Understanding with Reinforcement Fine-Tuning
Zilun Zhang, Zian Guan, Tiancheng Zhao, Haozhan Shen, Tianyu Li, Yuxiang Cai, Zhonggen Su, Zhaojun Liu, Jianwei Yin, Xiang Li
Main category: cs.CV
TL;DR: Geo-R1 is a reasoning-centric reinforcement fine-tuning paradigm for few-shot geospatial referring that generates explicit reasoning chains before localizing objects, improving performance in data-scarce scenarios.
Details
Motivation: Supervised fine-tuning struggles with poor generalization in data-scarce scenarios for geospatial referring expression understanding, which requires complex object-context reasoning.Method: Proposes a “reason first, then act” process where the model generates explicit, interpretable reasoning chains to decompose referring expressions, then uses these rationales to localize target objects through reinforcement fine-tuning.
Result: Geo-R1 consistently and substantially outperforms SFT baselines on three few-shot geospatial referring benchmarks and demonstrates strong cross-dataset generalization.
Conclusion: The reasoning-centric reinforcement fine-tuning paradigm enables more effective use of limited annotations, enhances generalization, and provides interpretability for geospatial referring tasks.
Abstract: Referring expression understanding in remote sensing poses unique challenges, as it requires reasoning over complex object-context relationships. While supervised fine-tuning (SFT) on multimodal large language models achieves strong performance with massive labeled datasets, they struggle in data-scarce scenarios, leading to poor generalization. To address this limitation, we propose Geo-R1, a reasoning-centric reinforcement fine-tuning (RFT) paradigm for few-shot geospatial referring. Geo-R1 enforces the model to first generate explicit, interpretable reasoning chains that decompose referring expressions, and then leverage these rationales to localize target objects. This “reason first, then act” process enables the model to make more effective use of limited annotations, enhances generalization, and provides interpretability. We validate Geo-R1 on three carefully designed few-shot geospatial referring benchmarks, where our model consistently and substantially outperforms SFT baselines. It also demonstrates strong cross-dataset generalization, highlighting its robustness. Code and data will be released at http://geo-r1.github.io.
[304] Benchmarking and Mitigate Psychological Sycophancy in Medical Vision-Language Models
Zikun Guo, Xinyue Xu, Pei Xiang, Shu Yang, Xin Han, Di Wang, Lijie Hu
Main category: cs.CV
TL;DR: This paper evaluates sycophantic behavior in clinical vision-language models (VLMs) and proposes a mitigation framework called VIPER that filters non-evidentiary content to generate evidence-first responses.
Details
Motivation: VLMs are increasingly used in clinical workflows but often prioritize alignment with user phrasing and social cues over evidence-based reasoning, which can compromise medical decision-making.Method: Created a medical sycophancy benchmark from PathVQA, SLAKE, and VQA-RAD datasets using psychologically motivated pressure templates. Proposed VIPER framework that filters non-evidentiary content and generates constrained evidence-first answers.
Result: VLMs showed significant vulnerability to sycophantic behavior with weak correlation to model accuracy or size. Imitation and expert-provided corrections were the most effective triggers. VIPER framework reduced sycophancy by an average amount while maintaining interpretability.
Conclusion: The benchmark and VIPER mitigation framework provide groundwork for robust deployment of medical VLMs in real-world clinician interactions, emphasizing the need for evidence-anchored defenses against sycophantic behavior.
Abstract: Vision language models(VLMs) are increasingly integrated into clinical workflows, but they often exhibit sycophantic behavior prioritizing alignment with user phrasing social cues or perceived authority over evidence based reasoning. This study evaluate clinical sycophancy in medical visual question answering through a novel clinically grounded benchmark. We propose a medical sycophancy dataset construct from PathVQA, SLAKE, and VQA-RAD stratified by different type organ system and modality. Using psychologically motivated pressure templates including various sycophancy. In our adversarial experiments on various VLMs, we found that these models are generally vulnerable, exhibiting significant variations in the occurrence of adversarial responses, with weak correlations to the model accuracy or size. Imitation and expert provided corrections were found to be the most effective triggers, suggesting that the models possess a bias mechanism independent of visual evidence. To address this, we propose Visual Information Purification for Evidence based Response (VIPER) a lightweight mitigation strategy that filters non evidentiary content for example social pressures and then generates constrained evidence first answers. This framework reduces sycophancy by an average amount outperforming baselines while maintaining interpretability. Our benchmark analysis and mitigation framework lay the groundwork for robust deployment of medical VLMs in real world clinician interactions emphasizing the need for evidence anchored defenses.
[305] Resolving Ambiguity in Gaze-Facilitated Visual Assistant Interaction Paradigm
Zeyu Wang, Baiyu Chen, Kun Yan, Hongjing Piao, Hao Xue, Flora D. Salim, Yuanchun Shi, Yuntao Wang
Main category: cs.CV
TL;DR: GLARIFY is a novel method that leverages spatiotemporal gaze information to enhance Vision-Language Models’ effectiveness in real-world applications by addressing ambiguity challenges in smart glasses interactions.
Details
Motivation: With smart glasses gaining popularity, users' attention integration into VLMs faces ambiguity challenges: ambiguous verbal questions using pronouns or skipping context, and noisy gaze patterns with complex spatiotemporal relationships. Previous works only used single images, failing to capture dynamic attention.Method: GLARIFY analyzes gaze patterns, uses GPT-4o to create GLARIFY-Ambi dataset with chain-of-thought process for handling noisy gaze, and designs a heatmap module to incorporate gaze information into VLMs while preserving pretrained knowledge.
Result: Experiments on hold-out test set demonstrate that GLARIFY significantly outperforms baselines in handling ambiguous queries and noisy gaze patterns.
Conclusion: GLARIFY robustly aligns VLMs with human attention, paving the way for usable and intuitive interaction paradigms with visual assistants in smart glasses applications.
Abstract: With the rise in popularity of smart glasses, users’ attention has been integrated into Vision-Language Models (VLMs) to streamline multi-modal querying in daily scenarios. However, leveraging gaze data to model users’ attention may introduce ambiguity challenges: (1) users’ verbal questions become ambiguous by using pronouns or skipping context, (2) humans’ gaze patterns can be noisy and exhibit complex spatiotemporal relationships with their spoken questions. Previous works only consider single image as visual modality input, failing to capture the dynamic nature of the user’s attention. In this work, we introduce GLARIFY, a novel method to leverage spatiotemporal gaze information to enhance the model’s effectiveness in real-world applications. Initially, we analyzed hundreds of querying samples with the gaze modality to demonstrate the noisy nature of users’ gaze patterns. We then utilized GPT-4o to design an automatic data synthesis pipeline to generate the GLARIFY-Ambi dataset, which includes a dedicated chain-of-thought (CoT) process to handle noisy gaze patterns. Finally, we designed a heatmap module to incorporate gaze information into cutting-edge VLMs while preserving their pretrained knowledge. We evaluated GLARIFY using a hold-out test set. Experiments demonstrate that GLARIFY significantly outperforms baselines. By robustly aligning VLMs with human attention, GLARIFY paves the way for a usable and intuitive interaction paradigm with a visual assistant.
[306] From Bias to Balance: Exploring and Mitigating Spatial Bias in LVLMs
Yingjie Zhu, Xuefeng Bai, Kehai Chen, Yang Xiang, Weili Guan, Jun Yu, Min Zhang
Main category: cs.CV
TL;DR: LVLMs have spatial bias where identical visual information placed at different image locations produces inconsistent outputs, due to unbalanced position embeddings in the language model component rather than the vision encoder.
Details
Motivation: To systematically study and address the spatial robustness limitations of Large Vision-Language Models, which show inconsistent performance when key visual information appears at different spatial locations.Method: Introduces Balanced Position Assignment (BaPA), a simple mechanism that assigns identical position embeddings to all image tokens to promote balanced integration of visual information across spatial positions.
Result: BaPA enhances spatial robustness without retraining and boosts performance across multimodal benchmarks when combined with lightweight fine-tuning, achieving more balanced attention and holistic visual understanding.
Conclusion: The spatial bias in LVLMs stems from unbalanced position embedding designs, and BaPA effectively mitigates this issue by promoting equal treatment of visual information across all spatial locations.
Abstract: Large Vision-Language Models (LVLMs) have achieved remarkable success across a wide range of multimodal tasks, yet their robustness to spatial variations remains insufficiently understood. In this work, we present a systematic study of the spatial bias of LVLMs, focusing on how models respond when identical key visual information is placed at different locations within an image. Through a carefully designed probing dataset, we demonstrate that current LVLMs often produce inconsistent outputs under such spatial shifts, revealing a fundamental limitation in their spatial-semantic understanding. Further analysis shows that this phenomenon originates not from the vision encoder, which reliably perceives and interprets visual content across positions, but from the unbalanced design of position embeddings in the language model component. In particular, the widely adopted position embedding strategies, such as RoPE, introduce imbalance during cross-modal interaction, leading image tokens at different positions to exert unequal influence on semantic understanding. To mitigate this issue, we introduce Balanced Position Assignment (BaPA), a simple yet effective mechanism that assigns identical position embeddings to all image tokens, promoting a more balanced integration of visual information. Extensive experiments show that BaPA enhances the spatial robustness of LVLMs without retraining and further boosts their performance across diverse multimodal benchmarks when combined with lightweight fine-tuning. Further analysis of information flow reveals that BaPA yields balanced attention, enabling more holistic visual understanding.
[307] Mind-the-Glitch: Visual Correspondence for Detecting Inconsistencies in Subject-Driven Generation
Abdelrahman Eldesokey, Aleksandar Cvejic, Bernard Ghanem, Peter Wonka
Main category: cs.CV
TL;DR: Proposes a method to disentangle visual and semantic features from pre-trained diffusion models, enabling visual correspondence analysis and introducing a new metric (VSM) for quantifying visual inconsistencies in subject-driven image generation.
Details
Motivation: Diffusion model backbones contain both semantic and visual features needed for image synthesis, but isolating visual features is challenging due to lack of annotated datasets. This work aims to enable visual correspondence analysis similar to semantic correspondence.Method: Uses an automated pipeline to construct image pairs with annotated semantic and visual correspondences from existing datasets, and designs a contrastive architecture to separate the two feature types.
Result: The approach outperforms global feature-based metrics like CLIP, DINO, and vision-language models in quantifying visual inconsistencies while enabling spatial localization of inconsistent regions.
Conclusion: This is the first method that supports both quantification and localization of inconsistencies in subject-driven generation, providing a valuable tool for advancing this task.
Abstract: We propose a novel approach for disentangling visual and semantic features from the backbones of pre-trained diffusion models, enabling visual correspondence in a manner analogous to the well-established semantic correspondence. While diffusion model backbones are known to encode semantically rich features, they must also contain visual features to support their image synthesis capabilities. However, isolating these visual features is challenging due to the absence of annotated datasets. To address this, we introduce an automated pipeline that constructs image pairs with annotated semantic and visual correspondences based on existing subject-driven image generation datasets, and design a contrastive architecture to separate the two feature types. Leveraging the disentangled representations, we propose a new metric, Visual Semantic Matching (VSM), that quantifies visual inconsistencies in subject-driven image generation. Empirical results show that our approach outperforms global feature-based metrics such as CLIP, DINO, and vision–language models in quantifying visual inconsistencies while also enabling spatial localization of inconsistent regions. To our knowledge, this is the first method that supports both quantification and localization of inconsistencies in subject-driven generation, offering a valuable tool for advancing this task. Project Page:https://abdo-eldesokey.github.io/mind-the-glitch/
[308] ERGO: Efficient High-Resolution Visual Understanding for Vision-Language Models
Jewon Lee, Wooksu Shin, Seungmin Yang, Ki-Ung Song, DongUk Lim, Jaeyeon Kim, Tae-Ho Kim, Bo-Kyeong Kim
Main category: cs.CV
TL;DR: ERGO is an efficient vision-language model that uses a two-stage coarse-to-fine reasoning pipeline to reduce computational costs while maintaining accuracy by focusing only on task-relevant image regions.
Details
Motivation: Existing Large Vision-Language Models incur substantial computational overhead due to processing large numbers of vision tokens, especially for high-resolution images. The emergence of 'thinking with images' models enables visual reasoning, motivating a more efficient approach.Method: A two-stage ‘coarse-to-fine’ pipeline: first analyzes a downsampled image to identify task-relevant regions, then crops only these regions at full resolution for detailed processing. Uses reinforcement learning with reward components for coarse-to-fine perception.
Result: ERGO achieves higher accuracy than original models and competitive methods with greater efficiency. It surpasses Qwen2.5-VL-7B on V* benchmark by 4.7 points while using only 23% of vision tokens, achieving 3x inference speedup.
Conclusion: The proposed reasoning-driven perception approach effectively reduces computational costs while preserving accuracy by focusing on relevant image regions, demonstrating significant efficiency improvements in vision-language tasks.
Abstract: Efficient processing of high-resolution images is crucial for real-world vision-language applications. However, existing Large Vision-Language Models (LVLMs) incur substantial computational overhead due to the large number of vision tokens. With the advent of “thinking with images” models, reasoning now extends beyond text to the visual domain. This capability motivates our two-stage “coarse-to-fine” reasoning pipeline: first, a downsampled image is analyzed to identify task-relevant regions; then, only these regions are cropped at full resolution and processed in a subsequent reasoning stage. This approach reduces computational cost while preserving fine-grained visual details where necessary. A major challenge lies in inferring which regions are truly relevant to a given query. Recent related methods often fail in the first stage after input-image downsampling, due to perception-driven reasoning, where clear visual information is required for effective reasoning. To address this issue, we propose ERGO (Efficient Reasoning & Guided Observation) that performs reasoning-driven perception-leveraging multimodal context to determine where to focus. Our model can account for perceptual uncertainty, expanding the cropped region to cover visually ambiguous areas for answering questions. To this end, we develop simple yet effective reward components in a reinforcement learning framework for coarse-to-fine perception. Across multiple datasets, our approach delivers higher accuracy than the original model and competitive methods, with greater efficiency. For instance, ERGO surpasses Qwen2.5-VL-7B on the V* benchmark by 4.7 points while using only 23% of the vision tokens, achieving a 3x inference speedup. The code and models can be found at: https://github.com/nota-github/ERGO.
[309] DualFocus: Depth from Focus with Spatio-Focal Dual Variational Constraints
Sungmin Woo, Sangyoun Lee
Main category: cs.CV
TL;DR: DualFocus is a novel Depth-from-Focus framework that uses dual constraints (spatial and focal) to improve depth estimation by leveraging gradient patterns in focal stacks, addressing challenges in complex scenes with fine textures or abrupt depth changes.
Details
Motivation: Existing learning-based DFF methods struggle in complex scenes with fine textures or abrupt depth changes where focus cues become ambiguous or misleading, limiting their robustness and accuracy.Method: Introduces a variational formulation with dual constraints: spatial constraints analyze gradient pattern changes across focus levels to distinguish depth edges from texture artifacts, and focal constraints enforce unimodal, monotonic focus probabilities aligned with physical focus behavior.
Result: Comprehensive experiments on four public datasets show that DualFocus consistently outperforms state-of-the-art methods in both depth accuracy and perceptual quality.
Conclusion: The proposed dual constraint framework with spatial and focal inductive biases significantly improves robustness and accuracy in challenging DFF scenarios, demonstrating superior performance over existing methods.
Abstract: Depth-from-Focus (DFF) enables precise depth estimation by analyzing focus cues across a stack of images captured at varying focal lengths. While recent learning-based approaches have advanced this field, they often struggle in complex scenes with fine textures or abrupt depth changes, where focus cues may become ambiguous or misleading. We present DualFocus, a novel DFF framework that leverages the focal stack’s unique gradient patterns induced by focus variation, jointly modeling focus changes over spatial and focal dimensions. Our approach introduces a variational formulation with dual constraints tailored to DFF: spatial constraints exploit gradient pattern changes across focus levels to distinguish true depth edges from texture artifacts, while focal constraints enforce unimodal, monotonic focus probabilities aligned with physical focus behavior. These inductive biases improve robustness and accuracy in challenging regions. Comprehensive experiments on four public datasets demonstrate that DualFocus consistently outperforms state-of-the-art methods in both depth accuracy and perceptual quality.
[310] Rate-Distortion Optimized Communication for Collaborative Perception
Genjia Liu, Anning Hu, Yue Hu, Wenjun Zhang, Siheng Chen
Main category: cs.CV
TL;DR: The paper introduces RDcomm, a communication-efficient collaborative perception framework based on information theory that optimizes the trade-off between task performance and communication volume in multi-agent systems.
Details
Motivation: Prior work lacks theoretical foundation for the performance-communication trade-off in collaborative perception. The authors aim to fill this gap using information theory to provide principled guidance for designing optimal communication strategies.Method: Proposes RDcomm framework with two key innovations: 1) task entropy discrete coding that assigns task-relevant codeword-lengths to features, and 2) mutual-information-driven message selection using neural estimation to minimize redundancy.
Result: RDcomm achieves state-of-the-art accuracy on DAIR-V2X and OPV2V datasets for 3D object detection and BEV segmentation, while reducing communication volume by up to 108 times compared to previous methods.
Conclusion: The proposed rate-distortion theory provides theoretical foundation for collaborative perception, and RDcomm demonstrates practical effectiveness in balancing performance and communication efficiency through principled information-theoretic approaches.
Abstract: Collaborative perception emphasizes enhancing environmental understanding by enabling multiple agents to share visual information with limited bandwidth resources. While prior work has explored the empirical trade-off between task performance and communication volume, a significant gap remains in the theoretical foundation. To fill this gap, we draw on information theory and introduce a pragmatic rate-distortion theory for multi-agent collaboration, specifically formulated to analyze performance-communication trade-off in goal-oriented multi-agent systems. This theory concretizes two key conditions for designing optimal communication strategies: supplying pragmatically relevant information and transmitting redundancy-less messages. Guided by these two conditions, we propose RDcomm, a communication-efficient collaborative perception framework that introduces two key innovations: i) task entropy discrete coding, which assigns features with task-relevant codeword-lengths to maximize the efficiency in supplying pragmatic information; ii) mutual-information-driven message selection, which utilizes mutual information neural estimation to approach the optimal redundancy-less condition. Experiments on 3D object detection and BEV segmentation demonstrate that RDcomm achieves state-of-the-art accuracy on DAIR-V2X and OPV2V, while reducing communication volume by up to 108 times. The code will be released.
[311] FailureAtlas:Mapping the Failure Landscape of T2I Models via Active Exploration
Muxi Chen, Zhaohua Zhang, Chenchen Zhao, Mingyang Chen, Wenyu Jiang, Tianwen Jiang, Jianhuan Zhuo, Yu Tang, Qiuyong Xiao, Jihong Zhang, Qiang Xu
Main category: cs.CV
TL;DR: FailureAtlas is a framework that autonomously explores and maps failure landscapes of Text-to-Image models through active exploration, discovering hundreds of thousands of error slices and linking them to training data scarcity.
Details
Motivation: Static benchmarks have limited diagnostic power for uncovering systematic failures in T2I models, so a complementary active exploration paradigm is needed to better understand failure landscapes and root causes.Method: Frames error discovery as structured search for minimal failure-inducing concepts, using novel acceleration techniques to make the computationally explosive problem tractable.
Result: Applied to Stable Diffusion models, discovered over 247,000 previously unknown error slices in SD1.5 alone, providing large-scale evidence linking failures to training data scarcity.
Conclusion: Establishes a new diagnostic-first methodology for deep model auditing to guide development of more robust generative AI.
Abstract: Static benchmarks have provided a valuable foundation for comparing Text-to-Image (T2I) models. However, their passive design offers limited diagnostic power, struggling to uncover the full landscape of systematic failures or isolate their root causes. We argue for a complementary paradigm: active exploration. We introduce FailureAtlas, the first framework designed to autonomously explore and map the vast failure landscape of T2I models at scale. FailureAtlas frames error discovery as a structured search for minimal, failure-inducing concepts. While it is a computationally explosive problem, we make it tractable with novel acceleration techniques. When applied to Stable Diffusion models, our method uncovers hundreds of thousands of previously unknown error slices (over 247,000 in SD1.5 alone) and provides the first large-scale evidence linking these failures to data scarcity in the training set. By providing a principled and scalable engine for deep model auditing, FailureAtlas establishes a new, diagnostic-first methodology to guide the development of more robust generative AI. The code is available at https://github.com/cure-lab/FailureAtlas
[312] Lightweight Structured Multimodal Reasoning for Clinical Scene Understanding in Robotics
Saurav Jha, Stefan K. Ehrlich
Main category: cs.CV
TL;DR: A lightweight multimodal framework for healthcare robotics that combines Qwen2.5-VL-3B-Instruct with SmolAgent orchestration for enhanced video-based scene understanding, temporal reasoning, and structured outputs.
Details
Motivation: Current Vision-Language Models lack sufficient temporal reasoning, uncertainty estimation, and structured output capabilities needed for safe and robust robotic applications in dynamic clinical environments.Method: Combines Qwen2.5-VL-3B-Instruct model with SmolAgent-based orchestration layer, supporting chain-of-thought reasoning, speech-vision fusion, dynamic tool invocation, and hybrid retrieval for structured scene graph generation.
Result: Achieves competitive accuracy on Video-MME benchmark and custom clinical dataset, showing improved robustness compared to state-of-the-art VLMs.
Conclusion: The framework demonstrates strong potential for applications in robot-assisted surgery, patient monitoring, and clinical decision support systems.
Abstract: Healthcare robotics requires robust multimodal perception and reasoning to ensure safety in dynamic clinical environments. Current Vision-Language Models (VLMs) demonstrate strong general-purpose capabilities but remain limited in temporal reasoning, uncertainty estimation, and structured outputs needed for robotic planning. We present a lightweight agentic multimodal framework for video-based scene understanding. Combining the Qwen2.5-VL-3B-Instruct model with a SmolAgent-based orchestration layer, it supports chain-of-thought reasoning, speech-vision fusion, and dynamic tool invocation. The framework generates structured scene graphs and leverages a hybrid retrieval module for interpretable and adaptive reasoning. Evaluations on the Video-MME benchmark and a custom clinical dataset show competitive accuracy and improved robustness compared to state-of-the-art VLMs, demonstrating its potential for applications in robot-assisted surgery, patient monitoring, and decision support.
[313] Exposing Hallucinations To Suppress Them: VLMs Representation Editing With Generative Anchors
Youxu Shi, Suorong Yang, Dong Liu
Main category: cs.CV
TL;DR: A training-free method reduces hallucinations in multimodal LLMs by using text-to-image projection to create negative anchors and editing decoder hidden states to pull representations toward faithful semantics.
Details
Motivation: MLLMs suffer from hallucinations that produce content inconsistent with visual evidence, and existing mitigation approaches require additional finetuning, handcrafted priors, or compromise informativeness and scalability.Method: Introduces hallucination amplification: projecting captions into visual space via text-to-image model to reveal implicit hallucination signals as negative anchors, while original image provides positive anchors. Edits decoder hidden states by pulling representations toward faithful semantics and pushing away from hallucination directions.
Result: Significantly reduces hallucinations at object, attribute, and relation levels while preserving recall and caption richness. Achieves over 5% hallucination reduction on CHAIR using LLaVA-v1.5-7B. Validates strong cross-architecture generalization on diverse models.
Conclusion: The training-free, self-supervised method effectively mitigates hallucinations without side effects on hallucination-free captions, demonstrating robustness and practical plug-and-play applicability.
Abstract: Multimodal large language models (MLLMs) have achieved remarkable success across diverse vision-language tasks, yet they remain highly susceptible to hallucinations, producing content that is fluent but inconsistent with visual evidence. Such hallucinations, spanning objects, attributes, and relations, persist even in larger models, while existing mitigation approaches often require additional finetuning, handcrafted priors, or trade-offs that compromise informativeness and scalability. To address this limitation, we propose a training-free, self-supervised method for hallucination mitigation. Our approach introduces a novel hallucination amplification mechanism: a caption is projected into the visual space via a text-to-image model to reveal implicit hallucination signals, serving as a negative anchor, while the original image provides a positive anchor. Leveraging these dual anchors, we edit decoder hidden states by pulling representations toward faithful semantics and pushing them away from hallucination directions. This correction requires no human priors or additional training costs, ensuring both effectiveness and efficiency. Extensive experiments across multiple benchmarks show that our method significantly reduces hallucinations at the object, attribute, and relation levels while largely preserving recall and caption richness, e.g., achieving a hallucination reduction by over 5% using LLaVA-v1.5-7B on CHAIR. Furthermore, results on diverse architectures, including LLaVA-NEXT-7B, Cambrian-8B, and InstructBLIP-7B, validate strong cross-architecture generalization. More importantly, when applied to hallucination-free captions, our method introduces almost no side effects, underscoring its robustness and practical plug-and-play applicability. The implementation will be publicly available.
[314] CoFFT: Chain of Foresight-Focus Thought for Visual Language Models
Xinyu Zhang, Yuxuan Dong, Lingling Zhang, Chengyou Jia, Zhuohang Dang, Basura Fernando, Jun Liu, Mike Zheng Shou
Main category: cs.CV
TL;DR: CoFFT is a training-free approach that enhances VLMs’ visual reasoning by emulating human visual cognition through iterative foresight-focus thought cycles.
Details
Motivation: VLMs are constrained by complex and redundant visual input, leading to interference, excessive task-irrelevant reasoning, and hallucinations due to inability to precisely discover and process required regions during reasoning.Method: Three-stage iterative process: (1) Diverse Sample Generation for exploring reasoning paths, (2) Dual Foresight Decoding to evaluate samples based on visual focus and reasoning progression, (3) Visual Focus Adjustment to precisely adjust focus toward beneficial regions for future reasoning.
Result: Consistent performance improvements of 3.1-5.8% across multiple benchmarks using Qwen2.5-VL, InternVL-2.5, and Llava-Next with controllable computational overhead.
Conclusion: CoFFT successfully enhances VLMs’ visual reasoning by creating an interdependent cycle where reasoning guides visual focus and visual focus informs subsequent reasoning, addressing limitations of current VLMs.
Abstract: Despite significant advances in Vision Language Models (VLMs), they remain constrained by the complexity and redundancy of visual input. When images contain large amounts of irrelevant information, VLMs are susceptible to interference, thus generating excessive task-irrelevant reasoning processes or even hallucinations. This limitation stems from their inability to discover and process the required regions during reasoning precisely. To address this limitation, we present the Chain of Foresight-Focus Thought (CoFFT), a novel training-free approach that enhances VLMs’ visual reasoning by emulating human visual cognition. Each Foresight-Focus Thought consists of three stages: (1) Diverse Sample Generation: generates diverse reasoning samples to explore potential reasoning paths, where each sample contains several reasoning steps; (2) Dual Foresight Decoding: rigorously evaluates these samples based on both visual focus and reasoning progression, adding the first step of optimal sample to the reasoning process; (3) Visual Focus Adjustment: precisely adjust visual focus toward regions most beneficial for future reasoning, before returning to stage (1) to generate subsequent reasoning samples until reaching the final answer. These stages function iteratively, creating an interdependent cycle where reasoning guides visual focus and visual focus informs subsequent reasoning. Empirical results across multiple benchmarks using Qwen2.5-VL, InternVL-2.5, and Llava-Next demonstrate consistent performance improvements of 3.1-5.8% with controllable increasing computational overhead.
[315] EgoInstruct: An Egocentric Video Dataset of Face-to-face Instructional Interactions with Multi-modal LLM Benchmarking
Yuki Sakai, Ryosuke Furuta, Juichun Yen, Yoichi Sato
Main category: cs.CV
TL;DR: This paper introduces a new egocentric video dataset for face-to-face instruction analysis and benchmarks multimodal large language models (MLLMs) against conventional models on procedural step segmentation and conversation-state classification tasks.
Details
Motivation: Face-to-face instructional interactions are critical for educational support but haven't been systematically studied in computer vision due to lack of suitable datasets and limited analytical techniques.Method: Created a new egocentric video dataset with ground-truth annotations, then benchmarked multimodal large language models (MLLMs) that jointly process images, audio, and text against conventional task-specific models.
Result: MLLMs outperformed specialized baselines even without task-specific fine-tuning, demonstrating their promise for holistic understanding of instructional interactions.
Conclusion: Multimodal large language models show strong potential for comprehensive understanding of face-to-face instructional scenes by effectively integrating verbal and nonverbal communication modalities.
Abstract: Analyzing instructional interactions between an instructor and a learner who are co-present in the same physical space is a critical problem for educational support and skill transfer. Yet such face-to-face instructional scenes have not been systematically studied in computer vision. We identify two key reasons: i) the lack of suitable datasets and ii) limited analytical techniques. To address this gap, we present a new egocentric video dataset of face-to-face instruction and provide ground-truth annotations for two fundamental tasks that serve as a first step toward a comprehensive understanding of instructional interactions: procedural step segmentation and conversation-state classification. Using this dataset, we benchmark multimodal large language models (MLLMs) against conventional task-specific models. Since face-to-face instruction involves multiple modalities (speech content and prosody, gaze and body motion, and visual context), effective understanding requires methods that handle verbal and nonverbal communication in an integrated manner. Accordingly, we evaluate recently introduced MLLMs that jointly process images, audio, and text. This evaluation quantifies the extent to which current machine learning models understand face-to-face instructional scenes. In experiments, MLLMs outperform specialized baselines even without task-specific fine-tuning, suggesting their promise for holistic understanding of instructional interactions.
[316] SpecXNet: A Dual-Domain Convolutional Network for Robust Deepfake Detection
Inzamamul Alam, Md Tanvir Islam, Simon S. Woo
Main category: cs.CV
TL;DR: SpecXNet is a dual-domain deepfake detection architecture that combines spatial and spectral features using Dual-Domain Feature Coupler and Dual Fourier Attention modules, achieving state-of-the-art performance with improved generalization.
Details
Motivation: GANs and diffusion models create increasingly realistic deepfakes, making detection challenging. Existing methods focus only on spatial or frequency features, limiting generalization to unseen manipulations.Method: Proposes Spectral Cross-Attentional Network (SpecXNet) with Dual-Domain Feature Coupler (DDFC) that decomposes features into local spatial branch for texture anomalies and global spectral branch using FFT for periodic inconsistencies. Uses Dual Fourier Attention (DFA) to dynamically fuse spatial and spectral features. Built on modified XceptionNet backbone.
Result: Achieves state-of-the-art accuracy on multiple deepfake benchmarks, particularly under cross-dataset and unseen manipulation scenarios, while maintaining real-time feasibility.
Conclusion: Unified spatial-spectral learning is effective for robust and generalizable deepfake detection. Code released on GitHub for reproducibility.
Abstract: The increasing realism of content generated by GANs and diffusion models has made deepfake detection significantly more challenging. Existing approaches often focus solely on spatial or frequency-domain features, limiting their generalization to unseen manipulations. We propose the Spectral Cross-Attentional Network (SpecXNet), a dual-domain architecture for robust deepfake detection. The core \textbf{Dual-Domain Feature Coupler (DDFC)} decomposes features into a local spatial branch for capturing texture-level anomalies and a global spectral branch that employs Fast Fourier Transform to model periodic inconsistencies. This dual-domain formulation allows SpecXNet to jointly exploit localized detail and global structural coherence, which are critical for distinguishing authentic from manipulated images. We also introduce the \textbf{Dual Fourier Attention (DFA)} module, which dynamically fuses spatial and spectral features in a content-aware manner. Built atop a modified XceptionNet backbone, we embed the DDFC and DFA modules within a separable convolution block. Extensive experiments on multiple deepfake benchmarks show that SpecXNet achieves state-of-the-art accuracy, particularly under cross-dataset and unseen manipulation scenarios, while maintaining real-time feasibility. Our results highlight the effectiveness of unified spatial-spectral learning for robust and generalizable deepfake detection. To ensure reproducibility, we released the full code on \href{https://github.com/inzamamulDU/SpecXNet}{\textcolor{blue}{\textbf{GitHub}}}.
[317] Large Material Gaussian Model for Relightable 3D Generation
Jingrui Ye, Lingting Zhu, Runze Zhang, Zeyu Hu, Yingda Yin, Lanjiong Li, Lequan Yu, Qingmin Liao
Main category: cs.CV
TL;DR: MGM is a novel framework that generates 3D content with Physically Based Rendering (PBR) materials (albedo, roughness, metallic) instead of just RGB textures, enabling dynamic relighting in various environments.
Details
Motivation: Existing 3D reconstruction models fail to produce material properties needed for realistic rendering in diverse lighting environments, while current methods only produce RGB textures with uncontrolled light baking.Method: Fine-tune a multiview material diffusion model conditioned on depth and normal maps, then use Gaussian material representation to model PBR material channels and reconstruct point clouds that can be rendered with PBR attributes.
Result: Extensive experiments show materials produced by MGM are more visually appealing than baseline methods and enhance material modeling for practical downstream rendering applications.
Conclusion: MGM successfully generates high-quality 3D content with PBR materials, enabling dynamic relighting and advancing practical rendering applications beyond current RGB-only approaches.
Abstract: The increasing demand for 3D assets across various industries necessitates efficient and automated methods for 3D content creation. Leveraging 3D Gaussian Splatting, recent large reconstruction models (LRMs) have demonstrated the ability to efficiently achieve high-quality 3D rendering by integrating multiview diffusion for generation and scalable transformers for reconstruction. However, existing models fail to produce the material properties of assets, which is crucial for realistic rendering in diverse lighting environments. In this paper, we introduce the Large Material Gaussian Model (MGM), a novel framework designed to generate high-quality 3D content with Physically Based Rendering (PBR) materials, ie, albedo, roughness, and metallic properties, rather than merely producing RGB textures with uncontrolled light baking. Specifically, we first fine-tune a new multiview material diffusion model conditioned on input depth and normal maps. Utilizing the generated multiview PBR images, we explore a Gaussian material representation that not only aligns with 2D Gaussian Splatting but also models each channel of the PBR materials. The reconstructed point clouds can then be rendered to acquire PBR attributes, enabling dynamic relighting by applying various ambient light maps. Extensive experiments demonstrate that the materials produced by our method not only exhibit greater visual appeal compared to baseline methods but also enhance material modeling, thereby enabling practical downstream rendering applications.
[318] Self-Supervised Point Cloud Completion based on Multi-View Augmentations of Single Partial Point Cloud
Jingjing Lu, Huilong Pi, Yunchuan Qin, Zhuo Tang, Ruihui Li
Main category: cs.CV
TL;DR: A novel self-supervised point cloud completion method using multi-view augmentations and Mamba architecture to overcome limitations of existing supervised and unsupervised approaches.
Details
Motivation: Current point cloud completion methods have limitations: supervised methods rely on ground truth and suffer from synthetic-to-real domain gap, unsupervised methods need complete point clouds, weakly-supervised methods require multi-view observations, and existing self-supervised methods produce unsatisfactory predictions due to weak self-supervised signals.Method: Proposes self-supervised signals based on multi-view augmentations of single partial point clouds and incorporates Mamba architecture to enhance learning capability for generating higher quality point clouds.
Result: Achieves state-of-the-art performance on both synthetic and real-world datasets.
Conclusion: The proposed self-supervised method with multi-view augmentations and Mamba architecture effectively addresses limitations of existing approaches and demonstrates superior performance in point cloud completion tasks.
Abstract: Point cloud completion aims to reconstruct complete shapes from partial observations. Although current methods have achieved remarkable performance, they still have some limitations: Supervised methods heavily rely on ground truth, which limits their generalization to real-world datasets due to the synthetic-to-real domain gap. Unsupervised methods require complete point clouds to compose unpaired training data, and weakly-supervised methods need multi-view observations of the object. Existing self-supervised methods frequently produce unsatisfactory predictions due to the limited capabilities of their self-supervised signals. To overcome these challenges, we propose a novel self-supervised point cloud completion method. We design a set of novel self-supervised signals based on multi-view augmentations of the single partial point cloud. Additionally, to enhance the model’s learning ability, we first incorporate Mamba into self-supervised point cloud completion task, encouraging the model to generate point clouds with better quality. Experiments on synthetic and real-world datasets demonstrate that our method achieves state-of-the-art results.
[319] REFINE-CONTROL: A Semi-supervised Distillation Method For Conditional Image Generation
Yicheng Jiang, Jin Yuan, Hua Yuan, Yao Zhang, Yong Rui
Main category: cs.CV
TL;DR: Refine-Control is a semi-supervised distillation framework that reduces computational costs and latency for conditional image generation models while maintaining high-fidelity generation and controllability.
Details
Motivation: High resource demands of conditional image generation models and scarcity of well-annotated data hinder deployment on edge devices, causing high costs and privacy concerns when user data is sent to third parties.Method: Introduces a tri-level knowledge fusion loss to transfer different levels of knowledge and a semi-supervised distillation method utilizing both labeled and unlabeled data.
Result: Achieves significant reductions in computational cost and latency while maintaining high-fidelity generation capabilities and controllability.
Conclusion: Refine-Control effectively addresses deployment challenges of conditional image generation models on edge devices through efficient knowledge distillation.
Abstract: Conditional image generation models have achieved remarkable results by leveraging text-based control to generate customized images. However, the high resource demands of these models and the scarcity of well-annotated data have hindered their deployment on edge devices, leading to enormous costs and privacy concerns, especially when user data is sent to a third party. To overcome these challenges, we propose Refine-Control, a semi-supervised distillation framework. Specifically, we improve the performance of the student model by introducing a tri-level knowledge fusion loss to transfer different levels of knowledge. To enhance generalization and alleviate dataset scarcity, we introduce a semi-supervised distillation method utilizing both labeled and unlabeled data. Our experiments reveal that Refine-Control achieves significant reductions in computational cost and latency, while maintaining high-fidelity generation capabilities and controllability, as quantified by comparative metrics.
[320] Joint graph entropy knowledge distillation for point cloud classification and robustness against corruptions
Zhiqiang Tian, Weigang Li, Junwei Hu, Chunhua Deng
Main category: cs.CV
TL;DR: JGEKD is a classification strategy for non-IID 3D point clouds that uses joint graph entropy knowledge distillation to capture class correlations and improve robustness.
Details
Motivation: Traditional classification assumes IID data, which destroys class correlations in 3D point clouds. This work addresses the need for methods that preserve and leverage these correlations.Method: Uses joint graphs to capture class relationships, constructs loss functions based on joint graph entropy, employs Siamese structures for spatial transformation invariance, and implements both self-knowledge and teacher-knowledge distillation frameworks.
Result: Extensive experiments on ScanObject, ModelNet40, ScanntV2_cls and ModelNet-C demonstrate competitive performance.
Conclusion: JGEKD effectively handles non-IID 3D point cloud data by transferring class correlation knowledge through joint graph entropy distillation, achieving robust and competitive classification results.
Abstract: Classification tasks in 3D point clouds often assume that class events \replaced{are }{follow }independent and identically distributed (IID), although this assumption destroys the correlation between classes. This \replaced{study }{paper }proposes a classification strategy, \textbf{J}oint \textbf{G}raph \textbf{E}ntropy \textbf{K}nowledge \textbf{D}istillation (JGEKD), suitable for non-independent and identically distributed 3D point cloud data, \replaced{which }{the strategy } achieves knowledge transfer of class correlations through knowledge distillation by constructing a loss function based on joint graph entropy. First\deleted{ly}, we employ joint graphs to capture add{the }hidden relationships between classes\replaced{ and}{,} implement knowledge distillation to train our model by calculating the entropy of add{add }graph.\replaced{ Subsequently}{ Then}, to handle 3D point clouds \deleted{that is }invariant to spatial transformations, we construct \replaced{S}{s}iamese structures and develop two frameworks, self-knowledge distillation and teacher-knowledge distillation, to facilitate information transfer between different transformation forms of the same data. \replaced{In addition}{ Additionally}, we use the above framework to achieve knowledge transfer between point clouds and their corrupted forms, and increase the robustness against corruption of model. Extensive experiments on ScanObject, ModelNet40, ScanntV2_cls and ModelNet-C demonstrate that the proposed strategy can achieve competitive results.
[321] MultiMat: Multimodal Program Synthesis for Procedural Materials using Large Multimodal Models
Jonas Belouadi, Tamy Boubekeur, Adrien Kaiser
Main category: cs.CV
TL;DR: MultiMat is a multimodal program synthesis framework that uses large multimodal models to generate procedural material graphs by processing both visual and textual representations, outperforming text-only approaches.
Details
Motivation: Current neural program synthesis methods only use textual representations, failing to capture the visual-spatial nature of material node graphs that makes them intuitive for humans. There's a need to bridge this gap between visual understanding and program synthesis.Method: Train large multimodal models on a dataset of production-quality procedural materials and combine them with a constrained tree search inference algorithm that ensures syntactic validity while efficiently navigating the program space.
Result: MultiMat is more efficient in both unconditional and conditional graph synthesis with higher visual quality and fidelity than text-only baselines, establishing new state-of-the-art performance.
Conclusion: Multimodal program synthesis that leverages both visual and textual representations significantly improves the generation of procedural material graphs compared to text-only approaches.
Abstract: Material node graphs are programs that generate the 2D channels of procedural materials, including geometry such as roughness and displacement maps, and reflectance such as albedo and conductivity maps. They are essential in computer graphics for representing the appearance of virtual 3D objects parametrically and at arbitrary resolution. In particular, their directed acyclic graph structures and intermediate states provide an intuitive understanding and workflow for interactive appearance modeling. Creating such graphs is a challenging task and typically requires professional training. While recent neural program synthesis approaches attempt to simplify this process, they solely represent graphs as textual programs, failing to capture the inherently visual-spatial nature of node graphs that makes them accessible to humans. To address this gap, we present MultiMat, a multimodal program synthesis framework that leverages large multimodal models to process both visual and textual graph representations for improved generation of procedural material graphs. We train our models on a new dataset of production-quality procedural materials and combine them with a constrained tree search inference algorithm that ensures syntactic validity while efficiently navigating the program space. Our experimental results show that our multimodal program synthesis method is more efficient in both unconditional and conditional graph synthesis with higher visual quality and fidelity than text-only baselines, establishing new state-of-the-art performance.
[322] DragGANSpace: Latent Space Exploration and Control for GANs
Kirsten Odendaal, Neela Kaushik, Spencer Halverson
Main category: cs.CV
TL;DR: Integrates StyleGAN, DragGAN and PCA to enhance latent space efficiency and controllability for GAN-generated images, showing improved optimization efficiency while maintaining visual quality.
Details
Motivation: To improve the efficiency and controllability of GAN latent spaces for more streamlined and interpretable image manipulation and cross-model alignment.Method: Combines StyleGAN’s structured latent space with DragGAN’s intuitive manipulation and PCA’s dimensionality reduction, applied to AFHQ dataset with focus on W+ latent layers.
Result: PCA integration reduces optimization time while maintaining visual quality and boosting SSIM, particularly in shallower latent spaces (W+ layers = 3). Enables cross-domain alignment between AFHQ-Dog and AFHQ-Cat models.
Conclusion: Demonstrates efficient and interpretable latent space control for image synthesis and editing applications through the integration of PCA with GAN frameworks.
Abstract: This work integrates StyleGAN, DragGAN and Principal Component Analysis (PCA) to enhance the latent space efficiency and controllability of GAN-generated images. Style-GAN provides a structured latent space, DragGAN enables intuitive image manipulation, and PCA reduces dimensionality and facilitates cross-model alignment for more streamlined and interpretable exploration of latent spaces. We apply our techniques to the Animal Faces High Quality (AFHQ) dataset, and find that our approach of integrating PCA-based dimensionality reduction with the Drag-GAN framework for image manipulation retains performance while improving optimization efficiency. Notably, introducing PCA into the latent W+ layers of DragGAN can consistently reduce the total optimization time while maintaining good visual quality and even boosting the Structural Similarity Index Measure (SSIM) of the optimized image, particularly in shallower latent spaces (W+ layers = 3). We also demonstrate capability for aligning images generated by two StyleGAN models trained on similar but distinct data domains (AFHQ-Dog and AFHQ-Cat), and show that we can control the latent space of these aligned images to manipulate the images in an intuitive and interpretable manner. Our findings highlight the possibility for efficient and interpretable latent space control for a wide range of image synthesis and editing applications.
[323] MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing
Junbo Niu, Zheng Liu, Zhuangcheng Gu, Bin Wang, Linke Ouyang, Zhiyuan Zhao, Tao Chu, Tianyao He, Fan Wu, Qintong Zhang, Zhenjiang Jin, Guang Liang, Rui Zhang, Wenzheng Zhang, Yuan Qu, Zhifei Ren, Yuefeng Sun, Yuanhong Zheng, Dongsheng Ma, Zirui Tang, Boyu Niu, Ziyang Miao, Hejun Dong, Siyi Qian, Junyuan Zhang, Jingzhou Chen, Fangdong Wang, Xiaomeng Zhao, Liqun Wei, Wei Li, Shasha Wang, Ruiliang Xu, Yuanyuan Cao, Lu Chen, Qianqian Wu, Huaiyu Gu, Lindong Lu, Keming Wang, Dechen Lin, Guanlin Shen, Xuanhe Zhou, Linfeng Zhang, Yuhang Zang, Xiaoyi Dong, Jiaqi Wang, Bo Zhang, Lei Bai, Pei Chu, Weijia Li, Jiang Wu, Lijun Wu, Zhenxiang Li, Guangyu Wang, Zhongying Tu, Chao Xu, Kai Chen, Yu Qiao, Bowen Zhou, Dahua Lin, Wentao Zhang, Conghui He
Main category: cs.CV
TL;DR: MinerU2.5 is a 1.2B-parameter document parsing model that uses a coarse-to-fine two-stage approach for efficient and accurate document analysis, achieving state-of-the-art performance with low computational overhead.
Details
Motivation: To develop an efficient document parsing model that can handle high-resolution inputs without computational overhead while maintaining accuracy for complex document elements like dense text, formulas, and tables.Method: Two-stage parsing strategy: first stage performs layout analysis on downsampled images, second stage performs targeted content recognition on native-resolution crops guided by the layout. Uses comprehensive data engine for training data generation.
Result: Achieves state-of-the-art performance on multiple benchmarks, surpassing both general-purpose and domain-specific models across various recognition tasks with significantly lower computational overhead.
Conclusion: MinerU2.5 demonstrates strong document parsing ability through its efficient two-stage approach, balancing computational efficiency with high recognition accuracy for complex document elements.
Abstract: We introduce MinerU2.5, a 1.2B-parameter document parsing vision-language model that achieves state-of-the-art recognition accuracy while maintaining exceptional computational efficiency. Our approach employs a coarse-to-fine, two-stage parsing strategy that decouples global layout analysis from local content recognition. In the first stage, the model performs efficient layout analysis on downsampled images to identify structural elements, circumventing the computational overhead of processing high-resolution inputs. In the second stage, guided by the global layout, it performs targeted content recognition on native-resolution crops extracted from the original image, preserving fine-grained details in dense text, complex formulas, and tables. To support this strategy, we developed a comprehensive data engine that generates diverse, large-scale training corpora for both pretraining and fine-tuning. Ultimately, MinerU2.5 demonstrates strong document parsing ability, achieving state-of-the-art performance on multiple benchmarks, surpassing both general-purpose and domain-specific models across various recognition tasks, while maintaining significantly lower computational overhead.
[324] Towards Faithful Reasoning in Remote Sensing: A Perceptually-Grounded GeoSpatial Chain-of-Thought for Vision-Language Models
Jiaqi Liu, Lang Sun, Ronghao Fu, Bo Yang
Main category: cs.CV
TL;DR: Geo-CoT framework enables verifiable multi-step reasoning in remote sensing analysis through supervised fine-tuning and policy optimization, outperforming state-of-the-art models.
Details
Motivation: Current Vision-Language Models in remote sensing fail at complex analytical tasks due to end-to-end training that bypasses reasoning steps and produces unverifiable outputs.Method: Two-stage alignment strategy: supervised fine-tuning to instill cognitive architecture, then Group Reward Policy Optimization to refine reasoning policy. Uses Geo-CoT380k dataset of structured rationales.
Result: RSThinker model significantly outperforms state-of-the-art models across comprehensive tasks, providing both final answers and verifiable analytical traces.
Conclusion: The framework enables transition from opaque perception to structured, verifiable reasoning in Earth Observation, with public release of dataset and model.
Abstract: Vision-Language Models (VLMs) in remote sensing often fail at complex analytical tasks, a limitation stemming from their end-to-end training paradigm that bypasses crucial reasoning steps and leads to unverifiable outputs. To address this limitation, we introduce the Perceptually-Grounded Geospatial Chain-of-Thought (Geo-CoT), a framework that models remote sensing analysis as a verifiable, multi-step process. We instill this analytical process through a two-stage alignment strategy, leveraging Geo-CoT380k, the first large-scale dataset of structured Geo-CoT rationales. This strategy first employs supervised fine-tuning (SFT) to instill the foundational cognitive architecture, then leverages Group Reward Policy Optimization (GRPO) to refine the model’s reasoning policy towards factual correctness. The resulting model, RSThinker, outputs both a final answer and its justifying, verifiable analytical trace. This capability yields dominant performance, significantly outperforming state-of-the-art models across a comprehensive range of tasks. The public release of our Geo-CoT380k dataset and RSThinker model upon publication serves as a concrete pathway from opaque perception towards structured, verifiable reasoning for Earth Observation.
[325] Polysemous Language Gaussian Splatting via Matching-based Mask Lifting
Jiayu Ding, Xinpeng Liu, Zhiyi Pan, Shiqiang Long, Ge Li
Main category: cs.CV
TL;DR: MUSplat is a training-free framework that lifts 2D open-vocabulary understanding into 3D Gaussian Splatting scenes without per-scene retraining, addressing monosemous limitations and cross-view inconsistencies.
Details
Motivation: Mainstream methods suffer from three key flaws: reliance on costly per-scene retraining, restrictive monosemous design that fails to represent multi-concept semantics, and vulnerability to cross-view semantic inconsistencies.Method: Leverages pre-trained 2D segmentation model to generate multi-granularity 2D masks lifted into 3D, estimates foreground probability for Gaussian points, optimizes boundaries using semantic entropy and geometric opacity, and uses Vision-Language Model to distill robust textual features from representative viewpoints.
Result: Reduces scene adaptation time from hours to minutes, outperforms established training-based frameworks on open-vocabulary 3D object selection and semantic segmentation tasks.
Conclusion: MUSplat successfully eliminates costly per-scene training while addressing monosemous limitations and achieving superior performance in open-vocabulary 3D understanding.
Abstract: Lifting 2D open-vocabulary understanding into 3D Gaussian Splatting (3DGS) scenes is a critical challenge. However, mainstream methods suffer from three key flaws: (i) their reliance on costly per-scene retraining prevents plug-and-play application; (ii) their restrictive monosemous design fails to represent complex, multi-concept semantics; and (iii) their vulnerability to cross-view semantic inconsistencies corrupts the final semantic representation. To overcome these limitations, we introduce MUSplat, a training-free framework that abandons feature optimization entirely. Leveraging a pre-trained 2D segmentation model, our pipeline generates and lifts multi-granularity 2D masks into 3D, where we estimate a foreground probability for each Gaussian point to form initial object groups. We then optimize the ambiguous boundaries of these initial groups using semantic entropy and geometric opacity. Subsequently, by interpreting the object’s appearance across its most representative viewpoints, a Vision-Language Model (VLM) distills robust textual features that reconciles visual inconsistencies, enabling open-vocabulary querying via semantic matching. By eliminating the costly per-scene training process, MUSplat reduces scene adaptation time from hours to mere minutes. On benchmark tasks for open-vocabulary 3D object selection and semantic segmentation, MUSplat outperforms established training-based frameworks while simultaneously addressing their monosemous limitations.
[326] UrbanFeel: A Comprehensive Benchmark for Temporal and Perceptual Understanding of City Scenes through Human Perspective
Jun He, Yi Lin, Zilong Huang, Jiacong Yin, Junyan Ye, Yuchuan Zhou, Weijia Li, Xiang Zhang
Main category: cs.CV
TL;DR: UrbanFeel is a comprehensive benchmark for evaluating Multimodal Large Language Models (MLLMs) on urban development understanding and subjective environmental perception, featuring 14.3K visual questions across three dimensions: Static Scene Perception, Temporal Change Understanding, and Subjective Environmental Perception.
Details
Motivation: Existing benchmarks for MLLMs in urban environments are limited, lacking systematic exploration of temporal evolution and subjective perception that aligns with human perception, despite urban development impacting over half of the global population.Method: Collected multi-temporal single-view and panoramic street-view images from 11 representative cities worldwide, and generated high-quality question-answer pairs through a hybrid pipeline of spatial clustering, rule-based generation, model-assisted prompting, and manual annotation.
Result: Gemini-2.5 Pro achieved the best overall performance, approaching human expert levels with only 1.5% average gap. Most models performed well on scene understanding tasks, with some surpassing human annotators in pixel-level change detection. However, performance dropped notably in temporal reasoning tasks, while several models reached human-level consistency in subjective perception dimensions like beauty and safety.
Conclusion: UrbanFeel provides a comprehensive benchmark for evaluating MLLMs in urban environments, revealing that while models excel at scene understanding and some subjective perception tasks, they still struggle with temporal reasoning over urban development, highlighting areas for future improvement.
Abstract: Urban development impacts over half of the global population, making human-centered understanding of its structural and perceptual changes essential for sustainable development. While Multimodal Large Language Models (MLLMs) have shown remarkable capabilities across various domains, existing benchmarks that explore their performance in urban environments remain limited, lacking systematic exploration of temporal evolution and subjective perception of urban environment that aligns with human perception. To address these limitations, we propose UrbanFeel, a comprehensive benchmark designed to evaluate the performance of MLLMs in urban development understanding and subjective environmental perception. UrbanFeel comprises 14.3K carefully constructed visual questions spanning three cognitively progressive dimensions: Static Scene Perception, Temporal Change Understanding, and Subjective Environmental Perception. We collect multi-temporal single-view and panoramic street-view images from 11 representative cities worldwide, and generate high-quality question-answer pairs through a hybrid pipeline of spatial clustering, rule-based generation, model-assisted prompting, and manual annotation. Through extensive evaluation of 20 state-of-the-art MLLMs, we observe that Gemini-2.5 Pro achieves the best overall performance, with its accuracy approaching human expert levels and narrowing the average gap to just 1.5%. Most models perform well on tasks grounded in scene understanding. In particular, some models even surpass human annotators in pixel-level change detection. However, performance drops notably in tasks requiring temporal reasoning over urban development. Additionally, in the subjective perception dimension, several models reach human-level or even higher consistency in evaluating dimension such as beautiful and safety.
[327] A Tale of Two Experts: Cooperative Learning for Source-Free Unsupervised Domain Adaptation
Jiaping Yu, Muli Yang, Jiapeng Ji, Jiexi Yan, Cheng Deng
Main category: cs.CV
TL;DR: EXCL proposes a dual experts framework with retrieval-augmentation-interaction pipeline for source-free unsupervised domain adaptation, achieving state-of-the-art performance without accessing source data.
Details
Motivation: Address privacy and cost concerns in domain adaptation by adapting source-trained models to target domains without accessing source data, overcoming limitations of existing methods that neglect complementary insights and target data structure.Method: Dual Experts framework with frozen source-domain model (with Conv-Adapter) and pretrained vision-language model (with trainable text prompt), plus Retrieval-Augmented-Interaction pipeline that retrieves pseudo-source/complex samples, fine-tunes experts separately, and enforces learning consistency.
Result: Extensive experiments on four benchmark datasets demonstrate that the approach matches state-of-the-art performance in source-free unsupervised domain adaptation.
Conclusion: EXCL effectively addresses SFUDA challenges by leveraging dual experts and retrieval-augmentation interaction, providing a robust solution for domain adaptation without source data access.
Abstract: Source-Free Unsupervised Domain Adaptation (SFUDA) addresses the realistic challenge of adapting a source-trained model to a target domain without access to the source data, driven by concerns over privacy and cost. Existing SFUDA methods either exploit only the source model’s predictions or fine-tune large multimodal models, yet both neglect complementary insights and the latent structure of target data. In this paper, we propose the Experts Cooperative Learning (EXCL). EXCL contains the Dual Experts framework and Retrieval-Augmentation-Interaction optimization pipeline. The Dual Experts framework places a frozen source-domain model (augmented with Conv-Adapter) and a pretrained vision-language model (with a trainable text prompt) on equal footing to mine consensus knowledge from unlabeled target samples. To effectively train these plug-in modules under purely unsupervised conditions, we introduce Retrieval-Augmented-Interaction(RAIN), a three-stage pipeline that (1) collaboratively retrieves pseudo-source and complex target samples, (2) separately fine-tunes each expert on its respective sample set, and (3) enforces learning object consistency via a shared learning result. Extensive experiments on four benchmark datasets demonstrate that our approach matches state-of-the-art performance.
[328] FlashEdit: Decoupling Speed, Structure, and Semantics for Precise Image Editing
Junyi Wu, Zhiteng Li, Haotong Qin, Xiaohong Liu, Linghe Kong, Yulun Zhang, Xiaokang Yang
Main category: cs.CV
TL;DR: FlashEdit is a real-time image editing framework that achieves 150x speedup over previous methods while maintaining high fidelity through one-step inversion, background preservation, and sparsified attention mechanisms.
Details
Motivation: Existing text-guided image editing with diffusion models achieves high quality but suffers from prohibitive latency that hinders real-world applications, creating a need for efficient real-time editing solutions.Method: Three key innovations: (1) One-Step Inversion-and-Editing (OSIE) pipeline that bypasses iterative processes; (2) Background Shield (BG-Shield) for selective feature modification only in edit regions; (3) Sparsified Spatial Cross-Attention (SSCA) to suppress semantic leakage and ensure precise localized edits.
Result: FlashEdit performs edits in under 0.2 seconds (150x speedup over prior methods) while maintaining superior background consistency and structural integrity.
Conclusion: FlashEdit enables high-fidelity, real-time image editing through its efficient framework design, making text-guided image editing practical for real-world applications.
Abstract: Text-guided image editing with diffusion models has achieved remarkable quality but suffers from prohibitive latency, hindering real-world applications. We introduce FlashEdit, a novel framework designed to enable high-fidelity, real-time image editing. Its efficiency stems from three key innovations: (1) a One-Step Inversion-and-Editing (OSIE) pipeline that bypasses costly iterative processes; (2) a Background Shield (BG-Shield) technique that guarantees background preservation by selectively modifying features only within the edit region; and (3) a Sparsified Spatial Cross-Attention (SSCA) mechanism that ensures precise, localized edits by suppressing semantic leakage to the background. Extensive experiments demonstrate that FlashEdit maintains superior background consistency and structural integrity, while performing edits in under 0.2 seconds, which is an over 150$\times$ speedup compared to prior multi-step methods. Our code will be made publicly available at https://github.com/JunyiWuCode/FlashEdit.
[329] Beyond Classification Accuracy: Neural-MedBench and the Need for Deeper Reasoning Benchmarks
Miao Jing, Mengting Jia, Junling Lin, Zhongxia Shen, Lijun Wang, Yuanyuan Peng, Huan Gao, Mingkun Xu, Shangyang Li
Main category: cs.CV
TL;DR: Neural-MedBench is a compact but reasoning-intensive benchmark for evaluating multimodal clinical reasoning in neurology, revealing significant performance drops in state-of-the-art VLMs compared to conventional datasets.
Details
Motivation: Existing medical benchmarks focus on classification accuracy, creating an evaluation illusion where models appear proficient but fail at high-stakes diagnostic reasoning. The true clinical reasoning ability of VLMs remains unclear.Method: Developed Neural-MedBench integrating multi-sequence MRI scans, EHRs, and clinical notes with three task families: differential diagnosis, lesion recognition, and rationale generation. Used hybrid scoring combining LLM-based graders, clinician validation, and semantic similarity metrics.
Result: Evaluation of GPT-4o, Claude-4, and MedGemma showed sharp performance drops compared to conventional datasets. Error analysis revealed reasoning failures dominate model shortcomings rather than perceptual errors.
Conclusion: Proposes a Two-Axis Evaluation Framework: breadth-oriented datasets for statistical generalization and depth-oriented benchmarks like Neural-MedBench for reasoning fidelity. Released as open diagnostic testbed for rigorous assessment of clinically trustworthy AI.
Abstract: Recent advances in vision-language models (VLMs) have achieved remarkable performance on standard medical benchmarks, yet their true clinical reasoning ability remains unclear. Existing datasets predominantly emphasize classification accuracy, creating an evaluation illusion in which models appear proficient while still failing at high-stakes diagnostic reasoning. We introduce Neural-MedBench, a compact yet reasoning-intensive benchmark specifically designed to probe the limits of multimodal clinical reasoning in neurology. Neural-MedBench integrates multi-sequence MRI scans, structured electronic health records, and clinical notes, and encompasses three core task families: differential diagnosis, lesion recognition, and rationale generation. To ensure reliable evaluation, we develop a hybrid scoring pipeline that combines LLM-based graders, clinician validation, and semantic similarity metrics. Through systematic evaluation of state-of-the-art VLMs, including GPT-4o, Claude-4, and MedGemma, we observe a sharp performance drop compared to conventional datasets. Error analysis shows that reasoning failures, rather than perceptual errors, dominate model shortcomings. Our findings highlight the necessity of a Two-Axis Evaluation Framework: breadth-oriented large datasets for statistical generalization, and depth-oriented, compact benchmarks such as Neural-MedBench for reasoning fidelity. We release Neural-MedBench at https://neuromedbench.github.io/ as an open and extensible diagnostic testbed, which guides the expansion of future benchmarks and enables rigorous yet cost-effective assessment of clinically trustworthy AI.
[330] UniMapGen: A Generative Framework for Large-Scale Map Construction from Multi-modal Data
Yujian Yuan, Changjie Wu, Xinyuan Chang, Sijin Wang, Hang Zhang, Shiyi Liang, Shuang Zeng, Mu Xu
Main category: cs.CV
TL;DR: UniMapGen is a generative framework for large-scale map construction that represents lane lines as discrete sequences and uses multi-modal inputs to overcome satellite data limitations, achieving state-of-the-art performance.
Details
Motivation: Traditional map construction methods are costly and inefficient, while existing satellite-based methods suffer from data limitations (occlusions, outdatedness) and produce rough, discontinuous roads that require extensive post-processing.Method: Represents lane lines as discrete sequences with iterative generation, supports multi-modal inputs (BEV, PV, text prompts), and uses state update strategy for global continuity and consistency.
Result: Achieves state-of-the-art performance on OpenSatMap dataset, can infer occluded roads and predict missing roads from dataset annotations.
Conclusion: UniMapGen provides an efficient generative framework for large-scale map construction that overcomes limitations of traditional and satellite-based methods through discrete sequence representation and multi-modal input support.
Abstract: Large-scale map construction is foundational for critical applications such as autonomous driving and navigation systems. Traditional large-scale map construction approaches mainly rely on costly and inefficient special data collection vehicles and labor-intensive annotation processes. While existing satellite-based methods have demonstrated promising potential in enhancing the efficiency and coverage of map construction, they exhibit two major limitations: (1) inherent drawbacks of satellite data (e.g., occlusions, outdatedness) and (2) inefficient vectorization from perception-based methods, resulting in discontinuous and rough roads that require extensive post-processing. This paper presents a novel generative framework, UniMapGen, for large-scale map construction, offering three key innovations: (1) representing lane lines as \textbf{discrete sequence} and establishing an iterative strategy to generate more complete and smooth map vectors than traditional perception-based methods. (2) proposing a flexible architecture that supports \textbf{multi-modal} inputs, enabling dynamic selection among BEV, PV, and text prompt, to overcome the drawbacks of satellite data. (3) developing a \textbf{state update} strategy for global continuity and consistency of the constructed large-scale map. UniMapGen achieves state-of-the-art performance on the OpenSatMap dataset. Furthermore, UniMapGen can infer occluded roads and predict roads missing from dataset annotations. Our code will be released.
[331] GS-2M: Gaussian Splatting for Joint Mesh Reconstruction and Material Decomposition
Dinh Minh Nguyen, Malte Avenhaus, Thomas Lindemeier
Main category: cs.CV
TL;DR: GS-2M is a unified framework for mesh reconstruction and material decomposition from multi-view images using 3D Gaussian Splatting, achieving high-quality results without relying on external priors or complex neural components.
Details
Motivation: Previous methods handle mesh reconstruction and material decomposition separately, struggle with reflective surfaces, and often rely on external priors. Existing joint approaches use sophisticated neural components that limit scalability.Method: Joint optimization of attributes for rendered depth and normals quality, novel roughness supervision based on multi-view photometric variation, and carefully designed loss and optimization process using 3D Gaussian Splatting.
Result: Produces reconstruction results comparable to state-of-the-art methods, delivering triangle meshes and associated material components. Validated on widely used datasets with qualitative comparisons showing effectiveness.
Conclusion: GS-2M provides a unified solution that maintains geometric details, handles reflective surfaces well, and eliminates the need for complex neural components while achieving state-of-the-art performance.
Abstract: We propose a unified solution for mesh reconstruction and material decomposition from multi-view images based on 3D Gaussian Splatting, referred to as GS-2M. Previous works handle these tasks separately and struggle to reconstruct highly reflective surfaces, often relying on priors from external models to enhance the decomposition results. Conversely, our method addresses these two problems by jointly optimizing attributes relevant to the quality of rendered depth and normals, maintaining geometric details while being resilient to reflective surfaces. Although contemporary works effectively solve these tasks together, they often employ sophisticated neural components to learn scene properties, which hinders their performance at scale. To further eliminate these neural components, we propose a novel roughness supervision strategy based on multi-view photometric variation. When combined with a carefully designed loss and optimization process, our unified framework produces reconstruction results comparable to state-of-the-art methods, delivering triangle meshes and their associated material components for downstream tasks. We validate the effectiveness of our approach with widely used datasets from previous works and qualitative comparisons with state-of-the-art surface reconstruction methods.
[332] MesaTask: Towards Task-Driven Tabletop Scene Generation via 3D Spatial Reasoning
Jinkun Hao, Naifu Liang, Zhen Luo, Xudong Xu, Weipeng Zhong, Ran Yi, Yichen Jin, Zhaoyang Lyu, Feng Zheng, Lizhuang Ma, Jiangmiao Pang
Main category: cs.CV
TL;DR: The paper introduces MesaTask, an LLM-based framework for generating task-oriented tabletop scenes using a Spatial Reasoning Chain and DPO algorithms, achieving superior performance in creating realistic, task-conforming layouts.
Details
Motivation: Traditional methods for creating tabletop scenes for robot training rely on manual design or randomized layouts, which are limited in plausibility and task alignment. There's a need for automated generation of realistic tabletop scenes that match specific task instructions.Method: Proposes a Spatial Reasoning Chain that decomposes scene generation into object inference, spatial interrelation reasoning, and scene graph construction. Uses an LLM-based framework enhanced with DPO algorithms to generate physically plausible 3D layouts.
Result: MesaTask-10K dataset with 10,700 synthetic tabletop scenes with manually crafted realistic layouts. MesaTask framework demonstrates superior performance compared to baselines in generating task-conforming scenes with realistic layouts.
Conclusion: The proposed MesaTask framework effectively bridges the gap between high-level task instructions and tabletop scene generation, producing physically plausible and task-aligned scenes that outperform existing methods.
Abstract: The ability of robots to interpret human instructions and execute manipulation tasks necessitates the availability of task-relevant tabletop scenes for training. However, traditional methods for creating these scenes rely on time-consuming manual layout design or purely randomized layouts, which are limited in terms of plausibility or alignment with the tasks. In this paper, we formulate a novel task, namely task-oriented tabletop scene generation, which poses significant challenges due to the substantial gap between high-level task instructions and the tabletop scenes. To support research on such a challenging task, we introduce MesaTask-10K, a large-scale dataset comprising approximately 10,700 synthetic tabletop scenes with manually crafted layouts that ensure realistic layouts and intricate inter-object relations. To bridge the gap between tasks and scenes, we propose a Spatial Reasoning Chain that decomposes the generation process into object inference, spatial interrelation reasoning, and scene graph construction for the final 3D layout. We present MesaTask, an LLM-based framework that utilizes this reasoning chain and is further enhanced with DPO algorithms to generate physically plausible tabletop scenes that align well with given task descriptions. Exhaustive experiments demonstrate the superior performance of MesaTask compared to baselines in generating task-conforming tabletop scenes with realistic layouts. Project page is at https://mesatask.github.io/
[333] Rule-Based Reinforcement Learning for Document Image Classification with Vision Language Models
Michael Jungo, Andreas Fischer
Main category: cs.CV
TL;DR: Rule-based reinforcement learning improves generalization for document image classification across out-of-distribution data, unseen classes, and different modalities.
Details
Motivation: To explore the benefits of rule-based reinforcement learning in document analysis tasks, particularly for enhancing reasoning capabilities and generalization to out-of-distribution data.Method: Applied rule-based reinforcement learning to Document Image Classification task, testing on three out-of-distribution scenarios: different images, unseen classes, and different modalities.
Result: Reinforcement learning demonstrated better generalization capabilities compared to traditional methods across all three out-of-distribution scenarios.
Conclusion: Rule-based reinforcement learning is effective for document analysis tasks and shows superior generalization to out-of-distribution data.
Abstract: Rule-based reinforcement learning has been gaining popularity ever since DeepSeek-R1 has demonstrated its success through simple verifiable rewards. In the domain of document analysis, reinforcement learning is not as prevalent, even though many downstream tasks may benefit from the emerging properties of reinforcement learning, particularly the enhanced reason capabilities. We study the effects of rule-based reinforcement learning with the task of Document Image Classification which is one of the most commonly studied downstream tasks in document analysis. We find that reinforcement learning tends to have better generalisation capabilities to out-of-distritbution data, which we examine in three different scenarios, namely out-of-distribution images, unseen classes and different modalities. Our code is available at https://github.com/jungomi/vision-finetune.
[334] Jailbreaking on Text-to-Video Models via Scene Splitting Strategy
Wonjun Lee, Haon Park, Doehyeon Lee, Bumsub Ham, Suhyun Kim
Main category: cs.CV
TL;DR: SceneSplit is a black-box jailbreak method for Text-to-Video models that fragments harmful narratives into multiple benign scenes, manipulating the generative output space to bypass safety filters and generate harmful content.
Details
Motivation: Text-to-Video models have significant safety risks that remain largely unexplored compared to other AI models like LLMs and T2I models, creating a critical safety gap that needs to be addressed.Method: SceneSplit fragments harmful narratives into multiple individually benign scenes, uses sequential scene combination to constrain the output space to unsafe regions, employs iterative scene manipulation to bypass safety filters, and utilizes a strategy library to reuse successful attack patterns.
Result: SceneSplit achieves high Attack Success Rates: 77.2% on Luma Ray2, 84.1% on Hailuo, and 78.2% on Veo2 across 11 safety categories, significantly outperforming existing baselines.
Conclusion: Current T2V safety mechanisms are vulnerable to narrative structure exploitation attacks, providing new insights for understanding and improving T2V model safety.
Abstract: Along with the rapid advancement of numerous Text-to-Video (T2V) models, growing concerns have emerged regarding their safety risks. While recent studies have explored vulnerabilities in models like LLMs, VLMs, and Text-to-Image (T2I) models through jailbreak attacks, T2V models remain largely unexplored, leaving a significant safety gap. To address this gap, we introduce SceneSplit, a novel black-box jailbreak method that works by fragmenting a harmful narrative into multiple scenes, each individually benign. This approach manipulates the generative output space, the abstract set of all potential video outputs for a given prompt, using the combination of scenes as a powerful constraint to guide the final outcome. While each scene individually corresponds to a wide and safe space where most outcomes are benign, their sequential combination collectively restricts this space, narrowing it to an unsafe region and significantly increasing the likelihood of generating a harmful video. This core mechanism is further enhanced through iterative scene manipulation, which bypasses the safety filter within this constrained unsafe region. Additionally, a strategy library that reuses successful attack patterns further improves the attack’s overall effectiveness and robustness. To validate our method, we evaluate SceneSplit across 11 safety categories on T2V models. Our results show that it achieves a high average Attack Success Rate (ASR) of 77.2% on Luma Ray2, 84.1% on Hailuo, and 78.2% on Veo2, significantly outperforming the existing baseline. Through this work, we demonstrate that current T2V safety mechanisms are vulnerable to attacks that exploit narrative structure, providing new insights for understanding and improving the safety of T2V models.
[335] HiGS: History-Guided Sampling for Plug-and-Play Enhancement of Diffusion Models
Seyedmorteza Sadat, Farnood Salehi, Romann M. Weber
Main category: cs.CV
TL;DR: HiGS is a momentum-based sampling technique that improves diffusion model outputs by integrating recent model predictions into each inference step, achieving state-of-the-art results with fewer sampling steps.
Details
Motivation: Diffusion models still produce unrealistic outputs with lacking fine details when using fewer neural function evaluations or lower guidance scales.Method: Leverages the difference between current prediction and weighted average of past predictions to steer sampling process, requiring no additional computation, training, or fine-tuning.
Result: Achieves SOTA FID of 1.61 for unguided ImageNet generation at 256×256 with only 30 steps (vs standard 250), consistently improves image quality across diverse models and sampling budgets.
Conclusion: HiGS is a plug-and-play enhancement that enables faster generation with higher fidelity in diffusion sampling.
Abstract: While diffusion models have made remarkable progress in image generation, their outputs can still appear unrealistic and lack fine details, especially when using fewer number of neural function evaluations (NFEs) or lower guidance scales. To address this issue, we propose a novel momentum-based sampling technique, termed history-guided sampling (HiGS), which enhances quality and efficiency of diffusion sampling by integrating recent model predictions into each inference step. Specifically, HiGS leverages the difference between the current prediction and a weighted average of past predictions to steer the sampling process toward more realistic outputs with better details and structure. Our approach introduces practically no additional computation and integrates seamlessly into existing diffusion frameworks, requiring neither extra training nor fine-tuning. Extensive experiments show that HiGS consistently improves image quality across diverse models and architectures and under varying sampling budgets and guidance scales. Moreover, using a pretrained SiT model, HiGS achieves a new state-of-the-art FID of 1.61 for unguided ImageNet generation at 256$\times$256 with only 30 sampling steps (instead of the standard 250). We thus present HiGS as a plug-and-play enhancement to standard diffusion sampling that enables faster generation with higher fidelity.
[336] Johnson-Lindenstrauss Lemma Guided Network for Efficient 3D Medical Segmentation
Jinpeng Lu, Linghan Cai, Yinda Chen, Guo Tang, Songhan Jiang, Haoyuan Shi, Zhiwei Xiong
Main category: cs.CV
TL;DR: VeloxSeg is a lightweight 3D medical image segmentation method that addresses the efficiency/robustness conflict through a dual-stream CNN-Transformer architecture with Paired Window Attention and Johnson-Lindenstrauss lemma-guided convolution, achieving significant performance improvements and computational efficiency.
Details
Motivation: To overcome the fundamental 'efficiency/robustness conflict' in lightweight 3D medical image segmentation, particularly for complex anatomical structures and heterogeneous modalities, by redesigning the framework based on high-dimensional 3D image characteristics.Method: Uses a dual-stream CNN-Transformer architecture with Paired Window Attention (PWA) for multi-scale information retrieval and Johnson-Lindenstrauss lemma-guided convolution (JLC) for robust local feature extraction. Incorporates modal interaction and Spatially Decoupled Knowledge Transfer (SDKT) via Gram matrices to inject texture priors from self-supervised networks.
Result: Achieves 26% Dice improvement on multimodal benchmarks while increasing GPU throughput by 11x and CPU throughput by 48x compared to baselines.
Conclusion: VeloxSeg successfully addresses the efficiency/robustness trade-off in lightweight 3D medical image segmentation, providing significant performance gains and computational efficiency without extra inference cost.
Abstract: Lightweight 3D medical image segmentation remains constrained by a fundamental “efficiency / robustness conflict”, particularly when processing complex anatomical structures and heterogeneous modalities. In this paper, we study how to redesign the framework based on the characteristics of high-dimensional 3D images, and explore data synergy to overcome the fragile representation of lightweight methods. Our approach, VeloxSeg, begins with a deployable and extensible dual-stream CNN-Transformer architecture composed of Paired Window Attention (PWA) and Johnson-Lindenstrauss lemma-guided convolution (JLC). For each 3D image, we invoke a “glance-and-focus” principle, where PWA rapidly retrieves multi-scale information, and JLC ensures robust local feature extraction with minimal parameters, significantly enhancing the model’s ability to operate with low computational budget. Followed by an extension of the dual-stream architecture that incorporates modal interaction into the multi-scale image-retrieval process, VeloxSeg efficiently models heterogeneous modalities. Finally, Spatially Decoupled Knowledge Transfer (SDKT) via Gram matrices injects the texture prior extracted by a self-supervised network into the segmentation network, yielding stronger representations than baselines at no extra inference cost. Experimental results on multimodal benchmarks show that VeloxSeg achieves a 26% Dice improvement, alongside increasing GPU throughput by 11x and CPU by 48x. Codes are available at https://github.com/JinPLu/VeloxSeg.
[337] NIFTY: a Non-Local Image Flow Matching for Texture Synthesis
Pierrick Chatillon, Julien Rabin, David Tschumperlé
Main category: cs.CV
TL;DR: NIFTY is a hybrid framework combining diffusion models with patch-based texture synthesis, using non-parametric flow-matching to avoid neural network training while improving patch-based methods.
Details
Motivation: To address limitations of both neural network-based and classical patch-based texture synthesis methods, particularly poor initialization and visual artifacts in patch-based approaches.Method: Non-parametric flow-matching model built on non-local patch matching, combining diffusion model insights with classical texture optimization techniques without requiring neural network training.
Result: Experimental results show NIFTY outperforms representative methods from the literature in exemplar-based texture synthesis.
Conclusion: NIFTY provides an effective hybrid approach that leverages strengths of both diffusion models and patch-based methods while avoiding their respective limitations.
Abstract: This paper addresses the problem of exemplar-based texture synthesis. We introduce NIFTY, a hybrid framework that combines recent insights on diffusion models trained with convolutional neural networks, and classical patch-based texture optimization techniques. NIFTY is a non-parametric flow-matching model built on non-local patch matching, which avoids the need for neural network training while alleviating common shortcomings of patch-based methods, such as poor initialization or visual artifacts. Experimental results demonstrate the effectiveness of the proposed approach compared to representative methods from the literature. Code is available at https://github.com/PierrickCh/Nifty.git
[338] RAPID^3: Tri-Level Reinforced Acceleration Policies for Diffusion Transformer
Wangbo Zhao, Yizeng Han, Zhiwei Tang, Jiasheng Tang, Pengfei Zhou, Kai Wang, Bohan Zhuang, Zhangyang Wang, Fan Wang, Yang You
Main category: cs.CV
TL;DR: RAPID3 is a training-free acceleration framework for Diffusion Transformers that uses three lightweight policy heads to achieve nearly 3x faster sampling while maintaining competitive generation quality.
Details
Motivation: Current diffusion transformer accelerators use uniform heuristics for all images, sacrificing quality, while dynamic neural networks require expensive fine-tuning. RAPID3 aims to provide per-image adaptive acceleration without updating the base generator.Method: Three lightweight policy heads (Step-Skip, Cache-Reuse, Sparse-Attention) observe denoising state and independently decide speed-up at each timestep. Parameters are trained online via Group Relative Policy Optimization while generator remains frozen, with an adversarial discriminator to prevent reward hacking.
Result: RAPID3 achieves nearly 3x faster sampling across state-of-the-art DiT backbones including Stable Diffusion 3 and FLUX, while maintaining competitive generation quality.
Conclusion: The framework successfully enables image-wise adaptive acceleration for diffusion transformers without requiring updates to the base generator, achieving significant speed improvements with preserved quality.
Abstract: Diffusion Transformers (DiTs) excel at visual generation yet remain hampered by slow sampling. Existing training-free accelerators - step reduction, feature caching, and sparse attention - enhance inference speed but typically rely on a uniform heuristic or a manually designed adaptive strategy for all images, leaving quality on the table. Alternatively, dynamic neural networks offer per-image adaptive acceleration, but their high fine-tuning costs limit broader applicability. To address these limitations, we introduce RAPID3: Tri-Level Reinforced Acceleration Policies for Diffusion Transformers, a framework that delivers image-wise acceleration with zero updates to the base generator. Specifically, three lightweight policy heads - Step-Skip, Cache-Reuse, and Sparse-Attention - observe the current denoising state and independently decide their corresponding speed-up at each timestep. All policy parameters are trained online via Group Relative Policy Optimization (GRPO) while the generator remains frozen. Meanwhile, an adversarially learned discriminator augments the reward signal, discouraging reward hacking by boosting returns only when generated samples stay close to the original model’s distribution. Across state-of-the-art DiT backbones, including Stable Diffusion 3 and FLUX, RAPID3 achieves nearly 3x faster sampling with competitive generation quality.
[339] Pedestrian Attribute Recognition via Hierarchical Cross-Modality HyperGraph Learning
Xiao Wang, Shujuan Wu, Xiaoxia Cheng, Changwei Bi, Jin Tang, Bin Luo
Main category: cs.CV
TL;DR: This paper proposes a multi-modal knowledge graph and cross-modal hypergraph learning framework to enhance pedestrian attribute recognition by modeling relationships between visual features, attributes, and text.
Details
Motivation: Current PAR methods fail to fully exploit attribute knowledge and contextual information, and existing approaches using attribute text as additional input are still in their infancy.Method: Constructs a multi-modal knowledge graph to mine relationships between local visual features and text, and between attributes and visual context samples. Introduces a knowledge graph-guided cross-modal hypergraph learning framework.
Result: Comprehensive experiments on multiple PAR benchmark datasets demonstrate the effectiveness of the proposed knowledge graph for PAR tasks.
Conclusion: The approach establishes a strong foundation for knowledge-guided pedestrian attribute recognition and the source code will be publicly released.
Abstract: Current Pedestrian Attribute Recognition (PAR) algorithms typically focus on mapping visual features to semantic labels or attempt to enhance learning by fusing visual and attribute information. However, these methods fail to fully exploit attribute knowledge and contextual information for more accurate recognition. Although recent works have started to consider using attribute text as additional input to enhance the association between visual and semantic information, these methods are still in their infancy. To address the above challenges, this paper proposes the construction of a multi-modal knowledge graph, which is utilized to mine the relationships between local visual features and text, as well as the relationships between attributes and extensive visual context samples. Specifically, we propose an effective multi-modal knowledge graph construction method that fully considers the relationships among attributes and the relationships between attributes and vision tokens. To effectively model these relationships, this paper introduces a knowledge graph-guided cross-modal hypergraph learning framework to enhance the standard pedestrian attribute recognition framework. Comprehensive experiments on multiple PAR benchmark datasets have thoroughly demonstrated the effectiveness of our proposed knowledge graph for the PAR task, establishing a strong foundation for knowledge-guided pedestrian attribute recognition. The source code of this paper will be released on https://github.com/Event-AHU/OpenPAR
[340] CircuitSense: A Hierarchical Circuit System Benchmark Bridging Visual Comprehension and Symbolic Reasoning in Engineering Design Process
Arman Akbari, Jian Gao, Yifei Zou, Mei Yang, Jinru Duan, Dmitrii Torbunov, Yanzhi Wang, Yihui Ren, Xuan Zhang
Main category: cs.CV
TL;DR: CircuitSense is a benchmark that evaluates circuit understanding in MLLMs across hierarchical engineering design workflows, revealing significant gaps in visual-to-mathematical reasoning despite strong performance on perception tasks.
Details
Motivation: While MLLMs excel at natural image tasks, their ability to extract mathematical models from technical diagrams remains unexplored, particularly in engineering design workflows that require hierarchical abstraction from system specifications to component implementations.Method: Created CircuitSense benchmark with 8,006+ problems spanning component-level schematics to system-level block diagrams, using a hierarchical synthetic generation pipeline with grid-based schematic generator and block diagram generator with auto-derived symbolic equation labels.
Result: Closed-source models achieve over 85% accuracy on perception tasks but fall below 19% on symbolic derivation and analytical reasoning. Models with stronger symbolic reasoning capabilities consistently achieve higher design task accuracy.
Conclusion: Symbolic reasoning is established as the key metric for engineering competence in MLLMs, with fundamental limitations identified in visual-to-mathematical reasoning despite strong visual parsing capabilities.
Abstract: Engineering design operates through hierarchical abstraction from system specifications to component implementations, requiring visual understanding coupled with mathematical reasoning at each level. While Multi-modal Large Language Models (MLLMs) excel at natural image tasks, their ability to extract mathematical models from technical diagrams remains unexplored. We present \textbf{CircuitSense}, a comprehensive benchmark evaluating circuit understanding across this hierarchy through 8,006+ problems spanning component-level schematics to system-level block diagrams. Our benchmark uniquely examines the complete engineering workflow: Perception, Analysis, and Design, with a particular emphasis on the critical but underexplored capability of deriving symbolic equations from visual inputs. We introduce a hierarchical synthetic generation pipeline consisting of a grid-based schematic generator and a block diagram generator with auto-derived symbolic equation labels. Comprehensive evaluation of six state-of-the-art MLLMs, including both closed-source and open-source models, reveals fundamental limitations in visual-to-mathematical reasoning. Closed-source models achieve over 85% accuracy on perception tasks involving component recognition and topology identification, yet their performance on symbolic derivation and analytical reasoning falls below 19%, exposing a critical gap between visual parsing and symbolic reasoning. Models with stronger symbolic reasoning capabilities consistently achieve higher design task accuracy, confirming the fundamental role of mathematical understanding in circuit synthesis and establishing symbolic reasoning as the key metric for engineering competence.
[341] HierLight-YOLO: A Hierarchical and Lightweight Object Detection Network for UAV Photography
Defan Chen, Yaohua Hu, Luchan Zhang
Main category: cs.CV
TL;DR: HierLight-YOLO is a hierarchical feature fusion and lightweight model based on YOLOv8 that enhances real-time small object detection for drone imagery, achieving state-of-the-art performance on VisDrone2019 benchmark.
Details
Motivation: YOLO-series detectors struggle with high false negative rates for drone-based small object detection (<32 pixels) while maintaining real-time efficiency on resource-constrained platforms.Method: Proposes Hierarchical Extended Path Aggregation Network (HEPAN) for multi-scale feature fusion, plus lightweight modules (IRDCB and LDown) to reduce parameters/computation, and a specialized small object detection head for tiny objects (4 pixels).
Result: Demonstrates state-of-the-art performance on VisDrone2019 benchmark through comparison experiments and ablation studies.
Conclusion: HierLight-YOLO effectively addresses the dual challenges of detecting small targets in drone imagery while maintaining real-time efficiency through hierarchical feature fusion and lightweight design.
Abstract: The real-time detection of small objects in complex scenes, such as the unmanned aerial vehicle (UAV) photography captured by drones, has dual challenges of detecting small targets (<32 pixels) and maintaining real-time efficiency on resource-constrained platforms. While YOLO-series detectors have achieved remarkable success in real-time large object detection, they suffer from significantly higher false negative rates for drone-based detection where small objects dominate, compared to large object scenarios. This paper proposes HierLight-YOLO, a hierarchical feature fusion and lightweight model that enhances the real-time detection of small objects, based on the YOLOv8 architecture. We propose the Hierarchical Extended Path Aggregation Network (HEPAN), a multi-scale feature fusion method through hierarchical cross-level connections, enhancing the small object detection accuracy. HierLight-YOLO includes two innovative lightweight modules: Inverted Residual Depthwise Convolution Block (IRDCB) and Lightweight Downsample (LDown) module, which significantly reduce the model’s parameters and computational complexity without sacrificing detection capabilities. Small object detection head is designed to further enhance spatial resolution and feature fusion to tackle the tiny object (4 pixels) detection. Comparison experiments and ablation studies on the VisDrone2019 benchmark demonstrate state-of-the-art performance of HierLight-YOLO.
[342] Effectiveness of Large Multimodal Models in Detecting Disinformation: Experimental Results
Yasmina Kheddache, Marc Lalonde
Main category: cs.CV
TL;DR: The study investigates using GPT-4o for multimodal disinformation detection, developing optimized prompts, structured analysis framework, and evaluation criteria to assess performance across multiple datasets.
Details
Motivation: The proliferation of multimodal disinformation combining text and images presents significant challenges across digital platforms, requiring effective detection methods.Method: Leveraged GPT-4o with optimized prompt engineering, structured multimodal analysis framework, preprocessing for token limitations, six evaluation criteria with self-assessment, and tested across multiple datasets (Gossipcop, Politifact, Fakeddit, MMFakeBench, AMMEBA).
Result: Comprehensive performance analysis revealed GPT-4o’s strengths and limitations in disinformation detection, with investigation of prediction variability and stability through repeated testing.
Conclusion: The study provides a robust and reproducible methodological framework for automated multimodal disinformation analysis using confidence-level and variability-based evaluation methods.
Abstract: The proliferation of disinformation, particularly in multimodal contexts combining text and images, presents a significant challenge across digital platforms. This study investigates the potential of large multimodal models (LMMs) in detecting and mitigating false information. We propose to approach multimodal disinformation detection by leveraging the advanced capabilities of the GPT-4o model. Our contributions include: (1) the development of an optimized prompt incorporating advanced prompt engineering techniques to ensure precise and consistent evaluations; (2) the implementation of a structured framework for multimodal analysis, including a preprocessing methodology for images and text to comply with the model’s token limitations; (3) the definition of six specific evaluation criteria that enable a fine-grained classification of content, complemented by a self-assessment mechanism based on confidence levels; (4) a comprehensive performance analysis of the model across multiple heterogeneous datasets Gossipcop, Politifact, Fakeddit, MMFakeBench, and AMMEBA highlighting GPT-4o’s strengths and limitations in disinformation detection; (5) an investigation of prediction variability through repeated testing, evaluating the stability and reliability of the model’s classifications; and (6) the introduction of confidence-level and variability-based evaluation methods. These contributions provide a robust and reproducible methodological framework for automated multimodal disinformation analysis.
[343] GPT-4 for Occlusion Order Recovery
Kaziwa Saleh, Zhyar Rzgar K Rostam, Sándor Szénási, Zoltán Vámossy
Main category: cs.CV
TL;DR: Using GPT-4’s reasoning capabilities to predict occlusion order relationships between objects in images through zero-shot analysis, achieving accurate predictions without training data.
Details
Motivation: Occlusion poses a major challenge for vision models in interpreting complex real-world scenes, requiring robust methods to determine object occlusion order relationships.Method: Leverage pre-trained GPT-4 with specifically designed prompts to analyze images and generate occlusion order predictions, then parse responses to construct occlusion matrices for integration into occlusion handling frameworks.
Result: Evaluation on COCOA and InstaOrder datasets shows the model produces more accurate order predictions using semantic context, visual patterns, and commonsense knowledge compared to baseline methods.
Conclusion: GPT-4 enables zero-shot occlusion reasoning without annotated training data, providing an easily integrable solution for occlusion handling tasks and improving image understanding.
Abstract: Occlusion remains a significant challenge for current vision models to robustly interpret complex and dense real-world images and scenes. To address this limitation and to enable accurate prediction of the occlusion order relationship between objects, we propose leveraging the advanced capability of a pre-trained GPT-4 model to deduce the order. By providing a specifically designed prompt along with the input image, GPT-4 can analyze the image and generate order predictions. The response can then be parsed to construct an occlusion matrix which can be utilized in assisting with other occlusion handling tasks and image understanding. We report the results of evaluating the model on COCOA and InstaOrder datasets. The results show that by using semantic context, visual patterns, and commonsense knowledge, the model can produce more accurate order predictions. Unlike baseline methods, the model can reason about occlusion relationships in a zero-shot fashion, which requires no annotated training data and can easily be integrated into occlusion handling frameworks.
[344] Gradient-based multi-focus image fusion with focus-aware saliency enhancement
Haoyu Li, XiaoSong Li
Main category: cs.CV
TL;DR: A multi-focus image fusion method using significant boundary enhancement to generate sharp focus-defocus boundaries and preserve focused details through gradient-domain modeling and Tenengrad gradient detection.
Details
Motivation: Existing multi-focus image fusion methods struggle with preserving sharp focus-defocus boundaries, often resulting in blurred transitions and loss of focused details.Method: Proposes a gradient-domain-based model for initial fusion with complete boundaries, uses Tenengrad gradient detection for salient feature extraction, and develops a focus metric integrating gradient and complementary information for boundary refinement.
Result: Extensive experiments on four public datasets show the method consistently outperforms 12 state-of-the-art methods in both subjective and objective evaluations.
Conclusion: The proposed method effectively enhances boundary quality in multi-focus image fusion while preserving focused details, demonstrating superior performance compared to existing approaches.
Abstract: Multi-focus image fusion (MFIF) aims to yield an all-focused image from multiple partially focused inputs, which is crucial in applications cover sur-veillance, microscopy, and computational photography. However, existing methods struggle to preserve sharp focus-defocus boundaries, often resulting in blurred transitions and focused details loss. To solve this problem, we propose a MFIF method based on significant boundary enhancement, which generates high-quality fused boundaries while effectively detecting focus in-formation. Particularly, we propose a gradient-domain-based model that can obtain initial fusion results with complete boundaries and effectively pre-serve the boundary details. Additionally, we introduce Tenengrad gradient detection to extract salient features from both the source images and the ini-tial fused image, generating the corresponding saliency maps. For boundary refinement, we develop a focus metric based on gradient and complementary information, integrating the salient features with the complementary infor-mation across images to emphasize focused regions and produce a high-quality initial decision result. Extensive experiments on four public datasets demonstrate that our method consistently outperforms 12 state-of-the-art methods in both subjective and objective evaluations. We have realized codes in https://github.com/Lihyua/GICI
[345] Text Adversarial Attacks with Dynamic Outputs
Wenqiang Wang, Siyuan Liang, Xiao Yan, Xiaochun Cao
Main category: cs.CV
TL;DR: TDOA is a text adversarial attack method that handles dynamic output scenarios by converting them to static scenarios using clustering-based surrogate training, achieving up to 50.81% attack success rate with single queries.
Details
Motivation: Existing text adversarial attack methods are designed for static scenarios with fixed output labels and predefined label spaces, but real-world applications often involve dynamic outputs.Method: Uses clustering-based surrogate model training to convert dynamic-output scenarios to static single-output scenarios, and employs farthest-label targeted attack strategy to maximize disruption.
Result: Achieves maximum attack success rate of 50.81% with single query per text on dynamic scenarios, and 82.68% on static scenarios. Also extends to generative settings with improvements of up to 0.64 RDBLEU and 0.62 RDchrF.
Conclusion: TDOA effectively addresses dynamic output scenarios in text adversarial attacks, demonstrating strong performance across multiple datasets and models with limited access, while also excelling in conventional static scenarios.
Abstract: Text adversarial attack methods are typically designed for static scenarios with fixed numbers of output labels and a predefined label space, relying on extensive querying of the victim model (query-based attacks) or the surrogate model (transfer-based attacks). To address this gap, we introduce the Textual Dynamic Outputs Attack (TDOA) method, which employs a clustering-based surrogate model training approach to convert the dynamic-output scenario into a static single-output scenario. To improve attack effectiveness, we propose the farthest-label targeted attack strategy, which selects adversarial vectors that deviate most from the model’s coarse-grained labels, thereby maximizing disruption. We extensively evaluate TDOA on four datasets and eight victim models (e.g., ChatGPT-4o, ChatGPT-4.1), showing its effectiveness in crafting adversarial examples and its strong potential to compromise large language models with limited access. With a single query per text, TDOA achieves a maximum attack success rate of 50.81%. Additionally, we find that TDOA also achieves state-of-the-art performance in conventional static output scenarios, reaching a maximum ASR of 82.68%. Meanwhile, by conceptualizing translation tasks as classification problems with unbounded output spaces, we extend the TDOA framework to generative settings, surpassing prior results by up to 0.64 RDBLEU and 0.62 RDchrF.
[346] RAU: Reference-based Anatomical Understanding with Vision Language Models
Yiwei Li, Yikang Liu, Jiaqi Guo, Lin Zhao, Zheyuan Zhang, Xiao Chen, Boris Mailhe, Ankush Mukherjee, Terrence Chen, Shanhui Sun
Main category: cs.CV
TL;DR: RAU is a framework that uses vision-language models for reference-based anatomical understanding in medical imaging, enabling identification, localization and segmentation of anatomical structures by leveraging spatial reasoning between reference and target images.
Details
Motivation: Anatomical understanding in medical imaging is crucial for automated workflows but limited by scarce expert-labeled data. Reference-based approaches using vision-language models offer a promising solution but current VLMs lack fine-grained localization capabilities.Method: RAU combines VLM spatial reasoning (trained on moderately sized datasets) with SAM2’s segmentation capability. It uses relative spatial reasoning between reference and target images for identification, then integrates VLM-derived spatial cues with SAM2 for pixel-level segmentation.
Result: RAU outperforms SAM2 fine-tuning baselines across in-distribution and out-of-distribution datasets, achieving more accurate segmentations and reliable localization. It shows strong generalization to out-of-distribution data, crucial for medical applications.
Conclusion: RAU demonstrates the first successful use of VLMs for reference-based anatomical understanding in medical imaging, highlighting the potential of VLM-driven approaches for automated clinical workflows with strong generalization capabilities.
Abstract: Anatomical understanding through deep learning is critical for automatic report generation, intra-operative navigation, and organ localization in medical imaging; however, its progress is constrained by the scarcity of expert-labeled data. A promising remedy is to leverage an annotated reference image to guide the interpretation of an unlabeled target. Although recent vision-language models (VLMs) exhibit non-trivial visual reasoning, their reference-based understanding and fine-grained localization remain limited. We introduce RAU, a framework for reference-based anatomical understanding with VLMs. We first show that a VLM learns to identify anatomical regions through relative spatial reasoning between reference and target images, trained on a moderately sized dataset. We validate this capability through visual question answering (VQA) and bounding box prediction. Next, we demonstrate that the VLM-derived spatial cues can be seamlessly integrated with the fine-grained segmentation capability of SAM2, enabling localization and pixel-level segmentation of small anatomical regions, such as vessel segments. Across two in-distribution and two out-of-distribution datasets, RAU consistently outperforms a SAM2 fine-tuning baseline using the same memory setup, yielding more accurate segmentations and more reliable localization. More importantly, its strong generalization ability makes it scalable to out-of-distribution datasets, a property crucial for medical image applications. To the best of our knowledge, RAU is the first to explore the capability of VLMs for reference-based identification, localization, and segmentation of anatomical structures in medical images. Its promising performance highlights the potential of VLM-driven approaches for anatomical understanding in automated clinical workflows.
[347] Integrating Background Knowledge in Medical Semantic Segmentation with Logic Tensor Networks
Luca Bergamin, Giovanna Maria Dimitri, Fabio Aiolli
Main category: cs.CV
TL;DR: The paper proposes using Logic Tensor Networks (LTNs) to incorporate medical background knowledge into semantic segmentation models, improving performance especially with limited training data.
Details
Motivation: Current deep learning systems for medical semantic segmentation are imperfect and can benefit from incorporating domain-specific medical knowledge to improve performance.Method: Use Logic Tensor Networks (LTNs) to encode medical background knowledge using first-order logic rules, integrated with SwinUNETR in an end-to-end framework for semantic segmentation.
Result: LTNs improved baseline segmentation performance for hippocampus segmentation in brain MRI scans, particularly when training data was scarce.
Conclusion: Neuro-symbolic methods like LTNs are general enough to be adapted to other medical semantic segmentation tasks and show promise for improving performance with limited data.
Abstract: Semantic segmentation is a fundamental task in medical image analysis, aiding medical decision-making by helping radiologists distinguish objects in an image. Research in this field has been driven by deep learning applications, which have the potential to scale these systems even in the presence of noise and artifacts. However, these systems are not yet perfected. We argue that performance can be improved by incorporating common medical knowledge into the segmentation model’s loss function. To this end, we introduce Logic Tensor Networks (LTNs) to encode medical background knowledge using first-order logic (FOL) rules. The encoded rules span from constraints on the shape of the produced segmentation, to relationships between different segmented areas. We apply LTNs in an end-to-end framework with a SwinUNETR for semantic segmentation. We evaluate our method on the task of segmenting the hippocampus in brain MRI scans. Our experiments show that LTNs improve the baseline segmentation performance, especially when training data is scarce. Despite being in its preliminary stages, we argue that neurosymbolic methods are general enough to be adapted and applied to other medical semantic segmentation tasks.
[348] Explaining multimodal LLMs via intra-modal token interactions
Jiawei Liang, Ruoyu Chen, Xianghao Jiao, Siyuan Liang, Shiming Liu, Qunli Zhang, Zheng Hu, Xiaochun Cao
Main category: cs.CV
TL;DR: The paper proposes methods to improve interpretability of Multimodal Large Language Models by addressing intra-modal dependencies in both visual and textual modalities.
Details
Motivation: Existing interpretability research focuses on cross-modal attribution but overlooks intra-modal dependencies, leading to fragmented visual explanations and spurious textual activations.Method: Proposes Multi-Scale Explanation Aggregation (MSEA) for visual branch to aggregate attributions over multi-scale inputs, and Activation Ranking Correlation (ARC) for textual branch to measure token relevance via prediction ranking alignment.
Result: Extensive experiments show the approach consistently outperforms existing interpretability methods, yielding more faithful and fine-grained explanations.
Conclusion: The proposed methods effectively enhance MLLM interpretability by leveraging intra-modal interactions, addressing limitations of current attribution approaches.
Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable success across diverse vision-language tasks, yet their internal decision-making mechanisms remain insufficiently understood. Existing interpretability research has primarily focused on cross-modal attribution, identifying which image regions the model attends to during output generation. However, these approaches often overlook intra-modal dependencies. In the visual modality, attributing importance to isolated image patches ignores spatial context due to limited receptive fields, resulting in fragmented and noisy explanations. In the textual modality, reliance on preceding tokens introduces spurious activations. Failing to effectively mitigate these interference compromises attribution fidelity. To address these limitations, we propose enhancing interpretability by leveraging intra-modal interaction. For the visual branch, we introduce \textit{Multi-Scale Explanation Aggregation} (MSEA), which aggregates attributions over multi-scale inputs to dynamically adjust receptive fields, producing more holistic and spatially coherent visual explanations. For the textual branch, we propose \textit{Activation Ranking Correlation} (ARC), which measures the relevance of contextual tokens to the current token via alignment of their top-$k$ prediction rankings. ARC leverages this relevance to suppress spurious activations from irrelevant contexts while preserving semantically coherent ones. Extensive experiments across state-of-the-art MLLMs and benchmark datasets demonstrate that our approach consistently outperforms existing interpretability methods, yielding more faithful and fine-grained explanations of model behavior.
[349] Closing the Safety Gap: Surgical Concept Erasure in Visual Autoregressive Models
Xinhao Zhong, Yimin Zhou, Zhiqi Zhang, Junhao Li, Yi Sun, Bin Chen, Shu-Tao Xia, Ke Xu
Main category: cs.CV
TL;DR: VARE framework enables stable concept erasure in visual autoregressive models using auxiliary visual tokens and filtered cross entropy loss to address safety concerns while maintaining generation quality.
Details
Motivation: Existing concept erasure techniques designed for diffusion models fail to generalize to visual autoregressive (VAR) models due to their next-scale token prediction paradigm, creating safety gaps in text-to-image generation.Method: Proposes VARE framework with auxiliary visual tokens to reduce fine-tuning intensity, and S-VARE method with filtered cross entropy loss to precisely identify unsafe tokens and preservation loss to maintain semantic fidelity.
Result: Extensive experiments show the approach achieves surgical concept erasure while preserving generation quality, effectively closing safety gaps in autoregressive text-to-image generation.
Conclusion: The proposed VARE and S-VARE methods successfully address concept erasure challenges in VAR models, providing stable safety improvements without compromising generation capabilities.
Abstract: The rapid progress of visual autoregressive (VAR) models has brought new opportunities for text-to-image generation, but also heightened safety concerns. Existing concept erasure techniques, primarily designed for diffusion models, fail to generalize to VARs due to their next-scale token prediction paradigm. In this paper, we first propose a novel VAR Erasure framework VARE that enables stable concept erasure in VAR models by leveraging auxiliary visual tokens to reduce fine-tuning intensity. Building upon this, we introduce S-VARE, a novel and effective concept erasure method designed for VAR, which incorporates a filtered cross entropy loss to precisely identify and minimally adjust unsafe visual tokens, along with a preservation loss to maintain semantic fidelity, addressing the issues such as language drift and reduced diversity introduce by na"ive fine-tuning. Extensive experiments demonstrate that our approach achieves surgical concept erasure while preserving generation quality, thereby closing the safety gap in autoregressive text-to-image generation by earlier methods.
[350] Vision-Language Alignment from Compressed Image Representations using 2D Gaussian Splatting
Yasmine Omri, Connor Ding, Tsachy Weissman, Thierry Tambe
Main category: cs.CV
TL;DR: 2D Gaussian Splatting (2DGS) is proposed as an alternative visual representation to RGB images for vision-language pipelines, offering 3-20x compression while maintaining semantic capabilities.
Details
Motivation: Current RGB vision pipelines have structural inefficiencies: (i) transmitting dense RGB images from edge to cloud is energy intensive, and (ii) patch-based tokenization creates long sequences that stress attention mechanisms.Method: Developed scalable 2DGS pipeline with structured initialization, luminance-aware pruning, and optimized CUDA kernels. Adapted CLIP training by reusing frozen RGB transformer backbone with splat-aware input stem and perceiver resampler, training only 7% of parameters.
Result: Achieved 90x faster fitting and 97% GPU utilization compared to prior implementations. 2DGS encoders yield meaningful zero-shot ImageNet-1K performance while compressing inputs 3-20x relative to pixels. Accuracy currently trails RGB encoders but establishes viability.
Conclusion: 2DGS is established as a viable multimodal substrate that opens a path toward representations that are both semantically powerful and transmission-efficient for edge-cloud learning, though current accuracy needs improvement.
Abstract: Modern vision language pipelines are driven by RGB vision encoders trained on massive image text corpora. While these pipelines have enabled impressive zero shot capabilities and strong transfer across tasks, they still inherit two structural inefficiencies from the pixel domain: (i) transmitting dense RGB images from edge devices to the cloud is energy intensive and costly, and (ii) patch based tokenization explodes sequence length, stressing attention budgets and context limits. We explore 2D Gaussian Splatting (2DGS) as an alternative visual substrate for alignment: a compact, spatially adaptive representation that parameterizes images by a set of colored anisotropic Gaussians. We develop a scalable 2DGS pipeline with structured initialization, luminance aware pruning, and batched CUDA kernels, achieving over 90x faster fitting and about 97% GPU utilization compared to prior implementations. We further adapt contrastive language image pretraining (CLIP) to 2DGS by reusing a frozen RGB-based transformer backbone with a lightweight splat aware input stem and a perceiver resampler, training only about 7% of the total parameters. On large DataComp subsets, GS encoders yield meaningful zero shot ImageNet-1K performance while compressing inputs 3 to 20x relative to pixels. While accuracy currently trails RGB encoders, our results establish 2DGS as a viable multimodal substrate, pinpoint architectural bottlenecks, and open a path toward representations that are both semantically powerful and transmission efficient for edge cloud learning.
[351] FreqDebias: Towards Generalizable Deepfake Detection via Consistency-Driven Frequency Debiasing
Hossein Kashiani, Niloufar Alipour Talemi, Fatemeh Afghah
Main category: cs.CV
TL;DR: FreqDebias is a frequency debiasing framework that addresses spectral bias in deepfake detectors through Forgery Mixup augmentation and dual consistency regularization, improving cross-domain generalization.
Details
Motivation: Deepfake detectors often fail to generalize to novel forgery types due to spectral bias - over-reliance on specific frequency bands learned from limited training data.Method: Proposes FreqDebias with two strategies: 1) Forgery Mixup (Fo-Mixup) augmentation to diversify frequency characteristics, and 2) dual consistency regularization using class activation maps (local) and von Mises-Fisher distribution on hyperspherical embeddings (global).
Result: Extensive experiments show FreqDebias significantly enhances cross-domain generalization and outperforms state-of-the-art methods in both cross-domain and in-domain settings.
Conclusion: The proposed frequency debiasing framework effectively mitigates spectral bias and improves deepfake detector generalization across different forgery types.
Abstract: Deepfake detectors often struggle to generalize to novel forgery types due to biases learned from limited training data. In this paper, we identify a new type of model bias in the frequency domain, termed spectral bias, where detectors overly rely on specific frequency bands, restricting their ability to generalize across unseen forgeries. To address this, we propose FreqDebias, a frequency debiasing framework that mitigates spectral bias through two complementary strategies. First, we introduce a novel Forgery Mixup (Fo-Mixup) augmentation, which dynamically diversifies frequency characteristics of training samples. Second, we incorporate a dual consistency regularization (CR), which enforces both local consistency using class activation maps (CAMs) and global consistency through a von Mises-Fisher (vMF) distribution on a hyperspherical embedding space. This dual CR mitigates over-reliance on certain frequency components by promoting consistent representation learning under both local and global supervision. Extensive experiments show that FreqDebias significantly enhances cross-domain generalization and outperforms state-of-the-art methods in both cross-domain and in-domain settings.
[352] LucidFlux: Caption-Free Universal Image Restoration via a Large-Scale Diffusion Transformer
Song Fei, Tian Ye, Lujia Wang, Lei Zhu
Main category: cs.CV
TL;DR: LucidFlux is a caption-free universal image restoration framework that adapts Flux.1 diffusion transformer without needing image captions, using dual-branch conditioning and adaptive modulation for robust restoration.
Details
Motivation: Current universal image restoration methods often oversmooth, hallucinate, or drift when dealing with unknown degradation mixtures while preserving semantics. Existing approaches rely on text prompts or MLLM captions which introduce latency and instability.Method: Uses lightweight dual-branch conditioner to inject signals from degraded input and lightly restored proxy; timestep- and layer-adaptive modulation schedule; caption-free semantic alignment via SigLIP features; scalable curation pipeline for structure-rich supervision.
Result: Consistently outperforms strong open-source and commercial baselines across synthetic and in-the-wild benchmarks. Ablation studies verify necessity of each component.
Conclusion: For large diffusion transformers, the key to robust caption-free universal image restoration is determining when, where, and what to condition on, rather than adding parameters or relying on text prompts.
Abstract: Universal image restoration (UIR) aims to recover images degraded by unknown mixtures while preserving semantics – conditions under which discriminative restorers and UNet-based diffusion priors often oversmooth, hallucinate, or drift. We present LucidFlux, a caption-free UIR framework that adapts a large diffusion transformer (Flux.1) without image captions. LucidFlux introduces a lightweight dual-branch conditioner that injects signals from the degraded input and a lightly restored proxy to respectively anchor geometry and suppress artifacts. Then, a timestep- and layer-adaptive modulation schedule is designed to route these cues across the backbone’s hierarchy, in order to yield coarse-to-fine and context-aware updates that protect the global structure while recovering texture. After that, to avoid the latency and instability of text prompts or MLLM captions, we enforce caption-free semantic alignment via SigLIP features extracted from the proxy. A scalable curation pipeline further filters large-scale data for structure-rich supervision. Across synthetic and in-the-wild benchmarks, LucidFlux consistently outperforms strong open-source and commercial baselines, and ablation studies verify the necessity of each component. LucidFlux shows that, for large DiTs, when, where, and what to condition on – rather than adding parameters or relying on text prompts – is the governing lever for robust and caption-free universal image restoration in the wild.
[353] LABELING COPILOT: A Deep Research Agent for Automated Data Curation in Computer Vision
Debargha Ganguly, Sumit Kumar, Ishwar Balappanawar, Weicong Chen, Shashank Kambhatla, Srinivasan Iyengar, Shivkumar Kalyanaraman, Ponnurangam Kumaraguru, Vipin Chaudhary
Main category: cs.CV
TL;DR: Labeling Copilot is an AI agent for automated computer vision data curation that combines calibrated discovery, controllable synthesis, and consensus annotation to efficiently create high-quality datasets at scale.
Details
Motivation: Curating high-quality domain-specific datasets is a major bottleneck for vision systems, requiring complex trade-offs between data quality, diversity, and cost when dealing with large unlabeled data lakes.Method: A central orchestrator agent powered by a multimodal language model uses multi-step reasoning to execute specialized tools: Calibrated Discovery for sourcing relevant data, Controllable Synthesis for generating rare scenario data, and Consensus Annotation for accurate labeling using multiple foundation models with novel consensus mechanisms.
Result: Consensus Annotation achieved 14.2 candidate proposals per image (vs 7.4 ground-truth) on COCO with 37.1% mAP, discovered 903 new categories on Open Images. Calibrated Discovery was 40x more computationally efficient at 10-million sample scale with equivalent sample efficiency.
Conclusion: Agentic workflows with optimized, scalable tools provide a robust foundation for curating industrial-scale datasets, overcoming traditional data curation bottlenecks.
Abstract: Curating high-quality, domain-specific datasets is a major bottleneck for deploying robust vision systems, requiring complex trade-offs between data quality, diversity, and cost when researching vast, unlabeled data lakes. We introduce Labeling Copilot, the first data curation deep research agent for computer vision. A central orchestrator agent, powered by a large multimodal language model, uses multi-step reasoning to execute specialized tools across three core capabilities: (1) Calibrated Discovery sources relevant, in-distribution data from large repositories; (2) Controllable Synthesis generates novel data for rare scenarios with robust filtering; and (3) Consensus Annotation produces accurate labels by orchestrating multiple foundation models via a novel consensus mechanism incorporating non-maximum suppression and voting. Our large-scale validation proves the effectiveness of Labeling Copilot’s components. The Consensus Annotation module excels at object discovery: on the dense COCO dataset, it averages 14.2 candidate proposals per image-nearly double the 7.4 ground-truth objects-achieving a final annotation mAP of 37.1%. On the web-scale Open Images dataset, it navigated extreme class imbalance to discover 903 new bounding box categories, expanding its capability to over 1500 total. Concurrently, our Calibrated Discovery tool, tested at a 10-million sample scale, features an active learning strategy that is up to 40x more computationally efficient than alternatives with equivalent sample efficiency. These experiments validate that an agentic workflow with optimized, scalable tools provides a robust foundation for curating industrial-scale datasets.
[354] U-MAN: U-Net with Multi-scale Adaptive KAN Network for Medical Image Segmentation
Bohan Huang, Qianyun Bao, Haoyuan Ma
Main category: cs.CV
TL;DR: U-MAN is a novel medical image segmentation architecture that enhances U-Net with Multi-scale Adaptive KAN modules to address semantic gaps and multi-scale feature extraction limitations, achieving superior boundary accuracy and detail preservation.
Details
Motivation: To overcome limitations in conventional U-Net architectures that struggle with preserving fine-grained details and precise boundaries in medical images due to simple skip connections ignoring encoder-decoder semantic gaps and lack of multi-scale feature extraction in deep layers.Method: Proposed U-MAN architecture with two key modules: Progressive Attention-Guided Feature Fusion (PAGF) to replace simple skip connections using attention mechanisms, and Multi-scale Adaptive KAN (MAN) to enable adaptive multi-scale feature processing for segmenting objects of various sizes.
Result: Experiments on three public datasets (BUSI, GLAS, and CVC) demonstrate that U-MAN outperforms state-of-the-art methods, particularly excelling in defining accurate boundaries and preserving fine details.
Conclusion: U-MAN effectively addresses the core limitations of conventional U-Nets through its attention-guided feature fusion and multi-scale adaptive processing, achieving superior performance in medical image segmentation tasks.
Abstract: Medical image segmentation faces significant challenges in preserving fine-grained details and precise boundaries due to complex anatomical structures and pathological regions. These challenges primarily stem from two key limitations of conventional U-Net architectures: (1) their simple skip connections ignore the encoder-decoder semantic gap between various features, and (2) they lack the capability for multi-scale feature extraction in deep layers. To address these challenges, we propose the U-Net with Multi-scale Adaptive KAN (U-MAN), a novel architecture that enhances the emerging Kolmogorov-Arnold Network (KAN) with two specialized modules: Progressive Attention-Guided Feature Fusion (PAGF) and the Multi-scale Adaptive KAN (MAN). Our PAGF module replaces the simple skip connection, using attention to fuse features from the encoder and decoder. The MAN module enables the network to adaptively process features at multiple scales, improving its ability to segment objects of various sizes. Experiments on three public datasets (BUSI, GLAS, and CVC) show that U-MAN outperforms state-of-the-art methods, particularly in defining accurate boundaries and preserving fine details.
[355] $γ$-Quant: Towards Learnable Quantization for Low-bit Pattern Recognition
Mishal Fatima, Shashank Agnihotri, Marius Bock, Kanchana Vaishnavi Gandikota, Kristof Van Laerhoven, Michael Moeller, Margret Keuper
Main category: cs.CV
TL;DR: The paper proposes γ-Quant, a task-specific learnable non-linear quantization method for pattern recognition that enables using low-bit-depth (4-bit) raw sensor data while maintaining performance comparable to high-bit (12-bit) data, addressing energy and bandwidth constraints in computer vision and human activity recognition.
Details
Motivation: Current pattern recognition models use pre-processed data optimized for human perception, but this is inefficient for automated analysis and energy-constrained devices. High-bit-depth data transmission significantly impacts battery life in wearable devices.Method: The authors propose γ-Quant, which learns a non-linear quantization specifically for pattern recognition tasks. The approach is demonstrated on raw-image object detection and human activity recognition using wearable sensor data.
Result: The method shows that raw data with learnable quantization using only 4-bits can perform on par with raw 12-bit data, achieving comparable performance while significantly reducing data requirements.
Conclusion: Task-specific learnable quantization enables efficient pattern recognition in low-bandwidth and energy-constrained settings, making it possible to use low-bit-depth sensor data without sacrificing performance.
Abstract: Most pattern recognition models are developed on pre-proce-ssed data. In computer vision, for instance, RGB images processed through image signal processing (ISP) pipelines designed to cater to human perception are the most frequent input to image analysis networks. However, many modern vision tasks operate without a human in the loop, raising the question of whether such pre-processing is optimal for automated analysis. Similarly, human activity recognition (HAR) on body-worn sensor data commonly takes normalized floating-point data arising from a high-bit analog-to-digital converter (ADC) as an input, despite such an approach being highly inefficient in terms of data transmission, significantly affecting the battery life of wearable devices. In this work, we target low-bandwidth and energy-constrained settings where sensors are limited to low-bit-depth capture. We propose $\gamma$-Quant, i.e.~the task-specific learning of a non-linear quantization for pattern recognition. We exemplify our approach on raw-image object detection as well as HAR of wearable data, and demonstrate that raw data with a learnable quantization using as few as 4-bits can perform on par with the use of raw 12-bit data. All code to reproduce our experiments is publicly available via https://github.com/Mishalfatima/Gamma-Quant
[356] Learning Human-Perceived Fakeness in AI-Generated Videos via Multimodal LLMs
Xingyu Fu, Siyi Liu, Yinuo Xu, Pan Lu, Guangqiuse Hu, Tianbo Yang, Taran Anantasagar, Christopher Shen, Yikai Mao, Yuanzhe Liu, Keyush Shah, Chung Un Lee, Yejin Choi, James Zou, Dan Roth, Chris Callison-Burch
Main category: cs.CV
TL;DR: DeeptraceReward is a benchmark for detecting AI-generated videos by identifying spatiotemporal fake traces, with 4.3K annotations across 3.3K videos, used to train reward models that outperform GPT-5 by 34.7%.
Details
Motivation: To address the gap in detecting fine-grained deepfake traces in AI-generated videos, focusing on human-perceived visual artifacts that reveal machine generation.Method: Created a dataset with detailed annotations including natural-language explanations, bounding-box regions, and timestamps; trained multimodal language models as reward models to mimic human judgments.
Result: The 7B reward model outperformed GPT-5 by 34.7% across fake clue identification, grounding, and explanation tasks, with performance degrading from explanations to spatial and temporal grounding.
Conclusion: DeeptraceReward provides a rigorous framework for improving trustworthy video generation by emphasizing human-perceived deepfake traces.
Abstract: Can humans identify AI-generated (fake) videos and provide grounded reasons? While video generation models have advanced rapidly, a critical dimension – whether humans can detect deepfake traces within a generated video, i.e., spatiotemporal grounded visual artifacts that reveal a video as machine generated – has been largely overlooked. We introduce DeeptraceReward, the first fine-grained, spatially- and temporally- aware benchmark that annotates human-perceived fake traces for video generation reward. The dataset comprises 4.3K detailed annotations across 3.3K high-quality generated videos. Each annotation provides a natural-language explanation, pinpoints a bounding-box region containing the perceived trace, and marks precise onset and offset timestamps. We consolidate these annotations into 9 major categories of deepfake traces that lead humans to identify a video as AI-generated, and train multimodal language models (LMs) as reward models to mimic human judgments and localizations. On DeeptraceReward, our 7B reward model outperforms GPT-5 by 34.7% on average across fake clue identification, grounding, and explanation. Interestingly, we observe a consistent difficulty gradient: binary fake v.s. real classification is substantially easier than fine-grained deepfake trace detection; within the latter, performance degrades from natural language explanations (easiest), to spatial grounding, to temporal labeling (hardest). By foregrounding human-perceived deepfake traces, DeeptraceReward provides a rigorous testbed and training signal for socially aware and trustworthy video generation.
[357] SSVIF: Self-Supervised Segmentation-Oriented Visible and Infrared Image Fusion
Zixian Zhao, Xingchen Zhang
Main category: cs.CV
TL;DR: Proposes SSVIF, a self-supervised framework for segmentation-oriented visible-infrared image fusion that eliminates the need for labeled segmentation data by leveraging cross-segmentation consistency between feature-level and pixel-level fusion.
Details
Motivation: Application-oriented VIF methods require expensive labeled datasets for downstream tasks like segmentation, making data acquisition labor-intensive. This work aims to develop a self-supervised approach that achieves similar performance without segmentation labels.Method: Uses cross-segmentation consistency between feature-level and pixel-level fusion-based segmentation as a self-supervised task. Implements a two-stage training strategy with dynamic weight adjustment for effective joint learning.
Result: Extensive experiments show SSVIF outperforms traditional VIF methods and rivals supervised segmentation-oriented methods, despite being trained only on unlabeled visible-infrared image pairs.
Conclusion: SSVIF provides an effective self-supervised solution for segmentation-oriented VIF that eliminates the need for expensive labeled data while achieving competitive performance with supervised approaches.
Abstract: Visible and infrared image fusion (VIF) has gained significant attention in recent years due to its wide application in tasks such as scene segmentation and object detection. VIF methods can be broadly classified into traditional VIF methods and application-oriented VIF methods. Traditional methods focus solely on improving the quality of fused images, while application-oriented VIF methods additionally consider the performance of downstream tasks on fused images by introducing task-specific loss terms during training. However, compared to traditional methods, application-oriented VIF methods require datasets labeled for downstream tasks (e.g., semantic segmentation or object detection), making data acquisition labor-intensive and time-consuming. To address this issue, we propose a self-supervised training framework for segmentation-oriented VIF methods (SSVIF). Leveraging the consistency between feature-level fusion-based segmentation and pixel-level fusion-based segmentation, we introduce a novel self-supervised task-cross-segmentation consistency-that enables the fusion model to learn high-level semantic features without the supervision of segmentation labels. Additionally, we design a two-stage training strategy and a dynamic weight adjustment method for effective joint learning within our self-supervised framework. Extensive experiments on public datasets demonstrate the effectiveness of our proposed SSVIF. Remarkably, although trained only on unlabeled visible-infrared image pairs, our SSVIF outperforms traditional VIF methods and rivals supervised segmentation-oriented ones. Our code will be released upon acceptance.
[358] CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning
Long Xing, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Jianze Liang, Qidong Huang, Jiaqi Wang, Feng Wu, Dahua Lin
Main category: cs.CV
TL;DR: CapRL applies Reinforcement Learning with Verifiable Rewards to image captioning, using a vision-free LLM’s question-answering accuracy as an objective reward metric to overcome limitations of supervised fine-tuning.
Details
Motivation: Current SFT-based captioning models rely on expensive human annotations, leading to memorization and limited generalization. The subjective nature of caption quality makes objective reward design challenging.Method: Two-stage pipeline: LVLM generates caption, then vision-free LLM answers multiple-choice questions based on caption alone. Reward is derived from QA accuracy, defining caption quality by its utility for downstream tasks.
Result: CapRL significantly improves performance across 12 benchmarks. Pretraining on CapRL-5M dataset yields substantial gains. Achieves comparable performance to Qwen2.5-VL-72B with 8.4% average improvement over baseline in Prism Framework.
Conclusion: CapRL successfully applies RLVR to subjective image captioning, demonstrating that utility-based reward design enables training more generalizable and diverse captioning models without expensive human annotations.
Abstract: Image captioning is a fundamental task that bridges the visual and linguistic domains, playing a critical role in pre-training Large Vision-Language Models (LVLMs). Current state-of-the-art captioning models are typically trained with Supervised Fine-Tuning (SFT), a paradigm that relies on expensive, non-scalable data annotated by humans or proprietary models. This approach often leads to models that memorize specific ground-truth answers, limiting their generality and ability to generate diverse, creative descriptions. To overcome the limitation of SFT, we propose applying the Reinforcement Learning with Verifiable Rewards (RLVR) paradigm to the open-ended task of image captioning. A primary challenge, however, is designing an objective reward function for the inherently subjective nature of what constitutes a “good” caption. We introduce Captioning Reinforcement Learning (CapRL), a novel training framework that redefines caption quality through its utility: a high-quality caption should enable a non-visual language model to accurately answer questions about the corresponding image. CapRL employs a decoupled two-stage pipeline where an LVLM generates a caption, and the objective reward is derived from the accuracy of a separate, vision-free LLM answering Multiple-Choice Questions based solely on that caption. As the first study to apply RLVR to the subjective image captioning task, we demonstrate that CapRL significantly enhances multiple settings. Pretraining on the CapRL-5M caption dataset annotated by CapRL-3B results in substantial gains across 12 benchmarks. Moreover, within the Prism Framework for caption quality evaluation, CapRL achieves performance comparable to Qwen2.5-VL-72B, while exceeding the baseline by an average margin of 8.4%. Code is available here: https://github.com/InternLM/CapRL.
[359] Bézier Meets Diffusion: Robust Generation Across Domains for Medical Image Segmentation
Chen Li, Meilong Xu, Xiaoling Hu, Weimin Lyu, Chao Chen
Main category: cs.CV
TL;DR: Proposes B'ezier Meets Diffusion framework for cross-domain medical image generation using B'ezier-curve style transfer and conditional diffusion models to improve domain adaptation and segmentation performance.
Details
Motivation: Training robust learning algorithms across different medical imaging modalities is challenging due to large domain gaps. Existing GAN-based style transfer methods struggle with high variability regions.Method: Uses B'ezier-curve-based style transfer to reduce domain gap, trains segmentation model on transferred images, then uses pseudo-labels to train conditional diffusion model with uncertainty-guided score matching for robustness.
Result: Extensive experiments show the approach generates realistic labeled images, significantly augments target domain data, and improves segmentation performance on public datasets.
Conclusion: The unified framework effectively addresses domain adaptation challenges in medical imaging through combined B'ezier style transfer and diffusion modeling, producing high-quality synthetic data that enhances segmentation across domains.
Abstract: Training robust learning algorithms across different medical imaging modalities is challenging due to the large domain gap. Unsupervised domain adaptation (UDA) mitigates this problem by using annotated images from the source domain and unlabeled images from the target domain to train the deep models. Existing approaches often rely on GAN-based style transfer, but these methods struggle to capture cross-domain mappings in regions with high variability. In this paper, we propose a unified framework, B'ezier Meets Diffusion, for cross-domain image generation. First, we introduce a B'ezier-curve-based style transfer strategy that effectively reduces the domain gap between source and target domains. The transferred source images enable the training of a more robust segmentation model across domains. Thereafter, using pseudo-labels generated by this segmentation model on the target domain, we train a conditional diffusion model (CDM) to synthesize high-quality, labeled target-domain images. To mitigate the impact of noisy pseudo-labels, we further develop an uncertainty-guided score matching method that improves the robustness of CDM training. Extensive experiments on public datasets demonstrate that our approach generates realistic labeled images, significantly augmenting the target domain and improving segmentation performance.
[360] PSTTS: A Plug-and-Play Token Selector for Efficient Event-based Spatio-temporal Representation Learning
Xiangmo Zhao, Nan Yang, Yang Wang, Zhanwen Liu
Main category: cs.CV
TL;DR: PSTTS is a plug-and-play module that reduces computational overhead in event-based vision by identifying and removing spatio-temporal redundant tokens from event frame sequences, achieving significant efficiency gains without accuracy loss.
Details
Motivation: Existing event-based methods convert event streams to frame sequences but ignore spatial sparsity and inter-frame motion redundancy, causing computational inefficiency. Token sparsification methods for RGB videos don't work well for event data due to unreliable intermediate representations and event noise.Method: PSTTS uses two stages: Spatial Token Purification removes noise and non-event regions by assessing spatio-temporal consistency within frames, and Temporal Token Selection evaluates motion pattern similarity between adjacent frames to remove redundant temporal information.
Result: Applied to four backbones on three datasets, PSTTS reduces FLOPs by 29-43.6% and increases FPS by 21.6-41.3% on DailyDVS-200 while maintaining accuracy.
Conclusion: PSTTS effectively addresses computational inefficiency in event-based vision by leveraging raw event data characteristics for token selection, achieving optimal accuracy-efficiency trade-off without additional parameters.
Abstract: Mainstream event-based spatio-temporal representation learning methods typically process event streams by converting them into sequences of event frames, achieving remarkable performance. However, they neglect the high spatial sparsity and inter-frame motion redundancy inherent in event frame sequences, leading to significant computational overhead. Existing token sparsification methods for RGB videos rely on unreliable intermediate token representations and neglect the influence of event noise, making them ineffective for direct application to event data. In this paper, we propose Progressive Spatio-Temporal Token Selection (PSTTS), a Plug-and-Play module for event data without introducing any additional parameters. PSTTS exploits the spatio-temporal distribution characteristics embedded in raw event data to effectively identify and discard spatio-temporal redundant tokens, achieving an optimal trade-off between accuracy and efficiency. Specifically, PSTTS consists of two stages, Spatial Token Purification and Temporal Token Selection. Spatial Token Purification discards noise and non-event regions by assessing the spatio-temporal consistency of events within each event frame to prevent interference with subsequent temporal redundancy evaluation. Temporal Token Selection evaluates the motion pattern similarity between adjacent event frames, precisely identifying and removing redundant temporal information. We apply PSTTS to four representative backbones UniformerV2, VideoSwin, EVMamba, and ExACT on the HARDVS, DailyDVS-200, and SeACT datasets. Experimental results demonstrate that PSTTS achieves significant efficiency improvements. Specifically, PSTTS reduces FLOPs by 29-43.6% and increases FPS by 21.6-41.3% on the DailyDVS-200 dataset, while maintaining task accuracy. Our code will be available.
[361] Group Critical-token Policy Optimization for Autoregressive Image Generation
Guohui Zhang, Hu Yu, Xiaoxiao Ma, JingHao Zhang, Yaning Pan, Mingde Yao, Jie Xiao, Linjiang Huang, Feng Zhao
Main category: cs.CV
TL;DR: GCPO is a method that identifies critical tokens in autoregressive visual generation and applies targeted policy optimization to improve training efficiency and performance.
Details
Motivation: Existing RLVR methods apply uniform optimization across all image tokens, ignoring that different tokens contribute differently to training effectiveness. The challenge is identifying which tokens are more critical and optimizing them specifically.Method: GCPO identifies critical tokens from three perspectives: causal dependency (early tokens determine later ones), entropy-induced spatial structure (high entropy gradient tokens), and RLVR-focused token diversity (low visual similarity tokens). It then applies dynamic token-wise advantage weighting for these critical tokens.
Result: GCPO achieves better performance than GRPO using only 30% of image tokens, demonstrating effectiveness across multiple text-to-image benchmarks for both AR models and unified multimodal models.
Conclusion: Targeted optimization of critical tokens in autoregressive visual generation significantly improves training efficiency and performance compared to uniform optimization across all tokens.
Abstract: Recent studies have extended Reinforcement Learning with Verifiable Rewards (RLVR) to autoregressive (AR) visual generation and achieved promising progress. However, existing methods typically apply uniform optimization across all image tokens, while the varying contributions of different image tokens for RLVR’s training remain unexplored. In fact, the key obstacle lies in how to identify more critical image tokens during AR generation and implement effective token-wise optimization for them. To tackle this challenge, we propose $\textbf{G}$roup $\textbf{C}$ritical-token $\textbf{P}$olicy $\textbf{O}$ptimization ($\textbf{GCPO}$), which facilitates effective policy optimization on critical tokens. We identify the critical tokens in RLVR-based AR generation from three perspectives, specifically: $\textbf{(1)}$ Causal dependency: early tokens fundamentally determine the later tokens and final image effect due to unidirectional dependency; $\textbf{(2)}$ Entropy-induced spatial structure: tokens with high entropy gradients correspond to image structure and bridges distinct visual regions; $\textbf{(3)}$ RLVR-focused token diversity: tokens with low visual similarity across a group of sampled images contribute to richer token-level diversity. For these identified critical tokens, we further introduce a dynamic token-wise advantage weight to encourage exploration, based on confidence divergence between the policy model and reference model. By leveraging 30% of the image tokens, GCPO achieves better performance than GRPO with full tokens. Extensive experiments on multiple text-to-image benchmarks for both AR models and unified multimodal models demonstrate the effectiveness of GCPO for AR visual generation.
[362] Where MLLMs Attend and What They Rely On: Explaining Autoregressive Token Generation
Ruoyu Chen, Xiaoqing Guo, Kangwei Liu, Siyuan Liang, Shiming Liu, Qunli Zhang, Hua Zhang, Xiaochun Cao
Main category: cs.CV
TL;DR: EAGLE is a lightweight black-box framework that explains token generation in multimodal LLMs by attributing tokens to compact perceptual regions and quantifying language vs. perceptual influence.
Details
Motivation: Current MLLMs lack understanding of how generated tokens depend on visual modalities, limiting interpretability and reliability.Method: EAGLE uses an objective function unifying sufficiency and indispensability scores, optimized via greedy search over sparsified image regions for efficient attribution. It performs modality-aware analysis to disentangle token dependencies.
Result: EAGLE consistently outperforms existing methods in faithfulness, localization, and hallucination diagnosis while requiring substantially less GPU memory.
Conclusion: EAGLE effectively advances MLLM interpretability through faithful and efficient attribution of token generation to visual inputs.
Abstract: Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in aligning visual inputs with natural language outputs. Yet, the extent to which generated tokens depend on visual modalities remains poorly understood, limiting interpretability and reliability. In this work, we present EAGLE, a lightweight black-box framework for explaining autoregressive token generation in MLLMs. EAGLE attributes any selected tokens to compact perceptual regions while quantifying the relative influence of language priors and perceptual evidence. The framework introduces an objective function that unifies sufficiency (insight score) and indispensability (necessity score), optimized via greedy search over sparsified image regions for faithful and efficient attribution. Beyond spatial attribution, EAGLE performs modality-aware analysis that disentangles what tokens rely on, providing fine-grained interpretability of model decisions. Extensive experiments across open-source MLLMs show that EAGLE consistently outperforms existing methods in faithfulness, localization, and hallucination diagnosis, while requiring substantially less GPU memory. These results highlight its effectiveness and practicality for advancing the interpretability of MLLMs. The code is available at https://github.com/RuoyuChen10/EAGLE.
[363] Color Names in Vision-Language Models
Alexandra Gomez-Villa, Pablo Hernández-Cámara, Muhammad Atif Butt, Valero Laparra, Jesus Malo, Javier Vazquez-Corral
Main category: cs.CV
TL;DR: This paper presents the first systematic evaluation of color naming capabilities in vision-language models (VLMs), showing they perform well on prototypical colors but struggle with non-prototypical colors, revealing training imbalances across languages and architectural influences.
Details
Motivation: Understanding whether VLMs name colors like humans is crucial for effective human-AI interaction, as color is a fundamental dimension of visual perception and communication.Method: Replicated classic color naming methodologies using 957 color samples across five representative VLMs, conducted cross-linguistic analysis across nine languages, and performed ablation studies on language model architecture.
Result: VLMs achieve high accuracy on prototypical colors but performance drops significantly on expanded, non-prototypical color sets. Identified 21 common color terms across all models, with constrained models using basic terms and expansive models using systematic lightness modifiers. Cross-linguistic analysis revealed severe training imbalances favoring English and Chinese, with hue as the primary driver of color naming decisions.
Conclusion: Language model architecture significantly influences color naming independent of visual processing capabilities, highlighting the need for more balanced training across languages and better handling of non-prototypical colors in VLMs.
Abstract: Color serves as a fundamental dimension of human visual perception and a primary means of communicating about objects and scenes. As vision-language models (VLMs) become increasingly prevalent, understanding whether they name colors like humans is crucial for effective human-AI interaction. We present the first systematic evaluation of color naming capabilities across VLMs, replicating classic color naming methodologies using 957 color samples across five representative models. Our results show that while VLMs achieve high accuracy on prototypical colors from classical studies, performance drops significantly on expanded, non-prototypical color sets. We identify 21 common color terms that consistently emerge across all models, revealing two distinct approaches: constrained models using predominantly basic terms versus expansive models employing systematic lightness modifiers. Cross-linguistic analysis across nine languages demonstrates severe training imbalances favoring English and Chinese, with hue serving as the primary driver of color naming decisions. Finally, ablation studies reveal that language model architecture significantly influences color naming independent of visual processing capabilities.
[364] EfficientDepth: A Fast and Detail-Preserving Monocular Depth Estimation Model
Andrii Litvynchuk, Ivan Livinsky, Anand Ravi, Nima Kalantari, Andrii Tsarov
Main category: cs.CV
TL;DR: EfficientDepth is a monocular depth estimation system that combines transformer architecture with lightweight convolutional decoder and bimodal density head to achieve geometric consistency, fine details, and efficiency for edge devices.
Details
Motivation: Existing MDE methods fail to meet requirements for 3D reconstruction and view synthesis, including geometric consistency, fine details, robustness to real-world challenges, and efficiency for edge devices.Method: Combines transformer architecture with lightweight convolutional decoder and bimodal density head; trained on labeled synthetic/real images and pseudo-labeled real images; uses multi-stage optimization strategy and LPIPS-based loss function.
Result: Achieves performance comparable to or better than state-of-the-art models with significantly reduced computational resources.
Conclusion: EfficientDepth successfully addresses key challenges in monocular depth estimation while maintaining computational efficiency suitable for edge devices.
Abstract: Monocular depth estimation (MDE) plays a pivotal role in various computer vision applications, such as robotics, augmented reality, and autonomous driving. Despite recent advancements, existing methods often fail to meet key requirements for 3D reconstruction and view synthesis, including geometric consistency, fine details, robustness to real-world challenges like reflective surfaces, and efficiency for edge devices. To address these challenges, we introduce a novel MDE system, called EfficientDepth, which combines a transformer architecture with a lightweight convolutional decoder, as well as a bimodal density head that allows the network to estimate detailed depth maps. We train our model on a combination of labeled synthetic and real images, as well as pseudo-labeled real images, generated using a high-performing MDE method. Furthermore, we employ a multi-stage optimization strategy to improve training efficiency and produce models that emphasize geometric consistency and fine detail. Finally, in addition to commonly used objectives, we introduce a loss function based on LPIPS to encourage the network to produce detailed depth maps. Experimental results demonstrate that EfficientDepth achieves performance comparable to or better than existing state-of-the-art models, with significantly reduced computational resources.
[365] Category Discovery: An Open-World Perspective
Zhenqi He, Yuanpei Liu, Kai Han
Main category: cs.CV
TL;DR: This paper provides a comprehensive survey of category discovery (CD) methods, categorizing them into novel category discovery (NCD) and generalized category discovery (GCD) settings, and analyzing key components including representation learning, label assignment, and class number estimation.
Details
Motivation: Category discovery is an emerging open-world learning task that aims to automatically categorize unlabeled data containing instances from unseen classes, given some labeled data from seen classes. The field has seen significant growth but lacks systematic organization and analysis.Method: The survey organizes literature into a taxonomy with two base settings (NCD and GCD) and several derived settings for real-world scenarios. It analyzes methods through three fundamental components: representation learning, label assignment, and estimation of class number.
Result: Benchmarking reveals that large-scale pretrained backbones, hierarchical and auxiliary cues, and curriculum-style training benefit category discovery. Challenges remain in label assignment design, class number estimation, and scaling to complex multi-object scenarios.
Conclusion: The survey provides key insights from current literature and identifies promising future research directions, while maintaining a living survey repository for ongoing updates in the category discovery field.
Abstract: Category discovery (CD) is an emerging open-world learning task, which aims at automatically categorizing unlabelled data containing instances from unseen classes, given some labelled data from seen classes. This task has attracted significant attention over the years and leads to a rich body of literature trying to address the problem from different perspectives. In this survey, we provide a comprehensive review of the literature, and offer detailed analysis and in-depth discussion on different methods. Firstly, we introduce a taxonomy for the literature by considering two base settings, namely novel category discovery (NCD) and generalized category discovery (GCD), and several derived settings that are designed to address the extra challenges in different real-world application scenarios, including continual category discovery, skewed data distribution, federated category discovery, etc. Secondly, for each setting, we offer a detailed analysis of the methods encompassing three fundamental components, representation learning, label assignment, and estimation of class number. Thirdly, we benchmark all the methods and distill key insights showing that large-scale pretrained backbones, hierarchical and auxiliary cues, and curriculum-style training are all beneficial for category discovery, while challenges remain in the design of label assignment, the estimation of class numbers, and scaling to complex multi-object scenarios.Finally, we discuss the key insights from the literature so far and point out promising future research directions. We compile a living survey of the category discovery literature at \href{https://github.com/Visual-AI/Category-Discovery}{https://github.com/Visual-AI/Category-Discovery}.
[366] HyCoVAD: A Hybrid SSL-LLM Model for Complex Video Anomaly Detection
Mohammad Mahdi Hemmatyar, Mahdi Jafari, Mohammad Amin Yousefi, Mohammad Reza Nemati, Mobin Azadani, Hamid Reza Rastad, Amirmohammad Akbari
Main category: cs.CV
TL;DR: HyCoVAD is a hybrid SSL-LLM model for complex video anomaly detection that combines self-supervised learning for temporal analysis with LLM validation for semantic reasoning, achieving 72.5% AUC on ComplexVAD dataset.
Details
Motivation: Existing methods struggle with complex anomalies involving intricate relationships and temporal dependencies. SSL methods lack semantic understanding, while LLMs are computationally expensive and lack spatial localization.Method: Hybrid approach combining multi-task SSL temporal analyzer (using nnFormer backbone) with LLM validator. SSL identifies suspicious frames, then LLM applies structured rule-based reasoning for semantic validation.
Result: Achieves 72.5% frame-level AUC on ComplexVAD dataset, outperforming baselines by 12.5% while reducing LLM computation.
Conclusion: HyCoVAD effectively addresses complex video anomaly detection by leveraging complementary strengths of SSL and LLMs, with released taxonomy and tools for future research.
Abstract: Video anomaly detection (VAD) is crucial for intelligent surveillance, but a significant challenge lies in identifying complex anomalies, which are events defined by intricate relationships and temporal dependencies among multiple entities rather than by isolated actions. While self-supervised learning (SSL) methods effectively model low-level spatiotemporal patterns, they often struggle to grasp the semantic meaning of these interactions. Conversely, large language models (LLMs) offer powerful contextual reasoning but are computationally expensive for frame-by-frame analysis and lack fine-grained spatial localization. We introduce HyCoVAD, Hybrid Complex Video Anomaly Detection, a hybrid SSL-LLM model that combines a multi-task SSL temporal analyzer with LLM validator. The SSL module is built upon an nnFormer backbone which is a transformer-based model for image segmentation. It is trained with multiple proxy tasks, learns from video frames to identify those suspected of anomaly. The selected frames are then forwarded to the LLM, which enriches the analysis with semantic context by applying structured, rule-based reasoning to validate the presence of anomalies. Experiments on the challenging ComplexVAD dataset show that HyCoVAD achieves a 72.5% frame-level AUC, outperforming existing baselines by 12.5% while reducing LLM computation. We release our interaction anomaly taxonomy, adaptive thresholding protocol, and code to facilitate future research in complex VAD scenarios.
[367] JanusVLN: Decoupling Semantics and Spatiality with Dual Implicit Memory for Vision-Language Navigation
Shuang Zeng, Dekang Qi, Xinyuan Chang, Feng Xiong, Shichao Xie, Xiaolong Wu, Shiyi Liang, Mu Xu, Xing Wei
Main category: cs.CV
TL;DR: JanusVLN is a novel Vision-and-Language Navigation framework that uses dual implicit neural memory to model spatial-geometric and visual-semantic information as compact neural representations, achieving state-of-the-art performance while avoiding computational redundancy and memory bloat.
Details
Motivation: Current VLN methods relying on explicit semantic memory (textual cognitive maps or stored visual frames) suffer from spatial information loss, computational redundancy, and memory bloat, which impede efficient navigation. The work is inspired by human navigation's implicit scene representation, analogous to left brain semantic understanding and right brain spatial cognition.Method: Proposes JanusVLN with dual implicit neural memory: 1) Extends MLLM to incorporate 3D prior knowledge from spatial-geometric encoder, enhancing spatial reasoning from RGB input; 2) Constructs dual implicit memory using historical key-value caches from both encoders; 3) Retains only KVs of tokens in initial and sliding window to avoid redundant computation and enable efficient incremental updates.
Result: Outperforms over 20 recent methods to achieve SOTA performance. Success rate improves by 10.5-35.5 compared to methods using multiple data types as input, and by 3.6-10.8 compared to methods using more RGB training data.
Conclusion: The dual implicit neural memory serves as a novel paradigm that explores promising new directions for future VLN research, demonstrating that compact neural representations can effectively capture both spatial and semantic information for efficient navigation.
Abstract: Vision-and-Language Navigation requires an embodied agent to navigate through unseen environments, guided by natural language instructions and a continuous video stream. Recent advances in VLN have been driven by the powerful semantic understanding of Multimodal Large Language Models. However, these methods typically rely on explicit semantic memory, such as building textual cognitive maps or storing historical visual frames. This type of method suffers from spatial information loss, computational redundancy, and memory bloat, which impede efficient navigation. Inspired by the implicit scene representation in human navigation, analogous to the left brain’s semantic understanding and the right brain’s spatial cognition, we propose JanusVLN, a novel VLN framework featuring a dual implicit neural memory that models spatial-geometric and visual-semantic memory as separate, compact, and fixed-size neural representations. This framework first extends the MLLM to incorporate 3D prior knowledge from the spatial-geometric encoder, thereby enhancing the spatial reasoning capabilities of models based solely on RGB input. Then, the historical key-value caches from the spatial-geometric and visual-semantic encoders are constructed into a dual implicit memory. By retaining only the KVs of tokens in the initial and sliding window, redundant computation is avoided, enabling efficient incremental updates. Extensive experiments demonstrate that JanusVLN outperforms over 20 recent methods to achieve SOTA performance. For example, the success rate improves by 10.5-35.5 compared to methods using multiple data types as input and by 3.6-10.8 compared to methods using more RGB training data. This indicates that the proposed dual implicit neural memory, as a novel paradigm, explores promising new directions for future VLN research. Ours project page: https://miv-xjtu.github.io/JanusVLN.github.io/.
[368] SpikeMatch: Semi-Supervised Learning with Temporal Dynamics of Spiking Neural Networks
Jini Yang, Beomseok Oh, Seungryong Kim, Sunok Kim
Main category: cs.CV
TL;DR: SpikeMatch is the first semi-supervised learning framework for spiking neural networks that uses temporal dynamics and co-training to generate reliable pseudo-labels from unlabeled data.
Details
Motivation: Semi-supervised learning methods for spiking neural networks are underexplored compared to artificial neural networks, despite SNNs' biological plausibility and energy efficiency advantages.Method: Uses temporal dynamics through SNN leakage factors for diverse pseudo-labeling in a co-training framework. Generates reliable pseudo-labels from weakly-augmented unlabeled samples to train on strongly-augmented ones, mitigating confirmation bias.
Result: Outperforms existing SSL methods adapted to SNN backbones across various standard benchmarks.
Conclusion: SpikeMatch successfully demonstrates the effectiveness of leveraging SNN temporal dynamics for semi-supervised learning, addressing the gap in SSL methods for spiking neural networks.
Abstract: Spiking neural networks (SNNs) have recently been attracting significant attention for their biological plausibility and energy efficiency, but semi-supervised learning (SSL) methods for SNN-based models remain underexplored compared to those for artificial neural networks (ANNs). In this paper, we introduce SpikeMatch, the first SSL framework for SNNs that leverages the temporal dynamics through the leakage factor of SNNs for diverse pseudo-labeling within a co-training framework. By utilizing agreement among multiple predictions from a single SNN, SpikeMatch generates reliable pseudo-labels from weakly-augmented unlabeled samples to train on strongly-augmented ones, effectively mitigating confirmation bias by capturing discriminative features with limited labels. Experiments show that SpikeMatch outperforms existing SSL methods adapted to SNN backbones across various standard benchmarks.
[369] Hierarchical Representation Matching for CLIP-based Class-Incremental Learning
Zhen-Hao Wen, Yan Wang, Ji Feng, Han-Jia Ye, De-Chuan Zhan, Da-Wei Zhou
Main category: cs.CV
TL;DR: HERMAN introduces hierarchical representation matching for CLIP-based class-incremental learning, using LLM-generated descriptors to capture both coarse and fine-grained visual concepts, achieving state-of-the-art performance.
Details
Motivation: Existing CLIP-based CIL methods use simplistic templates that ignore hierarchical visual concepts and rely only on last-layer representations, missing valuable hierarchical information from earlier layers.Method: Leverages LLMs to recursively generate discriminative textual descriptors, matches them to different semantic hierarchy levels, and adaptively routes them based on task requirements to enable precise discrimination while reducing catastrophic forgetting.
Result: Extensive experiments on multiple benchmarks demonstrate consistent state-of-the-art performance.
Conclusion: HERMAN effectively addresses hierarchical concept learning in CIL by augmenting semantic space with explicit hierarchical cues and leveraging multi-layer representations, achieving superior performance.
Abstract: Class-Incremental Learning (CIL) aims to endow models with the ability to continuously adapt to evolving data streams. Recent advances in pre-trained vision-language models (e.g., CLIP) provide a powerful foundation for this task. However, existing approaches often rely on simplistic templates, such as “a photo of a [CLASS]”, which overlook the hierarchical nature of visual concepts. For example, recognizing “cat” versus “car” depends on coarse-grained cues, while distinguishing “cat” from “lion” requires fine-grained details. Similarly, the current feature mapping in CLIP relies solely on the representation from the last layer, neglecting the hierarchical information contained in earlier layers. In this work, we introduce HiErarchical Representation MAtchiNg (HERMAN) for CLIP-based CIL. Our approach leverages LLMs to recursively generate discriminative textual descriptors, thereby augmenting the semantic space with explicit hierarchical cues. These descriptors are matched to different levels of the semantic hierarchy and adaptively routed based on task-specific requirements, enabling precise discrimination while alleviating catastrophic forgetting in incremental tasks. Extensive experiments on multiple benchmarks demonstrate that our method consistently achieves state-of-the-art performance.
[370] LongLive: Real-time Interactive Long Video Generation
Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, Song Han, Yukang Chen
Main category: cs.CV
TL;DR: LongLive is a frame-level autoregressive framework for real-time long video generation that addresses efficiency and quality challenges through KV-recache mechanism, streaming long tuning, and frame sink attention.
Details
Motivation: To overcome the limitations of existing methods: diffusion models have low efficiency due to bidirectional attention, while causal AR models degrade in quality on long videos and lack interactive capabilities for dynamic content creation.Method: Uses causal frame-level AR design with KV-recache mechanism for prompt switching, streaming long tuning for long video training, and short window attention with frame sink for long-range consistency.
Result: Achieves 20.7 FPS on single H100 GPU, supports up to 240-second videos, strong VBench performance on both short and long videos, and INT8-quantized inference with minimal quality loss. Fine-tunes 1.3B model to minute-long generation in 32 GPU-days.
Conclusion: LongLive provides an efficient, high-quality solution for real-time interactive long video generation with strong performance metrics and practical deployment capabilities.
Abstract: We present LongLive, a frame-level autoregressive (AR) framework for real-time and interactive long video generation. Long video generation presents challenges in both efficiency and quality. Diffusion and Diffusion-Forcing models can produce high-quality videos but suffer from low efficiency due to bidirectional attention. Causal attention AR models support KV caching for faster inference, but often degrade in quality on long videos due to memory challenges during long-video training. In addition, beyond static prompt-based generation, interactive capabilities, such as streaming prompt inputs, are critical for dynamic content creation, enabling users to guide narratives in real time. This interactive requirement significantly increases complexity, especially in ensuring visual consistency and semantic coherence during prompt transitions. To address these challenges, LongLive adopts a causal, frame-level AR design that integrates a KV-recache mechanism that refreshes cached states with new prompts for smooth, adherent switches; streaming long tuning to enable long video training and to align training and inference (train-long-test-long); and short window attention paired with a frame-level attention sink, shorten as frame sink, preserving long-range consistency while enabling faster generation. With these key designs, LongLive fine-tunes a 1.3B-parameter short-clip model to minute-long generation in just 32 GPU-days. At inference, LongLive sustains 20.7 FPS on a single NVIDIA H100, achieves strong performance on VBench in both short and long videos. LongLive supports up to 240-second videos on a single H100 GPU. LongLive further supports INT8-quantized inference with only marginal quality loss.
[371] SPARK: Synergistic Policy And Reward Co-Evolving Framework
Ziyu Liu, Yuhang Zang, Shengyuan Ding, Yuhang Cao, Xiaoyi Dong, Haodong Duan, Dahua Lin, Jiaqi Wang
Main category: cs.CV
TL;DR: SPARK is a synergistic framework that recycles rollouts and correctness data to train LLMs/LVLMs as their own reward models, eliminating need for separate reward models and human preference data while creating a co-evolving feedback loop between policy and reward.
Details
Motivation: Address limitations of RLHF (high costs, reward-policy mismatch) and RLVR (wasted supervision from discarded rollouts) by developing a more efficient approach that leverages existing data.Method: Recycles rollouts and correctness data to simultaneously train the model as a generative reward model using mixed objectives (pointwise reward, pairwise comparison, reflection-conditioned evaluation), creating a co-evolving feedback loop between policy and reward.
Result: Significant performance gains: SPARK-VL-7B achieved 9.7% average gain on reasoning benchmarks, 12.1% on reward benchmarks, and 1.5% on general benchmarks over baselines, demonstrating robustness and broad generalization.
Conclusion: SPARK provides an efficient, unified framework that eliminates need for separate reward models and human preference data while achieving substantial performance improvements across multiple benchmarks and model types.
Abstract: Recent Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) increasingly use Reinforcement Learning (RL) for post-pretraining, such as RL with Verifiable Rewards (RLVR) for objective tasks and RL from Human Feedback (RLHF) for subjective tasks. However, RLHF incurs high costs and potential reward-policy mismatch due to reliance on human preferences, while RLVR still wastes supervision by discarding rollouts and correctness signals after each update. To address these challenges, we introduce the Synergistic Policy And Reward Co-Evolving Framework (SPARK), an efficient, on-policy, and stable method that builds on RLVR. Instead of discarding rollouts and correctness data, SPARK recycles this valuable information to simultaneously train the model itself as a generative reward model. This auxiliary training uses a mix of objectives, such as pointwise reward score, pairwise comparison, and evaluation conditioned on further-reflection responses, to teach the model to evaluate and improve its own responses. Our process eliminates the need for a separate reward model and costly human preference data. SPARK creates a positive co-evolving feedback loop: improved reward accuracy yields better policy gradients, which in turn produce higher-quality rollouts that further refine the reward model. Our unified framework supports test-time scaling via self-reflection without external reward models and their associated costs. We show that SPARK achieves significant performance gains on multiple LLM and LVLM models and multiple reasoning, reward models, and general benchmarks. For example, SPARK-VL-7B achieves an average 9.7% gain on 7 reasoning benchmarks, 12.1% on 2 reward benchmarks, and 1.5% on 8 general benchmarks over the baselines, demonstrating robustness and broad generalization.
[372] CCNeXt: An Effective Self-Supervised Stereo Depth Estimation Approach
Alexandre Lopes, Roberto Souza, Helio Pedrini
Main category: cs.CV
TL;DR: CCNeXt is a novel self-supervised CNN architecture for depth estimation that outperforms existing CNNs and ViTs while being computationally efficient, achieving state-of-the-art results on multiple datasets.
Details
Motivation: Depth estimation is crucial for robotics, autonomous vehicles, and AR applications that operate under computational constraints. Stereo image pairs provide an effective solution, but acquiring reliable ground-truth depth data is difficult, making self-supervised approaches valuable.Method: Proposed CCNeXt architecture uses a modern CNN feature extractor with a novel windowed epipolar cross-attention module in the encoder, along with a comprehensive redesign of the depth estimation decoder.
Result: CCNeXt achieves competitive metrics on KITTI Eigen Split test data while being 10.18× faster than current best models, and achieves state-of-the-art results on KITTI Eigen Split Improved Ground Truth and Driving Stereo datasets.
Conclusion: CCNeXt provides an efficient and effective self-supervised solution for depth estimation that balances computational cost with performance, outperforming both CNN and ViT approaches.
Abstract: Depth Estimation plays a crucial role in recent applications in robotics, autonomous vehicles, and augmented reality. These scenarios commonly operate under constraints imposed by computational power. Stereo image pairs offer an effective solution for depth estimation since it only needs to estimate the disparity of pixels in image pairs to determine the depth in a known rectified system. Due to the difficulty in acquiring reliable ground-truth depth data across diverse scenarios, self-supervised techniques emerge as a solution, particularly when large unlabeled datasets are available. We propose a novel self-supervised convolutional approach that outperforms existing state-of-the-art Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) while balancing computational cost. The proposed CCNeXt architecture employs a modern CNN feature extractor with a novel windowed epipolar cross-attention module in the encoder, complemented by a comprehensive redesign of the depth estimation decoder. Our experiments demonstrate that CCNeXt achieves competitive metrics on the KITTI Eigen Split test data while being 10.18$\times$ faster than the current best model and achieves state-of-the-art results in all metrics in the KITTI Eigen Split Improved Ground Truth and Driving Stereo datasets when compared to recently proposed techniques. To ensure complete reproducibility, our project is accessible at \href{https://github.com/alelopes/CCNext}{\texttt{https://github.com/alelopes/CCNext}}.
[373] UML-CoT: Structured Reasoning and Planning with Unified Modeling Language for Robotic Room Cleaning
Hongyu Chen, Guangrun Wang
Main category: cs.CV
TL;DR: UML-CoT is a structured reasoning framework that uses Unified Modeling Language (UML) to create symbolic chain-of-thought reasoning and executable action plans, outperforming traditional unstructured CoT in embodied tasks.
Details
Motivation: Traditional Chain-of-Thought prompting relies on unstructured text, which limits interpretability and executability in embodied tasks. Existing structured CoTs using scene or logic graphs are limited to low-order relations and lack constructs for inheritance, behavioral abstraction, and standardized planning semantics.Method: UML-CoT uses UML class diagrams to capture compositional object semantics and activity diagrams to model procedural control flow. The approach employs a three-stage training pipeline combining supervised fine-tuning with Group Relative Policy Optimization (GRPO), including reward learning from answer-only data.
Result: Evaluated on MRoom-30k benchmark for cluttered room-cleaning scenarios, UML-CoT outperforms unstructured CoTs in interpretability, planning coherence, and execution success.
Conclusion: UML provides a more expressive and actionable structured reasoning formalism compared to existing approaches, demonstrating superior performance in embodied reasoning tasks.
Abstract: Chain-of-Thought (CoT) prompting improves reasoning in large language models (LLMs), but its reliance on unstructured text limits interpretability and executability in embodied tasks. Prior work has explored structured CoTs using scene or logic graphs, yet these remain fundamentally limited: they model only low-order relations, lack constructs like inheritance or behavioral abstraction, and provide no standardized semantics for sequential or conditional planning. We propose UML-CoT, a structured reasoning and planning framework that leverages Unified Modeling Language (UML) to generate symbolic CoTs and executable action plans. UML class diagrams capture compositional object semantics, while activity diagrams model procedural control flow. Our three-stage training pipeline combines supervised fine-tuning with Group Relative Policy Optimization (GRPO), including reward learning from answer-only data. We evaluate UML-CoT on MRoom-30k, a new benchmark of cluttered room-cleaning scenarios. UML-CoT outperforms unstructured CoTs in interpretability, planning coherence, and execution success, highlighting UML as a more expressive and actionable structured reasoning formalism.
[374] Training-Free Synthetic Data Generation with Dual IP-Adapter Guidance
Luc Boudier, Loris Manganelli, Eleftherios Tsonis, Nicolas Dufour, Vicky Kalogeiton
Main category: cs.CV
TL;DR: DIPSY is a training-free few-shot image classification method that uses IP-Adapter for image-to-image translation to generate discriminative synthetic images without model fine-tuning or external tools.
Details
Motivation: Few-shot image classification is challenging due to limited labeled data. Existing text-to-image diffusion methods require extensive fine-tuning or external information sources, which DIPSY aims to eliminate.Method: Uses IP-Adapter for image-to-image translation with three innovations: extended classifier-free guidance for independent positive/negative conditioning, class similarity-based sampling for contrastive examples, and a simple pipeline requiring no fine-tuning or external captioning/filtering.
Result: Achieves state-of-the-art or comparable performance across ten benchmark datasets, particularly effective for fine-grained classification tasks.
Conclusion: DIPSY demonstrates the effectiveness of dual image prompting with positive-negative guidance for generating class-discriminative features without requiring generative model adaptation or external tools.
Abstract: Few-shot image classification remains challenging due to the limited availability of labeled examples. Recent approaches have explored generating synthetic training data using text-to-image diffusion models, but often require extensive model fine-tuning or external information sources. We present a novel training-free approach, called DIPSY, that leverages IP-Adapter for image-to-image translation to generate highly discriminative synthetic images using only the available few-shot examples. DIPSY introduces three key innovations: (1) an extended classifier-free guidance scheme that enables independent control over positive and negative image conditioning; (2) a class similarity-based sampling strategy that identifies effective contrastive examples; and (3) a simple yet effective pipeline that requires no model fine-tuning or external captioning and filtering. Experiments across ten benchmark datasets demonstrate that our approach achieves state-of-the-art or comparable performance, while eliminating the need for generative model adaptation or reliance on external tools for caption generation and image filtering. Our results highlight the effectiveness of leveraging dual image prompting with positive-negative guidance for generating class-discriminative features, particularly for fine-grained classification tasks.
[375] Scale-Wise VAR is Secretly Discrete Diffusion
Amandeep Kumar, Nithin Gopalakrishnan Nair, Vishal M. Patel
Main category: cs.CV
TL;DR: The paper shows that Visual Autoregressive Generation (VAR) transformers with Markovian attention masks are mathematically equivalent to discrete diffusion models, enabling the integration of diffusion advantages like iterative refinement into VAR for improved efficiency and performance.
Details
Motivation: To bridge autoregressive transformers and diffusion models by uncovering their mathematical equivalence, allowing VAR to benefit from diffusion model advantages while maintaining its computational efficiency and scalability.Method: Reinterpret VAR with Markovian attention masks as Scalable Visual Refinement with Discrete Diffusion (SRDD), importing diffusion techniques like iterative refinement into the VAR framework to reduce architectural inefficiencies.
Result: The diffusion-based perspective of VAR leads to faster convergence, lower inference cost, improved zero-shot reconstruction, and consistent gains in efficiency and generation across multiple datasets.
Conclusion: Establishing the mathematical equivalence between VAR and discrete diffusion provides a principled bridge that enables combining the strengths of both approaches, yielding more efficient and effective visual generation models.
Abstract: Autoregressive (AR) transformers have emerged as a powerful paradigm for visual generation, largely due to their scalability, computational efficiency and unified architecture with language and vision. Among them, next scale prediction Visual Autoregressive Generation (VAR) has recently demonstrated remarkable performance, even surpassing diffusion-based models. In this work, we revisit VAR and uncover a theoretical insight: when equipped with a Markovian attention mask, VAR is mathematically equivalent to a discrete diffusion. We term this reinterpretation as Scalable Visual Refinement with Discrete Diffusion (SRDD), establishing a principled bridge between AR transformers and diffusion models. Leveraging this new perspective, we show how one can directly import the advantages of diffusion such as iterative refinement and reduce architectural inefficiencies into VAR, yielding faster convergence, lower inference cost, and improved zero-shot reconstruction. Across multiple datasets, we show that the diffusion based perspective of VAR leads to consistent gains in efficiency and generation.
[376] RefAM: Attention Magnets for Zero-Shot Referral Segmentation
Anna Kukleva, Enis Simsar, Alessio Tonioni, Muhammad Ferjad Naeem, Federico Tombari, Jan Eric Lenssen, Bernt Schiele
Main category: cs.CV
TL;DR: A training-free method that uses diffusion transformer features for referring segmentation without architectural changes or fine-tuning, achieving state-of-the-art performance.
Details
Motivation: Existing referring segmentation methods require fine-tuning or multiple pre-trained models, while diffusion models contain rich semantic information that can be directly exploited.Method: Extracts features and attention scores from diffusion transformers, filters stop words as attention magnets, handles global attention sinks, and uses attention redistribution with appended stop words to create sharper grounding maps.
Result: Outperforms prior methods across zero-shot referring image and video segmentation benchmarks without fine-tuning or additional components.
Conclusion: RefAM framework demonstrates that diffusion transformer features can be effectively leveraged for vision-language grounding tasks, establishing new state-of-the-art performance in training-free settings.
Abstract: Most existing approaches to referring segmentation achieve strong performance only through fine-tuning or by composing multiple pre-trained models, often at the cost of additional training and architectural modifications. Meanwhile, large-scale generative diffusion models encode rich semantic information, making them attractive as general-purpose feature extractors. In this work, we introduce a new method that directly exploits features, attention scores, from diffusion transformers for downstream tasks, requiring neither architectural modifications nor additional training. To systematically evaluate these features, we extend benchmarks with vision-language grounding tasks spanning both images and videos. Our key insight is that stop words act as attention magnets: they accumulate surplus attention and can be filtered to reduce noise. Moreover, we identify global attention sinks (GAS) emerging in deeper layers and show that they can be safely suppressed or redirected onto auxiliary tokens, leading to sharper and more accurate grounding maps. We further propose an attention redistribution strategy, where appended stop words partition background activations into smaller clusters, yielding sharper and more localized heatmaps. Building on these findings, we develop RefAM, a simple training-free grounding framework that combines cross-attention maps, GAS handling, and redistribution. Across zero-shot referring image and video segmentation benchmarks, our approach consistently outperforms prior methods, establishing a new state of the art without fine-tuning or additional components.
[377] Multi-View Hypercomplex Learning for Breast Cancer Screening
Eleonora Lopez, Eleonora Grassucci, Danilo Comminiello
Main category: cs.CV
TL;DR: Proposes multi-view hypercomplex learning using parameterized hypercomplex neural networks (PHNNs) for breast cancer classification, with architectures PHResNets (2-view), PHYBOnet (efficient 4-view), and PHYSEnet (accurate 4-view), outperforming state-of-the-art models.
Details
Motivation: Current multi-view mammography analysis methods suffer from view dominance, training instability, and computational overhead when capturing dependencies between views, which are crucial for accurate diagnosis.Method: Uses hypercomplex algebra in parameterized hypercomplex neural networks (PHNNs) to intrinsically capture intra- and inter-view relations, proposing specific architectures for 2-view (PHResNets) and 4-view (PHYBOnet, PHYSEnet) exams.
Result: The approach consistently outperforms state-of-the-art multi-view models and generalizes across radiographic modalities and tasks including chest X-ray disease classification and brain tumor segmentation.
Conclusion: Multi-view hypercomplex learning provides an effective solution for multi-view medical image analysis, addressing key limitations of existing fusion methods while achieving superior performance and generalization.
Abstract: Radiologists interpret mammography exams by jointly analyzing all four views, as correlations among them are crucial for accurate diagnosis. Recent methods employ dedicated fusion blocks to capture such dependencies, but these are often hindered by view dominance, training instability, and computational overhead. To address these challenges, we introduce multi-view hypercomplex learning, a novel learning paradigm for multi-view breast cancer classification based on parameterized hypercomplex neural networks (PHNNs). Thanks to hypercomplex algebra, our models intrinsically capture both intra- and inter-view relations. We propose PHResNets for two-view exams and two complementary four-view architectures: PHYBOnet, optimized for efficiency, and PHYSEnet, optimized for accuracy. Extensive experiments demonstrate that our approach consistently outperforms state-of-the-art multi-view models, while also generalizing across radiographic modalities and tasks such as disease classification from chest X-rays and multimodal brain tumor segmentation. Full code and pretrained models are available at https://github.com/ispamm/PHBreast.
[378] Leveraging Model Guidance to Extract Training Data from Personalized Diffusion Models
Xiaoyu Wu, Jiaru Zhang, Zhiwei Steven Wu
Main category: cs.CV
TL;DR: FineXtract is a framework that extracts fine-tuning data from diffusion models by modeling fine-tuning as a distribution shift and using extrapolation to recover training images.
Details
Motivation: To address data leakage risks and copyright concerns when fine-tuned diffusion models are shared online, as model owners may overlook the potential for training data extraction.Method: Models fine-tuning as a gradual distribution shift from pretrained to fine-tuned data, uses extrapolation between pre- and post-fine-tuning models to guide generation toward fine-tuning data distribution, and applies clustering to extract most probable images.
Result: Successfully extracted about 20% of fine-tuning data from models trained on WikiArt, DreamBooth, and real-world online checkpoints.
Conclusion: Fine-tuned diffusion models can leak their training data, posing privacy and copyright risks, and FineXtract provides an effective method for detecting such data extraction.
Abstract: Diffusion Models (DMs) have become powerful image generation tools, especially for few-shot fine-tuning where a pretrained DM is fine-tuned on a small image set to capture specific styles or objects. Many people upload these personalized checkpoints online, fostering communities such as Civitai and HuggingFace. However, model owners may overlook the data leakage risks when releasing fine-tuned checkpoints. Moreover, concerns regarding copyright violations arise when unauthorized data is used during fine-tuning. In this paper, we ask: “Can training data be extracted from these fine-tuned DMs shared online?” A successful extraction would present not only data leakage threats but also offer tangible evidence of copyright infringement. To answer this, we propose FineXtract, a framework for extracting fine-tuning data. Our method approximates fine-tuning as a gradual shift in the model’s learned distribution – from the original pretrained DM toward the fine-tuning data. By extrapolating the models before and after fine-tuning, we guide the generation toward high-probability regions within the fine-tuned data distribution. We then apply a clustering algorithm to extract the most probable images from those generated using this extrapolated guidance. Experiments on DMs fine-tuned with datasets including WikiArt, DreamBooth, and real-world checkpoints posted online validate the effectiveness of our method, extracting about 20% of fine-tuning data in most cases. The code is available https://github.com/Nicholas0228/FineXtract.
[379] Pose Prior Learner: Unsupervised Categorical Prior Learning for Pose Estimation
Ziyu Wang, Shuangpeng Han, Mengmi Zhang
Main category: cs.CV
TL;DR: Pose Prior Learner (PPL) learns general pose priors for object categories from images without supervision, using hierarchical memory to store prototypical poses and improving pose estimation through template transformation and reconstruction.
Details
Motivation: Priors help in pose estimation but are difficult to acquire. The paper aims to learn pose priors unsupervisedly for any object category without human annotations.Method: PPL uses hierarchical memory to store compositional parts of prototypical poses, distills a general pose prior, and refines poses through iterative inference and template transformation.
Result: PPL outperforms baselines on human and animal pose datasets, shows effectiveness on occluded images, and refines poses by regressing to prototypical poses in memory.
Conclusion: PPL successfully learns meaningful pose priors without supervision, improving pose estimation accuracy and handling occlusions through iterative refinement.
Abstract: A prior represents a set of beliefs or assumptions about a system, aiding inference and decision-making. In this paper, we introduce the challenge of unsupervised categorical prior learning in pose estimation, where AI models learn a general pose prior for an object category from images in a self-supervised manner. Although priors are effective in estimating pose, acquiring them can be difficult. We propose a novel method, named Pose Prior Learner (PPL), to learn a general pose prior for any object category. PPL uses a hierarchical memory to store compositional parts of prototypical poses, from which we distill a general pose prior. This prior improves pose estimation accuracy through template transformation and image reconstruction. PPL learns meaningful pose priors without any additional human annotations or interventions, outperforming competitive baselines on both human and animal pose estimation datasets. Notably, our experimental results reveal the effectiveness of PPL using learned prototypical poses for pose estimation on occluded images. Through iterative inference, PPL leverages the pose prior to refine estimated poses, regressing them to any prototypical poses stored in memory. Our code, model, and data will be publicly available.
[380] Diffusion Curriculum: Synthetic-to-Real Data Curriculum via Image-Guided Diffusion
Yijun Liang, Shweta Bhardwaj, Tianyi Zhou
Main category: cs.CV
TL;DR: The paper proposes Diffusion Curriculum (DisCL), a method that uses diffusion models with image guidance to generate synthetic data at different proximity levels to original data, creating a curriculum learning approach that adapts to model training stages to improve performance on hard samples in long-tail classification and low-quality data scenarios.
Details
Motivation: Low-quality or scarce data challenges deep neural network training. Text-only guidance in diffusion models creates out-of-distribution synthetic data that harms model performance. There's a need to control synthetic images' proximity to original data to create effective training data.Method: DisCL uses image guidance in diffusion models to create a spectrum of interpolations between synthetic and real images. It adjusts image guidance levels during training stages, focusing on hard samples and assessing optimal guidance levels. The curriculum starts with lower-guidance high-quality images to learn prototypical features, then progresses to higher-guidance images.
Result: On iWildCam dataset: 2.7% gain in OOD macro-accuracy and 2.1% gain in ID macro-accuracy. On ImageNet-LT: tail-class accuracy improved from 4.4% to 23.64%, with 4.02% improvement in all-class accuracy.
Conclusion: DisCL effectively addresses data scarcity and quality issues by creating a curriculum of synthetic data with controlled proximity to original data, significantly improving model performance on challenging tasks like long-tail classification and learning from low-quality data.
Abstract: Low-quality or scarce data has posed significant challenges for training deep neural networks in practice. While classical data augmentation cannot contribute very different new data, diffusion models opens up a new door to build self-evolving AI by generating high-quality and diverse synthetic data through text-guided prompts. However, text-only guidance cannot control synthetic images’ proximity to the original images, resulting in out-of-distribution data detrimental to the model performance. To overcome the limitation, we study image guidance to achieve a spectrum of interpolations between synthetic and real images. With stronger image guidance, the generated images are similar to the training data but hard to learn. While with weaker image guidance, the synthetic images will be easier for model but contribute to a larger distribution gap with the original data. The generated full spectrum of data enables us to build a novel “Diffusion Curriculum (DisCL)”. DisCL adjusts the image guidance level of image synthesis for each training stage: It identifies and focuses on hard samples for the model and assesses the most effective guidance level of synthetic images to improve hard data learning. We apply DisCL to two challenging tasks: long-tail (LT) classification and learning from low-quality data. It focuses on lower-guidance images of high-quality to learn prototypical features as a warm-up of learning higher-guidance images that might be weak on diversity or quality. Extensive experiments showcase a gain of 2.7% and 2.1% in OOD and ID macro-accuracy when applying DisCL to iWildCam dataset. On ImageNet-LT, DisCL improves the base model’s tail-class accuracy from 4.4% to 23.64% and leads to a 4.02% improvement in all-class accuracy.
[381] Large Pre-Training Datasets Don’t Always Guarantee Robustness after Fine-Tuning
Jaedong Hwang, Brian Cheung, Zhang-Wei Hong, Akhilan Boopathy, Pulkit Agrawal, Ila Fiete
Main category: cs.CV
TL;DR: Fine-tuning large pretrained models on specialized tasks causes catastrophic forgetting and loss of out-of-distribution (OOD) generalization, with models trained on larger datasets showing worse robustness preservation.
Details
Motivation: To assess whether fine-tuning preserves the overall robustness of pretrained models and understand how different pretraining datasets affect robustness inheritance in specialized tasks.Method: Proposed ImageNet-RIB benchmark with related OOD tasks, fine-tuning on one task and testing on others. Evaluated various pretrained models including those from LAION-2B.
Result: Fine-tuning reduces robustness across all pretrained models. Models from largest datasets (e.g., LAION-2B) show larger robustness losses and lower absolute robustness after fine-tuning on small datasets.
Conclusion: Starting with the strongest foundation model is not necessarily best for specialist tasks due to significant robustness degradation during fine-tuning.
Abstract: Large-scale pretrained models are widely leveraged as foundations for learning new specialized tasks via fine-tuning, with the goal of maintaining the general performance of the model while allowing it to gain new skills. A valuable goal for all such models is robustness: the ability to perform well on out-of-distribution (OOD) tasks. We assess whether fine-tuning preserves the overall robustness of the pretrained model, and observed that models pretrained on large datasets exhibited strong catastrophic forgetting and loss of OOD generalization. To systematically assess robustness preservation in fine-tuned models, we propose the Robustness Inheritance Benchmark (ImageNet-RIB). The benchmark, which can be applied to any pretrained model, consists of a set of related but distinct OOD (downstream) tasks and involves fine-tuning on one of the OOD tasks in the set then testing on the rest. We find that though continual learning methods help, fine-tuning reduces robustness across pretrained models. Surprisingly, models pretrained on the largest and most diverse datasets (e.g., LAION-2B) exhibit both larger robustness losses and lower absolute robustness after fine-tuning on small datasets, relative to models pretrained on smaller datasets. These findings suggest that starting with the strongest foundation model is not necessarily the best approach for performance on specialist tasks. https://jd730.github.io/projects/ImageNet-RIB
[382] TAPTRv3: Spatial and Temporal Context Foster Robust Tracking of Any Point in Long Video
Jinyuan Qu, Hongyang Li, Shilong Liu, Tianhe Ren, Zhaoyang Zeng, Lei Zhang
Main category: cs.CV
TL;DR: TAPTRv3 improves upon TAPTRv2 by addressing feature querying issues in long videos through spatial and temporal context enhancements, achieving state-of-the-art performance.
Details
Motivation: TAPTRv2 works well in regular videos but fails in long videos due to poor feature querying quality and feature drifting caused by increasing target variation over time.Method: Proposes Context-aware Cross-Attention (CCA) for better spatial feature querying by incorporating spatial context, and Visibility-aware Long-Temporal Attention (VLTA) for temporal feature querying that considers frame visibilities to prevent feature drifting.
Result: TAPTRv3 significantly outperforms TAPTRv2 on challenging datasets and achieves state-of-the-art performance, even surpassing methods trained on large-scale extra data.
Conclusion: The proposed spatial and temporal context enhancements in TAPTRv3 effectively address long-video tracking challenges and demonstrate superior performance compared to existing methods.
Abstract: In this paper, built upon TAPTRv2, we present TAPTRv3. TAPTRv2 is a simple yet effective DETR-like point tracking framework that works fine in regular videos but tends to fail in long videos. TAPTRv3 improves TAPTRv2 by addressing its shortcomings in querying high-quality features from long videos, where the target tracking points normally undergo increasing variation over time. In TAPTRv3, we propose to utilize both spatial and temporal context to bring better feature querying along the spatial and temporal dimensions for more robust tracking in long videos. For better spatial feature querying, we identify that off-the-shelf attention mechanisms struggle with point-level tasks and present Context-aware Cross-Attention (CCA). CCA introduces spatial context into the attention mechanism to enhance the quality of attention scores when querying image features. For better temporal feature querying, we introduce Visibility-aware Long-Temporal Attention (VLTA), which conducts temporal attention over past frames while considering their corresponding visibilities. This effectively addresses the feature drifting problem in TAPTRv2 caused by its RNN-like long-term modeling. TAPTRv3 surpasses TAPTRv2 by a large margin on most of the challenging datasets and obtains state-of-the-art performance. Even when compared with methods trained on large-scale extra internal data, TAPTRv3 still demonstrates superiority.
[383] DynamicControl: Adaptive Condition Selection for Improved Text-to-Image Generation
Qingdong He, Jinlong Peng, Pengcheng Xu, Boyuan Jiang, Xiaobin Hu, Donghao Luo, Yong Liu, Yabiao Wang, Chengjie Wang, Xiangtai Li, Jiangning Zhang
Main category: cs.CV
TL;DR: DynamicControl is a novel framework that enables dynamic combinations of multiple control signals for text-to-image diffusion models, addressing limitations of existing methods that handle conditions inefficiently or use fixed numbers of conditions.
Details
Motivation: Current ControlNet-like models have limitations in handling multiple control conditions efficiently and dealing with potential conflicts between different conditions, highlighting the need for more effective multi-condition management in image synthesis.Method: The approach uses a double-cycle controller to generate initial condition rankings, integrates a Multimodal Large Language Model (MLLM) as a condition evaluator to optimize condition ordering, and employs a parallel multi-control adapter that learns feature maps from dynamic visual conditions to modulate ControlNet.
Result: DynamicControl demonstrates superior performance over existing methods in terms of controllability, generation quality, and composability under various conditional controls, as shown through both quantitative and qualitative comparisons.
Conclusion: The proposed DynamicControl framework effectively addresses the challenges of managing multiple conditions in text-to-image synthesis, providing enhanced control and better handling of complex multi-condition scenarios through its innovative combination of double-cycle control, MLLM reasoning, and parallel multi-control adaptation.
Abstract: To enhance the controllability of text-to-image diffusion models, current ControlNet-like models have explored various control signals to dictate image attributes. However, existing methods either handle conditions inefficiently or use a fixed number of conditions, which does not fully address the complexity of multiple conditions and their potential conflicts. This underscores the need for innovative approaches to manage multiple conditions effectively for more reliable and detailed image synthesis. To address this issue, we propose a novel framework, DynamicControl, which supports dynamic combinations of diverse control signals, allowing adaptive selection of different numbers and types of conditions. Our approach begins with a double-cycle controller that generates an initial real score sorting for all input conditions by leveraging pre-trained conditional generation models and discriminative models. This controller evaluates the similarity between extracted conditions and input conditions, as well as the pixel-level similarity with the source image. Then, we integrate a Multimodal Large Language Model (MLLM) to build an efficient condition evaluator. This evaluator optimizes the ordering of conditions based on the double-cycle controller’s score ranking. Our method jointly optimizes MLLMs and diffusion models, utilizing MLLMs’ reasoning capabilities to facilitate multi-condition text-to-image (T2I) tasks. The final sorted conditions are fed into a parallel multi-control adapter, which learns feature maps from dynamic visual conditions and integrates them to modulate ControlNet, thereby enhancing control over generated images. Through both quantitative and qualitative comparisons, DynamicControl demonstrates its superiority over existing methods in terms of controllability, generation quality and composability under various conditional controls.
[384] Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, Lixin Gu, Xuehui Wang, Qingyun Li, Yiming Ren, Zixuan Chen, Jiapeng Luo, Jiahao Wang, Tan Jiang, Bo Wang, Conghui He, Botian Shi, Xingcheng Zhang, Han Lv, Yi Wang, Wenqi Shao, Pei Chu, Zhongying Tu, Tong He, Zhiyong Wu, Huipeng Deng, Jiaye Ge, Kai Chen, Kaipeng Zhang, Limin Wang, Min Dou, Lewei Lu, Xizhou Zhu, Tong Lu, Dahua Lin, Yu Qiao, Jifeng Dai, Wenhai Wang
Main category: cs.CV
TL;DR: InternVL 2.5 is an advanced multimodal large language model that builds on InternVL 2.0 with improved training strategies and data quality, achieving competitive performance against leading commercial models and setting new open-source standards.
Details
Motivation: To advance multimodal AI systems by exploring the relationship between model scaling and performance, and to contribute to the open-source community with state-of-the-art capabilities.Method: Systematic exploration of performance trends in vision encoders, language models, dataset sizes, and test-time configurations, using Chain-of-Thought reasoning for enhanced performance.
Result: Achieves competitive performance across multiple benchmarks including multi-discipline reasoning, document understanding, and multilingual capabilities, becoming the first open-source MLLM to surpass 70% on MMMU benchmark with 3.7-point improvement through CoT reasoning.
Conclusion: InternVL 2.5 sets new standards for developing and applying multimodal AI systems, demonstrating strong potential for test-time scaling and rivaling leading commercial models.
Abstract: We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0, maintaining its core model architecture while introducing significant enhancements in training and testing strategies as well as data quality. In this work, we delve into the relationship between model scaling and performance, systematically exploring the performance trends in vision encoders, language models, dataset sizes, and test-time configurations. Through extensive evaluations on a wide range of benchmarks, including multi-discipline reasoning, document understanding, multi-image / video understanding, real-world comprehension, multimodal hallucination detection, visual grounding, multilingual capabilities, and pure language processing, InternVL 2.5 exhibits competitive performance, rivaling leading commercial models such as GPT-4o and Claude-3.5-Sonnet. Notably, our model is the first open-source MLLMs to surpass 70% on the MMMU benchmark, achieving a 3.7-point improvement through Chain-of-Thought (CoT) reasoning and showcasing strong potential for test-time scaling. We hope this model contributes to the open-source community by setting new standards for developing and applying multimodal AI systems. HuggingFace demo see https://huggingface.co/spaces/OpenGVLab/InternVL
[385] Self-Guidance: Boosting Flow and Diffusion Generation on Their Own
Tiancheng Li, Weijian Luo, Zhiyang Chen, Liyuan Ma, Guo-Jun Qi
Main category: cs.CV
TL;DR: Self-Guidance (SG) is a plug-and-play method that improves image/video generation quality in diffusion/flow models by detecting artifact outliers through density changes between noise levels, without requiring retraining.
Details
Motivation: Existing guidance methods require specific training or strong model biases, limiting their flexibility and application scope. The authors observed that artifact outliers can be detected by density declines between noise levels.Method: SG uses only the sampling score function of original diffusion/flow models at different noise levels to suppress low-quality samples. SG-prev variant reuses previous step outputs for efficiency.
Result: SG outperforms existing methods on metrics like FID and Human Preference Score with Stable Diffusion 3.5 and FLUX. SG-prev achieves strong results with 50% more efficiency. Both methods effectively eliminate human body artifacts.
Conclusion: SG provides a flexible, training-free guidance approach that significantly improves generation quality and handles human body artifacts effectively, with efficient variants available.
Abstract: Proper guidance strategies are essential to achieve high-quality generation results without retraining diffusion and flow-based text-to-image models. Existing guidance either requires specific training or strong inductive biases of diffusion model networks, which potentially limits their ability and application scope. Motivated by the observation that artifact outliers can be detected by a significant decline in the density from a noisier to a cleaner noise level, we propose Self-Guidance (SG), which can significantly improve the quality of the generated image by suppressing the generation of low-quality samples. The biggest difference from existing guidance is that SG only relies on the sampling score function of the original diffusion or flow model at different noise levels, with no need for any tricky and expensive guidance-specific training. This makes SG highly flexible to be used in a plug-and-play manner by any diffusion or flow models. We also introduce an efficient variant of SG, named SG-prev, which reuses the output from the immediately previous diffusion step to avoid additional forward passes of the diffusion network.We conduct extensive experiments on text-to-image and text-to-video generation with different architectures, including UNet and transformer models. With open-sourced diffusion models such as Stable Diffusion 3.5 and FLUX, SG exceeds existing algorithms on multiple metrics, including both FID and Human Preference Score. SG-prev also achieves strong results over both the baseline and the SG, with 50 percent more efficiency. Moreover, we find that SG and SG-prev both have a surprisingly positive effect on the generation of physiologically correct human body structures such as hands, faces, and arms, showing their ability to eliminate human body artifacts with minimal efforts. We have released our code at https://github.com/maple-research-lab/Self-Guidance.
[386] LOGen: Toward Lidar Object Generation by Point Diffusion
Ellington Kirby, Mickael Chen, Renaud Marlet, Nermin Samet
Main category: cs.CV
TL;DR: A diffusion-based model for generating LiDAR object point clouds with intensity and extensive control via conditioning information, evaluated on nuScenes and KITTI-360 datasets.
Details
Motivation: LiDAR scan generation is challenging compared to image and 3D object generation, with applications in autonomous driving. Focusing on object generation leverages advancements in 3D generative methods.Method: Novel diffusion-based model to produce LiDAR point clouds of dataset objects including intensity, with extensive control through conditioning information.
Result: High-quality generations demonstrated on nuScenes and KITTI-360 datasets, measured using new 3D metrics developed specifically for LiDAR objects.
Conclusion: The proposed method successfully generates realistic LiDAR object point clouds with control capabilities, advancing LiDAR scan generation for autonomous driving applications.
Abstract: The generation of LiDAR scans is a growing topic with diverse applications to autonomous driving. However, scan generation remains challenging, especially when compared to the rapid advancement of image and 3D object generation. We consider the task of LiDAR object generation, requiring models to produce 3D objects as viewed by a LiDAR scan. It focuses LiDAR scan generation on a key aspect of scenes, the objects, while also benefiting from advancements in 3D object generative methods. We introduce a novel diffusion-based model to produce LiDAR point clouds of dataset objects, including intensity, and with an extensive control of the generation via conditioning information. Our experiments on nuScenes and KITTI-360 show the quality of our generations measured with new 3D metrics developed to suit LiDAR objects. The code is available at https://github.com/valeoai/LOGen.
[387] UIP2P: Unsupervised Instruction-based Image Editing via Edit Reversibility Constraint
Enis Simsar, Alessio Tonioni, Yongqin Xian, Thomas Hofmann, Federico Tombari
Main category: cs.CV
TL;DR: Unsupervised instruction-based image editing using Edit Reversibility Constraint (ERC) that eliminates need for ground-truth edited images during training.
Details
Motivation: Existing methods require costly supervised training with ground-truth edited images, which are either biased from existing editing methods or expensive human annotations, limiting generalization.Method: Proposes Edit Reversibility Constraint (ERC) that applies forward and reverse edits in one training step and enforces alignment in image, text, and attention spaces, enabling training on real image-caption pairs or triplets.
Result: Approach performs better across broader range of edits with high-fidelity and precision compared to existing methods.
Conclusion: Eliminates need for pre-existing triplet datasets, reduces biases, and enables scaling of instruction-based image editing through unsupervised training.
Abstract: We propose an unsupervised instruction-based image editing approach that removes the need for ground-truth edited images during training. Existing methods rely on supervised learning with triplets of input images, ground-truth edited images, and edit instructions. These triplets are typically generated either by existing editing methods, introducing biases, or through human annotations, which are costly and limit generalization. Our approach addresses these challenges by introducing a novel editing mechanism called Edit Reversibility Constraint (ERC), which applies forward and reverse edits in one training step and enforces alignment in image, text, and attention spaces. This allows us to bypass the need for ground-truth edited images and unlock training for the first time on datasets comprising either real image-caption pairs or image-caption-instruction triplets. We empirically show that our approach performs better across a broader range of edits with high-fidelity and precision. By eliminating the need for pre-existing datasets of triplets, reducing biases associated with current methods, and proposing ERC, our work represents a significant advancement in unblocking scaling of instruction-based image editing.
[388] Unforgettable Lessons from Forgettable Images: Intra-Class Memorability Matters in Computer Vision
Jie Jing, Yongjian Huang, Serena J. -W. Wang, Shuangpeng Han, Lucia Schiatti, Yen-Ling Kuo, Qing Lin, Mengmi Zhang
Main category: cs.CV
TL;DR: The paper introduces intra-class memorability, showing that certain images within the same object class are more memorable than others. It proposes ICMscore metric, creates ICMD dataset, and demonstrates applications in AI tasks including memorability prediction and image editing.
Details
Motivation: To understand what fine-grained visual features make certain object instances more memorable than others within the same category, and to enable real-world applications in computer vision.Method: Conducted human behavior experiments with sequential image matching tasks, developed ICMscore metric incorporating temporal intervals, curated ICMD dataset with 5,000+ images across 10 classes, trained AI models for various downstream tasks, and fine-tuned diffusion models for memorability-controlled image editing.
Result: High-ICMscore images impair AI performance in image recognition and continual learning, while low-ICMscore images improve these tasks. Diffusion models can successfully manipulate image elements to enhance or reduce memorability.
Conclusion: The work opens new pathways for understanding intra-class memorability through fine-grained visual features and lays groundwork for real-world computer vision applications, with all code, data, and models to be publicly released.
Abstract: We introduce intra-class memorability, where certain images within the same class are more memorable than others despite shared category characteristics. To investigate what features make one object instance more memorable than others, we design and conduct human behavior experiments, where participants are shown a series of images, and they must identify when the current image matches the image presented a few steps back in the sequence. To quantify memorability, we propose the Intra-Class Memorability score (ICMscore), a novel metric that incorporates the temporal intervals between repeated image presentations into its calculation. Furthermore, we curate the Intra-Class Memorability Dataset (ICMD), comprising over 5,000 images across ten object classes with their ICMscores derived from 2,000 participants’ responses. Subsequently, we demonstrate the usefulness of ICMD by training AI models on this dataset for various downstream tasks: memorability prediction, image recognition, continual learning, and memorability-controlled image editing. Surprisingly, high-ICMscore images impair AI performance in image recognition and continual learning tasks, while low-ICMscore images improve outcomes in these tasks. Additionally, we fine-tune a state-of-the-art image diffusion model on ICMD image pairs with and without masked semantic objects. The diffusion model can successfully manipulate image elements to enhance or reduce memorability. Our contributions open new pathways in understanding intra-class memorability by scrutinizing fine-grained visual features behind the most and least memorable images and laying the groundwork for real-world applications in computer vision. We will release all code, data, and models publicly.
[389] LiT: Delving into a Simple Linear Diffusion Transformer for Image Generation
Jiahao Wang, Ning Kang, Lewei Yao, Mengzhao Chen, Chengyue Wu, Songyang Zhang, Shuchen Xue, Yong Liu, Taiqiang Wu, Xihui Liu, Kaipeng Zhang, Shifeng Zhang, Wenqi Shao, Zhenguo Li, Ping Luo
Main category: cs.CV
TL;DR: This paper proposes Linear Diffusion Transformer (LiT), which converts pre-trained Diffusion Transformers (DiT) into linear variants with 5 practical guidelines for efficient image generation while maintaining comparable performance.
Details
Motivation: To create simpler, more parallelizable, and efficient image generation models by converting pre-trained Diffusion Transformers into linear variants while preserving performance.Method: Proposes 5 guidelines: 1) depth-wise convolution in linear attention, 2) fewer attention heads, 3) weight inheritance from pre-trained DiT, 4) loading all parameters except linear attention, 5) hybrid knowledge distillation with noise and variance supervision.
Result: LiT achieves comparable performance to DiT with only 20% training steps for 256×256 and 33% for 512×512 ImageNet generation. It also rivals Mamba and Gated Linear Attention methods, and generalizes to text-to-image generation with PixArt-Σ.
Conclusion: LiT provides a safe and efficient baseline for DiT with pure linear attention, enabling faster training while maintaining performance across both class-conditional and text-to-image generation tasks.
Abstract: In this paper, we investigate how to convert a pre-trained Diffusion Transformer (DiT) into a linear DiT, as its simplicity, parallelism, and efficiency for image generation. Through detailed exploration, we offer a suite of ready-to-use solutions, ranging from linear attention design to optimization strategies. Our core contributions include 5 practical guidelines: 1) Applying depth-wise convolution within simple linear attention is sufficient for image generation. 2) Using fewer heads in linear attention provides a free-lunch performance boost without increasing latency. 3) Inheriting weights from a fully converged, pre-trained DiT. 4) Loading all parameters except those related to linear attention. 5) Hybrid knowledge distillation: using a pre-trained teacher DiT to help the training of the student linear DiT, supervising not only the predicted noise but also the variance of the reverse diffusion process. These guidelines lead to our proposed \underline{L}inear D\underline{i}ffusion \underline{T}ransformer (LiT), which serves as a safe and efficient alternative baseline for DiT with pure linear attention. In class-conditional 256$\times$256 and 512$\times$512 ImageNet generation, LiT can be quickly adapted from DiT using only $20%$ and $33%$ of DiT’s training steps, respectively, while achieving comparable performance. LiT also rivals methods based on Mamba or Gated Linear Attention. Moreover, the same guidelines generalize to text-to-image generation: LiT can be swiftly converted from PixArt-$\Sigma$ to generate high-quality images, maintaining comparable GenEval scores.
[390] Single-weight Model Editing for Post-hoc Spurious Correlation Neutralization
Shahin Hakemi, Naveed Akhtar, Ghulam Mubashar Hassan, Ajmal Mian
Main category: cs.CV
TL;DR: Proposes a post-hoc method to neutralize spurious feature impact by treating spurious features as fictitious sub-classes and removing them through single-weight modifications, achieving competitive performance with minimal training overhead.
Details
Motivation: Existing methods for addressing spurious correlations require additional training costs and are often deployed after discovering model misbehavior. Spuriousness is subjective, so methods should proportionally distract model attention from spurious features for reliable predictions.Method: Conceptualizes spurious features as fictitious sub-classes within original classes and eliminates them using a unique precise class removal technique that makes a single-weight modification, enabling post-hoc neutralization controllable to arbitrary degrees.
Result: Extensive experiments show that by editing just a single weight in a post-hoc manner, the method achieves highly competitive or better performance compared to state-of-the-art methods with negligible performance compromise for remaining classes.
Conclusion: The proposed post-hoc approach effectively neutralizes spurious feature impact through minimal weight modifications, offering practical utility without the training overhead of existing methods while maintaining competitive performance.
Abstract: Neural network training tends to exploit the simplest features as shortcuts to greedily minimize training loss. However, some of these features might be spuriously correlated with the target labels, leading to incorrect predictions by the model. Several methods have been proposed to address this issue. Focusing on suppressing the spurious correlations with model training, they not only incur additional training cost, but also have limited practical utility as the model misbehavior due to spurious relations is usually discovered after its deployment. It is also often overlooked that spuriousness is a subjective notion. Hence, the precise questions that must be investigated are; to what degree a feature is spurious, and how we can proportionally distract the model’s attention from it for reliable prediction. To this end, we propose a method that enables post-hoc neutralization of spurious feature impact, controllable to an arbitrary degree. We conceptualize spurious features as fictitious sub-classes within the original classes, which can be eliminated by a class removal scheme. We then propose a unique precise class removal technique that makes a single-weight modification, which entails negligible performance compromise for the remaining classes. We perform extensive experiments, demonstrating that by editing just a single weight in a post-hoc manner, our method achieves highly competitive, or better performance against the state-of-the-art methods.
[391] Calibrated Multi-Preference Optimization for Aligning Diffusion Models
Kyungmin Lee, Xiaohang Li, Qifei Wang, Junfeng He, Junjie Ke, Ming-Hsuan Yang, Irfan Essa, Jinwoo Shin, Feng Yang, Yinxiao Li
Main category: cs.CV
TL;DR: CaPO is a novel method that aligns text-to-image diffusion models using calibrated preferences from multiple reward models without human annotations, outperforming prior methods like DPO.
Details
Motivation: Current preference optimization methods for T2I models rely on costly human annotations or underutilize reward model information by only considering pairwise preferences, lacking generalization to multi-preference scenarios and struggling with reward inconsistencies.Method: Proposes Calibrated Preference Optimization (CaPO) with: 1) reward calibration using expected win-rate against pretrained model samples, 2) frontier-based pair selection from Pareto frontiers for multi-preference management, and 3) regression loss fine-tuning to match calibrated reward differences.
Result: Experimental results show CaPO consistently outperforms prior methods like DPO in both single and multi-reward settings, validated on T2I benchmarks including GenEval and T2I-Compbench.
Conclusion: CaPO effectively aligns T2I diffusion models by leveraging general preferences from multiple reward models without human annotations, demonstrating superior performance over existing preference optimization approaches.
Abstract: Aligning text-to-image (T2I) diffusion models with preference optimization is valuable for human-annotated datasets, but the heavy cost of manual data collection limits scalability. Using reward models offers an alternative, however, current preference optimization methods fall short in exploiting the rich information, as they only consider pairwise preference distribution. Furthermore, they lack generalization to multi-preference scenarios and struggle to handle inconsistencies between rewards. To address this, we present Calibrated Preference Optimization (CaPO), a novel method to align T2I diffusion models by incorporating the general preference from multiple reward models without human annotated data. The core of our approach involves a reward calibration method to approximate the general preference by computing the expected win-rate against the samples generated by the pretrained models. Additionally, we propose a frontier-based pair selection method that effectively manages the multi-preference distribution by selecting pairs from Pareto frontiers. Finally, we use regression loss to fine-tune diffusion models to match the difference between calibrated rewards of a selected pair. Experimental results show that CaPO consistently outperforms prior methods, such as Direct Preference Optimization (DPO), in both single and multi-reward settings validated by evaluation on T2I benchmarks, including GenEval and T2I-Compbench.
[392] PDV: Prompt Directional Vectors for Zero-shot Composed Image Retrieval
Osman Tursun, Sinan Kalkan, Simon Denman, Clinton Fookes
Main category: cs.CV
TL;DR: PDV is a training-free enhancement for Zero-shot Composed Image Retrieval that addresses limitations in current methods by creating dynamic composed embeddings through prompt directional vectors, enabling better text-image fusion and semantic transfer.
Details
Motivation: Current ZS-CIR methods suffer from static query embeddings, insufficient image embedding utilization, and suboptimal text-image fusion, limiting their effectiveness in composed image retrieval tasks.Method: Introduces Prompt Directional Vector (PDV) - a training-free approach that captures semantic modifications from user prompts, enabling dynamic composed text embeddings, composed image embeddings via semantic transfer, and weighted fusion of text and image embeddings.
Result: PDV consistently improves retrieval performance across multiple benchmarks when integrated with state-of-the-art ZS-CIR approaches, particularly benefiting methods with accurate compositional embeddings.
Conclusion: PDV serves as an effective plug-and-play enhancement for existing ZS-CIR methods with minimal computational overhead, addressing key limitations in current approaches and improving retrieval performance.
Abstract: Zero-shot Composed Image Retrieval (ZS-CIR) enables image search using a reference image and a text prompt without requiring specialized text-image composition networks trained on large-scale paired data. However, current ZS-CIR approaches suffer from three critical limitations in their reliance on composed text embeddings: static query embedding representations, insufficient utilization of image embeddings, and suboptimal performance when fusing text and image embeddings. To address these challenges, we introduce the \textbf{Prompt Directional Vector (PDV)}, a simple yet effective training-free enhancement that captures semantic modifications induced by user prompts. PDV enables three key improvements: (1) Dynamic composed text embeddings where prompt adjustments are controllable via a scaling factor, (2) composed image embeddings through semantic transfer from text prompts to image features, and (3) weighted fusion of composed text and image embeddings that enhances retrieval by balancing visual and semantic similarity. Our approach serves as a plug-and-play enhancement for existing ZS-CIR methods with minimal computational overhead. Extensive experiments across multiple benchmarks demonstrate that PDV consistently improves retrieval performance when integrated with state-of-the-art ZS-CIR approaches, particularly for methods that generate accurate compositional embeddings. The code will be released upon publication.
[393] GeoDANO: Geometric VLM with Domain Agnostic Vision Encoder
Seunghyuk Cho, Zhenyue Qin, Yang Liu, Youngbin Choi, Seungbeom Lee, Dongwoo Kim
Main category: cs.CV
TL;DR: GeoDANO is a geometric vision-language model with a domain-agnostic vision encoder that outperforms specialized methods and GPT-4o on plane geometry problems.
Details
Motivation: Existing vision-language models struggle to recognize geometric features in diagrams and fail to generalize across domains, limiting their effectiveness for solving geometry problems.Method: Developed GeoCLIP (a CLIP-based model trained on synthetic geometric diagram-caption pairs) and GeoDANO (which augments GeoCLIP with domain adaptation for unseen diagram styles).
Result: GeoCLIP outperforms existing vision encoders in recognizing geometric features. GeoDANO outperforms specialized methods and GPT-4o on MathVerse benchmark.
Conclusion: The proposed geometric vision-language model with domain-agnostic vision encoder effectively solves plane geometry problems by improving geometric feature recognition and domain adaptation.
Abstract: We introduce GeoDANO, a geometric vision-language model (VLM) with a domain-agnostic vision encoder, for solving plane geometry problems. Although VLMs have been employed for solving geometry problems, their ability to recognize geometric features remains insufficiently analyzed. To address this gap, we propose a benchmark that evaluates the recognition of visual geometric features, including primitives such as dots and lines, and relations such as orthogonality. Our preliminary study shows that vision encoders often used in general-purpose VLMs, e.g., OpenCLIP, fail to detect these features and struggle to generalize across domains. To overcome the limitation, we develop GeoCLIP, a CLIP-based model trained on synthetic geometric diagram–caption pairs. Benchmark results show that GeoCLIP outperforms existing vision encoders in recognizing geometric features. We then propose our VLM, GeoDANO, which augments GeoCLIP with a domain adaptation strategy for unseen diagram styles. GeoDANO outperforms specialized methods for plane geometry problems and GPT-4o on MathVerse. The implementation is available at https://github.com/ml-postech/GeoDANO.
[394] LiteGS: A High-performance Framework to Train 3DGS in Subminutes via System and Algorithm Codesign
Kaimin Liao, Hua Wang, Zhi Chen, Luchao Wang, Yaohua Tang
Main category: cs.CV
TL;DR: LiteGS is a high-performance framework that accelerates 3D Gaussian Splatting training by up to 13.4x through multi-level optimizations including warp-based rasterization, dynamic spatial sorting, and improved densification criteria.
Details
Motivation: 3D Gaussian Splatting suffers from high training costs, limiting its practical applications despite being a promising 3D representation method.Method: Three-level optimization: 1) Low-level: warp-based raster with hardware-aware optimizations to reduce gradient overhead; 2) Mid-level: dynamic spatial sorting using Morton coding for better data locality; 3) Top-level: new densification criterion based on opacity gradient variance with stable opacity control.
Result: Achieves up to 13.4x speedup over original 3DGS with comparable or superior quality, surpasses SOTA lightweight models by 1.4x, and sets new accuracy records for high-quality reconstruction while reducing training time by an order of magnitude.
Conclusion: LiteGS provides a comprehensive optimization framework that significantly accelerates 3DGS training while maintaining or improving reconstruction quality, making it more practical for real-world applications.
Abstract: 3D Gaussian Splatting (3DGS) has emerged as promising alternative in 3D
representation. However, it still suffers from high training cost. This paper
introduces LiteGS, a high performance framework that systematically optimizes
the 3DGS training pipeline from multiple aspects. At the low-level computation
layer, we design a warp-based raster'' associated with two hardware-aware optimizations to significantly reduce gradient reduction overhead. At the mid-level data management layer, we introduce dynamic spatial sorting based on Morton coding to enable a performant
Cluster-Cull-Compact’’ pipeline and
improve data locality, therefore reducing cache misses. At the top-level
algorithm layer, we establish a new robust densification criterion based on the
variance of the opacity gradient, paired with a more stable opacity control
mechanism, to achieve more precise parameter growth. Experimental results
demonstrate that LiteGS accelerates the original 3DGS training by up to 13.4x
with comparable or superior quality and surpasses the current SOTA in
lightweight models by up to 1.4x speedup. For high-quality reconstruction
tasks, LiteGS sets a new accuracy record and decreases the training time by an
order of magnitude.
[395] SPEED: Scalable, Precise, and Efficient Concept Erasure for Diffusion Models
Ouxiang Li, Yuan Wang, Xinting Hu, Houcheng Jiang, Tao Liang, Yanbin Hao, Guojun Ma, Fuli Feng
Main category: cs.CV
TL;DR: SPEED is an efficient concept erasure method for text-to-image diffusion models that directly edits model parameters using null space optimization to erase target concepts while preserving non-target concepts, achieving 100 concept erasure in just 5 seconds.
Details
Motivation: Growing concerns over copyright infringement, offensive content, and privacy violations in text-to-image models require efficient concept erasure methods. Existing fine-tuning methods are time-consuming for multiple concepts, while real-time editing methods degrade non-target concept quality.Method: SPEED searches for a null space where parameter updates don’t affect non-target concepts. It uses three strategies: Influence-based Prior Filtering to select affected non-target concepts, Directed Prior Augmentation to enrich the retain set with variations, and Invariant Equality Constraints to preserve generation invariants.
Result: SPEED consistently outperforms existing methods in non-target preservation while achieving efficient and high-fidelity concept erasure. It successfully erases 100 concepts within only 5 seconds across multiple concept erasure tasks.
Conclusion: SPEED provides an effective solution for scalable concept erasure in text-to-image diffusion models, addressing the trade-off between erasure precision and non-target concept preservation through null space optimization and complementary strategies.
Abstract: Erasing concepts from large-scale text-to-image (T2I) diffusion models has become increasingly crucial due to the growing concerns over copyright infringement, offensive content, and privacy violations. In scalable applications, fine-tuning-based methods are time-consuming to precisely erase multiple target concepts, while real-time editing-based methods often degrade the generation quality of non-target concepts due to conflicting optimization objectives. To address this dilemma, we introduce SPEED, an efficient concept erasure approach that directly edits model parameters. SPEED searches for a null space, a model editing space where parameter updates do not affect non-target concepts, to achieve scalable and precise erasure. To facilitate accurate null space optimization, we incorporate three complementary strategies: Influence-based Prior Filtering (IPF) to selectively retain the most affected non-target concepts, Directed Prior Augmentation (DPA) to enrich the filtered retain set with semantically consistent variations, and Invariant Equality Constraints (IEC) to preserve key invariants during the T2I generation process. Extensive evaluations across multiple concept erasure tasks demonstrate that SPEED consistently outperforms existing methods in non-target preservation while achieving efficient and high-fidelity concept erasure, successfully erasing 100 concepts within only 5 seconds. Our code and models are available at: https://github.com/Ouxiang-Li/SPEED.
[396] Semantic Consistent Language Gaussian Splatting for Point-Level Open-vocabulary Querying
Hairong Yin, Huangying Zhan, Yi Xu, Raymond A. Yeh
Main category: cs.CV
TL;DR: A novel point-level querying framework for 3D Gaussian Splatting that uses tracking on segmentation masks to establish semantic ground-truth and a GT-anchored querying approach, achieving state-of-the-art performance on three benchmark datasets.
Details
Motivation: Existing methods for querying 3D Gaussian Splatting struggle with inconsistent 2D mask supervision and lack robust 3D point-level retrieval mechanisms, which is crucial for robotics applications like natural language-driven manipulation and autonomous navigation.Method: Two key approaches: (i) point-level querying framework with tracking on segmentation masks to establish semantically consistent ground-truth for distilling language Gaussians; (ii) GT-anchored querying that first retrieves distilled ground-truth and then uses it to query individual Gaussians.
Result: Outperforms state-of-the-art on three benchmark datasets with mIoU improvements of +4.14 on LERF, +20.42 on 3D-OVS, and +1.7 on Replica datasets.
Conclusion: The framework represents a promising step toward open-vocabulary understanding in real-world robotic systems, validated by significant performance improvements across multiple datasets.
Abstract: Open-vocabulary 3D scene understanding is crucial for robotics applications, such as natural language-driven manipulation, human-robot interaction, and autonomous navigation. Existing methods for querying 3D Gaussian Splatting often struggle with inconsistent 2D mask supervision and lack a robust 3D point-level retrieval mechanism. In this work, (i) we present a novel point-level querying framework that performs tracking on segmentation masks to establish a semantically consistent ground-truth for distilling the language Gaussians; (ii) we introduce a GT-anchored querying approach that first retrieves the distilled ground-truth and subsequently uses the ground-truth to query the individual Gaussians. Extensive experiments on three benchmark datasets demonstrate that the proposed method outperforms state-of-the-art performance. Our method achieves an mIoU improvement of +4.14, +20.42, and +1.7 on the LERF, 3D-OVS, and Replica datasets. These results validate our framework as a promising step toward open-vocabulary understanding in real-world robotic systems.
[397] DanceText: A Training-Free Layered Framework for Controllable Multilingual Text Transformation in Images
Zhenyu Yu, Mohd Yamani Idna Idris, Hua Wang, Pei Wang, Rizwan Qureshi, Shaina Raza, Aman Chadha, Yong Xiang, Zhixiang Chen
Main category: cs.CV
TL;DR: DanceText is a training-free framework for multilingual text editing in images that supports complex geometric transformations while maintaining layout consistency and seamless foreground-background integration.
Details
Motivation: Existing diffusion-based models for text-guided image synthesis lack controllability and fail to preserve layout consistency under complex manipulations like rotation, translation, scaling, and warping.Method: Uses a layered editing strategy that separates text from background, performs geometric transformations modularly, and employs a depth-aware module to align appearance and perspective between transformed text and reconstructed background.
Result: Achieves superior performance on the AnyWord-3M benchmark, especially in large-scale and complex transformation scenarios, with improved visual quality and spatial consistency.
Conclusion: DanceText provides an effective training-free solution for multilingual text editing in images that handles complex geometric transformations while maintaining photorealistic results and layout consistency.
Abstract: We present DanceText, a training-free framework for multilingual text editing in images, designed to support complex geometric transformations and achieve seamless foreground-background integration. While diffusion-based generative models have shown promise in text-guided image synthesis, they often lack controllability and fail to preserve layout consistency under non-trivial manipulations such as rotation, translation, scaling, and warping. To address these limitations, DanceText introduces a layered editing strategy that separates text from the background, allowing geometric transformations to be performed in a modular and controllable manner. A depth-aware module is further proposed to align appearance and perspective between the transformed text and the reconstructed background, enhancing photorealism and spatial consistency. Importantly, DanceText adopts a fully training-free design by integrating pretrained modules, allowing flexible deployment without task-specific fine-tuning. Extensive experiments on the AnyWord-3M benchmark demonstrate that our method achieves superior performance in visual quality, especially under large-scale and complex transformation scenarios. Code is avaible at https://github.com/YuZhenyuLindy/DanceText.git.
[398] Event2Vec: Processing Neuromorphic Events directly by Representations in Vector Space
Wei Fang, Priyadarshini Panda
Main category: cs.CV
TL;DR: event2vec is a novel representation method that treats neuromorphic camera events like words in natural language, enabling direct processing by neural networks with high efficiency and compatibility with Transformer architectures.
Details
Motivation: Neuromorphic event cameras have superior capabilities but their asynchronous sparse data format is incompatible with conventional deep learning methods, requiring solutions that maintain temporal resolution and leverage GPU acceleration.Method: Inspired by word-to-vector models, event2vec draws analogy between words and events, creating a representation that allows neural networks to process events directly while being compatible with Transformer architectures and self-supervised learning.
Result: Event2vec achieves high accuracy on DVS Gesture, ASL-DVS, and DVS-Lip benchmarks with remarkable parameter efficiency, high throughput, and maintains performance even with extremely low event counts.
Conclusion: Event2vec introduces a paradigm shift enabling neural networks to process event streams as natural language, paving way for native integration of event cameras with large language models and multimodal models.
Abstract: Neuromorphic event cameras possess superior temporal resolution, power efficiency, and dynamic range compared to traditional cameras. However, their asynchronous and sparse data format poses a significant challenge for conventional deep learning methods. Existing solutions to this incompatibility often sacrifice temporal resolution, require extensive pre-processing, and do not fully leverage GPU acceleration. Inspired by word-to-vector models, we draw an analogy between words and events to introduce event2vec, a novel representation that allows neural networks to process events directly. This approach is fully compatible with the parallel processing and self-supervised learning capabilities of Transformer architectures. We demonstrate the effectiveness of event2vec on the DVS Gesture, ASL-DVS, and DVS-Lip benchmarks. A comprehensive ablation study further analyzes our method’s features and contrasts them with existing representations. The experimental results show that event2vec is remarkably parameter-efficient, has high throughput, and can achieve high accuracy even with an extremely low number of events. Beyond its performance, the most significant contribution of event2vec is a new paradigm that enables neural networks to process event streams as if they were natural language. This paradigm shift paves the way for the native integration of event cameras with large language models and multimodal models. Code, model, and training logs are provided in https://github.com/Intelligent-Computing-Lab-Panda/event2vec.
[399] CARL: Camera-Agnostic Representation Learning for Spectral Image Analysis
Alexander Baumann, Leonardo Ayala, Silvia Seidlitz, Jan Sellner, Alexander Studier-Fischer, Berkin Özdemir, Lena Maier-Hein, Slobodan Ilic
Main category: cs.CV
TL;DR: CARL is a camera-agnostic representation learning model that converts spectral images of any channel dimensionality into unified representations, addressing spectral camera variability across RGB, multispectral, and hyperspectral imaging.
Details
Motivation: Spectral imaging faces challenges due to variability in channel dimensionality and captured wavelengths among different cameras, leading to camera-specific AI models with limited generalizability and cross-camera applicability.Method: Introduces a novel spectral encoder with self-attention-cross-attention mechanism to distill salient spectral information, combined with spatio-spectral pre-training using a feature-based self-supervision strategy tailored for camera-agnostic learning.
Result: Large-scale experiments across medical imaging, autonomous driving, and satellite imaging domains demonstrate superior robustness to spectral heterogeneity, outperforming on datasets with both simulated and real-world cross-camera spectral variations.
Conclusion: CARL’s scalability and versatility position it as a backbone for future spectral foundation models, effectively addressing the bottleneck of spectral camera variability in AI-driven methodologies.
Abstract: Spectral imaging offers promising applications across diverse domains, including medicine and urban scene understanding, and is already established as a critical modality in remote sensing. However, variability in channel dimensionality and captured wavelengths among spectral cameras impede the development of AI-driven methodologies, leading to camera-specific models with limited generalizability and inadequate cross-camera applicability. To address this bottleneck, we introduce CARL, a model for Camera-Agnostic Representation Learning across RGB, multispectral, and hyperspectral imaging modalities. To enable the conversion of a spectral image with any channel dimensionality to a camera-agnostic representation, we introduce a novel spectral encoder, featuring a self-attention-cross-attention mechanism, to distill salient spectral information into learned spectral representations. Spatio-spectral pre-training is achieved with a novel feature-based self-supervision strategy tailored to CARL. Large-scale experiments across the domains of medical imaging, autonomous driving, and satellite imaging demonstrate our model’s unique robustness to spectral heterogeneity, outperforming on datasets with simulated and real-world cross-camera spectral variations. The scalability and versatility of the proposed approach position our model as a backbone for future spectral foundation models.
[400] Image Recognition with Online Lightweight Vision Transformer: A Survey
Zherui Zhang, Rongtao Xu, Jie Zhou, Changwei Wang, Xingtian Pei, Wenhao Xu, Jiguang Zhang, Li Guo, Longxiang Gao, Wenbo Xu, Shibiao Xu
Main category: cs.CV
TL;DR: This paper surveys lightweight vision transformer strategies for image recognition, focusing on efficient component design, dynamic networks, and knowledge distillation, evaluated on ImageNet-1K with analysis of trade-offs.
Details
Motivation: Vision transformers face computational and memory challenges that limit real-world applicability, despite their success in capturing long-range dependencies and enabling parallel processing.Method: Survey and evaluation of three key lightweight strategies: Efficient Component Design, Dynamic Network, and Knowledge Distillation, analyzed on ImageNet-1K benchmark.
Result: Comprehensive analysis of trade-offs among precision, parameters, throughput, and other metrics to highlight advantages, disadvantages, and flexibility of different approaches.
Conclusion: Proposes future research directions and potential challenges in lightweight vision transformers to inspire further exploration and provide practical guidance for the community.
Abstract: The Transformer architecture has achieved significant success in natural language processing, motivating its adaptation to computer vision tasks. Unlike convolutional neural networks, vision transformers inherently capture long-range dependencies and enable parallel processing, yet lack inductive biases and efficiency benefits, facing significant computational and memory challenges that limit its real-world applicability. This paper surveys various online strategies for generating lightweight vision transformers for image recognition, focusing on three key areas: Efficient Component Design, Dynamic Network, and Knowledge Distillation. We evaluate the relevant exploration for each topic on the ImageNet-1K benchmark, analyzing trade-offs among precision, parameters, throughput, and more to highlight their respective advantages, disadvantages, and flexibility. Finally, we propose future research directions and potential challenges in the lightweighting of vision transformers with the aim of inspiring further exploration and providing practical guidance for the community. Project Page: https://github.com/ajxklo/Lightweight-VIT
[401] Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning
Yibin Wang, Zhimin Li, Yuhang Zang, Chunyu Wang, Qinglin Lu, Cheng Jin, Jiaqi Wang
Main category: cs.CV
TL;DR: This paper proposes UnifiedReward-Think, the first unified multimodal chain-of-thought (CoT) based reward model that incorporates explicit long chains of thought reasoning to improve reward signal accuracy and robustness for vision tasks.
Details
Motivation: Current multimodal reward models provide limited reasoning depth, leading to inaccurate reward signals. The authors believe incorporating explicit long chains of thought reasoning can significantly improve reliability and robustness, and that once internalized, CoT reasoning can also enhance direct response accuracy through implicit reasoning capabilities.Method: Three-phase approach: (1) Use small image generation preference data to distill GPT-4o’s reasoning process for cold start learning of CoT format; (2) Leverage model’s prior knowledge to prepare large-scale unified multimodal preference data to elicit reasoning across vision tasks, using correct outputs for rejection sampling; (3) Use incorrect samples for Group Relative Policy Optimization (GRPO) based reinforcement fine-tuning to explore diverse reasoning paths and optimize for correct solutions.
Result: Extensive experiments across various vision reward tasks demonstrate the superiority of the proposed model over existing approaches.
Conclusion: UnifiedReward-Think successfully incorporates explicit long chains of thought reasoning into multimodal reward models, significantly improving their reliability, robustness, and accuracy across visual understanding and generation tasks.
Abstract: Recent advances in multimodal Reward Models (RMs) have shown significant promise in delivering reward signals to align vision models with human preferences. However, current RMs are generally restricted to providing direct responses or engaging in shallow reasoning processes with limited depth, often leading to inaccurate reward signals. We posit that incorporating explicit long chains of thought (CoT) into the reward reasoning process can significantly strengthen their reliability and robustness. Furthermore, we believe that once RMs internalize CoT reasoning, their direct response accuracy can also be improved through implicit reasoning capabilities. To this end, this paper proposes UnifiedReward-Think, the first unified multimodal CoT-based reward model, capable of multi-dimensional, step-by-step long-chain reasoning for both visual understanding and generation reward tasks. Specifically, we adopt an exploration-driven reinforcement fine-tuning approach to elicit and incentivize the model’s latent complex reasoning ability: (1) We first use a small amount of image generation preference data to distill the reasoning process of GPT-4o, which is then used for the model’s cold start to learn the format and structure of CoT reasoning. (2) Subsequently, by leveraging the model’s prior knowledge and generalization capabilities, we prepare large-scale unified multimodal preference data to elicit the model’s reasoning process across various vision tasks. During this phase, correct reasoning outputs are retained for rejection sampling to refine the model (3) while incorrect predicted samples are finally used for Group Relative Policy Optimization (GRPO) based reinforcement fine-tuning, enabling the model to explore diverse reasoning paths and optimize for correct and robust solutions. Extensive experiments across various vision reward tasks demonstrate the superiority of our model.
[402] OS-W2S: An Automatic Labeling Engine for Language-Guided Open-Set Aerial Object Detection
Guoting Wei, Yu Liu, Xia Yuan, Xizhe Xue, Linlin Guo, Yifan Yang, Chunxia Zhao, Zongwen Bai, Haokui Zhang, Rong Xiao
Main category: cs.CV
TL;DR: The paper introduces MI-OAD, a large-scale language-guided open-set aerial detection dataset with 163,023 images and 2 million image-caption pairs, which is 40 times larger than comparable datasets. It addresses limitations in fine-grained open-world detection by providing three levels of language guidance (words, phrases, sentences) through an automatic annotation pipeline called OS-W2S Label Engine.
Details
Motivation: Existing language-guided methods for aerial object detection primarily focus on vocabulary-level descriptions, which fail to meet the demands of fine-grained open-world detection due to limited datasets. There is a need for more comprehensive language guidance from words to sentences.Method: Proposed OS-W2S Label Engine - an automatic annotation pipeline centered around an open-source large vision-language model, integrating image-operation-based preprocessing with BERT-based postprocessing. This pipeline handles diverse scene annotations for aerial images and expands existing datasets with rich textual annotations.
Result: MI-OAD dataset contains 163,023 images and 2 million image-caption pairs (40x larger than comparable datasets). Training on MI-OAD lifts Grounding DINO by +31.1 AP$_{50}$ and +34.7 Recall@10 with sentence-level inputs under zero-shot transfer. Pre-training with MI-OAD yields state-of-the-art performance on multiple existing benchmarks.
Conclusion: MI-OAD effectively addresses limitations in current remote sensing grounding data and enables effective language-guided open-set aerial detection. The dataset and OS-W2S annotations demonstrate high quality and effectiveness across multiple evaluation tasks.
Abstract: In recent years, language-guided open-set aerial object detection has gained significant attention due to its better alignment with real-world application needs. However, due to limited datasets, most existing language-guided methods primarily focus on vocabulary-level descriptions, which fail to meet the demands of fine-grained open-world detection. To address this limitation, we propose constructing a large-scale language-guided open-set aerial detection dataset, encompassing three levels of language guidance: from words to phrases, and ultimately to sentences. Centered around an open-source large vision-language model and integrating image-operation-based preprocessing with BERT-based postprocessing, we present the OS-W2S Label Engine, an automatic annotation pipeline capable of handling diverse scene annotations for aerial images. Using this label engine, we expand existing aerial detection datasets with rich textual annotations and construct a novel benchmark dataset, called MI-OAD, addressing the limitations of current remote sensing grounding data and enabling effective language-guided open-set aerial detection. Specifically, MI-OAD contains 163,023 images and 2 million image-caption pairs, approximately 40 times larger than comparable datasets. To demonstrate the effectiveness and quality of MI-OAD, we evaluate three representative tasks. On language-guided open-set aerial detection, training on MI-OAD lifts Grounding DINO by +31.1 AP$_{50}$ and +34.7 Recall@10 with sentence-level inputs under zero-shot transfer. Moreover, using MI-OAD for pre-training yields state-of-the-art performance on multiple existing open-vocabulary aerial detection and remote sensing visual grounding benchmarks, validating both the effectiveness of the dataset and the high quality of its OS-W2S annotations. More details are available at https://github.com/GT-Wei/MI-OAD.
[403] Intentional Gesture: Deliver Your Intentions with Gestures for Speech
Pinxin Liu, Haiyang Liu, Luchuan Song, Jason J. Corso, Chenliang Xu
Main category: cs.CV
TL;DR: Intentional-Gesture is a framework that generates co-speech gestures by understanding communicative intentions rather than just linguistic cues, achieving state-of-the-art performance on BEAT-2 benchmark.
Details
Motivation: Current gesture generation methods rely only on superficial linguistic cues (speech audio/text) and neglect communicative intentions, resulting in rhythmically synchronized but semantically shallow gestures.Method: Created InG dataset by augmenting BEAT-2 with gesture-intention annotations using vision-language models, and developed Intentional Gesture Motion Tokenizer to inject communicative functions into motion representations for intention-aware synthesis.
Result: Achieved new state-of-the-art performance on BEAT-2 benchmark, producing gestures that are both temporally aligned and semantically meaningful.
Conclusion: The framework provides a modular foundation for expressive gesture generation in digital humans and embodied AI by grounding gestures in high-level communicative functions.
Abstract: When humans speak, gestures help convey communicative intentions, such as adding emphasis or describing concepts. However, current co-speech gesture generation methods rely solely on superficial linguistic cues (e.g. speech audio or text transcripts), neglecting to understand and leverage the communicative intention that underpins human gestures. This results in outputs that are rhythmically synchronized with speech but are semantically shallow. To address this gap, we introduce Intentional-Gesture, a novel framework that casts gesture generation as an intention-reasoning task grounded in high-level communicative functions. First, we curate the InG dataset by augmenting BEAT-2 with gesture-intention annotations (i.e., text sentences summarizing intentions), which are automatically annotated using large vision-language models. Next, we introduce the Intentional Gesture Motion Tokenizer to leverage these intention annotations. It injects high-level communicative functions (e.g., intentions) into tokenized motion representations to enable intention-aware gesture synthesis that are both temporally aligned and semantically meaningful, achieving new state-of-the-art performance on the BEAT-2 benchmark. Our framework offers a modular foundation for expressive gesture generation in digital humans and embodied AI. Project Page: https://andypinxinliu.github.io/Intentional-Gesture
[404] Octic Vision Transformers: Quicker ViTs Through Equivariance
David Nordström, Johan Edstedt, Fredrik Kahl, Georg Bökman
Main category: cs.CV
TL;DR: Octic Vision Transformers (octic ViTs) use octic group equivariance to capture geometric symmetries like rotations and reflections, achieving significant computational efficiency gains (5.33x FLOPs reduction, 8x memory reduction) while maintaining baseline accuracy on ImageNet-1K.
Details
Motivation: Current Vision Transformers don't exploit natural geometric symmetries like 90-degree rotations and reflections, which could improve efficiency without fundamental limitations.Method: Introduce octic group equivariance in ViTs through octic linear layers, creating two families: fully equivariant networks and networks that break equivariance in later stages.
Result: Octic ViTs match baseline DeiT-III and DINOv2 accuracy on ImageNet-1K while providing substantial efficiency improvements - 5.33x FLOPs reduction and up to 8x memory reduction compared to ordinary linear layers.
Conclusion: Octic group equivariance enables Vision Transformers to efficiently capture geometric symmetries without sacrificing accuracy, offering a promising direction for more efficient computer vision models.
Abstract: Why are state-of-the-art Vision Transformers (ViTs) not designed to exploit natural geometric symmetries such as 90-degree rotations and reflections? In this paper, we argue that there is no fundamental reason, and what has been missing is an efficient implementation. To this end, we introduce Octic Vision Transformers (octic ViTs) which rely on octic group equivariance to capture these symmetries. In contrast to prior equivariant models that increase computational cost, our octic linear layers achieve 5.33x reductions in FLOPs and up to 8x reductions in memory compared to ordinary linear layers. In full octic ViT blocks the computational reductions approach the reductions in the linear layers with increased embedding dimension. We study two new families of ViTs, built from octic blocks, that are either fully octic equivariant or break equivariance in the last part of the network. Training octic ViTs supervised (DeiT-III) and unsupervised (DINOv2) on ImageNet-1K, we find that they match baseline accuracy while at the same time providing substantial efficiency gains.
[405] PhyMAGIC: Physical Motion-Aware Generative Inference with Confidence-guided LLM
Siwei Meng, Yawei Luo, Ping Liu
Main category: cs.CV
TL;DR: PhyMAGIC is a training-free framework that generates physically consistent 3D motion from a single image by integrating video diffusion models, LLM reasoning, and physics simulation.
Details
Motivation: Current video diffusion models often produce physically implausible results like momentum violations and object interpenetrations, while existing physics-aware methods require task-specific fine-tuning or supervised data, limiting scalability.Method: Integrates pre-trained image-to-video diffusion model with confidence-guided LLM reasoning and differentiable physics simulator. Uses iterative motion prompt refinement with LLM-derived confidence scores and simulation feedback to steer generation toward physical consistency.
Result: Outperforms state-of-the-art video generators and physics-aware baselines, improving physical property inference and motion-text alignment while maintaining visual fidelity.
Conclusion: PhyMAGIC provides a scalable, training-free solution for generating physically consistent 3D motion from single images, producing assets ready for downstream physical simulation without fine-tuning or manual supervision.
Abstract: Recent advances in 3D content generation have amplified demand for dynamic models that are both visually realistic and physically consistent. However, state-of-the-art video diffusion models frequently produce implausible results such as momentum violations and object interpenetrations. Existing physics-aware approaches often rely on task-specific fine-tuning or supervised data, which limits their scalability and applicability. To address the challenge, we present PhyMAGIC, a training-free framework that generates physically consistent motion from a single image. PhyMAGIC integrates a pre-trained image-to-video diffusion model, confidence-guided reasoning via LLMs, and a differentiable physics simulator to produce 3D assets ready for downstream physical simulation without fine-tuning or manual supervision. By iteratively refining motion prompts using LLM-derived confidence scores and leveraging simulation feedback, PhyMAGIC steers generation toward physically consistent dynamics. Comprehensive experiments demonstrate that PhyMAGIC outperforms state-of-the-art video generators and physics-aware baselines, enhancing physical property inference and motion-text alignment while maintaining visual fidelity.
[406] Deeper Diffusion Models Amplify Bias
Shahin Hakemi, Naveed Akhtar, Ghulam Mubashar Hassan, Ajmal Mian
Main category: cs.CV
TL;DR: This paper explores the bias-variance tradeoff in diffusion models, showing they can amplify training data bias and compromise privacy, expanding beyond simple generalization to reveal bias amplification risks in deeper models.
Details
Motivation: Despite diffusion models' strong performance, their internal workings are not well understood, which is problematic. The paper aims to explore the important bias-variance tradeoff concept in these models.Method: The paper provides a systematic foundation for exploring bias-variance tradeoff, establishing theoretical frameworks and conducting empirical validations.
Result: The research shows diffusion models can amplify inherent training data bias at one extreme, and compromise training sample privacy at the other extreme. Deeper models reveal increased risk of bias amplification.
Conclusion: The study expands the memorization-generalization understanding of generative models beyond just generalization, revealing significant bias amplification risks in diffusion models that are validated both theoretically and empirically.
Abstract: Despite the remarkable performance of generative Diffusion Models (DMs), their internal working is still not well understood, which is potentially problematic. This paper focuses on exploring the important notion of bias-variance tradeoff in diffusion models. Providing a systematic foundation for this exploration, it establishes that at one extreme, the diffusion models may amplify the inherent bias in the training data, and on the other, they may compromise the presumed privacy of the training samples. Our exploration aligns with the memorization-generalization understanding of the generative models, but it also expands further along this spectrum beyond “generalization”, revealing the risk of bias amplification in deeper models. Our claims are validated both theoretically and empirically.
[407] DVD-Quant: Data-free Video Diffusion Transformers Quantization
Zhiteng Li, Hanxuan Li, Junyi Wu, Kai Liu, Haotong Qin, Linghe Kong, Guihai Chen, Yulun Zhang, Xiaokang Yang
Main category: cs.CV
TL;DR: DVD-Quant is a data-free quantization framework for Video Diffusion Transformers that achieves 2x speedup while maintaining visual quality, enabling W4A4 post-training quantization without performance degradation.
Details
Motivation: Video DiTs have high computational and memory demands that hinder practical deployment. Existing PTQ methods suffer from computation-heavy calibration procedures and significant performance deterioration after quantization.Method: Integrates three innovations: Bounded-init Grid Refinement (BGR) and Auto-scaling Rotated Quantization (ARQ) for calibration data-free error reduction, plus δ-Guided Bit Switching (δ-GBS) for adaptive bit-width allocation.
Result: Achieves approximately 2x speedup over full-precision baselines while maintaining visual fidelity. First method to enable W4A4 PTQ for Video DiTs without compromising video quality.
Conclusion: DVD-Quant provides an effective data-free quantization solution for Video DiTs that balances computational efficiency with maintained performance, enabling practical deployment of state-of-the-art video generation models.
Abstract: Diffusion Transformers (DiTs) have emerged as the state-of-the-art architecture for video generation, yet their computational and memory demands hinder practical deployment. While post-training quantization (PTQ) presents a promising approach to accelerate Video DiT models, existing methods suffer from two critical limitations: (1) dependence on computation-heavy and inflexible calibration procedures, and (2) considerable performance deterioration after quantization. To address these challenges, we propose DVD-Quant, a novel Data-free quantization framework for Video DiTs. Our approach integrates three key innovations: (1) Bounded-init Grid Refinement (BGR) and (2) Auto-scaling Rotated Quantization (ARQ) for calibration data-free quantization error reduction, as well as (3) $\delta$-Guided Bit Switching ($\delta$-GBS) for adaptive bit-width allocation. Extensive experiments across multiple video generation benchmarks demonstrate that DVD-Quant achieves an approximately 2$\times$ speedup over full-precision baselines on advanced DiT models while maintaining visual fidelity. Notably, DVD-Quant is the first to enable W4A4 PTQ for Video DiTs without compromising video quality. Code and models will be available at https://github.com/lhxcs/DVD-Quant.
[408] ChartGalaxy: A Dataset for Infographic Chart Understanding and Generation
Zhen Li, Duan Li, Yukai Guo, Xinyuan Guo, Bowen Li, Lanxi Xiao, Shenyu Qiao, Jiashu Chen, Zijian Wu, Hui Zhang, Xinhuan Shu, Shixia Liu
Main category: cs.CV
TL;DR: ChartGalaxy is a million-scale dataset for improving LVLMs’ understanding and generation of infographic charts by capturing their visual and structural complexity through synthetic data creation.
Details
Motivation: Infographic charts combine visual and textual elements, posing challenges for LVLMs trained on plain charts, creating a gap in multimodal reasoning capabilities.Method: Constructed through inductive process identifying 75 chart types, 440 variations, and 68 layout templates from real infographics, then programmatically creating synthetic charts.
Result: Enables three key applications: improved chart understanding via fine-tuning, benchmarking code generation, and example-based chart generation.
Conclusion: ChartGalaxy provides a valuable resource for enhancing multimodal reasoning and generation in LVLMs by capturing real design complexity.
Abstract: Infographic charts are a powerful medium for communicating abstract data by combining visual elements (e.g., charts, images) with textual information. However, their visual and structural richness poses challenges for large vision-language models (LVLMs), which are typically trained on plain charts. To bridge this gap, we introduce ChartGalaxy, a million-scale dataset designed to advance the understanding and generation of infographic charts. The dataset is constructed through an inductive process that identifies 75 chart types, 440 chart variations, and 68 layout templates from real infographic charts and uses them to create synthetic ones programmatically. We showcase the utility of this dataset through: 1) improving infographic chart understanding via fine-tuning, 2) benchmarking code generation for infographic charts, and 3) enabling example-based infographic chart generation. By capturing the visual and structural complexity of real design, ChartGalaxy provides a useful resource for enhancing multimodal reasoning and generation in LVLMs.
[409] CARE: Confidence-aware Ratio Estimation for Medical Biomarkers
Jiameng Li, Teodora Popordanoska, Aleksei Tiulpin, Sebastian G. Gruber, Frederik Maes, Matthew B. Blaschko
Main category: cs.CV
TL;DR: A framework for estimating uncertainty in ratio-based biomarkers from medical image segmentation, addressing both statistical confidence intervals and model miscalibration.
Details
Motivation: Ratio-based biomarkers are crucial for clinical decision making but existing methods only provide point estimates without uncertainty measures, which is problematic for high-stakes medical applications.Method: Proposed a unified confidence-aware framework that analyzes error propagation in segmentation-to-biomarker pipelines, identifies model miscalibration as the main uncertainty source, and uses tunable parameters to control confidence levels.
Result: Extensive experiments show the method produces statistically sound confidence intervals with tunable confidence levels that can adapt to clinical practice requirements.
Conclusion: The framework enables more trustworthy application of predictive biomarkers in clinical workflows by providing reliable uncertainty quantification for ratio-based biomarkers.
Abstract: Ratio-based biomarkers – such as the proportion of necrotic tissue within a tumor – are widely used in clinical practice to support diagnosis, prognosis, and treatment planning. These biomarkers are typically estimated from soft segmentation outputs by computing region-wise ratios. Despite the high-stakes nature of clinical decision making, existing methods provide only point estimates, offering no measure of uncertainty. In this work, we propose a unified confidence-aware framework for estimating ratio-based biomarkers. Our uncertainty analysis stems from two observations: i) the probability ratio estimator inherently admits a statistical confidence interval regarding local randomness (bias and variance), ii) the segmentation network is not perfectly calibrated. We conduct a systematic analysis of error propagation in the segmentation-to-biomarker pipeline and identify model miscalibration as the dominant source of uncertainty. We leverage tunable parameters to control the confidence level of the derived bounds, allowing adaptation towards clinical practice. Extensive experiments show that our method produces statistically sound confidence intervals, with tunable confidence levels, enabling more trustworthy application of predictive biomarkers in clinical workflows.
[410] Mamba-Driven Topology Fusion for Monocular 3D Human Pose Estimation
Zenghao Zheng, Lianping Yang, Jinshan Pan, Hegui Zhu
Main category: cs.CV
TL;DR: Proposes Mamba-Driven Topology Fusion framework for 3D human pose estimation, addressing computational challenges of Transformers and limitations of Mamba in handling joint topology by incorporating bone-aware modules and graph convolutional networks.
Details
Motivation: Transformer-based 3D pose estimation methods face computational inefficiency due to quadratic self-attention complexity. While Mamba reduces computational overhead for long sequences, it struggles with processing joint sequences with topological structures and lacks insight into local joint relationships.Method: Mamba-Driven Topology Fusion framework with: 1) Bone Aware Module that infers bone vector direction/length in spherical coordinates for topological guidance; 2) Enhanced Mamba with forward/backward graph convolutional networks to capture local joint dependencies; 3) Spatiotemporal Refinement Module for modeling temporal and spatial relationships.
Result: Extensive experiments on Human3.6M and MPI-INF-3DHP datasets show the method greatly reduces computational cost while achieving higher accuracy compared to existing approaches. Ablation studies confirm effectiveness of each proposed module.
Conclusion: The proposed framework effectively addresses Mamba’s limitations in capturing human structural relationships by incorporating skeletal topology, achieving both computational efficiency and improved accuracy in 3D human pose estimation.
Abstract: Transformer-based methods for 3D human pose estimation face significant computational challenges due to the quadratic growth of self-attention mechanism complexity with sequence length. Recently, the Mamba model has substantially reduced computational overhead and demonstrated outstanding performance in modeling long sequences by leveraging state space model (SSM). However, the ability of SSM to process sequential data is not suitable for 3D joint sequences with topological structures, and the causal convolution structure in Mamba also lacks insight into local joint relationships. To address these issues, we propose the Mamba-Driven Topology Fusion framework in this paper. Specifically, the proposed Bone Aware Module infers the direction and length of bone vectors in the spherical coordinate system, providing effective topological guidance for the Mamba model in processing joint sequences. Furthermore, we enhance the convolutional structure within the Mamba model by integrating forward and backward graph convolutional network, enabling it to better capture local joint dependencies. Finally, we design a Spatiotemporal Refinement Module to model both temporal and spatial relationships within the sequence. Through the incorporation of skeletal topology, our approach effectively alleviates Mamba’s limitations in capturing human structural relationships. We conduct extensive experiments on the Human3.6M and MPI-INF-3DHP datasets for testing and comparison, and the results show that the proposed method greatly reduces computational cost while achieving higher accuracy. Ablation studies further demonstrate the effectiveness of each proposed module. The code and models will be released.
[411] Towards Scalable Language-Image Pre-training for 3D Medical Imaging
Chenhui Zhao, Yiwei Lyu, Asadur Chowdury, Edward Harake, Akhil Kondepudi, Akshay Rao, Xinhai Hou, Honglak Lee, Todd Hollon
Main category: cs.CV
TL;DR: HLIP introduces hierarchical attention for language-image pre-training on uncurated 3D medical imaging studies, achieving state-of-the-art performance on brain MRI and head CT benchmarks.
Details
Motivation: Current language-image pre-training for 3D medical imaging requires manual curation by radiologists, limiting scalability. Pre-training directly on uncurated studies aligns better with radiologist workflow and enables natural scalability.Method: Proposes HLIP framework with hierarchical attention mechanism that models the intrinsic hierarchy of radiology data: slice, scan, and study levels. Trained on 220K brain MRI studies (3.13M scans) and 240K head CT studies (1.44M scans).
Result: Achieves +10.5% balanced ACC on Pub-Brain-5 brain MRI benchmark; +8.3% and +1.7% macro AUC on CQ500 and RSNA head CT benchmarks; +4.3% macro AUC on Rad-ChestCT benchmark when pre-trained on CT-RATE.
Conclusion: Direct pre-training on uncurated clinical datasets with HLIP is a scalable and effective approach for language-image pre-training in 3D medical imaging, demonstrating strong generalization across benchmarks.
Abstract: The scalability of current language-image pre-training for 3D medical imaging, such as CT and MRI, is constrained by the need for radiologists to manually curate raw clinical studies. In this work, we pioneer pre-training directly on uncurated studies, which both aligns more closely with the radiologist’s workflow and provides a natural path to scalability. However, the unique structure of such data presents new challenges for existing model architectures, which were originally designed for 2D slices or single 3D scans. To address this, we introduce a novel hierarchical attention mechanism inspired by the intrinsic hierarchy of radiology data: slice, scan, and study. We denote our framework as Hierarchical attention for Language-Image Pre-training (HLIP). Trained on 220K studies with 3.13 million scans for brain MRI and 240K studies with 1.44 million scans for head CT, HLIP achieves state-of-the-art performance, e.g., +10.5% balanced ACC on the proposed publicly available brain MRI benchmark Pub-Brain-5; +8.3% and +1.7% macro AUC on head CT benchmarks CQ500 and RSNA, respectively. HLIP also exhibits strong generalizability on existing 3D medical language-image pre-training benchmarks, e.g., +4.3% macro AUC on the Rad-ChestCT benchmark when pre-trained on CT-RATE. These results demonstrate that, with HLIP, directly pre-training on uncurated clinical datasets is a scalable and effective direction for language-image pre-training in 3D medical imaging. The code is available at https://github.com/Zch0414/hlip.
[412] Pose-free 3D Gaussian splatting via shape-ray estimation
Youngju Na, Taeyeon Kim, Jumin Lee, Kyu Beom Han, Woo Jae Kim, Sung-eui Yoon
Main category: cs.CV
TL;DR: SHARE is a pose-free Gaussian splatting framework that jointly estimates shape and camera rays to handle noisy pose estimates in real-world scenarios.
Details
Motivation: Generalizable 3D Gaussian splatting depends heavily on precise camera poses, which are challenging to obtain accurately in real-world scenarios, leading to geometric misalignments.Method: SHARE builds a pose-aware canonical volume representation that integrates multi-view information without explicit 3D transformations, and uses anchor-aligned Gaussian prediction to refine local geometry around coarse anchors.
Result: Extensive experiments on diverse real-world datasets show robust performance in pose-free generalizable Gaussian splatting.
Conclusion: SHARE effectively addresses pose ambiguity in Gaussian splatting through joint shape and camera rays estimation, achieving reliable performance without requiring accurate pose estimates.
Abstract: While generalizable 3D Gaussian splatting enables efficient, high-quality rendering of unseen scenes, it heavily depends on precise camera poses for accurate geometry. In real-world scenarios, obtaining accurate poses is challenging, leading to noisy pose estimates and geometric misalignments. To address this, we introduce SHARE, a pose-free, feed-forward Gaussian splatting framework that overcomes these ambiguities by joint shape and camera rays estimation. Instead of relying on explicit 3D transformations, SHARE builds a pose-aware canonical volume representation that seamlessly integrates multi-view information, reducing misalignment caused by inaccurate pose estimates. Additionally, anchor-aligned Gaussian prediction enhances scene reconstruction by refining local geometry around coarse anchors, allowing for more precise Gaussian placement. Extensive experiments on diverse real-world datasets show that our method achieves robust performance in pose-free generalizable Gaussian splatting.
[413] Physics-Guided Motion Loss for Video Generation Model
Bowen Xue, Giuseppe Claudio Guarnera, Shuang Zhao, Zahra Montazeri
Main category: cs.CV
TL;DR: A frequency-domain physics prior improves motion plausibility in video diffusion models by decomposing rigid motions into lightweight spectral losses, requiring only 2.7% of frequency coefficients while preserving 97%+ spectral energy.
Details
Motivation: Current video diffusion models generate visually compelling content but often violate basic laws of physics, producing subtle artifacts like rubber-sheet deformations and inconsistent object motion.Method: Introduces a frequency-domain physics prior that decomposes common rigid motions (translation, rotation, scaling) into lightweight spectral losses without modifying model architectures.
Result: Improves motion accuracy and action recognition by ~11% on average on OpenVID-1M, reduces warping error by 22-37%, and user studies show 74-83% preference for physics-enhanced videos while maintaining visual quality.
Conclusion: Simple, global spectral cues are an effective drop-in regularizer for physically plausible motion in video diffusion models.
Abstract: Current video diffusion models generate visually compelling content but often violate basic laws of physics, producing subtle artifacts like rubber-sheet deformations and inconsistent object motion. We introduce a frequency-domain physics prior that improves motion plausibility without modifying model architectures. Our method decomposes common rigid motions (translation, rotation, scaling) into lightweight spectral losses, requiring only 2.7% of frequency coefficients while preserving 97%+ of spectral energy. Applied to Open-Sora, MVDIT, and Hunyuan, our approach improves both motion accuracy and action recognition by ~11% on average on OpenVID-1M (relative), while maintaining visual quality. User studies show 74–83% preference for our physics-enhanced videos. It also reduces warping error by 22–37% (depending on the backbone) and improves temporal consistency scores. These results indicate that simple, global spectral cues are an effective drop-in regularizer for physically plausible motion in video diffusion.
[414] ReSpace: Text-Driven 3D Indoor Scene Synthesis and Editing with Preference Alignment
Martin JJ. Bucher, Iro Armeni
Main category: cs.CV
TL;DR: ReSpace is a text-driven 3D indoor scene synthesis and editing framework using autoregressive language models with explicit room boundaries and asset-agnostic deployment.
Details
Motivation: Current methods oversimplify object semantics, require masked diffusion for editing, ignore room boundaries, or rely on floor plan renderings that fail to capture complex layouts. LLM-based methods enable richer semantics but lack editing functionality or have limited spatial reasoning.Method: Uses a compact structured scene representation with explicit room boundaries, frames scene editing as next-token prediction task, employs dual-stage training (supervised fine-tuning + preference alignment), and uses zero-shot LLM for object removal and addition prompts.
Result: Surpasses state-of-the-art on addition tasks and achieves superior human-perceived quality on full scene synthesis.
Conclusion: ReSpace provides an effective framework for text-driven 3D indoor scene synthesis and editing that addresses limitations of existing methods through structured representations and language model capabilities.
Abstract: Scene synthesis and editing has emerged as a promising direction in computer graphics. Current trained approaches for 3D indoor scenes either oversimplify object semantics through one-hot class encodings (e.g., ‘chair’ or ’table’), require masked diffusion for editing, ignore room boundaries, or rely on floor plan renderings that fail to capture complex layouts. LLM-based methods enable richer semantics via natural language (e.g., ‘modern studio with light wood furniture’), but lack editing functionality, are limited to rectangular layouts, or rely on weak spatial reasoning from implicit world models. We introduce ReSpace, a generative framework for text-driven 3D indoor scene synthesis and editing using autoregressive language models. Our approach features a compact structured scene representation with explicit room boundaries that enables asset-agnostic deployment and frames scene editing as a next-token prediction task. We leverage a dual-stage training approach combining supervised fine-tuning and preference alignment, enabling a specially trained language model for object addition that accounts for user instructions, spatial geometry, object semantics, and scene-level composition. For scene editing, we employ a zero-shot LLM to handle object removal and prompts for addition. We further introduce a voxelization-based evaluation capturing fine-grained geometry beyond 3D bounding boxes. Experimental results surpass state-of-the-art on addition and achieve superior human-perceived quality on full scene synthesis.
[415] Dual Branch VideoMamba with Gated Class Token Fusion for Violence Detection
Damith Chamalke Senadeera, Xiaoyun Yang, Shibo Li, Muhammad Awais, Dimitrios Kollias, Gregory Slabaugh
Main category: cs.CV
TL;DR: Proposes Dual Branch VideoMamba with Gated Class Token Fusion (GCTF) for efficient violence detection in surveillance videos, combining spatial and temporal branches with state-space models to handle long-term dependencies while maintaining computational efficiency.
Details
Motivation: Address limitations of CNNs and Transformers in handling long-term dependencies and computational efficiency for automated violence detection in surveillance systems, especially in challenging scenarios.Method: Uses dual-branch architecture with state-space model backbone - one branch captures spatial features, the other focuses on temporal dynamics, with continuous fusion via gating mechanism between branches. Also creates new benchmark by merging multiple datasets with strict train/test separation.
Result: Achieves state-of-the-art performance on the new benchmark and DVD dataset, offering optimal balance between accuracy and computational efficiency for near real-time surveillance violence detection.
Conclusion: Demonstrates the promise of state-space models for scalable, near real-time surveillance violence detection, providing an efficient alternative to CNNs and Transformers.
Abstract: The rapid proliferation of surveillance cameras has increased the demand for automated violence detection. While CNNs and Transformers have shown success in extracting spatio-temporal features, they struggle with long-term dependencies and computational efficiency. We propose Dual Branch VideoMamba with Gated Class Token Fusion (GCTF), an efficient architecture combining a dual-branch design and a state-space model (SSM) backbone where one branch captures spatial features, while the other focuses on temporal dynamics. The model performs continuous fusion via a gating mechanism between the branches to enhance the model’s ability to detect violent activities even in challenging surveillance scenarios. We also present a new benchmark by merging RWF-2000, RLVS, SURV and VioPeru datasets in video violence detection, ensuring strict separation between training and testing sets. Experimental results demonstrate that our model achieves state-of-the-art performance on this benchmark and also on DVD dataset which is another novel dataset on video violence detection, offering an optimal balance between accuracy and computational efficiency, demonstrating the promise of SSMs for scalable, near real-time surveillance violence detection.
[416] Astraea: A Token-wise Acceleration Framework for Video Diffusion Transformers
Haosong Liu, Yuge Cheng, Wenxuan Miao, Zihan Liu, Aiyue Chen, Jing Lin, Yiwu Yao, Chen Chen, Jingwen Leng, Yu Feng, Minyi Guo
Main category: cs.CV
TL;DR: Astraea is a framework that optimizes video diffusion transformers (vDiTs) for faster inference by using token selection and sparse attention, achieving up to 13.2x speedup with minimal quality loss.
Details
Motivation: Video diffusion transformers have high computational demands that limit practical deployment, and existing acceleration methods rely on heuristics with limited applicability.Method: Proposes lightweight token selection mechanism, memory-efficient sparse attention strategy, and uses evolutionary algorithm to automatically determine optimal token reduction across timesteps.
Result: Achieves up to 2.4x speedup on single GPU and 13.2x on 8 GPUs, with less than 0.5% quality loss on VBench compared to baselines.
Conclusion: Astraea provides an effective framework for accelerating vDiT-based video generation while maintaining high quality, enabling practical deployment.
Abstract: Video diffusion transformers (vDiTs) have made tremendous progress in text-to-video generation, but their high compute demands pose a major challenge for practical deployment. While studies propose acceleration methods to reduce workload at various granularities, they often rely on heuristics, limiting their applicability. We introduce Astraea, a framework that searches for near-optimal configurations for vDiT-based video generation under a performance target. At its core, Astraea proposes a lightweight token selection mechanism and a memory-efficient, GPU-friendly sparse attention strategy, enabling linear savings on execution time with minimal impact on generation quality. Meanwhile, to determine optimal token reduction for different timesteps, we further design a search framework that leverages a classic evolutionary algorithm to automatically determine the distribution of the token budget effectively. Together, Astraea achieves up to 2.4$\times$ inference speedup on a single GPU with great scalability (up to 13.2$\times$ speedup on 8 GPUs) while achieving up to over 10~dB video quality compared to the state-of-the-art methods ($<$0.5% loss on VBench compared to baselines).
[417] DriveAction: A Benchmark for Exploring Human-like Driving Decisions in VLA Models
Yuhan Hao, Zhengning Li, Lei Sun, Weilong Wang, Naixin Yi, Sheng Song, Caihong Qin, Mofan Zhou, Yifei Zhan, Xianpeng Lang
Main category: cs.CV
TL;DR: DriveAction is the first action-driven benchmark for VLA models in autonomous driving, featuring 16,185 QA pairs from 2,610 real-world driving scenarios with driver-collected action labels and a tree-structured evaluation framework.
Details
Motivation: Existing VLA benchmarks lack scenario diversity, reliable action-level annotation, and human-preference-aligned evaluation protocols for autonomous driving applications.Method: Leveraged real-world driving data collected by autonomous vehicle drivers to ensure broad scenario coverage, collected high-level discrete action labels from actual driving operations, and implemented an action-rooted tree-structured evaluation framework linking vision, language, and action tasks.
Result: Experiments show state-of-the-art VLMs require both vision and language guidance for accurate action prediction: accuracy drops by 3.3% without vision, 4.1% without language, and 8.0% without either input. The evaluation framework provides precise bottleneck identification with robust results.
Conclusion: DriveAction provides new insights and a rigorous foundation for advancing human-like decisions in autonomous driving through comprehensive scenario coverage, reliable action annotations, and structured evaluation.
Abstract: Vision-Language-Action (VLA) models have advanced autonomous driving, but existing benchmarks still lack scenario diversity, reliable action-level annotation, and evaluation protocols aligned with human preferences. To address these limitations, we introduce DriveAction, the first action-driven benchmark specifically designed for VLA models, comprising 16,185 QA pairs generated from 2,610 driving scenarios. DriveAction leverages real-world driving data proactively collected by drivers of autonomous vehicles to ensure broad and representative scenario coverage, offers high-level discrete action labels collected directly from drivers’ actual driving operations, and implements an action-rooted tree-structured evaluation framework that explicitly links vision, language, and action tasks, supporting both comprehensive and task-specific assessment. Our experiments demonstrate that state-of-the-art vision-language models (VLMs) require both vision and language guidance for accurate action prediction: on average, accuracy drops by 3.3% without vision input, by 4.1% without language input, and by 8.0% without either. Our evaluation supports precise identification of model bottlenecks with robust and consistent results, thus providing new insights and a rigorous foundation for advancing human-like decisions in autonomous driving.
[418] Structure before the Machine: Input Space is the Prerequisite for Concepts
Bowei Tian, Xuntao Lyu, Meng Liu, Hongyi Wang, Ang Li
Main category: cs.CV
TL;DR: The paper proposes the Input-Space Linearity Hypothesis (ISLH) and introduces the Spectral Principal Path (SPP) framework to explain how deep networks progressively distill linear representations along dominant spectral directions, demonstrating multimodal robustness in Vision-Language Models.
Details
Motivation: To enhance AI transparency and control by shifting focus from individual neurons to structured semantic directions aligned with human-interpretable concepts, motivated by the Linear Representation Hypothesis.Method: Proposes the Input-Space Linearity Hypothesis (ISLH) and introduces the Spectral Principal Path (SPP) framework to formalize how deep networks progressively distill linear representations along dominant spectral directions.
Result: Demonstrates the multimodal robustness of these representations in Vision-Language Models (VLMs) and bridges theoretical insights with empirical validation.
Conclusion: Advances a structured theory of representation formation in deep networks, paving the way for improving AI robustness, fairness, and transparency.
Abstract: High-level representations have become a central focus in enhancing AI transparency and control, shifting attention from individual neurons or circuits to structured semantic directions that align with human-interpretable concepts. Motivated by the Linear Representation Hypothesis (LRH), we propose the Input-Space Linearity Hypothesis (ISLH), which posits that concept-aligned directions originate in the input space and are selectively amplified with increasing depth. We then introduce the Spectral Principal Path (SPP) framework, which formalizes how deep networks progressively distill linear representations along a small set of dominant spectral directions. Building on this framework, we further demonstrate the multimodal robustness of these representations in Vision-Language Models (VLMs). By bridging theoretical insights with empirical validation, this work advances a structured theory of representation formation in deep networks, paving the way for improving AI robustness, fairness, and transparency.
[419] VidBridge-R1: Bridging QA and Captioning for RL-based Video Understanding Models with Intermediate Proxy Tasks
Xinlong Chen, Yuanxing Zhang, Yushuo Guan, Weihong Lin, Zekun Wang, Bohan Zeng, Yang Shi, Sihan Yang, Qiang Liu, Pengfei Wan, Liang Wang, Tieniu Tan
Main category: cs.CV
TL;DR: Proposes VidBridge-R1, a novel training framework using proxy tasks (DarkEventInfer and MixVidQA) to resolve conflict between video QA and captioning, enabling one model to excel at both tasks.
Details
Motivation: Current video models specialize in either QA or captioning but struggle with both due to conflicting task natures, causing performance degradation when combined.Method: Uses two proxy tasks: DarkEventInfer (inferring masked video events) and MixVidQA (reasoning about interleaved video clips) to develop both holistic understanding and precise reasoning capabilities.
Result: VidBridge-R1 achieves significant performance gains on both QA and captioning tasks within a single model, demonstrating effective paradigm conflict resolution.
Conclusion: The proposed framework successfully bridges the conflict between QA and captioning, fostering more generalizable and powerful video understanding models.
Abstract: The “Reason-Then-Respond” paradigm, enhanced by Reinforcement Learning, has shown great promise in advancing Multimodal Large Language Models. However, its application to the video domain has led to specialized models that excel at either question answering (QA) or captioning tasks, but struggle to master both. Naively combining reward signals from these tasks results in mutual performance degradation, which we attribute to a conflict between their opposing task natures. To address this challenge, we propose a novel training framework built upon two intermediate proxy tasks: DarkEventInfer, which presents videos with masked event segments, requiring models to infer the obscured content based on contextual video cues; and MixVidQA, which presents interleaved video sequences composed of two distinct clips, challenging models to isolate and reason about one while disregarding the other. These proxy tasks compel the model to simultaneously develop both holistic, divergent understanding and precise, convergent reasoning capabilities. Embodying this framework, we present VidBridge-R1, the first versatile video reasoning model that effectively bridges the paradigm conflict. Extensive experiments show that VidBridge-R1 achieves significant performance gains on both QA and captioning within one model, demonstrating the efficacy of our approach in fostering more generalizable and powerful video understanding models.
[420] LEO-VL: Efficient Scene Representation for Scalable 3D Vision-Language Learning
Jiangyong Huang, Xiaojian Ma, Xiongkun Linghu, Yue Fan, Junchao He, Wenxin Tan, Qing Li, Song-Chun Zhu, Yixin Chen, Baoxiong Jia, Siyuan Huang
Main category: cs.CV
TL;DR: Proposes LEO-VL, a 3D vision-language model using condensed feature grids for efficient scene representation, trained on 700k 3D-VL data across multiple domains and tasks, achieving SOTA performance on 3D QA benchmarks.
Details
Motivation: Current 3D vision-language models lag behind 2D counterparts due to inefficient scene representations that trade performance for heavy token overhead, limiting scalability.Method: Introduces condensed feature grid (CFG) for efficient scene representation, LEO-VL model trained on diverse 3D-VL data, and SceneDPO for post-training robustness through answer and scene contrasts.
Result: LEO-VL achieves state-of-the-art performance on SQA3D, MSQA, and Beacon3D benchmarks, demonstrating efficiency, scalability benefits, and robustness improvements over SFT and GRPO.
Conclusion: The proposed methods advance 3D VLMs in efficiency, scalability, and robustness, with condensed feature grids enabling competitive performance without heavy token overhead.
Abstract: Developing vision-language models (VLMs) capable of understanding 3D scenes has been a longstanding goal in the 3D-VL community. Despite recent progress, 3D VLMs still fall short of their 2D counterparts in capability and robustness. A key bottleneck is that current scene representations struggle to balance performance and efficiency: competitive performance comes at the cost of heavy token overhead, which in turn hampers the scalability of 3D-VL learning. To address this, we propose the condensed feature grid (CFG), an efficient scene representation featuring significantly reduced token overhead and strong perception capability. Building on CFG, we introduce LEO-VL, a 3D VLM trained on 700k 3D-VL data spanning four real-world indoor domains and five tasks such as captioning and dialogue. To enhance the robustness of 3D VLM, we further propose SceneDPO for post-training, which involves contrasts across answers and scenes. LEO-VL achieves state-of-the-art performance on various 3D QA benchmarks, including SQA3D, MSQA, and Beacon3D. Our extensive experiments highlight the efficiency of our representation, the benefit of task and scene diversity, consistent scaling effects, and the advantages of SceneDPO compared to SFT and GRPO. We hope our findings advance the efficiency, scalability, and robustness of future 3D VLMs.
[421] Think With Videos For Agentic Long-Video Understanding
Huaying Yuan, Zheng Liu, Junjie Zhou, Hongjin Qian, Yan Shu, Nicu Sebe, Ji-Rong Wen, Zhicheng Dou
Main category: cs.CV
TL;DR: VideoExplorer is a framework for long-video understanding that iteratively formulates sub-questions, locates relevant moments, and performs task-oriented video analysis, enabling efficient and interpretable reasoning.
Details
Motivation: Existing methods for long-video understanding either sacrifice fine-grained details by downsampling frames or rely on textual reasoning over task-agnostic representations, which hinders task-specific perception and exploration.Method: VideoExplorer uses iterative reasoning with sub-question formulation, temporal grounding, and scalable perception. It employs a two-stage training pipeline with supervised trajectory initialization followed by trajectory-level preference optimization, trained on a constructed long-video reasoning dataset using difficulty-adaptive sampling.
Result: Extensive evaluations on long-video understanding benchmarks show VideoExplorer’s significant advantage over existing baselines, demonstrating robustness, adaptability, and efficiency.
Conclusion: VideoExplorer provides a faithful, efficient, and interpretable approach to long-video understanding by naturally integrating planning, temporal grounding, and scalable perception into a coherent reasoning process.
Abstract: Long-video understanding~(LVU) is a challenging problem in computer vision. Existing methods either downsample frames for single-pass reasoning, sacrificing fine-grained details, or depend on textual reasoning over task-agnostic representations, hindering task-specific perception and exploration. In this paper, we propose VideoExplorer, a framework grounded in the principle of ``thinking with video’’, which naturally intertwines planning, temporal grounding, and scalable perception into a coherent reasoning process. Rather than reasoning over a static context, VideoExplorer iteratively formulates sub-questions, locates relevant moments, and performs task-oriented, temporally scalable video understanding until reaching the final answer, enabling faithful, efficient, and interpretable reasoning. To address the lack of LVU training resources, we construct a long-video reasoning dataset using difficulty-adaptive sampling to ensure high-quality trajectories on complex tasks. Building on this dataset, we design a two-stage training pipeline: supervised trajectory initialization followed by trajectory-level preference optimization, encouraging adaptive temporal grounding and iterative information integration guided by downstream rewards. Extensive evaluations on popular long-video understanding and reasoning benchmarks demonstrate VideoExplorer’s significant advantage over existing baselines, highlighting its robustness, adaptability, and efficiency. Our code is made publicly available in this repository(https://github.com/yhy-2000/VideoDeepResearch).
[422] TAMMs: Temporal-Aware Multimodal Model for Satellite Image Change Understanding and Forecasting
Zhongbin Guo, Yuhao Wang, Ping Jian, Chengzhi Li, Xinyue Chen, Zhen Yang, Ertai E
Main category: cs.CV
TL;DR: TAMMs is a unified framework that jointly performs Temporal Change Description and Future Satellite Image Forecasting using MLLM-diffusion architecture with temporal adaptation modules and semantic-fused control injection.
Details
Motivation: To address the disjointed nature of Temporal Change Description and Future Satellite Image Forecasting tasks in Satellite Image Time Series analysis, and overcome their shared limitation in modeling long-range temporal dynamics.Method: Introduces Temporal Adaptation Modules to enhance MLLM’s long-range temporal understanding, and Semantic-Fused Control Injection mechanism to translate change understanding into generative control.
Result: Extensive experiments show TAMMs significantly outperforms state-of-the-art specialist baselines on both tasks.
Conclusion: The synergistic design enables understanding from TCD task to directly improve consistency of FSIF task, demonstrating effective joint performance on both tasks.
Abstract: Temporal Change Description (TCD) and Future Satellite Image Forecasting (FSIF) are critical, yet historically disjointed tasks in Satellite Image Time Series (SITS) analysis. Both are fundamentally limited by the common challenge of modeling long-range temporal dynamics. To explore how to improve the performance of methods on both tasks simultaneously by enhancing long-range temporal understanding capabilities, we introduce TAMMs, the first unified framework designed to jointly perform TCD and FSIF within a single MLLM-diffusion architecture. TAMMs introduces two key innovations: Temporal Adaptation Modules (TAM) enhance frozen MLLM’s ability to comprehend long-range dynamics, and Semantic-Fused Control Injection (SFCI) mechanism translates this change understanding into fine-grained generative control. This synergistic design makes the understanding from the TCD task to directly inform and improve the consistency of the FSIF task. Extensive experiments demonstrate TAMMs significantly outperforms state-of-the-art specialist baselines on both tasks.
[423] Ctrl-Z Sampling: Diffusion Sampling with Controlled Random Zigzag Explorations
Shunqi Mao, Wei Guo, Chaoyi Zhang, Jieting Long, Ke Xie, Weidong Cai
Main category: cs.CV
TL;DR: Ctrl-Z Sampling is a novel diffusion model sampling strategy that adaptively detects and escapes local optima through controlled exploration, improving generation quality with moderate computational overhead.
Details
Motivation: Diffusion models often converge to local optima with suboptimal generations due to latent space complexity and poor initialization, while existing methods have limited capacity to escape steep local maxima.Method: Proposes Controlled Random Zigzag Sampling (Ctrl-Z Sampling) that uses a reward model to detect local maxima, then injects noise and reverts to previous states to escape plateaus, evaluating candidate trajectories and performing deeper explorations when needed.
Result: Experimental results show Ctrl-Z Sampling substantially improves generation quality while requiring only about 7.72 times the NFEs (Number of Function Evaluations) of the original diffusion process.
Conclusion: The proposed method is model-agnostic and compatible with existing diffusion frameworks, enabling dynamic alternation between forward refinement and backward exploration to enhance both alignment and visual quality in generated outputs.
Abstract: Diffusion models have shown strong performance in conditional generation by progressively denoising Gaussian samples toward a target data distribution. This denoising process can be interpreted as a form of hill climbing in a learned representation space, where the model iteratively refines a sample toward regions of higher probability. However, this learned climbing often converges to local optima with plausible but suboptimal generations due to latent space complexity and suboptimal initialization. While prior efforts often strengthen guidance signals or introduce fixed exploration strategies to address this, they exhibit limited capacity to escape steep local maxima. In contrast, we propose Controlled Random Zigzag Sampling (Ctrl-Z Sampling), a novel sampling strategy that adaptively detects and escapes such traps through controlled exploration. In each diffusion step, we first identify potential local maxima using a reward model. Upon such detection, we inject noise and revert to a previous, noisier state to escape the current plateau. The reward model then evaluates candidate trajectories, accepting only those that offer improvement, otherwise scheming progressively deeper explorations when nearby alternatives fail. This controlled zigzag process allows dynamic alternation between forward refinement and backward exploration, enhancing both alignment and visual quality in the generated outputs. The proposed method is model-agnostic and also compatible with existing diffusion frameworks. Experimental results show that Ctrl-Z Sampling substantially improves generation quality while requiring only about 7.72 times the NFEs of the original.
[424] Shape-for-Motion: Precise and Consistent Video Editing with 3D Proxy
Yuhao Liu, Tengfei Wang, Fang Liu, Zhenwei Wang, Rynson W. H. Lau
Main category: cs.CV
TL;DR: Shape-for-Motion is a novel framework that uses 3D proxies for precise and consistent video editing, enabling users to perform edits on 3D meshes that automatically propagate across frames.
Details
Motivation: Users need tools for faithful creative editing with precise and consistent control in video synthesis, but existing methods struggle with fine-grained alignment with user intentions.Method: Converts target objects to time-consistent 3D meshes, uses Dual-Propagation Strategy to propagate edits from single frame to others, projects 3D meshes to 2D renderings, and employs decoupled video diffusion model for final generation.
Result: Supports various precise and physically-consistent manipulations including pose editing, rotation, scaling, translation, texture modification, and object composition across video frames.
Conclusion: The approach represents a key step toward high-quality, controllable video editing workflows, with extensive experiments demonstrating superiority and effectiveness.
Abstract: Recent advances in deep generative modeling have unlocked unprecedented opportunities for video synthesis. In real-world applications, however, users often seek tools to faithfully realize their creative editing intentions with precise and consistent control. Despite the progress achieved by existing methods, ensuring fine-grained alignment with user intentions remains an open and challenging problem. In this work, we present Shape-for-Motion, a novel framework that incorporates a 3D proxy for precise and consistent video editing. Shape-for-Motion achieves this by converting the target object in the input video to a time-consistent mesh, i.e., a 3D proxy, allowing edits to be performed directly on the proxy and then inferred back to the video frames. To simplify the editing process, we design a novel Dual-Propagation Strategy that allows users to perform edits on the 3D mesh of a single frame, and the edits are then automatically propagated to the 3D meshes of the other frames. The 3D meshes for different frames are further projected onto the 2D space to produce the edited geometry and texture renderings, which serve as inputs to a decoupled video diffusion model for generating edited results. Our framework supports various precise and physically-consistent manipulations across the video frames, including pose editing, rotation, scaling, translation, texture modification, and object composition. Our approach marks a key step toward high-quality, controllable video editing workflows. Extensive experiments demonstrate the superiority and effectiveness of our approach. Project page: https://shapeformotion.github.io/
[425] Reasoning to Edit: Hypothetical Instruction-Based Image Editing with Visual Reasoning
Qingdong He, Xueqin Chen, Chaoyi Wang, Yanjie Pan, Xiaobin Hu, Zhenye Gan, Yabiao Wang, Chengjie Wang, Xiangtai Li, Jiangning Zhang
Main category: cs.CV
TL;DR: Proposes Reason50K dataset and ReasonBrain framework for complex hypothetical instruction reasoning in image editing, addressing limitations of current methods that struggle with implicit instructions requiring deeper reasoning.
Details
Motivation: Current instruction-based image editing methods focus on simple explicit instructions and lack capabilities for complex implicit hypothetical instructions that require deeper reasoning about visual changes and user intent.Method: ReasonBrain framework uses MLLMs for editing guidance generation and diffusion models for image synthesis, with Fine-grained Reasoning Cue Extraction (FRCE) module and Cross-Modal Enhancer (CME) to capture detailed semantics and enable rich feature interactions.
Result: Extensive experiments show ReasonBrain consistently outperforms state-of-the-art baselines on reasoning scenarios and exhibits strong zero-shot generalization to conventional IIE tasks.
Conclusion: The proposed Reason50K dataset and ReasonBrain framework effectively address the limitations of current methods in handling complex hypothetical instructions requiring reasoning, demonstrating superior performance and generalization capabilities.
Abstract: Instruction-based image editing (IIE) has advanced rapidly with the success of diffusion models. However, existing efforts primarily focus on simple and explicit instructions to execute editing operations such as adding, deleting, moving, or swapping objects. They struggle to handle more complex implicit hypothetical instructions that require deeper reasoning to infer plausible visual changes and user intent. Additionally, current datasets provide limited support for training and evaluating reasoning-aware editing capabilities. Architecturally, these methods also lack mechanisms for fine-grained detail extraction that support such reasoning. To address these limitations, we propose Reason50K, a large-scale dataset specifically curated for training and evaluating hypothetical instruction reasoning image editing, along with ReasonBrain, a novel framework designed to reason over and execute implicit hypothetical instructions across diverse scenarios. Reason50K includes over 50K samples spanning four key reasoning scenarios: Physical, Temporal, Causal, and Story reasoning. ReasonBrain leverages Multimodal Large Language Models (MLLMs) for editing guidance generation and a diffusion model for image synthesis, incorporating a Fine-grained Reasoning Cue Extraction (FRCE) module to capture detailed visual and textual semantics essential for supporting instruction reasoning. To mitigate the semantic loss, we further introduce a Cross-Modal Enhancer (CME) that enables rich interactions between the fine-grained cues and MLLM-derived features. Extensive experiments demonstrate that ReasonBrain consistently outperforms state-of-the-art baselines on reasoning scenarios while exhibiting strong zero-shot generalization to conventional IIE tasks. Our dataset and code will be released publicly.
[426] SIU3R: Simultaneous Scene Understanding and 3D Reconstruction Beyond Feature Alignment
Qi Xu, Dongxu Wei, Lingzhe Zhao, Wenpu Li, Zhangchi Huang, Shunping Ji, Peidong Liu
Main category: cs.CV
TL;DR: SIU3R is an alignment-free framework for simultaneous 3D reconstruction and understanding from unposed images, using pixel-aligned 3D representation and unified learnable queries to avoid 2D-3D feature alignment limitations.
Details
Motivation: Current 2D-to-3D feature alignment approaches for simultaneous understanding and 3D reconstruction suffer from limited 3D understanding capability and semantic information loss.Method: Uses pixel-aligned 3D representation and unifies multiple understanding tasks into learnable queries, with two lightweight modules to facilitate interaction between reconstruction and understanding tasks.
Result: Achieves state-of-the-art performance on individual 3D reconstruction and understanding tasks, as well as simultaneous understanding and 3D reconstruction.
Conclusion: The alignment-free framework with mutual benefit designs effectively enables native 3D understanding without 2D model alignment, demonstrating superior performance across tasks.
Abstract: Simultaneous understanding and 3D reconstruction plays an important role in developing end-to-end embodied intelligent systems. To achieve this, recent approaches resort to 2D-to-3D feature alignment paradigm, which leads to limited 3D understanding capability and potential semantic information loss. In light of this, we propose SIU3R, the first alignment-free framework for generalizable simultaneous understanding and 3D reconstruction from unposed images. Specifically, SIU3R bridges reconstruction and understanding tasks via pixel-aligned 3D representation, and unifies multiple understanding (segmentation) tasks into a set of unified learnable queries, enabling native 3D understanding without the need of alignment with 2D models. To encourage collaboration between the two tasks with shared representation, we further conduct in-depth analyses of their mutual benefits, and propose two lightweight modules to facilitate their interaction. Extensive experiments demonstrate that our method achieves state-of-the-art performance not only on the individual tasks of 3D reconstruction and understanding, but also on the task of simultaneous understanding and 3D reconstruction, highlighting the advantages of our alignment-free framework and the effectiveness of the mutual benefit designs. Project page: https://insomniaaac.github.io/siu3r/
[427] Investigating Redundancy in Multimodal Large Language Models with Multiple Vision Encoders
Yizhou Wang, Song Mao, Yang Chen, Yufan Shen, Yinqiao Yan, Pinlong Cai, Ding Wang, Guohang Yan, Zhi Yu, Xuming Hu, Botian Shi
Main category: cs.CV
TL;DR: This paper challenges the assumption that using multiple vision encoders in multimodal LLMs improves performance, showing that encoder redundancy is common and that removing specific encoders can actually boost performance.
Details
Motivation: To investigate whether the common practice of integrating multiple vision encoders in MLLMs actually provides complementary benefits, as current models assume diverse pretraining objectives yield better performance.Method: Systematic encoder masking across representative multi-encoder MLLMs, introducing two metrics: Conditional Utilization Rate (CUR) to measure marginal encoder contribution, and Information Gap (IG) to capture encoder utility heterogeneity.
Result: Found pervasive encoder redundancy - performance often degrades gracefully or even improves when masking selected encoders. Single/dual encoder variants recover over 90% of baseline on most non-OCR tasks. Masking specific encoders can yield up to 16% higher accuracy on specific tasks and 3.6% overall performance boost.
Conclusion: Challenges the ‘more encoders are better’ heuristic in MLLMs and provides actionable diagnostics for developing more efficient multimodal architectures by identifying redundant and detrimental encoders.
Abstract: Recent multimodal large language models (MLLMs) increasingly integrate multiple vision encoders to improve performance on various benchmarks, assuming that diverse pretraining objectives yield complementary visual signals. However, we show this assumption often fails in practice. Through systematic encoder masking across representative multi encoder MLLMs, we find that performance typically degrades gracefully and sometimes even improves when selected encoders are masked, revealing pervasive encoder redundancy. To quantify this effect, we introduce two principled metrics: the Conditional Utilization Rate (CUR), which measures an encoders marginal contribution in the presence of others, and the Information Gap (IG), which captures heterogeneity in encoder utility within a model. Using these tools, we observe (i) strong specialization on tasks like OCR and Chart, where a single encoder can dominate with a CUR greater than 90%, (ii) high redundancy on general VQA and knowledge-based tasks, where encoders are largely interchangeable, (iii) instances of detrimental encoders with negative CUR. Notably, masking specific encoders can yield up to 16% higher accuracy on a specific task category and 3.6% overall performance boost compared to the full model.Furthermore, single and dual encoder variants recover over 90% of baseline on most non OCR tasks. Our analysis challenges the more encoders are better heuristic in MLLMs and provides actionable diagnostics for developing more efficient and effective multimodal architectures.
[428] pFedMMA: Personalized Federated Fine-Tuning with Multi-Modal Adapter for Vision-Language Models
Sajjad Ghiasvand, Mahnoosh Alizadeh, Ramtin Pedarsani
Main category: cs.CV
TL;DR: pFedMMA is a personalized federated learning framework that uses multi-modal adapters for vision-language tasks, achieving better trade-offs between personalization and generalization than existing methods.
Details
Motivation: Adapting Vision-Language Models (VLMs) like CLIP to decentralized, heterogeneous data efficiently while maintaining generalization remains challenging, as existing prompt tuning methods often sacrifice generalization for personalization.Method: Each client uses modality-specific up- and down-projection layers with a globally shared projection that aligns cross-modal features. Clients locally adapt to personalized data while collaboratively training the shared projection, with only the shared component exchanged during communication.
Result: Extensive experiments across eleven datasets show pFedMMA achieves state-of-the-art trade-offs between personalization and generalization, outperforming recent federated prompt tuning methods in domain- and label-shift scenarios.
Conclusion: pFedMMA successfully addresses the challenge of adapting VLMs to decentralized data by leveraging multi-modal adapters, enabling both effective personalization and improved global generalization in a communication-efficient manner.
Abstract: Vision-Language Models (VLMs) like CLIP have demonstrated remarkable generalization in zero- and few-shot settings, but adapting them efficiently to decentralized, heterogeneous data remains a challenge. While prompt tuning has emerged as a popular parameter-efficient approach in personalized federated learning, existing methods often sacrifice generalization in favor of personalization, struggling particularly on unseen classes or domains. In this work, we propose pFedMMA, the first personalized federated learning framework that leverages multi-modal adapters for vision-language tasks. Each adapter contains modality-specific up- and down-projection layers alongside a globally shared projection that aligns cross-modal features. Our optimization strategy allows clients to locally adapt to personalized data distributions while collaboratively training the shared projection to improve global generalization. This design is also communication-efficient, as only the shared component is exchanged during communication rounds. Through extensive experiments across eleven datasets, including domain- and label-shift scenarios, we show that pFedMMA achieves state-of-the-art trade-offs between personalization and generalization, outperforming recent federated prompt tuning methods.
[429] NarrLV: Towards a Comprehensive Narrative-Centric Evaluation for Long Video Generation
X. Feng, H. Yu, M. Wu, S. Hu, J. Chen, C. Zhu, J. Wu, X. Chu, K. Huang
Main category: cs.CV
TL;DR: NarrLV is the first benchmark for evaluating narrative expression capabilities in long video generation models, using Temporal Narrative Atoms (TNAs) to measure narrative richness and MLLM-based metrics aligned with human judgment.
Details
Motivation: Current long video generation models lack proper evaluation benchmarks for narrative content expression, as existing benchmarks use simple prompts that don't capture richer narrative capabilities needed for longer videos.Method: Introduces Temporal Narrative Atoms (TNAs) as basic narrative units, creates an automatic prompt generation pipeline for flexible TNA expansion, and designs a three-level evaluation metric using MLLM-based question generation and answering.
Result: The proposed metric aligns closely with human judgments and reveals detailed capability boundaries of current video generation models in narrative content expression.
Conclusion: NarrLV successfully addresses the gap in evaluating narrative expression for long video generation and provides a comprehensive benchmark that can guide future model development.
Abstract: With the rapid development of foundation video generation technologies, long video generation models have exhibited promising research potential thanks to expanded content creation space. Recent studies reveal that the goal of long video generation tasks is not only to extend video duration but also to accurately express richer narrative content within longer videos. However, due to the lack of evaluation benchmarks specifically designed for long video generation models, the current assessment of these models primarily relies on benchmarks with simple narrative prompts (e.g., VBench). To the best of our knowledge, our proposed NarrLV is the first benchmark to comprehensively evaluate the Narrative expression capabilities of Long Video generation models. Inspired by film narrative theory, (i) we first introduce the basic narrative unit maintaining continuous visual presentation in videos as Temporal Narrative Atom (TNA), and use its count to quantitatively measure narrative richness. Guided by three key film narrative elements influencing TNA changes, we construct an automatic prompt generation pipeline capable of producing evaluation prompts with a flexibly expandable number of TNAs. (ii) Then, based on the three progressive levels of narrative content expression, we design an effective evaluation metric using the MLLM-based question generation and answering framework. (iii) Finally, we conduct extensive evaluations on existing long video generation models and the foundation generation models. Experimental results demonstrate that our metric aligns closely with human judgments. The derived evaluation outcomes reveal the detailed capability boundaries of current video generation models in narrative content expression.
[430] DriveAgent-R1: Advancing VLM-based Autonomous Driving with Active Perception and Hybrid Thinking
Weicheng Zheng, Xiaofei Mao, Nanfei Ye, Pengxiang Li, Kun Zhan, Xianpeng Lang, Hang Zhao
Main category: cs.CV
TL;DR: DriveAgent-R1 is the first autonomous driving agent with active perception for planning, using visual reasoning tools and hybrid thinking to switch between text-only and visual reasoning based on scene complexity.
Details
Motivation: Existing Vision-Language Models for autonomous driving rely on passive perception with text-based reasoning, limiting their ability to actively seek visual evidence when uncertain.Method: Proposes a hybrid thinking framework with three-stage progressive training including Cascaded Reinforcement Learning, enabling adaptive switching between text-only reasoning and tool-augmented visual reasoning.
Result: With only 3B parameters, DriveAgent-R1 achieves competitive performance comparable to GPT-5 and human driving proficiency on Drive-Internal and nuScenes datasets.
Conclusion: DriveAgent-R1 offers a proven path toward more intelligent autonomous driving systems by enabling active perception and grounding decisions in visual evidence.
Abstract: The advent of Vision-Language Models (VLMs) has significantly advanced end-to-end autonomous driving, demonstrating powerful reasoning abilities for high-level behavior planning tasks. However, existing methods are often constrained by a passive perception paradigm, relying solely on text-based reasoning. This passivity restricts the model’s capacity to actively seek crucial visual evidence when faced with uncertainty. To address this, we introduce DriveAgent-R1, the first autonomous driving agent capable of active perception for planning. In complex scenarios, DriveAgent-R1 proactively invokes tools to perform visual reasoning, firmly grounding its decisions in visual evidence, thereby enhancing both interpretability and reliability. Furthermore, we propose a hybrid thinking framework, inspired by human driver cognitive patterns, allowing the agent to adaptively switch between efficient text-only reasoning and robust tool-augmented visual reasoning based on scene complexity. This capability is cultivated through a three-stage progressive training strategy, featuring a core Cascaded Reinforcement Learning (Cascaded RL) phase. Extensive experiments on the Drive-Internal dataset, which is rich in long-tail scenarios, and the public nuScenes dataset show that, with only 3B parameters, DriveAgent-R1 achieves competitive performance comparable to top closed model systems such as GPT-5 and to human driving proficiency while remaining deployment-friendly, offering a proven path toward building more intelligent autonomous driving systems.
[431] $A^2R^2$: Advancing Img2LaTeX Conversion via Visual Reasoning with Attention-Guided Refinement
Zhecheng Li, Guoxian Song, Yiwei Wang, Zhen Xiong, Junsong Yuan, Yujun Cai
Main category: cs.CV
TL;DR: A^2R^2 framework improves Img2LaTeX conversion by integrating attention localization and iterative refinement within visual reasoning, enabling self-correction and progressive improvement of LaTeX generation quality.
Details
Motivation: Current vision-language models perform suboptimally on Img2LaTeX tasks, struggling with fine-grained visual elements like subscripts and superscripts, leading to inaccurate LaTeX generation.Method: Proposed A^2R^2 framework combines attention localization and iterative refinement in a visual reasoning framework, allowing models to perform self-correction. Also introduced Img2LaTeX-Hard-1K dataset with 1,100 challenging examples for evaluation.
Result: Significant performance improvements across various metrics, with increasing inference rounds yielding notable gains. Ablation studies confirm effectiveness and synergy of core components.
Conclusion: A^2R^2 demonstrates strong potential for test-time scaling scenarios and effectively addresses fine-grained visual reasoning challenges in Img2LaTeX conversion.
Abstract: Img2LaTeX is a practically important task that involves translating mathematical expressions and structured visual content from images into LaTeX code. In recent years, vision-language models (VLMs) have achieved remarkable progress across a range of visual understanding tasks, largely due to their strong generalization capabilities. However, despite initial efforts to apply VLMs to the Img2LaTeX task, their performance remains suboptimal. Empirical evidence shows that VLMs can be challenged by fine-grained visual elements, such as subscripts and superscripts in mathematical expressions, which results in inaccurate LaTeX generation. To address this challenge, we propose $A^2R^2$: Advancing Img2LaTeX Conversion via Visual Reasoning with Attention-Guided Refinement, a framework that effectively integrates attention localization and iterative refinement within a visual reasoning framework, enabling VLMs to perform self-correction and progressively improve LaTeX generation quality. For effective evaluation, we introduce a new dataset, Img2LaTex-Hard-1K, consisting of 1,100 carefully curated and challenging examples designed to rigorously evaluate the capabilities of VLMs within this task domain. Extensive experimental results demonstrate that: (1) $A^2R^2$ significantly improves model performance across various evaluation metrics spanning both textual and visual levels; (2) Increasing the number of inference rounds yields notable performance gains, underscoring the potential of $A^2R^2$ in test-time scaling scenarios; (3) Ablation studies and further evaluations confirm the effectiveness of our approach and the synergy of its core components during inference.
[432] RelMap: Enhancing Online Map Construction with Class-Aware Spatial Relation and Semantic Priors
Tianhui Cai, Yun Zhang, Zewei Zhou, Zhiyu Huang, Jiaqi Ma
Main category: cs.CV
TL;DR: RelMap is an end-to-end framework for online HD map construction that explicitly models spatial relations and semantic priors to improve accuracy and generalization.
Details
Motivation: Existing Transformer-based methods for online HD map construction overlook inherent spatial dependencies and semantic relationships among map elements, limiting their accuracy and generalization capabilities.Method: Proposes Class-aware Spatial Relation Prior to encode relative positional dependencies using learnable class-aware relation encoder, and Mixture-of-Experts-based Semantic Prior that routes features to class-specific experts based on predicted class probabilities.
Result: Achieves state-of-the-art performance on both nuScenes and Argoverse 2 datasets, compatible with both single-frame and temporal perception backbones.
Conclusion: Explicit modeling of spatial relations and semantic priors significantly enhances online HD map construction performance.
Abstract: Online high-definition (HD) map construction is crucial for scaling autonomous driving systems. While Transformer-based methods have become prevalent in online HD map construction, most existing approaches overlook the inherent spatial dependencies and semantic relationships among map elements, which constrains their accuracy and generalization capabilities. To address this, we propose RelMap, an end-to-end framework that explicitly models both spatial relations and semantic priors to enhance online HD map construction. Specifically, we introduce a Class-aware Spatial Relation Prior, which explicitly encodes relative positional dependencies between map elements using a learnable class-aware relation encoder. Additionally, we design a Mixture-of-Experts-based Semantic Prior, which routes features to class-specific experts based on predicted class probabilities, refining instance feature decoding. RelMap is compatible with both single-frame and temporal perception backbones, achieving state-of-the-art performance on both the nuScenes and Argoverse 2 datasets.
[433] Content-Aware Mamba for Learned Image Compression
Yunuo Chen, Zezheng Lyu, Bing He, Hongwei Hu, Qi Wang, Yuan Tian, Li Song, Wenjun Zhang, Guo Lu
Main category: cs.CV
TL;DR: CAM introduces content-aware Mamba with adaptive token permutation and global priors to overcome rigid scans in LIC, achieving SOTA compression performance.
Details
Motivation: Standard Mamba's content-agnostic raster scans and strict causality hinder effective redundancy elimination between content-correlated but spatially distant tokens.Method: Two novel mechanisms: content-adaptive token permutation to prioritize interactions between similar tokens, and injection of sample-specific global priors to mitigate strict causality without multi-directional scans.
Result: CMIC achieves state-of-the-art rate-distortion performance, surpassing VTM-21.0 by 15.91%, 21.34%, and 17.58% in BD-rate on Kodak, Tecnick, and CLIC datasets respectively.
Conclusion: CAM enables better global redundancy capture while preserving computational efficiency, demonstrating superior performance in learned image compression.
Abstract: Recent learned image compression (LIC) leverages Mamba-style state-space models (SSMs) for global receptive fields with linear complexity. However, the standard Mamba adopts content-agnostic, predefined raster (or multi-directional) scans under strict causality. This rigidity hinders its ability to effectively eliminate redundancy between tokens that are content-correlated but spatially distant. We introduce Content-Aware Mamba (CAM), an SSM that dynamically adapts its processing to the image content. Specifically, CAM overcomes prior limitations with two novel mechanisms. First, it replaces the rigid scan with a content-adaptive token permutation strategy to prioritize interactions between content-similar tokens regardless of their location. Second, it overcomes the sequential dependency by injecting sample-specific global priors into the state-space model, which effectively mitigates the strict causality without multi-directional scans. These innovations enable CAM to better capture global redundancy while preserving computational efficiency. Our Content-Aware Mamba-based LIC model (CMIC) achieves state-of-the-art rate-distortion performance, surpassing VTM-21.0 by 15.91%, 21.34%, and 17.58% in BD-rate on the Kodak, Tecnick, and CLIC datasets, respectively. Code and checkpoints will be released later.
[434] RAAG: Ratio Aware Adaptive Guidance
Shangwen Zhu, Qianyu Peng, Yuting Hu, Zhantao Yang, Han Zhang, Zhao Pu, Andy Zheng, Zhilei Shu, Ruili Feng, Fan Cheng
Main category: cs.CV
TL;DR: The paper identifies a fundamental sampling instability in flow-based generative models where early steps are highly sensitive to classifier-free guidance (CFG), leading to error amplification. It proposes an adaptive guidance schedule that automatically adjusts guidance scale during early sampling steps to enable faster generation while maintaining quality.
Details
Motivation: Current practice of using fixed, strong guidance scales throughout inference is poorly suited for fast, few-step sampling in modern applications, causing image quality degradation due to early-step sensitivity.Method: Proposes a simple, theoretically grounded adaptive guidance schedule that automatically dampens guidance scale at early steps based on the evolving ratio of conditional to unconditional predictions, requiring no inference overhead.
Result: Experiments across state-of-the-art image (SD3.5, Qwen-Image) and video (WAN2.1) models show up to 3x faster sampling while maintaining or improving quality, robustness, and semantic alignment.
Conclusion: Adapting guidance to the sampling process rather than fixing it is critical for unlocking the full potential of fast, flow-based models.
Abstract: Flow-based generative models have achieved remarkable progress, with classifier-free guidance (CFG) becoming the standard for high-fidelity generation. However, the conventional practice of applying a strong, fixed guidance scale throughout inference is poorly suited for the rapid, few-step sampling required by modern applications. In this work, we uncover the root cause of this conflict: a fundamental sampling instability where the earliest steps are acutely sensitive to guidance. We trace this to a significant spike in the ratio of conditional to unconditional predictions–a spike that we prove to be an inherent property of the training data distribution itself, making it a almost inevitable challenge. Applying a high, static guidance value during this volatile initial phase leads to an exponential amplification of error, degrading image quality. To resolve this, we propose a simple, theoretically grounded, adaptive guidance schedule that automatically dampens the guidance scale at early steps based on the evolving ratio. Our method is lightweight, incurs no inference overhead, and is compatible with standard frameworks. Experiments across state-of-the-art image (SD3.5, Qwen-Image) and video (WAN2.1) models show our approach enables up to 3x faster sampling while maintaining or improving quality, robustness, and semantic alignment. Our findings highlight that adapting guidance to the sampling process, rather than fixing it, is critical for unlocking the full potential of fast, flow-based models.
[435] TempFlow-GRPO: When Timing Matters for GRPO in Flow Models
Xiaoxuan He, Siming Fu, Yuke Zhao, Wanli Li, Jian Yang, Dacheng Yin, Fengyun Rao, Bo Zhang
Main category: cs.CV
TL;DR: TempFlow-GRPO introduces temporal-aware reinforcement learning for flow matching models, addressing inefficient credit assignment in text-to-image generation by using trajectory branching, noise-aware weighting, and seed grouping.
Details
Motivation: Existing flow matching models for text-to-image generation have suboptimal integration with reinforcement learning for human preference alignment, particularly due to temporal uniformity assumptions that fail to capture varying decision criticality across timesteps.Method: Three key innovations: (1) trajectory branching mechanism for process rewards, (2) noise-aware weighting scheme for temporal optimization, and (3) seed group strategy to isolate exploration effects.
Result: Achieves state-of-the-art performance in human preference alignment and text-to-image benchmarks through temporally-aware optimization that respects generative dynamics.
Conclusion: TempFlow-GRPO provides a principled framework that effectively captures temporal structure in flow-based generation, enabling more efficient exploration and convergence for preference alignment.
Abstract: Recent flow matching models for text-to-image generation have achieved remarkable quality, yet their integration with reinforcement learning for human preference alignment remains suboptimal, hindering fine-grained reward-based optimization. We observe that the key impediment to effective GRPO training of flow models is the temporal uniformity assumption in existing approaches: sparse terminal rewards with uniform credit assignment fail to capture the varying criticality of decisions across generation timesteps, resulting in inefficient exploration and suboptimal convergence. To remedy this shortcoming, we introduce \textbf{TempFlow-GRPO} (Temporal Flow GRPO), a principled GRPO framework that captures and exploits the temporal structure inherent in flow-based generation. TempFlow-GRPO introduces three key innovations: (i) a trajectory branching mechanism that provides process rewards by concentrating stochasticity at designated branching points, enabling precise credit assignment without requiring specialized intermediate reward models; (ii) a noise-aware weighting scheme that modulates policy optimization according to the intrinsic exploration potential of each timestep, prioritizing learning during high-impact early stages while ensuring stable refinement in later phases; and (iii) a seed group strategy that controls for initialization effects to isolate exploration contributions. These innovations endow the model with temporally-aware optimization that respects the underlying generative dynamics, leading to state-of-the-art performance in human preference alignment and text-to-image benchmarks.
[436] Vivid-VR: Distilling Concepts from Text-to-Video Diffusion Transformer for Photorealistic Video Restoration
Haoran Bai, Xiaoxu Chen, Canqian Yang, Zongyao He, Sibin Deng, Ying Chen
Main category: cs.CV
TL;DR: Vivid-VR is a DiT-based generative video restoration method that uses ControlNet for content consistency and proposes concept distillation training to preserve texture and temporal quality, along with an enhanced control architecture for better generation controllability.
Details
Motivation: To address the issue of distribution drift and compromised texture realism and temporal coherence in conventional fine-tuning of controllable video restoration pipelines due to imperfect multimodal alignment.Method: Proposes a concept distillation training strategy using pretrained T2V model to synthesize training samples with embedded textual concepts. Also redesigns control architecture with a control feature projector to filter degradation artifacts and a dual-branch ControlNet connector with MLP-based feature mapping and cross-attention for dynamic control feature retrieval.
Result: Vivid-VR performs favorably against existing approaches on synthetic, real-world benchmarks, and AIGC videos, achieving impressive texture realism, visual vividness, and temporal consistency.
Conclusion: The proposed method effectively addresses distribution drift issues in video restoration while maintaining high-quality texture and temporal coherence through concept distillation and enhanced control architecture.
Abstract: We present Vivid-VR, a DiT-based generative video restoration method built upon an advanced T2V foundation model, where ControlNet is leveraged to control the generation process, ensuring content consistency. However, conventional fine-tuning of such controllable pipelines frequently suffers from distribution drift due to limitations in imperfect multimodal alignment, resulting in compromised texture realism and temporal coherence. To tackle this challenge, we propose a concept distillation training strategy that utilizes the pretrained T2V model to synthesize training samples with embedded textual concepts, thereby distilling its conceptual understanding to preserve texture and temporal quality. To enhance generation controllability, we redesign the control architecture with two key components: 1) a control feature projector that filters degradation artifacts from input video latents to minimize their propagation through the generation pipeline, and 2) a new ControlNet connector employing a dual-branch design. This connector synergistically combines MLP-based feature mapping with cross-attention mechanism for dynamic control feature retrieval, enabling both content preservation and adaptive control signal modulation. Extensive experiments show that Vivid-VR performs favorably against existing approaches on both synthetic and real-world benchmarks, as well as AIGC videos, achieving impressive texture realism, visual vividness, and temporal consistency. The codes and checkpoints are publicly available at https://github.com/csbhr/Vivid-VR.
[437] Small Dents, Big Impact: A Dataset and Deep Learning Approach for Vehicle Dent Detection
Danish Zia Baig, Mohsin Kamal, Zahid Ullah
Main category: cs.CV
TL;DR: The paper presents a YOLOv8-based deep learning solution for automatically detecting microscopic surface flaws like tiny dents on car exteriors, achieving high detection accuracy with precision of 0.86, recall of 0.84, and F1-score of 0.85 using the YOLOv8m-t42 model.
Details
Motivation: Traditional automotive damage inspection is manual, time-consuming, and unreliable for detecting tiny surface imperfections like microscopic dents. There's increasing demand for faster and more precise inspection methods.Method: Used YOLOv8 object recognition framework with custom variants YOLOv8m-t4 and YOLOv8m-t42. Created a bespoke dataset with annotated car surface photos under various conditions. Employed real-time data augmentation for robustness.
Result: YOLOv8m-t42 model achieved precision: 0.86, recall: 0.84, F1-score: 0.85, mAP@0.5: 0.60, and PR curve area: 0.88, outperforming YOLOv8m-t4 model (precision: 0.81, recall: 0.79, F1-score: 0.80, PR curve: 0.82).
Conclusion: The deep learning-based approach provides excellent detection accuracy and low inference latency, making it suitable for real-time applications like automated insurance evaluations and car inspections, with YOLOv8m-t42 being more appropriate for practical dent detection despite slower convergence.
Abstract: Conventional car damage inspection techniques are labor-intensive, manual, and frequently overlook tiny surface imperfections like microscopic dents. Machine learning provides an innovative solution to the increasing demand for quicker and more precise inspection methods. The paper uses the YOLOv8 object recognition framework to provide a deep learning-based solution for automatically detecting microscopic surface flaws, notably tiny dents, on car exteriors. Traditional automotive damage inspection procedures are manual, time-consuming, and frequently unreliable at detecting tiny flaws. To solve this, a bespoke dataset containing annotated photos of car surfaces under various lighting circumstances, angles, and textures was created. To improve robustness, the YOLOv8m model and its customized variants, YOLOv8m-t4 and YOLOv8m-t42, were trained employing real-time data augmentation approaches. Experimental results show that the technique has excellent detection accuracy and low inference latency, making it suited for real-time applications such as automated insurance evaluations and automobile inspections. Evaluation parameters such as mean Average Precision (mAP), precision, recall, and F1-score verified the model’s efficacy. With a precision of 0.86, recall of 0.84, and F1-score of 0.85, the YOLOv8m-t42 model outperformed the YOLOv8m-t4 model (precision: 0.81, recall: 0.79, F1-score: 0.80) in identifying microscopic surface defects. With a little reduced mAP@0.5:0.95 of 0.20, the mAP@0.5 for YOLOv8m-t42 stabilized at 0.60. Furthermore, YOLOv8m-t42’s PR curve area was 0.88, suggesting more consistent performance than YOLOv8m-t4 (0.82). YOLOv8m-t42 has greater accuracy and is more appropriate for practical dent detection applications, even though its convergence is slower.
[438] Re-Densification Meets Cross-Scale Propagation: Real-Time Neural Compression of LiDAR Point Clouds
Pengpeng Yu, Haoran Li, Runqing Jiang, Jing Wang, Liang Lin, Yulan Guo
Main category: cs.CV
TL;DR: A novel LiDAR point cloud compression method that uses geometry re-densification and cross-scale feature propagation to achieve efficient predictive coding with state-of-the-art compression ratios and real-time performance.
Details
Motivation: High-precision LiDAR scans incur substantial storage and transmission overhead, while existing methods struggle with extreme sparsity of geometric details that hinders efficient context modeling and limits compression performance and speed.Method: Proposes two lightweight modules: 1) Geometry Re-Densification Module that re-densifies sparse geometry, extracts features at denser scale, then re-sparsifies for predictive coding; 2) Cross-scale Feature Propagation Module that leverages occupancy cues from multiple resolution levels to guide hierarchical feature propagation and enable cross-scale information sharing.
Result: Achieves state-of-the-art compression ratios on KITTI dataset with real-time performance (26 FPS for encoding/decoding at 12-bit quantization).
Conclusion: The proposed framework generates compact feature representations that provide efficient context modeling and accelerate the coding process, demonstrating superior compression performance while maintaining lightweight computation.
Abstract: LiDAR point clouds are fundamental to various applications, yet high-precision scans incur substantial storage and transmission overhead. Existing methods typically convert unordered points into hierarchical octree or voxel structures for dense-to-sparse predictive coding. However, the extreme sparsity of geometric details hinders efficient context modeling, thereby limiting their compression performance and speed. To address this challenge, we propose to generate compact features for efficient predictive coding. Our framework comprises two lightweight modules. First, the Geometry Re-Densification Module re-densifies encoded sparse geometry, extracts features at denser scale, and then re-sparsifies the features for predictive coding. This module avoids costly computation on highly sparse details while maintaining a lightweight prediction head. Second, the Cross-scale Feature Propagation Module leverages occupancy cues from multiple resolution levels to guide hierarchical feature propagation. This design facilitates information sharing across scales, thereby reducing redundant feature extraction and providing enriched features for the Geometry Re-Densification Module. By integrating these two modules, our method yields a compact feature representation that provides efficient context modeling and accelerates the coding process. Experiments on the KITTI dataset demonstrate state-of-the-art compression ratios and real-time performance, achieving 26 FPS for encoding/decoding at 12-bit quantization. Code is available at https://github.com/pengpeng-yu/FastPCC.
[439] Draw-In-Mind: Rebalancing Designer-Painter Roles in Unified Multimodal Models Benefits Image Editing
Ziyun Zeng, Junhao Zhang, Wei Li, Mike Zheng Shou
Main category: cs.CV
TL;DR: DIM addresses imbalanced responsibilities in unified multimodal models by creating a dataset with enhanced instruction comprehension and explicit design blueprints, enabling better image editing performance with smaller models.
Details
Motivation: Current unified multimodal models struggle with precise image editing due to imbalanced division of responsibilities - understanding modules act as translators while generation modules must handle layout inference, region identification, and content rendering simultaneously.Method: Introduces Draw-In-Mind (DIM) dataset with two subsets: DIM-T2I (14M long-context image-text pairs) for enhanced instruction comprehension, and DIM-Edit (233K chain-of-thought imaginations from GPT-4o) as explicit design blueprints. Connects frozen Qwen2.5-VL-3B with trainable SANA1.5-1.6B via lightweight MLP.
Result: DIM-4.6B-Edit achieves SOTA or competitive performance on ImgEdit and GEdit-Bench benchmarks, outperforming larger models like UniWorld-V1 and Step1X-Edit despite modest parameter scale.
Conclusion: Explicitly assigning design responsibility to the understanding module provides significant benefits for image editing, demonstrating that balanced task allocation is more effective than model scaling alone.
Abstract: In recent years, integrating multimodal understanding and generation into a single unified model has emerged as a promising paradigm. While this approach achieves strong results in text-to-image (T2I) generation, it still struggles with precise image editing. We attribute this limitation to an imbalanced division of responsibilities. The understanding module primarily functions as a translator that encodes user instructions into semantic conditions, while the generation module must simultaneously act as designer and painter, inferring the original layout, identifying the target editing region, and rendering the new content. This imbalance is counterintuitive because the understanding module is typically trained with several times more data on complex reasoning tasks than the generation module. To address this issue, we introduce Draw-In-Mind (DIM), a dataset comprising two complementary subsets: (i) DIM-T2I, containing 14M long-context image-text pairs to enhance complex instruction comprehension; and (ii) DIM-Edit, consisting of 233K chain-of-thought imaginations generated by GPT-4o, serving as explicit design blueprints for image edits. We connect a frozen Qwen2.5-VL-3B with a trainable SANA1.5-1.6B via a lightweight two-layer MLP, and train it on the proposed DIM dataset, resulting in DIM-4.6B-T2I/Edit. Despite its modest parameter scale, DIM-4.6B-Edit achieves SOTA or competitive performance on the ImgEdit and GEdit-Bench benchmarks, outperforming much larger models such as UniWorld-V1 and Step1X-Edit. These findings demonstrate that explicitly assigning the design responsibility to the understanding module provides significant benefits for image editing. Our dataset and models are available at https://github.com/showlab/DIM.
[440] GLEAM: Learning to Match and Explain in Cross-View Geo-Localization
Xudong Lu, Zhi Zheng, Yi Wan, Yongxiang Yao, Annan Wang, Renrui Zhang, Panwang Xia, Qiong Wu, Qingyun Li, Weifeng Lin, Xiangyu Zhao, Peifeng Ma, Xue Yang, Hongsheng Li
Main category: cs.CV
TL;DR: GLEAM-C is a foundational cross-view geo-localization model that unifies multiple views and modalities by aligning them with satellite imagery, while GLEAM-X introduces explainable reasoning using MLLMs to address interpretability issues in traditional CVGL methods.
Details
Motivation: Existing CVGL approaches are restricted to single views/modalities and lack interpretability - they only determine image correspondence without explaining the rationale behind matches.Method: GLEAM-C uses a two-phase training strategy to align multiple views/modalities (UAV, street maps, panoramic views, ground photos) with satellite imagery. GLEAM-X leverages MLLMs for explainable reasoning and constructs a bilingual benchmark using GPT-4o and Doubao-1.5-Thinking-Vision-Pro.
Result: GLEAM-C achieves accuracy comparable to prior modality-specific CVGL models with enhanced training efficiency. GLEAM-X enables systematic evaluation of explainable cross-view reasoning through human-refined test data.
Conclusion: The GLEAM framework integrates multi-modal, multi-view alignment with interpretable correspondence analysis, advancing geo-localization by unifying accurate cross-view matching with explainable reasoning.
Abstract: Cross-View Geo-Localization (CVGL) focuses on identifying correspondences between images captured from distinct perspectives of the same geographical location. However, existing CVGL approaches are typically restricted to a single view or modality, and their direct visual matching strategy lacks interpretability: they only determine whether two images correspond, without explaining the rationale behind the match. In this paper, we present GLEAM-C, a foundational CVGL model that unifies multiple views and modalities-including UAV imagery, street maps, panoramic views, and ground photographs-by aligning them exclusively with satellite imagery. Our framework enhances training efficiency through optimized implementation while achieving accuracy comparable to prior modality-specific CVGL models through a two-phase training strategy. Moreover, to address the lack of interpretability in traditional CVGL methods, we leverage the reasoning capabilities of multimodal large language models (MLLMs) to propose a new task, GLEAM-X, which combines cross-view correspondence prediction with explainable reasoning. To support this task, we construct a bilingual benchmark using GPT-4o and Doubao-1.5-Thinking-Vision-Pro to generate training and testing data. The test set is further refined through detailed human revision, enabling systematic evaluation of explainable cross-view reasoning and advancing transparency and scalability in geo-localization. Together, GLEAM-C and GLEAM-X form a comprehensive CVGL pipeline that integrates multi-modal, multi-view alignment with interpretable correspondence analysis, unifying accurate cross-view matching with explainable reasoning and advancing Geo-Localization by enabling models to better Explain And Match. Code and datasets used in this work will be made publicly accessible at https://github.com/Lucky-Lance/GLEAM.
[441] A Fully Automatic Framework for Intracranial Pressure Grading: Integrating Keyframe Identification, ONSD Measurement and Clinical Data
Pengxu Wen, Tingting Yu, Ziwei Nie, Cheng Jiang, Zhenyu Yin, Mingyang He, Bo Liao, Xiaoping Yang
Main category: cs.CV
TL;DR: A fully automatic two-stage framework for non-invasive intracranial pressure (ICP) grading using optic nerve sheath diameter (ONSD) measurements from fundus ultrasound videos, combined with clinical data, achieving superior accuracy over conventional methods.
Details
Motivation: Current clinical practices for ICP measurement via ONSD suffer from inconsistency in manual operation, subjectivity in view selection, and variability in thresholding, limiting reliability. The invasiveness of lumbar puncture drives the need for non-invasive alternatives.Method: Two-stage framework: 1) Fundus ultrasound video processing with frame-level anatomical segmentation, rule-based keyframe identification guided by international consensus, and precise ONSD measurement; 2) ICP grading stage that fuses ONSD metrics with clinical features to predict ICP grades.
Result: Achieved validation accuracy of 0.845 ± 0.071 (five-fold cross-validation) and independent test accuracy of 0.786, significantly outperforming conventional threshold-based method (0.637 ± 0.111 validation accuracy, 0.429 test accuracy).
Conclusion: The framework establishes a reliable non-invasive approach for clinical ICP evaluation by reducing operator variability and integrating multi-source information, holding promise for improving patient management in acute neurological conditions.
Abstract: Intracranial pressure (ICP) elevation poses severe threats to cerebral function, thus necessitating monitoring for timely intervention. While lumbar puncture is the gold standard for ICP measurement, its invasiveness and associated risks drive the need for non-invasive alternatives. Optic nerve sheath diameter (ONSD) has emerged as a promising biomarker, as elevated ICP directly correlates with increased ONSD. However, current clinical practices for ONSD measurement suffer from inconsistency in manual operation, subjectivity in optimal view selection, and variability in thresholding, limiting their reliability. To address these challenges, we introduce a fully automatic two-stage framework for ICP grading, integrating keyframe identification, ONSD measurement and clinical data. Specifically, the fundus ultrasound video processing stage performs frame-level anatomical segmentation, rule-based keyframe identification guided by an international consensus statement, and precise ONSD measurement. The intracranial pressure grading stage then fuses ONSD metrics with clinical features to enable the prediction of ICP grades, thereby demonstrating an innovative blend of interpretable ultrasound analysis and multi-source data integration for objective clinical evaluation. Experimental results demonstrate that our method achieves a validation accuracy of $0.845 \pm 0.071$ (with standard deviation from five-fold cross-validation) and an independent test accuracy of 0.786, significantly outperforming conventional threshold-based method ($0.637 \pm 0.111$ validation accuracy, $0.429$ test accuracy). Through effectively reducing operator variability and integrating multi-source information, our framework establishes a reliable non-invasive approach for clinical ICP evaluation, holding promise for improving patient management in acute neurological conditions.
[442] Group Evidence Matters: Tiling-based Semantic Gating for Dense Object Detection
Yilun Xiao
Main category: cs.CV
TL;DR: A detector-agnostic post-processing framework that improves detection of dense small objects in UAV imagery by converting overlap-induced redundancy into group evidence through spatial and semantic clustering.
Details
Motivation: Dense small objects in UAV imagery are often missed due to long-range viewpoints, occlusion, and clutter, requiring improved detection methods.Method: Uses overlapping tiling to recover low-confidence candidates, then applies Spatial Gate (DBSCAN on box centroids) and Semantic Gate (DBSCAN on ResNet-18 embeddings) to validate group evidence, followed by controlled confidence reweighting and class-aware NMS fusion.
Result: On VisDrone dataset, recall increased from 0.685 to 0.778 (+0.093), precision adjusted from 0.801 to 0.595, yielding F1=0.669 with post-processing latency of 0.095s per image.
Conclusion: The framework provides recall-first, precision-trade-off behavior beneficial for recall-sensitive applications like far-field counting and monitoring, requires no retraining, and integrates with modern detectors.
Abstract: Dense small objects in UAV imagery are often missed due to long-range viewpoints, occlusion, and clutter[cite: 5]. This paper presents a detector-agnostic post-processing framework that converts overlap-induced redundancy into group evidence[cite: 6]. Overlapping tiling first recovers low-confidence candidates[cite: 7]. A Spatial Gate (DBSCAN on box centroids) and a Semantic Gate (DBSCAN on ResNet-18 embeddings) then validates group evidence[cite: 7]. Validated groups receive controlled confidence reweighting before class-aware NMS fusion[cite: 8]. Experiments on VisDrone show a recall increase from 0.685 to 0.778 (+0.093) and a precision adjustment from 0.801 to 0.595, yielding F1=0.669[cite: 9]. Post-processing latency averages 0.095 s per image[cite: 10]. These results indicate recall-first, precision-trade-off behavior that benefits recall-sensitive applications such as far-field counting and monitoring[cite: 10]. Ablation confirms that tiling exposes missed objects, spatial clustering stabilizes geometry, semantic clustering enforces appearance coherence, and reweighting provides calibrated integration with the baseline[cite: 11]. The framework requires no retraining and integrates with modern detectors[cite: 12]. Future work will reduce semantic gating cost and extend the approach with temporal cues[cite: 13].
[443] SPATIALGEN: Layout-guided 3D Indoor Scene Generation
Chuan Fang, Heng Li, Yixun Liang, Jia Zheng, Yongsen Mao, Yuan Liu, Rui Tang, Zihan Zhou, Ping Tan
Main category: cs.CV
TL;DR: SpatialGen is a multi-view multi-modal diffusion model that generates realistic 3D indoor scenes using a new large-scale synthetic dataset, producing appearance, geometry, and semantic information while maintaining spatial consistency.
Details
Motivation: Manual 3D modeling is time-consuming, and existing generative methods struggle with balancing visual quality, diversity, semantic consistency, and user control. There's a lack of large-scale, high-quality datasets for this task.Method: Created a synthetic dataset with 12,328 annotated scenes and 4.7M renderings. Developed SpatialGen, a multi-view multi-modal diffusion model that takes 3D layout and reference image as input to synthesize appearance, geometry, and semantic maps from arbitrary viewpoints.
Result: SpatialGen consistently generates superior results compared to previous methods, producing realistic and semantically consistent 3D indoor scenes while preserving spatial consistency across modalities.
Conclusion: The approach addresses the dataset bottleneck and enables high-quality 3D scene generation. The data and models are being open-sourced to advance indoor scene understanding and generation.
Abstract: Creating high-fidelity 3D models of indoor environments is essential for applications in design, virtual reality, and robotics. However, manual 3D modeling remains time-consuming and labor-intensive. While recent advances in generative AI have enabled automated scene synthesis, existing methods often face challenges in balancing visual quality, diversity, semantic consistency, and user control. A major bottleneck is the lack of a large-scale, high-quality dataset tailored to this task. To address this gap, we introduce a comprehensive synthetic dataset, featuring 12,328 structured annotated scenes with 57,440 rooms, and 4.7M photorealistic 2D renderings. Leveraging this dataset, we present SpatialGen, a novel multi-view multi-modal diffusion model that generates realistic and semantically consistent 3D indoor scenes. Given a 3D layout and a reference image (derived from a text prompt), our model synthesizes appearance (color image), geometry (scene coordinate map), and semantic (semantic segmentation map) from arbitrary viewpoints, while preserving spatial consistency across modalities. SpatialGen consistently generates superior results to previous methods in our experiments. We are open-sourcing our data and models to empower the community and advance the field of indoor scene understanding and generation.
[444] Efficient Multimodal Dataset Distillation via Generative Models
Zhenghao Zhao, Haoxuan Wang, Junyi Wu, Yuzhang Shang, Gaowen Liu, Yan Yan
Main category: cs.CV
TL;DR: EDGE is a generative method for efficient multimodal dataset distillation that addresses correlation and diversity challenges in image-text synthesis, achieving 18x faster performance than state-of-the-art methods.
Details
Motivation: Existing multimodal dataset distillation methods are constrained by Matching Training Trajectories algorithm, which significantly increases computing resource requirements and takes days to process distillation.Method: Proposes a generative model training workflow with bi-directional contrastive loss and diversity loss, plus a caption synthesis strategy to improve text-to-image retrieval performance.
Result: Superior performance and efficiency on Flickr30K, COCO, and CC3M datasets, achieving results 18x faster than state-of-the-art methods.
Conclusion: EDGE provides an efficient solution for multimodal dataset distillation by addressing key challenges in generative modeling for image-text datasets.
Abstract: Dataset distillation aims to synthesize a small dataset from a large dataset, enabling the model trained on it to perform well on the original dataset. With the blooming of large language models and multimodal large language models, the importance of multimodal datasets, particularly image-text datasets, has grown significantly. However, existing multimodal dataset distillation methods are constrained by the Matching Training Trajectories algorithm, which significantly increases the computing resource requirement, and takes days to process the distillation. In this work, we introduce EDGE, a generative distillation method for efficient multimodal dataset distillation. Specifically, we identify two key challenges of distilling multimodal datasets with generative models: 1) The lack of correlation between generated images and captions. 2) The lack of diversity among generated samples. To address the aforementioned issues, we propose a novel generative model training workflow with a bi-directional contrastive loss and a diversity loss. Furthermore, we propose a caption synthesis strategy to further improve text-to-image retrieval performance by introducing more text information. Our method is evaluated on Flickr30K, COCO, and CC3M datasets, demonstrating superior performance and efficiency compared to existing approaches. Notably, our method achieves results 18x faster than the state-of-the-art method.
[445] MS-GS: Multi-Appearance Sparse-View 3D Gaussian Splatting in the Wild
Deming Li, Kaiwen Jiang, Yutao Tang, Ravi Ramamoorthi, Rama Chellappa, Cheng Peng
Main category: cs.CV
TL;DR: MS-GS is a novel 3D Gaussian Splatting framework that addresses sparse-view reconstruction and multi-appearance challenges in in-the-wild photo collections by leveraging geometric priors from monocular depth and multi-view constraints.
Details
Motivation: In-the-wild photo collections often have limited imagery with varying appearances (different times, seasons), making scene reconstruction and novel view synthesis challenging. Existing NeRF and 3DGS adaptations tend to oversmooth and overfit in these sparse-view scenarios.Method: Uses geometric priors from monocular depth estimations, extracts local semantic regions with SfM points anchoring for alignment, and applies geometry-guided supervision at virtual views with fine-grained and coarse schemes to ensure 3D consistency and reduce overfitting.
Result: MS-GS achieves photorealistic renderings under challenging sparse-view and multi-appearance conditions, significantly outperforming existing approaches across different datasets.
Conclusion: The proposed MS-GS framework effectively handles sparse-view reconstruction with multiple appearances using geometric priors and multi-view constraints, setting new benchmarks for realistic in-the-wild scenarios.
Abstract: In-the-wild photo collections often contain limited volumes of imagery and exhibit multiple appearances, e.g., taken at different times of day or seasons, posing significant challenges to scene reconstruction and novel view synthesis. Although recent adaptations of Neural Radiance Field (NeRF) and 3D Gaussian Splatting (3DGS) have improved in these areas, they tend to oversmooth and are prone to overfitting. In this paper, we present MS-GS, a novel framework designed with Multi-appearance capabilities in Sparse-view scenarios using 3DGS. To address the lack of support due to sparse initializations, our approach is built on the geometric priors elicited from monocular depth estimations. The key lies in extracting and utilizing local semantic regions with a Structure-from-Motion (SfM) points anchored algorithm for reliable alignment and geometry cues. Then, to introduce multi-view constraints, we propose a series of geometry-guided supervision at virtual views in a fine-grained and coarse scheme to encourage 3D consistency and reduce overfitting. We also introduce a dataset and an in-the-wild experiment setting to set up more realistic benchmarks. We demonstrate that MS-GS achieves photorealistic renderings under various challenging sparse-view and multi-appearance conditions and outperforms existing approaches significantly across different datasets.
[446] GLip: A Global-Local Integrated Progressive Framework for Robust Visual Speech Recognition
Tianyue Wang, Shuang Yang, Shiguang Shan, Xilin Chen
Main category: cs.CV
TL;DR: GLip is a Global-Local Integrated Progressive framework for robust visual speech recognition that addresses real-world visual challenges like illumination variations, occlusions, blurring, and pose changes through dual-path feature extraction and progressive learning.
Details
Motivation: Existing VSR methods pay limited attention to real-world visual challenges such as illumination variations, occlusions, blurring, and pose changes, which significantly impact performance in practical applications.Method: GLip uses a dual-path feature extraction architecture integrating global and local features within a two-stage progressive learning framework. Stage 1 learns coarse alignment between visual features and speech units using audio-visual data. Stage 2 introduces a Contextual Enhancement Module to dynamically integrate local features with global context across spatial and temporal dimensions.
Result: The framework consistently outperforms existing methods on LRS2 and LRS3 benchmarks and demonstrates enhanced robustness against various visual challenges. It also shows effectiveness on a newly introduced challenging Mandarin dataset.
Conclusion: GLip’s progressive learning strategy that uniquely exploits discriminative local regions provides enhanced robustness against visual challenges and represents an effective approach for robust visual speech recognition in real-world scenarios.
Abstract: Visual speech recognition (VSR), also known as lip reading, is the task of recognizing speech from silent video. Despite significant advancements in VSR over recent decades, most existing methods pay limited attention to real-world visual challenges such as illumination variations, occlusions, blurring, and pose changes. To address these challenges, we propose GLip, a Global-Local Integrated Progressive framework designed for robust VSR. GLip is built upon two key insights: (i) learning an initial coarse alignment between visual features across varying conditions and corresponding speech content facilitates the subsequent learning of precise visual-to-speech mappings in challenging environments; (ii) under adverse conditions, certain local regions (e.g., non-occluded areas) often exhibit more discriminative cues for lip reading than global features. To this end, GLip introduces a dual-path feature extraction architecture that integrates both global and local features within a two-stage progressive learning framework. In the first stage, the model learns to align both global and local visual features with corresponding acoustic speech units using easily accessible audio-visual data, establishing a coarse yet semantically robust foundation. In the second stage, we introduce a Contextual Enhancement Module (CEM) to dynamically integrate local features with relevant global context across both spatial and temporal dimensions, refining the coarse representations into precise visual-speech mappings. Our framework uniquely exploits discriminative local regions through a progressive learning strategy, demonstrating enhanced robustness against various visual challenges and consistently outperforming existing methods on the LRS2 and LRS3 benchmarks. We further validate its effectiveness on a newly introduced challenging Mandarin dataset.
[447] Automated Facility Enumeration for Building Compliance Checking using Door Detection and Large Language Models
Licheng Zhang, Bach Le, Naveed Akhtar, Tuan Ngo
Main category: cs.CV
TL;DR: This paper introduces automated facility enumeration for building compliance checking using LLMs with Chain-of-Thought reasoning, combining door detection with reasoning capabilities to validate facility quantities against requirements.
Details
Motivation: Manual facility enumeration for building compliance checking is time-consuming and labor-intensive, creating a critical gap in existing workflows that has been overlooked in literature despite its importance.Method: Proposes a novel method integrating door detection with LLM-based reasoning using a Chain-of-Thought pipeline, being the first to apply LLMs to this task.
Result: Experiments on real-world and synthetic floor plan data demonstrate the method’s effectiveness, robustness, and good generalization across diverse datasets and facility types.
Conclusion: The proposed LLM-based approach with CoT reasoning successfully automates facility enumeration for building compliance checking, addressing a previously overlooked but critical component of the process.
Abstract: Building compliance checking (BCC) is a critical process for ensuring that constructed facilities meet regulatory standards. A core component of BCC is the accurate enumeration of facility types and their spatial distribution. Despite its importance, this problem has been largely overlooked in the literature, posing a significant challenge for BCC and leaving a critical gap in existing workflows. Performing this task manually is time-consuming and labor-intensive. Recent advances in large language models (LLMs) offer new opportunities to enhance automation by combining visual recognition with reasoning capabilities. In this paper, we introduce a new task for BCC: automated facility enumeration, which involves validating the quantity of each facility type against statutory requirements. To address it, we propose a novel method that integrates door detection with LLM-based reasoning. We are the first to apply LLMs to this task and further enhance their performance through a Chain-of-Thought (CoT) pipeline. Our approach generalizes well across diverse datasets and facility types. Experiments on both real-world and synthetic floor plan data demonstrate the effectiveness and robustness of our method.
[448] 4DGCPro: Efficient Hierarchical 4D Gaussian Compression for Progressive Volumetric Video Streaming
Zihan Zheng, Zhenlong Wu, Houqiang Zhong, Yuan Tian, Ning Cao, Lan Xu, Jiangchao Yao, Xiaoyun Zhang, Qiang Hu, Wenjun Zhang
Main category: cs.CV
TL;DR: 4DGCPro is a hierarchical 4D Gaussian compression framework that enables real-time mobile decoding and high-quality rendering of volumetric video through progressive streaming in a single bitstream.
Details
Motivation: Existing volumetric video compression methods lack flexibility for quality/bitrate adjustment in a single model and struggle with real-time decoding on mobile devices, preventing seamless viewing experiences comparable to 2D video.Method: Proposes perceptually-weighted hierarchical 4D Gaussian representation with motion-aware adaptive grouping, plus end-to-end entropy-optimized training with layer-wise rate-distortion supervision and attribute-specific entropy modeling.
Result: Enables flexible quality and multiple bitrates within a single model, achieves real-time decoding and rendering on mobile devices, and outperforms existing methods in rate-distortion performance across multiple datasets.
Conclusion: 4DGCPro successfully addresses the challenges of volumetric video compression by providing efficient streaming, mobile compatibility, and superior compression performance in a unified framework.
Abstract: Achieving seamless viewing of high-fidelity volumetric video, comparable to 2D video experiences, remains an open challenge. Existing volumetric video compression methods either lack the flexibility to adjust quality and bitrate within a single model for efficient streaming across diverse networks and devices, or struggle with real-time decoding and rendering on lightweight mobile platforms. To address these challenges, we introduce 4DGCPro, a novel hierarchical 4D Gaussian compression framework that facilitates real-time mobile decoding and high-quality rendering via progressive volumetric video streaming in a single bitstream. Specifically, we propose a perceptually-weighted and compression-friendly hierarchical 4D Gaussian representation with motion-aware adaptive grouping to reduce temporal redundancy, preserve coherence, and enable scalable multi-level detail streaming. Furthermore, we present an end-to-end entropy-optimized training scheme, which incorporates layer-wise rate-distortion (RD) supervision and attribute-specific entropy modeling for efficient bitstream generation. Extensive experiments show that 4DGCPro enables flexible quality and multiple bitrate within a single model, achieving real-time decoding and rendering on mobile devices while outperforming existing methods in RD performance across multiple datasets. Project Page: https://mediax-sjtu.github.io/4DGCPro
[449] Degradation-Aware All-in-One Image Restoration via Latent Prior Encoding
S M A Sharif, Abdur Rehman, Fayaz Ali Dharejo, Radu Timofte, Rizwan Ali Naqvi
Main category: cs.CV
TL;DR: Proposes a new all-in-one image restoration method that uses learned latent priors for adaptive feature selection, spatial localization, and degradation semantics, outperforming SOTA methods with 1.68 dB PSNR improvement and 3x efficiency.
Details
Motivation: Existing all-in-one restoration approaches rely on external text prompts or hand-crafted architectural priors, which impose brittle assumptions that weaken generalization to unseen or mixed degradations.Method: Reframes AIR as learned latent prior inference, formulating it as a structured reasoning paradigm with adaptive feature selection, spatial localization, and degradation semantics. Uses a lightweight decoding module that leverages latent encoded cues for spatially-adaptive restoration.
Result: Outperforms state-of-the-art approaches across six common degradation tasks, five compound settings, and previously unseen degradations, achieving average PSNR improvement of 1.68 dB while being three times more efficient.
Conclusion: The proposed method successfully addresses limitations of existing AIR approaches by automatically inferring degradation-aware representations from input without explicit task cues, demonstrating superior performance and efficiency.
Abstract: Real-world images often suffer from spatially diverse degradations such as haze, rain, snow, and low-light, significantly impacting visual quality and downstream vision tasks. Existing all-in-one restoration (AIR) approaches either depend on external text prompts or embed hand-crafted architectural priors (e.g., frequency heuristics); both impose discrete, brittle assumptions that weaken generalization to unseen or mixed degradations. To address this limitation, we propose to reframe AIR as learned latent prior inference, where degradation-aware representations are automatically inferred from the input without explicit task cues. Based on latent priors, we formulate AIR as a structured reasoning paradigm: (1) which features to route (adaptive feature selection), (2) where to restore (spatial localization), and (3) what to restore (degradation semantics). We design a lightweight decoding module that efficiently leverages these latent encoded cues for spatially-adaptive restoration. Extensive experiments across six common degradation tasks, five compound settings, and previously unseen degradations demonstrate that our method outperforms state-of-the-art (SOTA) approaches, achieving an average PSNR improvement of 1.68 dB while being three times more efficient.
[450] Deep Learning for Clouds and Cloud Shadow Segmentation in Methane Satellite and Airborne Imaging Spectroscopy
Manuel Perez-Carrasco, Maya Nasr, Sebastien Roche, Chris Chan Miller, Zhan Zhang, Core Francisco Park, Eleanor Walker, Cecilia Garraffo, Douglas Finkbeiner, Ritesh Gautam, Steven Wofsy
Main category: cs.CV
TL;DR: Machine learning methods for cloud and cloud shadow detection in hyperspectral remote sensing, comparing conventional techniques with deep learning models to improve methane emission quantification.
Details
Motivation: Effective cloud and cloud shadow detection is critical for accurate atmospheric methane retrieval in hyperspectral remote sensing, especially for MethaneSAT and MethaneAIR missions, as clouds bias methane retrievals and impact emission quantification.Method: Deployed and evaluated conventional methods (Iterative Logistic Regression and Multilayer Perceptron) against advanced deep learning architectures (UNet and Spectral Channel Attention Network) for cloud and cloud shadow detection.
Result: Conventional methods struggled with spatial coherence and boundary definition. Deep learning models substantially improved detection quality: UNet performed best in preserving spatial structure, while SCAN excelled at capturing fine boundary details and surpassed UNet on MethaneSAT data.
Conclusion: Advanced deep learning architectures provide robust, scalable solutions for cloud and cloud shadow screening, enhancing methane emission quantification capacity for hyperspectral missions. Spectral attention mechanisms are particularly beneficial for satellite-specific features.
Abstract: Effective cloud and cloud shadow detection is a critical prerequisite for accurate retrieval of concentrations of atmospheric methane or other trace gases in hyperspectral remote sensing. This challenge is especially pertinent for MethaneSAT and for its airborne companion mission, MethaneAIR. In this study, we use machine learning methods to address the cloud and cloud shadow detection problem for sensors with these high spatial resolutions instruments. Cloud and cloud shadows in remote sensing data need to be effectively screened out as they bias methane retrievals in remote sensing imagery and impact the quantification of emissions. We deploy and evaluate conventional techniques including Iterative Logistic Regression (ILR) and Multilayer Perceptron (MLP), with advanced deep learning architectures, namely UNet and a Spectral Channel Attention Network (SCAN) method. Our results show that conventional methods struggle with spatial coherence and boundary definition, affecting the detection of clouds and cloud shadows. Deep learning models substantially improve detection quality: UNet performs best in preserving spatial structure, while SCAN excels at capturing fine boundary details. Notably, SCAN surpasses UNet on MethaneSAT data, underscoring the benefits of incorporating spectral attention for satellite specific features. This in depth assessment of various disparate machine learning techniques demonstrates the strengths and effectiveness of advanced deep learning architectures in providing robust, scalable solutions for clouds and cloud shadow screening towards enhancing methane emission quantification capacity of existing and next generation hyperspectral missions. Our data and code is publicly available at https://doi.org/10.7910/DVN/IKLZOJ
[451] HiPerformer: A High-Performance Global-Local Segmentation Model with Modular Hierarchical Fusion Strategy
Dayu Tan, Zhenpeng Xu, Yansen Su, Xin Peng, Chunhou Zheng, Weimin Zhong
Main category: cs.CV
TL;DR: HiPerformer is a novel medical image segmentation method that uses modular hierarchical architecture and local-global feature fusion to better integrate local details and global context, outperforming existing methods on 11 datasets.
Details
Motivation: Existing CNN-Transformer hybrid methods use simple feature fusion techniques that struggle with feature inconsistencies, leading to information conflict and loss in medical image segmentation.Method: Proposes HiPerformer with modular hierarchical encoder for parallel multi-source feature fusion, Local-Global Feature Fusion (LGFF) module for precise integration, and Progressive Pyramid Aggregation (PPA) module to replace traditional skip connections.
Result: Experiments on eleven public datasets show HiPerformer outperforms existing segmentation techniques with higher accuracy and robustness.
Conclusion: HiPerformer effectively addresses feature inconsistency problems in medical image segmentation through its innovative architecture and fusion modules, achieving superior performance compared to current methods.
Abstract: Both local details and global context are crucial in medical image segmentation, and effectively integrating them is essential for achieving high accuracy. However, existing mainstream methods based on CNN-Transformer hybrid architectures typically employ simple feature fusion techniques such as serial stacking, endpoint concatenation, or pointwise addition, which struggle to address the inconsistencies between features and are prone to information conflict and loss. To address the aforementioned challenges, we innovatively propose HiPerformer. The encoder of HiPerformer employs a novel modular hierarchical architecture that dynamically fuses multi-source features in parallel, enabling layer-wise deep integration of heterogeneous information. The modular hierarchical design not only retains the independent modeling capability of each branch in the encoder, but also ensures sufficient information transfer between layers, effectively avoiding the degradation of features and information loss that come with traditional stacking methods. Furthermore, we design a Local-Global Feature Fusion (LGFF) module to achieve precise and efficient integration of local details and global semantic information, effectively alleviating the feature inconsistency problem and resulting in a more comprehensive feature representation. To further enhance multi-scale feature representation capabilities and suppress noise interference, we also propose a Progressive Pyramid Aggregation (PPA) module to replace traditional skip connections. Experiments on eleven public datasets demonstrate that the proposed method outperforms existing segmentation techniques, demonstrating higher segmentation accuracy and robustness. The code is available at https://github.com/xzphappy/HiPerformer.
[452] FAST: Foreground-aware Diffusion with Accelerated Sampling Trajectory for Segmentation-oriented Anomaly Synthesis
Xichen Xu, Yanshu Wang, Jinbao Wang, Xiaoning Lei, Guoyang Xie, Guannan Jiang, Zhichao Lu
Main category: cs.CV
TL;DR: FAST is a foreground-aware diffusion framework for industrial anomaly segmentation that uses accelerated sampling and foreground-aware reconstruction to generate high-quality anomalies efficiently.
Details
Motivation: Existing anomaly synthesis methods struggle with balancing sampling efficiency and generation quality, and treat all spatial regions uniformly without considering statistical differences between anomaly and background areas.Method: Proposes FAST with two modules: Anomaly-Informed Accelerated Sampling (AIAS) for training-free accelerated sampling in 10 steps, and Foreground-Aware Reconstruction Module (FARM) that adaptively adjusts anomaly-aware noise in masked foreground regions.
Result: Extensive experiments on multiple industrial benchmarks show FAST consistently outperforms existing anomaly synthesis methods in downstream segmentation tasks.
Conclusion: FAST enables efficient, high-quality synthesis of structure-specific anomalies for industrial segmentation tasks through its foreground-aware approach and accelerated sampling.
Abstract: Industrial anomaly segmentation relies heavily on pixel-level annotations, yet real-world anomalies are often scarce, diverse, and costly to label. Segmentation-oriented industrial anomaly synthesis (SIAS) has emerged as a promising alternative; however, existing methods struggle to balance sampling efficiency and generation quality. Moreover, most approaches treat all spatial regions uniformly, overlooking the distinct statistical differences between anomaly and background areas. This uniform treatment hinders the synthesis of controllable, structure-specific anomalies tailored for segmentation tasks. In this paper, we propose FAST, a foreground-aware diffusion framework featuring two novel modules: the Anomaly-Informed Accelerated Sampling (AIAS) and the Foreground-Aware Reconstruction Module (FARM). AIAS is a training-free sampling algorithm specifically designed for segmentation-oriented industrial anomaly synthesis, which accelerates the reverse process through coarse-to-fine aggregation and enables the synthesis of state-of-the-art segmentation-oriented anomalies in as few as 10 steps. Meanwhile, FARM adaptively adjusts the anomaly-aware noise within the masked foreground regions at each sampling step, preserving localized anomaly signals throughout the denoising trajectory. Extensive experiments on multiple industrial benchmarks demonstrate that FAST consistently outperforms existing anomaly synthesis methods in downstream segmentation tasks. We release the code at: https://github.com/Chhro123/fast-foreground-aware-anomaly-synthesis.
[453] EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning
Xuan Ju, Tianyu Wang, Yuqian Zhou, He Zhang, Qing Liu, Nanxuan Zhao, Zhifei Zhang, Yijun Li, Yuanhao Cai, Shaoteng Liu, Daniil Pakhomov, Zhe Lin, Soo Ye Kim, Qiang Xu
Main category: cs.CV
TL;DR: EditVerse is a unified framework for image and video generation and editing within a single model that represents all modalities as unified token sequences, enabling cross-modal knowledge transfer and flexible handling of arbitrary resolutions and durations.
Details
Motivation: Video generation and editing remain fragmented due to architectural limitations and data scarcity, while image generation has successfully transitioned to unified frameworks. There is a need for a unified approach that can handle both image and video tasks.Method: Represent all modalities (text, image, video) as unified token sequences and leverage self-attention for in-context learning and cross-modal knowledge transfer. Address data scarcity by creating a scalable pipeline that curates 232K video editing samples and combines them with large-scale image and video datasets for joint training.
Result: EditVerse achieves state-of-the-art performance, surpassing existing open-source and commercial models. It exhibits emergent editing and generation abilities across modalities and performs well on the newly created EditVerseBench benchmark for instruction-based video editing.
Conclusion: EditVerse successfully demonstrates that a unified framework can effectively handle both image and video generation and editing tasks, overcoming previous fragmentation and data scarcity issues through innovative token representation and scalable data curation.
Abstract: Recent advances in foundation models highlight a clear trend toward unification and scaling, showing emergent capabilities across diverse domains. While image generation and editing have rapidly transitioned from task-specific to unified frameworks, video generation and editing remain fragmented due to architectural limitations and data scarcity. In this work, we introduce EditVerse, a unified framework for image and video generation and editing within a single model. By representing all modalities, i.e., text, image, and video, as a unified token sequence, EditVerse leverages self-attention to achieve robust in-context learning, natural cross-modal knowledge transfer, and flexible handling of inputs and outputs with arbitrary resolutions and durations. To address the lack of video editing training data, we design a scalable data pipeline that curates 232K video editing samples and combines them with large-scale image and video datasets for joint training. Furthermore, we present EditVerseBench, the first benchmark for instruction-based video editing covering diverse tasks and resolutions. Extensive experiments and user studies demonstrate that EditVerse achieves state-of-the-art performance, surpassing existing open-source and commercial models, while exhibiting emergent editing and generation abilities across modalities.
[454] Neptune-X: Active X-to-Maritime Generation for Universal Maritime Object Detection
Yu Guo, Shengfeng He, Yuxu Lu, Haonan An, Yihang Tao, Huilin Zhu, Jingxian Liu, Yuguang Fang
Main category: cs.CV
TL;DR: Neptune-X is a data-centric framework that uses synthetic data generation and task-aware sample selection to improve maritime object detection, addressing data scarcity and generalization challenges.
Details
Motivation: Maritime object detection faces challenges due to scarce annotated data and poor generalization across different maritime attributes like object category, viewpoint, location, and imaging environment.Method: Proposes Neptune-X framework with X-to-Maritime generative model using Bidirectional Object-Water Attention for realistic scene synthesis, and Attribute-correlated Active Sampling for dynamic sample selection based on task relevance.
Result: The approach sets new benchmarks in maritime scene synthesis and significantly improves detection accuracy, especially in challenging and underrepresented settings.
Conclusion: Neptune-X effectively addresses maritime data scarcity and generalization issues through synthetic data generation and intelligent sample selection, with code publicly available.
Abstract: Maritime object detection is essential for navigation safety, surveillance, and autonomous operations, yet constrained by two key challenges: the scarcity of annotated maritime data and poor generalization across various maritime attributes (e.g., object category, viewpoint, location, and imaging environment). To address these challenges, we propose Neptune-X, a data-centric generative-selection framework that enhances training effectiveness by leveraging synthetic data generation with task-aware sample selection. From the generation perspective, we develop X-to-Maritime, a multi-modality-conditioned generative model that synthesizes diverse and realistic maritime scenes. A key component is the Bidirectional Object-Water Attention module, which captures boundary interactions between objects and their aquatic surroundings to improve visual fidelity. To further improve downstream tasking performance, we propose Attribute-correlated Active Sampling, which dynamically selects synthetic samples based on their task relevance. To support robust benchmarking, we construct the Maritime Generation Dataset, the first dataset tailored for generative maritime learning, encompassing a wide range of semantic conditions. Extensive experiments demonstrate that our approach sets a new benchmark in maritime scene synthesis, significantly improving detection accuracy, particularly in challenging and previously underrepresented settings. The code is available at https://github.com/gy65896/Neptune-X.
[455] Real-Time Object Detection Meets DINOv3
Shihua Huang, Yongjie Hou, Longfei Liu, Xuanlong Yu, Xi Shen
Main category: cs.CV
TL;DR: DEIMv2 extends DEIM with DINOv3 features across eight model sizes, using Spatial Tuning Adapter for larger models and HGNetv2 with pruning for smaller ones, achieving superior performance-cost trade-offs and new SOTA results.
Details
Motivation: To extend the successful DEIM framework with DINOv3 features and create a unified design that spans from GPU to mobile deployment with optimal performance-cost balance.Method: For X/L/M/S variants: DINOv3-pretrained backbones + Spatial Tuning Adapter (converts single-scale to multi-scale features). For Nano/Pico/Femto/Atto: HGNetv2 with depth/width pruning + simplified decoder + upgraded Dense O2O.
Result: DEIMv2-X achieves 57.8 AP with 50.3M parameters (surpassing 56.5 AP with 60M+ parameters). DEIMv2-S is first sub-10M model (9.71M) to exceed 50 AP (50.9 AP). DEIMv2-Pico (1.5M) matches YOLOv10-Nano (2.3M) with 38.5 AP.
Conclusion: DEIMv2 establishes new SOTA across diverse scenarios with superior performance-cost trade-off, demonstrating effectiveness of unified design with DINOv3 features and efficient architecture adaptations.
Abstract: Benefiting from the simplicity and effectiveness of Dense O2O and MAL, DEIM has become the mainstream training framework for real-time DETRs, significantly outperforming the YOLO series. In this work, we extend it with DINOv3 features, resulting in DEIMv2. DEIMv2 spans eight model sizes from X to Atto, covering GPU, edge, and mobile deployment. For the X, L, M, and S variants, we adopt DINOv3-pretrained or distilled backbones and introduce a Spatial Tuning Adapter (STA), which efficiently converts DINOv3’s single-scale output into multi-scale features and complements strong semantics with fine-grained details to enhance detection. For ultra-lightweight models (Nano, Pico, Femto, and Atto), we employ HGNetv2 with depth and width pruning to meet strict resource budgets. Together with a simplified decoder and an upgraded Dense O2O, this unified design enables DEIMv2 to achieve a superior performance-cost trade-off across diverse scenarios, establishing new state-of-the-art results. Notably, our largest model, DEIMv2-X, achieves 57.8 AP with only 50.3 million parameters, surpassing prior X-scale models that require over 60 million parameters for just 56.5 AP. On the compact side, DEIMv2-S is the first sub-10 million model (9.71 million) to exceed the 50 AP milestone on COCO, reaching 50.9 AP. Even the ultra-lightweight DEIMv2-Pico, with just 1.5 million parameters, delivers 38.5 AP, matching YOLOv10-Nano (2.3 million) with around 50 percent fewer parameters. Our code and pre-trained models are available at https://github.com/Intellindust-AI-Lab/DEIMv2
[456] MOSS-ChatV: Reinforcement Learning with Process Reasoning Reward for Video Temporal Reasoning
Sicheng Tao, Jungang Li, Yibo Yan, Junyan Zhang, Yubo Gao, Hanqian Li, ShuHang Xun, Yuxuan Fan, Hong Chen, Jianxiang He, Xuming Hu
Main category: cs.CV
TL;DR: MOSS-ChatV is a reinforcement learning framework with DTW-based process reward that addresses process inconsistency in video reasoning by aligning reasoning traces with temporal dynamics, improving interpretability and robustness.
Details
Motivation: Existing multimodal LLMs exhibit process inconsistency where intermediate reasoning drifts from video dynamics even when final answers are correct, undermining interpretability and robustness in video reasoning tasks.Method: Introduces MOSS-ChatV with reinforcement learning framework using Dynamic Time Warping (DTW)-based process reward to align reasoning traces with temporally grounded references, and constructs MOSS-Video benchmark with annotated reasoning traces for training and evaluation.
Result: Achieves 87.2% on MOSS-Video test split, improves performance on MVBench and MMVU benchmarks, and shows consistent gains across different architectures (Qwen2.5-VL, Phi-2). GPT-4o-as-judge confirms more consistent and stable reasoning traces.
Conclusion: The DTW-based process reward framework effectively addresses process inconsistency in video reasoning, demonstrating broad applicability across architectures and producing more interpretable and robust reasoning traces.
Abstract: Video reasoning has emerged as a critical capability for multimodal large language models (MLLMs), requiring models to move beyond static perception toward coherent understanding of temporal dynamics in complex scenes. Yet existing MLLMs often exhibit process inconsistency, where intermediate reasoning drifts from video dynamics even when the final answer is correct, undermining interpretability and robustness. To address this issue, we introduce MOSS-ChatV, a reinforcement learning framework with a Dynamic Time Warping (DTW)-based process reward. This rule-based reward aligns reasoning traces with temporally grounded references, enabling efficient process supervision without auxiliary reward models. We further identify dynamic state prediction as a key measure of video reasoning and construct MOSS-Video, a benchmark with annotated reasoning traces, where the training split is used to fine-tune MOSS-ChatV and the held-out split is reserved for evaluation. MOSS-ChatV achieves 87.2% on MOSS-Video (test) and improves performance on general video benchmarks such as MVBench and MMVU. The framework consistently yields gains across different architectures, including Qwen2.5-VL and Phi-2, confirming its broad applicability. Evaluations with GPT-4o-as-judge further show that MOSS-ChatV produces more consistent and stable reasoning traces.
cs.AI
[457] Towards mitigating information leakage when evaluating safety monitors
Gerard Boxo, Aman Neelappa, Shivam Raval
Main category: cs.AI
TL;DR: A framework for evaluating white box monitors that detects harmful behaviors in LLMs, addressing performance inflation from data leakage during training/evaluation.
Details
Motivation: To address the challenge where monitor performance is inflated due to leakage of elicitation artifacts from prompts used to generate harmful behaviors, rather than genuine detection of model behavior.Method: Proposed three evaluation strategies: content filtering (removing deception-related text), score filtering (aggregating only task-relevant tokens), and prompt distilled fine-tuned model organisms (models trained to exhibit deceptive behavior without explicit prompting).
Result: Content filtering decreased probe AUROC by 30%, score filtering reduced AUROC by 15%, and fine-tuned model organisms reduced monitor performance by up to 40% even when re-trained.
Conclusion: The framework reveals significant performance inflation in white box monitors due to elicitation and reasoning leakage, and provides effective mitigation strategies to evaluate genuine detection capabilities.
Abstract: White box monitors that analyze model internals offer promising advantages for detecting potentially harmful behaviors in large language models, including lower computational costs and integration into layered defense systems.However, training and evaluating these monitors requires response exemplars that exhibit the target behaviors, typically elicited through prompting or fine-tuning. This presents a challenge when the information used to elicit behaviors inevitably leaks into the data that monitors ingest, inflating their effectiveness. We present a systematic framework for evaluating a monitor’s performance in terms of its ability to detect genuine model behavior rather than superficial elicitation artifacts. Furthermore, we propose three novel strategies to evaluate the monitor: content filtering (removing deception-related text from inputs), score filtering (aggregating only over task-relevant tokens), and prompt distilled fine-tuned model organisms (models trained to exhibit deceptive behavior without explicit prompting). Using deception detection as a representative case study, we identify two forms of leakage that inflate monitor performance: elicitation leakage from prompts that explicitly request harmful behavior, and reasoning leakage from models that verbalize their deceptive actions. Through experiments on multiple deception benchmarks, we apply our proposed mitigation strategies and measure performance retention. Our evaluation of the monitors reveal three crucial findings: (1) Content filtering is a good mitigation strategy that allows for a smooth removal of elicitation signal and can decrease probe AUROC by 30% (2) Score filtering was found to reduce AUROC by 15% but is not as straightforward to attribute to (3) A finetuned model organism improves monitor evaluations but reduces their performance by upto 40%, even when re-trained.
[458] Correct Reasoning Paths Visit Shared Decision Pivots
Dongkyu Cho, Amy B. Z. Zhang, Bilel Fehri, Sheng Wang, Rumi Chunara, Rui Song, Hengrui Cai
Main category: cs.AI
TL;DR: Proposes decision pivots - minimal verifiable checkpoints that correct reasoning paths must visit, and a self-training pipeline to align LLM reasoning without ground truth data.
Details
Motivation: Chain-of-thought reasoning exposes LLM thinking processes but verifying these traces at scale remains unsolved, requiring a method to validate reasoning paths efficiently.Method: Self-training pipeline that samples diverse reasoning paths, mines shared decision pivots, compresses traces into pivot-focused short-path reasoning using an auxiliary verifier, and post-trains the model using self-generated outputs.
Result: Experiments on LogiQA, MedQA, and MATH500 benchmarks demonstrate the method’s effectiveness in aligning reasoning without ground truth data or external metrics.
Conclusion: Decision pivots provide a scalable way to verify and align LLM reasoning by identifying essential checkpoints that correct reasoning paths must satisfy, enabling effective self-training without external supervision.
Abstract: Chain-of-thought (CoT) reasoning exposes the intermediate thinking process of large language models (LLMs), yet verifying those traces at scale remains unsolved. In response, we introduce the idea of decision pivots-minimal, verifiable checkpoints that any correct reasoning path must visit. We hypothesize that correct reasoning, though stylistically diverse, converge on the same pivot set, while incorrect ones violate at least one pivot. Leveraging this property, we propose a self-training pipeline that (i) samples diverse reasoning paths and mines shared decision pivots, (ii) compresses each trace into pivot-focused short-path reasoning using an auxiliary verifier, and (iii) post-trains the model using its self-generated outputs. The proposed method aligns reasoning without ground truth reasoning data or external metrics. Experiments on standard benchmarks such as LogiQA, MedQA, and MATH500 show the effectiveness of our method.
[459] AutoClimDS: Climate Data Science Agentic AI – A Knowledge Graph is All You Need
Ahmed Jaber, Wangshu Zhu, Karthick Jayavelu, Justin Downes, Sameer Mohamed, Candace Agonafir, Linnia Hawkins, Tian Zheng
Main category: cs.AI
TL;DR: A knowledge graph integrated with AI agents lowers barriers in climate data science by enabling natural language interaction and automated workflows.
Details
Motivation: Climate data science faces challenges from fragmented data sources, heterogeneous formats, and high technical expertise requirements that limit participation and reproducibility.Method: Integration of a curated knowledge graph with AI agents powered by generative AI services, leveraging cloud-native workflows and existing API data portals.
Result: The system drastically lowers technical thresholds, enabling non-specialist users to identify and analyze relevant climate datasets through natural language interaction.
Conclusion: This approach demonstrates a pathway toward democratizing climate data access and establishing reproducible, extensible human-AI collaboration frameworks in scientific research.
Abstract: Climate data science faces persistent barriers stemming from the fragmented nature of data sources, heterogeneous formats, and the steep technical expertise required to identify, acquire, and process datasets. These challenges limit participation, slow discovery, and reduce the reproducibility of scientific workflows. In this paper, we present a proof of concept for addressing these barriers through the integration of a curated knowledge graph (KG) with AI agents designed for cloud-native scientific workflows. The KG provides a unifying layer that organizes datasets, tools, and workflows, while AI agents – powered by generative AI services – enable natural language interaction, automated data access, and streamlined analysis. Together, these components drastically lower the technical threshold for engaging in climate data science, enabling non-specialist users to identify and analyze relevant datasets. By leveraging existing cloud-ready API data portals, we demonstrate that “a knowledge graph is all you need” to unlock scalable and agentic workflows for scientific inquiry. The open-source design of our system further supports community contributions, ensuring that the KG and associated tools can evolve as a shared commons. Our results illustrate a pathway toward democratizing access to climate data and establishing a reproducible, extensible framework for human–AI collaboration in scientific research.
[460] EEG-Based Consumer Behaviour Prediction: An Exploration from Classical Machine Learning to Graph Neural Networks
Mohammad Parsa Afshar, Aryan Azimi
Main category: cs.AI
TL;DR: This paper compares classical machine learning models and Graph Neural Networks (GNNs) for predicting consumer behavior using EEG data from the NeuMa dataset, finding GNNs generally perform better in certain criteria.
Details
Motivation: To predict consumer behavior using EEG data for applications in marketing, cognitive neuroscience, and human-computer interaction by analyzing brain neural activity during decision processes.Method: Extracted and cleaned EEG features from NeuMa dataset, created brain connectivity features for GNN models, and compared various machine learning models including classical models (ensemble models, SVM) and GNNs with different architectures.
Result: No significant overall performance difference between models, but GNN models generally performed better in some basic criteria where classical models were unsatisfactory.
Conclusion: EEG signal analysis combined with machine learning provides deeper understanding of consumer behavior, and GNNs show promise as an alternative to traditional models in EEG-based neuromarketing applications.
Abstract: Prediction of consumer behavior is one of the important purposes in marketing, cognitive neuroscience, and human-computer interaction. The electroencephalography (EEG) data can help analyze the decision process by providing detailed information about the brain’s neural activity. In this research, a comparative approach is utilized for predicting consumer behavior by EEG data. In the first step, the features of the EEG data from the NeuMa dataset were extracted and cleaned. For the Graph Neural Network (GNN) models, the brain connectivity features were created. Different machine learning models, such as classical models and Graph Neural Networks, are used and compared. The GNN models with different architectures are implemented to have a comprehensive comparison; furthermore, a wide range of classical models, such as ensemble models, are applied, which can be very helpful to show the difference and performance of each model on the dataset. Although the results did not show a significant difference overall, the GNN models generally performed better in some basic criteria where classical models were not satisfactory. This study not only shows that combining EEG signal analysis and machine learning models can provide an approach to deeper understanding of consumer behavior, but also provides a comprehensive comparison between the machine learning models that have been widely used in previous studies in the EEG-based neuromarketing such as Support Vector Machine (SVM), and the models which are not used or rarely used in the field, like Graph Neural Networks.
[461] GeoEvolve: Automating Geospatial Model Discovery via Multi-Agent Large Language Models
Peng Luo, Xiayin Lou, Yu Zheng, Zhuo Zheng, Stefano Ermon
Main category: cs.AI
TL;DR: GeoEvolve is a multi-agent LLM framework that combines evolutionary search with geospatial domain knowledge to automatically design and refine geospatial algorithms, achieving significant improvements in spatial interpolation and uncertainty quantification tasks.
Details
Motivation: Existing LLM-based algorithm discovery frameworks lack domain knowledge and multi-step reasoning required for complex geospatial problems, while geospatial modeling is critical for addressing global challenges like sustainability and climate change.Method: Two nested loops: inner loop uses code evolver to generate/mutate candidate solutions, outer agentic controller evaluates elites and queries GeoKnowRAG module (structured geospatial knowledge base) to inject theoretical priors from geography, enabling knowledge-guided evolution.
Result: Reduced spatial interpolation error (RMSE) by 13-21%, enhanced uncertainty estimation performance by 17%, and discovered new algorithms incorporating geospatial theory on top of classical models.
Conclusion: GeoEvolve provides a scalable path toward automated, knowledge-driven geospatial modeling, opening new opportunities for trustworthy and efficient AI-for-Science discovery, with domain-guided retrieval being essential for stable, high-quality evolution.
Abstract: Geospatial modeling provides critical solutions for pressing global challenges such as sustainability and climate change. Existing large language model (LLM)-based algorithm discovery frameworks, such as AlphaEvolve, excel at evolving generic code but lack the domain knowledge and multi-step reasoning required for complex geospatial problems. We introduce GeoEvolve, a multi-agent LLM framework that couples evolutionary search with geospatial domain knowledge to automatically design and refine geospatial algorithms. GeoEvolve operates in two nested loops: an inner loop leverages a code evolver to generate and mutate candidate solutions, while an outer agentic controller evaluates global elites and queries a GeoKnowRAG module – a structured geospatial knowledge base that injects theoretical priors from geography. This knowledge-guided evolution steers the search toward theoretically meaningful and computationally efficient algorithms. We evaluate GeoEvolve on two fundamental and classical tasks: spatial interpolation (kriging) and spatial uncertainty quantification (geospatial conformal prediction). Across these benchmarks, GeoEvolve automatically improves and discovers new algorithms, incorporating geospatial theory on top of classical models. It reduces spatial interpolation error (RMSE) by 13-21% and enhances uncertainty estimation performance by 17%. Ablation studies confirm that domain-guided retrieval is essential for stable, high-quality evolution. These results demonstrate that GeoEvolve provides a scalable path toward automated, knowledge-driven geospatial modeling, opening new opportunities for trustworthy and efficient AI-for-Science discovery.
[462] Automated and Interpretable Survival Analysis from Multimodal Data
Mafalda Malafaia, Peter A. N. Bosman, Coen Rasch, Tanja Alderliesten
Main category: cs.AI
TL;DR: MultiFIX is an interpretable multimodal AI framework that integrates clinical variables and CT imaging for survival analysis in head and neck cancer, achieving superior performance while maintaining transparency through feature interpretation and Cox regression.
Details
Motivation: The need for accurate and interpretable survival analysis in oncology is growing due to increasing multimodal data and clinical requirements for transparent models that support validation and trust.Method: Uses deep learning to extract survival-relevant features from CT imaging (interpreted via Grad-CAM) and clinical variables (modeled as symbolic expressions through genetic programming), with risk estimation via transparent Cox regression for patient stratification.
Result: Achieved C-index of 0.838 for prediction and 0.826 for stratification on RADCURE head and neck cancer dataset, outperforming clinical and academic baseline approaches while aligning with known prognostic markers.
Conclusion: MultiFIX demonstrates the promise of interpretable multimodal AI for precision oncology by providing both high accuracy and transparency in survival analysis.
Abstract: Accurate and interpretable survival analysis remains a core challenge in oncology. With growing multimodal data and the clinical need for transparent models to support validation and trust, this challenge increases in complexity. We propose an interpretable multimodal AI framework to automate survival analysis by integrating clinical variables and computed tomography imaging. Our MultiFIX-based framework uses deep learning to infer survival-relevant features that are further explained: imaging features are interpreted via Grad-CAM, while clinical variables are modeled as symbolic expressions through genetic programming. Risk estimation employs a transparent Cox regression, enabling stratification into groups with distinct survival outcomes. Using the open-source RADCURE dataset for head and neck cancer, MultiFIX achieves a C-index of 0.838 (prediction) and 0.826 (stratification), outperforming the clinical and academic baseline approaches and aligning with known prognostic markers. These results highlight the promise of interpretable multimodal AI for precision oncology with MultiFIX.
[463] Semantic F1 Scores: Fair Evaluation Under Fuzzy Class Boundaries
Georgios Chochlakis, Jackson Trager, Vedant Jhaveri, Nikhil Ravichandran, Alexandros Potamianos, Shrikanth Narayanan
Main category: cs.AI
TL;DR: Semantic F1 Scores are new evaluation metrics for fuzzy multi-label classification that use label similarity matrices to give partial credit for semantically related predictions, addressing limitations of conventional F1 scores.
Details
Motivation: Conventional F1 metrics treat semantically related predictions as complete failures, which doesn't reflect the realities of domains with human disagreement or fuzzy category boundaries where similar predictions lead to similar outcomes.Method: Uses a two-step precision-recall formulation with a label similarity matrix to compute soft precision and recall scores, enabling comparison of label sets of arbitrary sizes without discarding labels or forcing matches.
Result: Semantic F1 demonstrates greater interpretability and ecological validity through theoretical justification and empirical validation on synthetic and real data.
Conclusion: Semantic F1 provides fairer evaluations by recognizing category overlap and annotator disagreement, and is applicable across tasks and modalities since it only requires a domain-appropriate similarity matrix rather than a rigid ontology.
Abstract: We propose Semantic F1 Scores, novel evaluation metrics for subjective or fuzzy multi-label classification that quantify semantic relatedness between predicted and gold labels. Unlike the conventional F1 metrics that treat semantically related predictions as complete failures, Semantic F1 incorporates a label similarity matrix to compute soft precision-like and recall-like scores, from which the Semantic F1 scores are derived. Unlike existing similarity-based metrics, our novel two-step precision-recall formulation enables the comparison of label sets of arbitrary sizes without discarding labels or forcing matches between dissimilar labels. By granting partial credit for semantically related but nonidentical labels, Semantic F1 better reflects the realities of domains marked by human disagreement or fuzzy category boundaries. In this way, it provides fairer evaluations: it recognizes that categories overlap, that annotators disagree, and that downstream decisions based on similar predictions lead to similar outcomes. Through theoretical justification and extensive empirical validation on synthetic and real data, we show that Semantic F1 demonstrates greater interpretability and ecological validity. Because it requires only a domain-appropriate similarity matrix, which is robust to misspecification, and not a rigid ontology, it is applicable across tasks and modalities.
[464] Can AI Perceive Physical Danger and Intervene?
Abhishek Jindal, Dmitry Kalashnikov, Oscar Chang, Divya Garikapati, Anirudha Majumdar, Pierre Sermanet, Vikas Sindhwani
Main category: cs.AI
TL;DR: This paper develops a scalable physical safety benchmark for Embodied AI systems, analyzes foundation models’ safety understanding, and creates a post-training method to improve safety reasoning with interpretable thinking traces.
Details
Motivation: Address safety challenges when AI interacts with the physical world, where physical harm is direct and immediate, by testing if foundation models understand common-sense physical safety facts.Method: Create photorealistic images/videos of safe-to-unsafe transitions using generative models, analyze major foundation models’ risk perception and safety reasoning, and develop post-training paradigm with explicit safety constraint reasoning.
Result: Comprehensive analysis reveals deployment readiness of foundation models for safety-critical applications, with post-trained models achieving state-of-the-art constraint satisfaction performance.
Conclusion: The benchmark enables continuous physical safety evaluation of Embodied AI, and the post-training approach makes safety reasoning interpretable and transparent while improving performance.
Abstract: When AI interacts with the physical world – as a robot or an assistive agent – new safety challenges emerge beyond those of purely ``digital AI”. In such interactions, the potential for physical harm is direct and immediate. How well do state-of-the-art foundation models understand common-sense facts about physical safety, e.g. that a box may be too heavy to lift, or that a hot cup of coffee should not be handed to a child? In this paper, our contributions are three-fold: first, we develop a highly scalable approach to continuous physical safety benchmarking of Embodied AI systems, grounded in real-world injury narratives and operational safety constraints. To probe multi-modal safety understanding, we turn these narratives and constraints into photorealistic images and videos capturing transitions from safe to unsafe states, using advanced generative models. Secondly, we comprehensively analyze the ability of major foundation models to perceive risks, reason about safety, and trigger interventions; this yields multi-faceted insights into their deployment readiness for safety-critical agentic applications. Finally, we develop a post-training paradigm to teach models to explicitly reason about embodiment-specific safety constraints provided through system instructions. The resulting models generate thinking traces that make safety reasoning interpretable and transparent, achieving state of the art performance in constraint satisfaction evaluations. The benchmark will be released at https://asimov-benchmark.github.io/v2
[465] Axiomatic Choice and the Decision-Evaluation Paradox
Ben Abramowitz, Nicholas Mattei
Main category: cs.AI
TL;DR: A framework for modeling decisions with axioms (ethical constraints) reveals a Decision-Evaluation Paradox between using axioms to make vs. evaluate decisions.
Details
Motivation: To understand the structural properties of decision axioms and identify potential tensions in their application.Method: Developed a framework for modeling decisions with axioms and defined a taxonomy based on structural properties.
Result: Discovered the Decision-Evaluation Paradox - a tension between using axioms to make decisions versus evaluating decisions.
Conclusion: The paradox shows careful consideration is needed when training models on decision data or applying axioms for decision-making and evaluation.
Abstract: We introduce a framework for modeling decisions with axioms that are statements about decisions, e.g., ethical constraints. Using our framework we define a taxonomy of decision axioms based on their structural properties and demonstrate a tension between the use of axioms to make decisions and the use of axioms to evaluate decisions which we call the Decision-Evaluation Paradox. We argue that the Decision-Evaluation Paradox arises with realistic axiom structures, and the paradox illuminates why one must be exceptionally careful when training models on decision data or applying axioms to make and evaluate decisions.
[466] Align2Speak: Improving TTS for Low Resource Languages via ASR-Guided Online Preference Optimization
Shehzeen Hussain, Paarth Neekhara, Xuesong Yang, Edresson Casanova, Subhankar Ghosh, Roy Fejgin, Ryan Langman, Mikyas Desta, Leili Tavabi, Jason Li
Main category: cs.AI
TL;DR: A framework using Group Relative Policy Optimization (GRPO) to adapt multilingual TTS models to low-resource languages using limited paired data and unpaired text with ASR-guided rewards.
Details
Motivation: Developing TTS for low-resource languages is challenging due to scarce paired text-speech data, while ASR models are more accessible through multilingual pre-training.Method: Three-stage approach: 1) Train multilingual TTS baseline with IPA tokens, 2) Fine-tune on limited paired data for target language prosody, 3) Apply GRPO optimization using unpaired text and speaker prompts with multi-objective rewards from ASR, speaker verification, and audio quality models.
Result: Produces intelligible and speaker-consistent speech in low-resource languages, substantially outperforming fine-tuning alone. Also improves TTS in high-resource languages, surpassing DPO with better intelligibility, speaker similarity, and audio quality.
Conclusion: GRPO-based framework effectively adapts TTS models to low-resource languages using accessible ASR models and unpaired data, achieving superior performance compared to traditional methods.
Abstract: Developing high-quality text-to-speech (TTS) systems for low-resource languages is challenging due to the scarcity of paired text and speech data. In contrast, automatic speech recognition (ASR) models for such languages are often more accessible, owing to large-scale multilingual pre-training efforts. We propose a framework based on Group Relative Policy Optimization (GRPO) to adapt an autoregressive, multilingual TTS model to new languages. Our method first establishes a language-agnostic foundation for TTS synthesis by training a multilingual baseline with International Phonetic Alphabet (IPA) tokens. Next, we fine-tune this model on limited paired data of the new languages to capture the target language’s prosodic features. Finally, we apply GRPO to optimize the model using only unpaired text and speaker prompts, guided by a multi-objective reward from pretrained ASR, speaker verification, and audio quality estimation models. Experiments demonstrate that this pipeline produces intelligible and speaker-consistent speech in low-resource languages, substantially outperforming fine-tuning alone. Furthermore, our GRPO-based framework also improves TTS performance in high-resource languages, surpassing offline alignment methods such as Direct Preference Optimization (DPO) yielding superior intelligibility, speaker similarity, and audio quality.
[467] Retrieval-of-Thought: Efficient Reasoning via Reusing Thoughts
Ammar Ahmed, Azal Ahmad Khan, Ayaan Ahmad, Sheng Di, Zirui Liu, Ali Anwar
Main category: cs.AI
TL;DR: RoT (Retrieval-of-Thought) reuses prior reasoning steps as composable “thoughts” to guide new problems, reducing output tokens by up to 40%, latency by 82%, and cost by 59% while maintaining accuracy.
Details
Motivation: Large reasoning models produce long reasoning traces that inflate latency and cost, creating a need for inference-time efficiency improvements.Method: Organizes reasoning steps into a thought graph with sequential and semantic edges, retrieves query-relevant nodes, and applies reward-guided traversal to assemble problem-specific templates that guide generation.
Result: Substantial efficiency gains with small prompt growth: up to 40% reduction in output tokens, 82% reduction in inference latency, and 59% reduction in cost while maintaining accuracy.
Conclusion: RoT establishes a scalable paradigm for efficient large reasoning model reasoning via dynamic template construction through retrieval.
Abstract: Large reasoning models improve accuracy by producing long reasoning traces, but this inflates latency and cost, motivating inference-time efficiency. We propose Retrieval-of-Thought (RoT), which reuses prior reasoning as composable ``thought" steps to guide new problems. RoT organizes steps into a thought graph with sequential and semantic edges to enable fast retrieval and flexible recombination. At inference, RoT retrieves query-relevant nodes and applies reward-guided traversal to assemble a problem-specific template that guides generation. This dynamic template reuse reduces redundant exploration and, therefore, reduces output tokens while preserving accuracy. We evaluate RoT on reasoning benchmarks with multiple models, measuring accuracy, token usage, latency, and memory overhead. Findings show small prompt growth but substantial efficiency gains, with RoT reducing output tokens by up to 40%, inference latency by 82%, and cost by 59% while maintaining accuracy. RoT establishes a scalable paradigm for efficient LRM reasoning via dynamic template construction through retrieval.
[468] Reimagining Agent-based Modeling with Large Language Model Agents via Shachi
So Kuroki, Yingtao Tian, Kou Misaki, Takashi Ikegami, Takuya Akiba, Yujin Tang
Main category: cs.AI
TL;DR: Shachi is a formal methodology and modular framework for analyzing emergent behaviors in LLM-driven multi-agent systems by decomposing agent policies into cognitive components (Configuration, Memory, Tools) orchestrated by LLM reasoning.
Details
Motivation: To address the lack of principled methodologies for controlled experimentation in studying emergent behaviors in large language model-driven multi-agent systems.Method: Introduces Shachi framework that decomposes agent policies into core cognitive components: Configuration (intrinsic traits), Memory (contextual persistence), and Tools (expanded capabilities), all orchestrated by an LLM reasoning engine.
Result: Validated on 10-task benchmark; demonstrated external validity by modeling real-world U.S. tariff shock showing agent behaviors align with observed market reactions only when properly configured with memory and tools.
Conclusion: Provides a rigorous, open-source foundation for building and evaluating LLM agents to foster more cumulative and scientifically grounded research in multi-agent systems.
Abstract: The study of emergent behaviors in large language model (LLM)-driven multi-agent systems is a critical research challenge, yet progress is limited by a lack of principled methodologies for controlled experimentation. To address this, we introduce Shachi, a formal methodology and modular framework that decomposes an agent’s policy into core cognitive components: Configuration for intrinsic traits, Memory for contextual persistence, and Tools for expanded capabilities, all orchestrated by an LLM reasoning engine. This principled architecture moves beyond brittle, ad-hoc agent designs and enables the systematic analysis of how specific architectural choices influence collective behavior. We validate our methodology on a comprehensive 10-task benchmark and demonstrate its power through novel scientific inquiries. Critically, we establish the external validity of our approach by modeling a real-world U.S. tariff shock, showing that agent behaviors align with observed market reactions only when their cognitive architecture is appropriately configured with memory and tools. Our work provides a rigorous, open-source foundation for building and evaluating LLM agents, aimed at fostering more cumulative and scientifically grounded research.
[469] Lifelong Learning with Behavior Consolidation for Vehicle Routing
Jiyuan Pei, Yi Mei, Jialin Liu, Mengjie Zhang, Xin Yao
Main category: cs.AI
TL;DR: Proposes LLR-BC, a lifelong learning framework for neural VRP solvers that prevents catastrophic forgetting while learning new tasks sequentially.
Details
Motivation: Existing neural solvers struggle with new tasks - either poor zero-shot generalization or catastrophic forgetting when fine-tuning. Need a lifelong learning approach for sequential tasks with diverse distributions and scales.Method: LLR-BC framework that consolidates prior knowledge by aligning behaviors of new task solver with buffered ones in decision-seeking way, with greater weights for low-confidence decisions.
Result: Extensive experiments on CVRP and TSP show LLR-BC effectively trains high-performance neural solvers, addresses catastrophic forgetting, maintains plasticity, and improves zero-shot generalization.
Conclusion: LLR-BC provides an effective lifelong learning solution for neural routing problem solvers, enabling continuous learning without forgetting previous knowledge.
Abstract: Recent neural solvers have demonstrated promising performance in learning to solve routing problems. However, existing studies are primarily based on one-off training on one or a set of predefined problem distributions and scales, i.e., tasks. When a new task arises, they typically rely on either zero-shot generalization, which may be poor due to the discrepancies between the new task and the training task(s), or fine-tuning the pretrained solver on the new task, which possibly leads to catastrophic forgetting of knowledge acquired from previous tasks. This paper explores a novel lifelong learning paradigm for neural VRP solvers, where multiple tasks with diverse distributions and scales arise sequentially over time. Solvers are required to effectively and efficiently learn to solve new tasks while maintaining their performance on previously learned tasks. Consequently, a novel framework called Lifelong Learning Router with Behavior Consolidation (LLR-BC) is proposed. LLR-BC consolidates prior knowledge effectively by aligning behaviors of the solver trained on a new task with the buffered ones in a decision-seeking way. To encourage more focus on crucial experiences, LLR-BC assigns greater consolidated weights to decisions with lower confidence. Extensive experiments on capacitated vehicle routing problems and traveling salesman problems demonstrate LLR-BC’s effectiveness in training high-performance neural solvers in a lifelong learning setting, addressing the catastrophic forgetting issue, maintaining their plasticity, and improving zero-shot generalization ability.
[470] CoBel-World: Harnessing LLM Reasoning to Build a Collaborative Belief World for Optimizing Embodied Multi-Agent Collaboration
Zhimin Wang, Shaokang He, Duo Wu, Jinghe Wang, Linjia Kang, Jing Yu, Zhi Wang
Main category: cs.AI
TL;DR: CoBel-World is a framework that enables LLM agents to model collaborators’ mental states and the environment, allowing for proactive miscoordination detection and adaptive communication, significantly improving collaboration efficiency.
Details
Motivation: Existing LLM collaboration frameworks lack dynamic intent inference, leading to inconsistent plans and redundant communication in partially observable environments.Method: CoBel-World uses a collaborative belief world with symbolic belief language for structured knowledge representation and zero-shot Bayesian-style belief updates through LLM reasoning.
Result: Reduced communication costs by 22-60% and improved task completion efficiency by 4-28% on TDW-MAT and C-WAH benchmarks compared to strongest baselines.
Conclusion: Explicit, intent-aware belief modeling is essential for efficient and human-like collaboration in LLM-based multi-agent systems.
Abstract: Effective real-world multi-agent collaboration requires not only accurate planning but also the ability to reason about collaborators’ intents – a crucial capability for avoiding miscoordination and redundant communication under partial observable environments. Due to their strong planning and reasoning capabilities, large language models (LLMs) have emerged as promising autonomous agents for collaborative task solving. However, existing collaboration frameworks for LLMs overlook their reasoning potential for dynamic intent inference, and thus produce inconsistent plans and redundant communication, reducing collaboration efficiency. To bridge this gap, we propose CoBel-World, a novel framework that equips LLM agents with a collaborative belief world – an internal representation jointly modeling the physical environment and collaborators’ mental states. CoBel-World enables agents to parse open-world task knowledge into structured beliefs via a symbolic belief language, and perform zero-shot Bayesian-style belief updates through LLM reasoning. This allows agents to proactively detect potential miscoordination (e.g., conflicting plans) and communicate adaptively. Evaluated on challenging embodied benchmarks (i.e., TDW-MAT and C-WAH), CoBel-World significantly reduces communication costs by 22-60% and improves task completion efficiency by 4-28% compared to the strongest baseline. Our results show that explicit, intent-aware belief modeling is essential for efficient and human-like collaboration in LLM-based multi-agent systems.
[471] UltraHorizon: Benchmarking Agent Capabilities in Ultra Long-Horizon Scenarios
Haotian Luo, Huaisong Zhang, Xuelin Zhang, Haoyu Wang, Zeyu Qin, Wenjie Lu, Guozheng Ma, Haiying He, Yingsha Xie, Qiyang Zhou, Zixuan Hu, Hongze Mi, Yibo Wang, Naiqiang Tan, Hong Chen, Yi R. Fung, Chun Yuan, Li Shen
Main category: cs.AI
TL;DR: UltraHorizon is a new benchmark for evaluating autonomous agents in long-horizon, partially observable tasks, revealing significant performance gaps between LLM-agents and humans despite heavy scaling.
Details
Motivation: Existing benchmarks focus on short-horizon, fully observable tasks, but real-world challenges like software development and scientific discovery require sustained reasoning, planning, memory management, and tool use in long-horizon scenarios.Method: Introduced UltraHorizon benchmark with exploration tasks across three environments where agents must iteratively uncover hidden rules through sustained reasoning, planning, memory management, and tool interactions. Tasks involve 35k-200k+ tokens and 60-400+ tool calls.
Result: LLM-agents consistently underperform in long-horizon settings while human participants achieve higher scores. Simple scaling fails to improve performance. Analysis identified eight error types attributed to in-context locking and functional capability gaps.
Conclusion: There’s a persistent gap in agents’ long-horizon abilities that current scaling approaches cannot overcome, highlighting the need for benchmarks that capture complex real-world challenges.
Abstract: Autonomous agents have recently achieved remarkable progress across diverse domains, yet most evaluations focus on short-horizon, fully observable tasks. In contrast, many critical real-world tasks, such as large-scale software development, commercial investment, and scientific discovery, unfold in long-horizon and partially observable scenarios where success hinges on sustained reasoning, planning, memory management, and tool use. Existing benchmarks rarely capture these long-horizon challenges, leaving a gap in systematic evaluation. To bridge this gap, we introduce \textbf{UltraHorizon} a novel benchmark that measures the foundational capabilities essential for complex real-world challenges. We use exploration as a unifying task across three distinct environments to validate these core competencies. Agents are designed in long-horizon discovery tasks where they must iteratively uncover hidden rules through sustained reasoning, planning, memory and tools management, and interaction with environments. Under the heaviest scale setting, trajectories average \textbf{200k+} tokens and \textbf{400+} tool calls, whereas in standard configurations they still exceed \textbf{35k} tokens and involve more than \textbf{60} tool calls on average. Our extensive experiments reveal that LLM-agents consistently underperform in these settings, whereas human participants achieve higher scores, underscoring a persistent gap in agents’ long-horizon abilities. We also observe that simple scaling fails in our task. To better illustrate the failure of agents, we conduct an in-depth analysis of collected trajectories. We identify eight types of errors and attribute them to two primary causes: in-context locking and functional fundamental capability gaps. \href{https://github.com/StarDewXXX/UltraHorizon}{Our code will be available here.}
[472] Log2Plan: An Adaptive GUI Automation Framework Integrated with Task Mining Approach
Seoyoung Lee, Seonbin Yoon, Seongbeen Lee, Hyesoo Kim, Joo Yong Sim
Main category: cs.AI
TL;DR: Log2Plan is a GUI automation system that combines structured two-level planning with task mining from user behavior logs to overcome limitations of existing LLM/VLM-based agents, achieving robust performance on complex tasks.
Details
Motivation: Existing LLM/VLM-based GUI automation agents suffer from brittle generalization, high latency, limited long-horizon coherence, and fragility under UI changes or complex tasks due to single-shot reasoning or static plans.Method: Combines structured two-level planning framework with task mining from user behavior logs. Constructs high-level plans by mapping commands to task dictionary, then grounds them into low-level action sequences using real-time GUI context. Uses task mining to identify user-specific patterns for personalization.
Result: Evaluated on 200 real-world tasks, showing significant improvements in task success rate and execution time. Maintains over 60.0% success rate on long-horizon task sequences, demonstrating robustness in complex multi-step workflows.
Conclusion: Log2Plan provides a robust and adaptable GUI automation solution that effectively handles complex tasks and UI variations through its two-level planning and task mining approach.
Abstract: GUI task automation streamlines repetitive tasks, but existing LLM or VLM-based planner-executor agents suffer from brittle generalization, high latency, and limited long-horizon coherence. Their reliance on single-shot reasoning or static plans makes them fragile under UI changes or complex tasks. Log2Plan addresses these limitations by combining a structured two-level planning framework with a task mining approach over user behavior logs, enabling robust and adaptable GUI automation. Log2Plan constructs high-level plans by mapping user commands to a structured task dictionary, enabling consistent and generalizable automation. To support personalization and reuse, it employs a task mining approach from user behavior logs that identifies user-specific patterns. These high-level plans are then grounded into low-level action sequences by interpreting real-time GUI context, ensuring robust execution across varying interfaces. We evaluated Log2Plan on 200 real-world tasks, demonstrating significant improvements in task success rate and execution time. Notably, it maintains over 60.0% success rate even on long-horizon task sequences, highlighting its robustness in complex, multi-step workflows.
[473] Benchmarking MLLM-based Web Understanding: Reasoning, Robustness and Safety
Junliang Liu, Jingyu Xiao, Wenxin Tang, Wenxuan Wang, Zhixian Wang, Minrui Zhang, Shuanghe Yu
Main category: cs.AI
TL;DR: WebRSSBench is a comprehensive benchmark for evaluating multimodal LLMs on web understanding tasks, focusing on reasoning, robustness, and safety across 8 tasks with 3799 QA pairs from 729 websites.
Details
Motivation: Existing benchmarks focus too much on visual perception or UI code generation, lacking evaluation of reasoning, robustness, and safety capabilities needed for end-to-end web applications.Method: Constructed benchmark from 729 websites with 3799 question-answer pairs across 8 tasks. Used standardized prompts, deterministic evaluation scripts, and multi-stage quality control with automatic checks and human verification.
Result: Evaluation of 12 MLLMs revealed significant gaps: models struggle with compositional and cross-element reasoning, show limited robustness to UI perturbations, and are overly conservative in safety-critical action recognition.
Conclusion: Current MLLMs have substantial limitations in web understanding capabilities, particularly in reasoning, robustness, and safety, highlighting the need for continued research in these areas.
Abstract: Multimodal large language models (MLLMs) are increasingly positioned as AI collaborators for building complex web-related applications like GUI agents and front-end code generation. However, existing benchmarks largely emphasize visual perception or UI code generation, showing insufficient evaluation on the reasoning, robustness and safety capability required for end-to-end web applications. To bridge the gap, we introduce a comprehensive web understanding benchmark, named WebRSSBench, that jointly evaluates Reasoning, Robustness, and Safety across eight tasks, such as position relationship reasoning, color robustness, and safety critical detection, etc. The benchmark is constructed from 729 websites and contains 3799 question answer pairs that probe multi-step inference over page structure, text, widgets, and safety-critical interactions. To ensure reliable measurement, we adopt standardized prompts, deterministic evaluation scripts, and multi-stage quality control combining automatic checks with targeted human verification. We evaluate 12 MLLMs on WebRSSBench. The results reveal significant gaps, models still struggle with compositional and cross-element reasoning over realistic layouts, show limited robustness when facing perturbations in user interfaces and content such as layout rearrangements or visual style shifts, and are rather conservative in recognizing and avoiding safety critical or irreversible actions. Our code is available at https://github.com/jinliang-byte/webssrbench.
[474] D-Artemis: A Deliberative Cognitive Framework for Mobile GUI Multi-Agents
Hongze Mi, Yibo Feng, Wenjie Lu, Yuqi Wang, Jinyuan Li, Song Cao, He Cui, Tengfei Tian, Xuelin Zhang, Haotian Luo, Di Sun, Naiqiang Tan, Gang Pan
Main category: cs.AI
TL;DR: D-Artemis is a novel deliberative GUI agent framework that uses Thinking-Alignment-Reflection cognitive loop with app-specific tip retrieval, pre-execution alignment, and post-execution reflection to achieve state-of-the-art performance on GUI automation tasks without requiring complex trajectory dataset training.
Details
Motivation: Current GUI agents face challenges including data bottlenecks in end-to-end training, high cost of delayed error detection, and risk of contradictory guidance, which limit their effectiveness in automating human tasks through user interaction emulation.Method: D-Artemis employs a cognitive loop with three stages: Thinking (using app-specific tip retrieval), Alignment (with Thought-Action Consistency Check and Action Correction Agent for pre-execution validation), and Reflection (with Status Reflection Agent for post-execution learning). It enhances general-purpose MLLMs without requiring training on complex trajectory datasets.
Result: Achieves 75.8% success rate on AndroidWorld and 96.8% on ScreenSpot-V2 benchmarks, establishing new state-of-the-art results. Ablation studies confirm significant contributions from each framework component.
Conclusion: D-Artemis demonstrates strong generalization capabilities for GUI tasks by leveraging a deliberative cognitive framework that mitigates execution risks and enables strategic learning, while avoiding the need for extensive training on complex datasets.
Abstract: Graphical User Interface (GUI) agents aim to automate a wide spectrum of human tasks by emulating user interaction. Despite rapid advancements, current approaches are hindered by several critical challenges: data bottleneck in end-to-end training, high cost of delayed error detection, and risk of contradictory guidance. Inspired by the human cognitive loop of Thinking, Alignment, and Reflection, we present D-Artemis – a novel deliberative framework in this paper. D-Artemis leverages a fine-grained, app-specific tip retrieval mechanism to inform its decision-making process. It also employs a proactive Pre-execution Alignment stage, where Thought-Action Consistency (TAC) Check module and Action Correction Agent (ACA) work in concert to mitigate the risk of execution failures. A post-execution Status Reflection Agent (SRA) completes the cognitive loop, enabling strategic learning from experience. Crucially, D-Artemis enhances the capabilities of general-purpose Multimodal large language models (MLLMs) for GUI tasks without the need for training on complex trajectory datasets, demonstrating strong generalization. D-Artemis establishes new state-of-the-art (SOTA) results across both major benchmarks, achieving a 75.8% success rate on AndroidWorld and 96.8% on ScreenSpot-V2. Extensive ablation studies further demonstrate the significant contribution of each component to the framework.
[475] ProRe: A Proactive Reward System for GUI Agents via Reasoner-Actor Collaboration
Gaole Dai, Shiqi Jiang, Ting Cao, Yuqing Yang, Yuanchun Li, Rui Tan, Mo Li, Lili Qiu
Main category: cs.AI
TL;DR: ProRe is a proactive reward system for GUI agents that uses a reasoner to schedule targeted state probing tasks and domain-specific evaluator agents to collect additional observations, improving reward accuracy and agent performance.
Details
Motivation: Existing reward methods struggle with GUI agents due to lack of ground-truth trajectories or application databases, and static trajectory-based approaches have limited accuracy.Method: ProRe uses a general-purpose reasoner to schedule targeted state probing tasks, which domain-specific evaluator agents execute by actively interacting with the environment to collect additional observations for more accurate reward assignment.
Result: Empirical results on over 3K trajectories show ProRe improves reward accuracy and F1 score by up to 5.3% and 19.4% respectively, and integration with state-of-the-art policy agents yields up to 22.4% success rate improvement.
Conclusion: ProRe effectively addresses the limitations of existing reward methods for GUI agents by enabling proactive environment interaction and observation collection, leading to significantly improved reward accuracy and agent performance.
Abstract: Reward is critical to the evaluation and training of large language models (LLMs). However, existing rule-based or model-based reward methods struggle to generalize to GUI agents, where access to ground-truth trajectories or application databases is often unavailable, and static trajectory-based LLM-as-a-Judge approaches suffer from limited accuracy. To address these challenges, we propose ProRe, a proactive reward system that leverages a general-purpose reasoner and domain-specific evaluator agents (actors). The reasoner schedules targeted state probing tasks, which the evaluator agents then execute by actively interacting with the environment to collect additional observations. This enables the reasoner to assign more accurate and verifiable rewards to GUI agents. Empirical results on over 3K trajectories demonstrate that ProRe improves reward accuracy and F1 score by up to 5.3% and 19.4%, respectively. Furthermore, integrating ProRe with state-of-the-art policy agents yields a success rate improvement of up to 22.4%.
[476] DS-STAR: Data Science Agent via Iterative Planning and Verification
Jaehyun Nam, Jinsung Yoon, Jiefeng Chen, Jinwoo Shin, Tomas Pfister
Main category: cs.AI
TL;DR: DS-STAR is a novel data science agent that overcomes LLM limitations in handling heterogeneous data formats and generating optimal analysis plans through automated data exploration, iterative plan verification, and sequential refinement.
Details
Motivation: Data science tasks are complex and involve exploring multiple data sources, but LLMs struggle with heterogeneous data formats and generating sufficient analysis plans without ground-truth labels for open-ended tasks.Method: DS-STAR features: (1) data file analysis module for exploring diverse data formats, (2) LLM-based judge for verifying analysis plan sufficiency at each stage, and (3) sequential planning mechanism that starts simple and iteratively refines plans based on feedback.
Result: DS-STAR achieves state-of-the-art performance across DABStep, KramaBench, and DA-Code benchmarks, particularly outperforming baselines on hard tasks requiring processing multiple data files with heterogeneous formats.
Conclusion: The iterative refinement approach allows DS-STAR to reliably navigate complex analyses involving diverse data sources, demonstrating superior performance in automated data science tasks.
Abstract: Data science, which transforms raw data into actionable insights, is critical for data-driven decision-making. However, these tasks are often complex, involving steps for exploring multiple data sources and synthesizing findings to deliver insightful answers. While large language models (LLMs) show significant promise in automating this process, they often struggle with heterogeneous data formats and generate sub-optimal analysis plans, as verifying plan sufficiency is inherently difficult without ground-truth labels for such open-ended tasks. To overcome these limitations, we introduce DS-STAR, a novel data science agent. Specifically, DS-STAR makes three key contributions: (1) a data file analysis module that automatically explores and extracts context from diverse data formats, including unstructured types; (2) a verification step where an LLM-based judge evaluates the sufficiency of the analysis plan at each stage; and (3) a sequential planning mechanism that starts with a simple, executable plan and iteratively refines it based on the DS-STAR’s feedback until its sufficiency is verified. This iterative refinement allows DS-STAR to reliably navigate complex analyses involving diverse data sources. Our experiments show that DS-STAR achieves state-of-the-art performance across three challenging benchmarks: DABStep, KramaBench, and DA-Code. Moreover, DS-STAR particularly outperforms baselines on hard tasks that require processing multiple data files with heterogeneous formats.
[477] DeepTravel: An End-to-End Agentic Reinforcement Learning Framework for Autonomous Travel Planning Agents
Yansong Ning, Rui Liu, Jun Wang, Kai Chen, Wei Li, Jun Fang, Kan Zheng, Naiqiang Tan, Hao Liu
Main category: cs.AI
TL;DR: DeepTravel is an end-to-end agentic reinforcement learning framework that enables autonomous travel planning agents to plan, execute tools, and reflect on responses for multi-step reasoning, outperforming frontier LLMs.
Details
Motivation: Existing travel planning agents rely on hand-crafted prompts and fixed workflows, limiting flexibility and autonomy. The paper aims to create a more autonomous and flexible travel planning agent.Method: The framework includes: 1) A sandbox environment with cached transportation, accommodation, and POI data; 2) Hierarchical reward modeling with trajectory-level and turn-level verifiers; 3) Reply-augmented reinforcement learning with failure experience replay.
Result: DeepTravel enables small LLMs (Qwen3 32B) to significantly outperform frontier LLMs like OpenAI o1, o3 and DeepSeek R1 in travel planning tasks, as demonstrated through online and offline evaluations on DiDi Enterprise Solutions App.
Conclusion: The proposed DeepTravel framework successfully creates autonomous travel planning agents capable of flexible planning and reasoning, achieving superior performance compared to existing large language models.
Abstract: Travel planning (TP) agent has recently worked as an emerging building block to interact with external tools and resources for travel itinerary generation, ensuring enjoyable user experience. Despite its benefits, existing studies rely on hand craft prompt and fixed agent workflow, hindering more flexible and autonomous TP agent. This paper proposes DeepTravel, an end to end agentic reinforcement learning framework for building autonomous travel planning agent, capable of autonomously planning, executing tools, and reflecting on tool responses to explore, verify, and refine intermediate actions in multi step reasoning. To achieve this, we first construct a robust sandbox environment by caching transportation, accommodation and POI data, facilitating TP agent training without being constrained by real world APIs limitations (e.g., inconsistent outputs). Moreover, we develop a hierarchical reward modeling system, where a trajectory level verifier first checks spatiotemporal feasibility and filters unsatisfied travel itinerary, and then the turn level verifier further validate itinerary detail consistency with tool responses, enabling efficient and precise reward service. Finally, we propose the reply augmented reinforcement learning method that enables TP agent to periodically replay from a failures experience buffer, emerging notable agentic capacity. We deploy trained TP agent on DiDi Enterprise Solutions App and conduct comprehensive online and offline evaluations, demonstrating that DeepTravel enables small size LLMs (e.g., Qwen3 32B) to significantly outperform existing frontier LLMs such as OpenAI o1, o3 and DeepSeek R1 in travel planning tasks.
[478] TRACE: Learning to Compute on Graphs
Ziyang Zheng, Jiaying Zhu, Jingyi Zhou, Qiang Xu
Main category: cs.AI
TL;DR: TRACE introduces a new paradigm for learning to compute on graphs using a Hierarchical Transformer and function shift learning to overcome architectural mismatches in existing methods.
Details
Motivation: Current graph representation learning methods like MPNNs and Transformers are architecturally mismatched for computational tasks due to permutation-invariant aggregation, preventing them from capturing the position-aware, hierarchical nature of computation.Method: TRACE uses a Hierarchical Transformer that mirrors step-by-step computation flow and introduces function shift learning - predicting only the discrepancy between true global function and simple local approximation rather than the complex function directly.
Result: TRACE substantially outperforms all prior architectures across comprehensive benchmarks on electronic circuits, one of the most complex computational graph classes.
Conclusion: The architecturally-aligned backbone and decoupled learning objective form a more robust paradigm for the fundamental challenge of learning to compute on graphs.
Abstract: Learning to compute, the ability to model the functional behavior of a computational graph, is a fundamental challenge for graph representation learning. Yet, the dominant paradigm is architecturally mismatched for this task. This flawed assumption, central to mainstream message passing neural networks (MPNNs) and their conventional Transformer-based counterparts, prevents models from capturing the position-aware, hierarchical nature of computation. To resolve this, we introduce \textbf{TRACE}, a new paradigm built on an architecturally sound backbone and a principled learning objective. First, TRACE employs a Hierarchical Transformer that mirrors the step-by-step flow of computation, providing a faithful architectural backbone that replaces the flawed permutation-invariant aggregation. Second, we introduce \textbf{function shift learning}, a novel objective that decouples the learning problem. Instead of predicting the complex global function directly, our model is trained to predict only the \textit{function shift}, the discrepancy between the true global function and a simple local approximation that assumes input independence. We validate this paradigm on electronic circuits, one of the most complex and economically critical classes of computational graphs. Across a comprehensive suite of benchmarks, TRACE substantially outperforms all prior architectures. These results demonstrate that our architecturally-aligned backbone and decoupled learning objective form a more robust paradigm for the fundamental challenge of learning to compute on graphs.
[479] GenesisGeo: Technical Report
Minfeng Zhu, Zi Wang, Sizhe Ji, Zhengtong Du, Junming Ke, Xiao Deng, Zanlang Yin, Xiuqi Huang, Heyu Wang, Wei Chen
Main category: cs.AI
TL;DR: GenesisGeo is an automated theorem prover for Euclidean geometry that achieves IMO gold medal level performance through neuro-symbolic reasoning and significant speed improvements.
Details
Motivation: To develop an automated theorem prover capable of solving complex geometry problems at International Mathematical Olympiad (IMO) level, addressing the need for efficient geometric reasoning systems.Method: Combines symbolic deduction engine DDARN (accelerated 120x through theorem matching and C++ implementation) with neuro-symbolic approach using Qwen3-0.6B-Base model, plus dual-model ensemble for enhanced performance.
Result: Solves 24/30 problems (IMO silver level) with single model and 26/30 problems (IMO gold level) with dual-model ensemble on IMO-AG-30 benchmark; created large-scale dataset of 21.8M geometric problems including 3M+ with auxiliary constructions.
Conclusion: GenesisGeo demonstrates state-of-the-art performance in automated geometric theorem proving, achieving IMO competition levels through efficient neuro-symbolic integration and large-scale data processing.
Abstract: We present GenesisGeo, an automated theorem prover in Euclidean geometry. We have open-sourced a large-scale geometry dataset of 21.8 million geometric problems, over 3 million of which contain auxiliary constructions. Specially, we significantly accelerate the symbolic deduction engine DDARN by 120x through theorem matching, combined with a C++ implementation of its core components. Furthermore, we build our neuro-symbolic prover, GenesisGeo, upon Qwen3-0.6B-Base, which solves 24 of 30 problems (IMO silver medal level) in the IMO-AG-30 benchmark using a single model, and achieves 26 problems (IMO gold medal level) with a dual-model ensemble.
[480] DyRo-MCTS: A Robust Monte Carlo Tree Search Approach to Dynamic Job Shop Scheduling
Ruiqi Chen, Yi Mei, Fangfang Zhang, Mengjie Zhang
Main category: cs.AI
TL;DR: Dynamic Robust MCTS (DyRo-MCTS) integrates action robustness estimation into Monte Carlo Tree Search to improve dynamic job shop scheduling by making decisions resilient to future job arrivals.
Details
Motivation: Existing offline scheduling policies are imperfect and vulnerable to disruptions from new job arrivals, requiring online planning methods that can handle incomplete problem information.Method: Proposes DyRo-MCTS which incorporates action robustness estimation into MCTS to guide scheduling decisions toward states that are both high-performing and easily adaptable to future disturbances.
Result: DyRo-MCTS significantly improves offline-learned policies with minimal additional planning time, consistently outperforms vanilla MCTS across various scenarios, and achieves sustainable performance gains under disturbances.
Conclusion: Integrating robustness estimation into online planning enables more resilient scheduling decisions that maintain long-term performance despite dynamic disruptions in job shop environments.
Abstract: Dynamic job shop scheduling, a fundamental combinatorial optimisation problem in various industrial sectors, poses substantial challenges for effective scheduling due to frequent disruptions caused by the arrival of new jobs. State-of-the-art methods employ machine learning to learn scheduling policies offline, enabling rapid responses to dynamic events. However, these offline policies are often imperfect, necessitating the use of planning techniques such as Monte Carlo Tree Search (MCTS) to improve performance at online decision time. The unpredictability of new job arrivals complicates online planning, as decisions based on incomplete problem information are vulnerable to disturbances. To address this issue, we propose the Dynamic Robust MCTS (DyRo-MCTS) approach, which integrates action robustness estimation into MCTS. DyRo-MCTS guides the production environment toward states that not only yield good scheduling outcomes but are also easily adaptable to future job arrivals. Extensive experiments show that DyRo-MCTS significantly improves the performance of offline-learned policies with negligible additional online planning time. Moreover, DyRo-MCTS consistently outperforms vanilla MCTS across various scheduling scenarios. Further analysis reveals that its ability to make robust scheduling decisions leads to long-term, sustainable performance gains under disturbances.
[481] Outlier Detection in Plantar Pressure: Human-Centered Comparison of Statistical Parametric Mapping and Explainable Machine Learning
Carlo Dindorf, Jonas Dully, Steven Simon, Dennis Perchthaler, Stephan Becker, Hannah Ehmann, Kjell Heitmann, Bernd Stetter, Christian Diers, Michael Fröhlich
Main category: cs.AI
TL;DR: This study compares Statistical Parametric Mapping (SPM) and explainable machine learning for outlier detection in plantar pressure data, finding ML outperforms SPM while both provide interpretable results.
Details
Motivation: Plantar pressure datasets often contain outliers from technical errors or procedural inconsistencies, but existing SPM methods are sensitive to alignment and their outlier detection capabilities are unclear.Method: Used 798 valid samples and 2000 outliers from multiple centers, comparing (i) non-parametric registration-dependent SPM and (ii) CNN with SHAP explanations, evaluated via nested cross-validation and expert surveys.
Result: ML model achieved high accuracy and outperformed SPM, which misclassified clinically meaningful variations and missed true outliers. Both SPM and SHAP explanations were perceived as clear and trustworthy by experts.
Conclusion: SPM and explainable ML have complementary potential for automated outlier detection in plantar pressure data, with explainability being crucial for translating complex model outputs into actionable insights.
Abstract: Plantar pressure mapping is essential in clinical diagnostics and sports science, yet large heterogeneous datasets often contain outliers from technical errors or procedural inconsistencies. Statistical Parametric Mapping (SPM) provides interpretable analyses but is sensitive to alignment and its capacity for robust outlier detection remains unclear. This study compares an SPM approach with an explainable machine learning (ML) approach to establish transparent quality-control pipelines for plantar pressure datasets. Data from multiple centers were annotated by expert consensus and enriched with synthetic anomalies resulting in 798 valid samples and 2000 outliers. We evaluated (i) a non-parametric, registration-dependent SPM approach and (ii) a convolutional neural network (CNN), explained using SHapley Additive exPlanations (SHAP). Performance was assessed via nested cross-validation; explanation quality via a semantic differential survey with domain experts. The ML model reached high accuracy and outperformed SPM, which misclassified clinically meaningful variations and missed true outliers. Experts perceived both SPM and SHAP explanations as clear, useful, and trustworthy, though SPM was assessed less complex. These findings highlight the complementary potential of SPM and explainable ML as approaches for automated outlier detection in plantar pressure data, and underscore the importance of explainability in translating complex model outputs into interpretable insights that can effectively inform decision-making.
[482] From Grunts to Lexicons: Emergent Language from Cooperative Foraging
Maytus Piriyajitakonkij, Rujikorn Charakorn, Weicheng Tao, Wei Pan, Mingfei Sun, Cheston Tan, Mengmi Zhang
Main category: cs.AI
TL;DR: This paper investigates how language emerges in multi-agent foraging games using deep reinforcement learning, finding that agents develop communication protocols with key features of natural language.
Details
Motivation: To understand how language evolves from ecological and social demands of cooperation, inspired by linguistic and anthropological theories about language origins.Method: Used end-to-end deep reinforcement learning in multi-agent foraging games where agents operate in a shared grid world with partial knowledge and must coordinate to complete tasks.
Result: Agents developed communication protocols exhibiting hallmark features of natural language: arbitrariness, interchangeability, displacement, cultural transmission, and compositionality.
Conclusion: The framework provides a platform for studying language evolution from partial observability, temporal reasoning, and cooperative goals in embodied multi-agent settings.
Abstract: Language is a powerful communicative and cognitive tool. It enables humans to express thoughts, share intentions, and reason about complex phenomena. Despite our fluency in using and understanding language, the question of how it arises and evolves over time remains unsolved. A leading hypothesis in linguistics and anthropology posits that language evolved to meet the ecological and social demands of early human cooperation. Language did not arise in isolation, but through shared survival goals. Inspired by this view, we investigate the emergence of language in multi-agent Foraging Games. These environments are designed to reflect the cognitive and ecological constraints believed to have influenced the evolution of communication. Agents operate in a shared grid world with only partial knowledge about other agents and the environment, and must coordinate to complete games like picking up high-value targets or executing temporally ordered actions. Using end-to-end deep reinforcement learning, agents learn both actions and communication strategies from scratch. We find that agents develop communication protocols with hallmark features of natural language: arbitrariness, interchangeability, displacement, cultural transmission, and compositionality. We quantify each property and analyze how different factors, such as population size, social dynamics, and temporal dependencies, shape specific aspects of the emergent language. Our framework serves as a platform for studying how language can evolve from partial observability, temporal reasoning, and cooperative goals in embodied multi-agent settings. We will release all data, code, and models publicly.
[483] RISK: A Framework for GUI Agents in E-commerce Risk Management
Renqi Chen, Zeyin Tao, Jianming Guo, Jingzhe Zhu, Yiheng Peng, Qingqing Sun, Tianyi Zhang, Shuai Chen
Main category: cs.AI
TL;DR: RISK is a framework for building GUI agents for e-commerce risk management, featuring a dataset, benchmark, and reinforcement fine-tuning method that improves performance on multi-step web interactions.
Details
Motivation: Traditional scraping methods and existing GUI agents cannot handle the complex, multi-step, stateful interactions required for e-commerce risk management, particularly with dynamic, interactive web content.Method: RISK framework with three components: RISK-Data (dataset of interaction trajectories), RISK-Bench (evaluation benchmark), and RISK-R1 (reinforcement fine-tuning framework with format rewards, stepwise accuracy, process reweighting, and level reweighting).
Result: RISK-R1 outperforms baselines with 6.8% improvement in offline single-step and 8.8% improvement in offline multi-step tasks, achieving 70.5% task success rate in online evaluation.
Conclusion: RISK provides a scalable, domain-specific solution for automating complex web interactions, advancing e-commerce risk management capabilities.
Abstract: E-commerce risk management requires aggregating diverse, deeply embedded web data through multi-step, stateful interactions, which traditional scraping methods and most existing Graphical User Interface (GUI) agents cannot handle. These agents are typically limited to single-step tasks and lack the ability to manage dynamic, interactive content critical for effective risk assessment. To address this challenge, we introduce RISK, a novel framework designed to build and deploy GUI agents for this domain. RISK integrates three components: (1) RISK-Data, a dataset of 8,492 single-step and 2,386 multi-step interaction trajectories, collected through a high-fidelity browser framework and a meticulous data curation process; (2) RISK-Bench, a benchmark with 802 single-step and 320 multi-step trajectories across three difficulty levels for standardized evaluation; and (3) RISK-R1, a R1-style reinforcement fine-tuning framework considering four aspects: (i) Output Format: Updated format reward to enhance output syntactic correctness and task comprehension, (ii) Single-step Level: Stepwise accuracy reward to provide granular feedback during early training stages, (iii) Multi-step Level: Process reweight to emphasize critical later steps in interaction sequences, and (iv) Task Level: Level reweight to focus on tasks of varying difficulty. Experiments show that RISK-R1 outperforms existing baselines, achieving a 6.8% improvement in offline single-step and an 8.8% improvement in offline multi-step. Moreover, it attains a top task success rate of 70.5% in online evaluation. RISK provides a scalable, domain-specific solution for automating complex web interactions, advancing the state of the art in e-commerce risk management.
[484] Bilinear relational structure fixes reversal curse and enables consistent model editing
Dong-Kyum Kim, Minsung Kim, Jea Kwon, Nakyeong Yang, Meeyoung Cha
Main category: cs.AI
TL;DR: The reversal curse in language models is not inherent but stems from knowledge encoding. Training on relational knowledge graphs induces bilinear structure in representations, alleviating the curse and enabling consistent model editing.
Details
Motivation: To challenge the view that the reversal curse is a fundamental limitation of language models and investigate how knowledge representation affects logical consistency.Method: Train language models from scratch on synthetic relational knowledge graphs and analyze the emergence of bilinear relational structure in hidden representations.
Result: Models with bilinear structure can infer unseen reverse facts and propagate edits consistently to logically dependent facts, while models lacking this structure fail to generalize edits and introduce inconsistencies.
Conclusion: Bilinear internal representations enable logically consistent behavior in language models, and successful model editing depends on the underlying representational geometry of knowledge.
Abstract: The reversal curse – a language model’s (LM) inability to infer an unseen
fact B is A'' from a learned fact
A is B’’ – is widely considered a
fundamental limitation. We show that this is not an inherent failure but an
artifact of how models encode knowledge. By training LMs from scratch on a
synthetic dataset of relational knowledge graphs, we demonstrate that bilinear
relational structure emerges in their hidden representations. This structure
substantially alleviates the reversal curse, enabling LMs to infer unseen
reverse facts. Crucially, we also find that this bilinear structure plays a key
role in consistent model editing. When a fact is updated in a LM with this
structure, the edit correctly propagates to its reverse and other logically
dependent facts. In contrast, models lacking this representation not only
suffer from the reversal curse but also fail to generalize edits, further
introducing logical inconsistencies. Our results establish that training on a
relational knowledge dataset induces the emergence of bilinear internal
representations, which in turn enable LMs to behave in a logically consistent
manner after editing. This implies that the success of model editing depends
critically not just on editing algorithms but on the underlying
representational geometry of the knowledge being modified.
[485] GSM-Agent: Understanding Agentic Reasoning Using Controllable Environments
Hanlin Zhu, Tianyu Guo, Song Mei, Stuart Russell, Nikhil Ghosh, Alberto Bietti, Jiantao Jiao
Main category: cs.AI
TL;DR: GSM-Agent benchmark tests LLM agentic reasoning by requiring models to solve grade-school math problems without premises, forcing proactive information gathering through tools. Even frontier models struggle (67% accuracy), revealing missing revisit patterns in agentic reasoning.
Details
Motivation: Current agent benchmarks mix agentic reasoning with other advanced capabilities, making it hard to isolate and evaluate pure agentic reasoning skills like tool use and proactive information gathering.Method: Created GSM-Agent benchmark where LLMs solve grade-school math problems but must proactively collect missing premises using tools. Proposed agentic reasoning graphs to analyze patterns and developed tool-augmented test-time scaling to encourage revisiting previously visited nodes.
Result: Even frontier models like GPT-5 only achieve 67% accuracy. Analysis revealed that the ability to revisit previously visited nodes, crucial for static reasoning, is often missing in agentic reasoning for many models.
Conclusion: The benchmark and agentic reasoning framework provide tools for better understanding and improving agentic reasoning capabilities in LLMs, with identified revisit patterns offering key insights for performance enhancement.
Abstract: As LLMs are increasingly deployed as agents, agentic reasoning - the ability to combine tool use, especially search, and reasoning - becomes a critical skill. However, it is hard to disentangle agentic reasoning when evaluated in complex environments and tasks. Current agent benchmarks often mix agentic reasoning with challenging math reasoning, expert-level knowledge, and other advanced capabilities. To fill this gap, we build a novel benchmark, GSM-Agent, where an LLM agent is required to solve grade-school-level reasoning problems, but is only presented with the question in the prompt without the premises that contain the necessary information to solve the task, and needs to proactively collect that information using tools. Although the original tasks are grade-school math problems, we observe that even frontier models like GPT-5 only achieve 67% accuracy. To understand and analyze the agentic reasoning patterns, we propose the concept of agentic reasoning graph: cluster the environment’s document embeddings into nodes, and map each tool call to its nearest node to build a reasoning path. Surprisingly, we identify that the ability to revisit a previously visited node, widely taken as a crucial pattern in static reasoning, is often missing for agentic reasoning for many models. Based on the insight, we propose a tool-augmented test-time scaling method to improve LLM’s agentic reasoning performance by adding tools to encourage models to revisit. We expect our benchmark and the agentic reasoning framework to aid future studies of understanding and pushing the boundaries of agentic reasoning.
[486] Towards Agentic OS: An LLM Agent Framework for Linux Schedulers
Yusheng Zheng, Yanpeng Hu, Wei Zhang, Andi Quinn
Main category: cs.AI
TL;DR: SchedCP is a framework that enables autonomous LLM agents to optimize Linux schedulers by separating semantic reasoning from execution, achieving up to 1.79x performance improvement and 13x cost reduction.
Details
Motivation: To address the semantic gap in OS schedulers where kernel policies fail to understand application-specific needs, leading to suboptimal performance.Method: Uses a decoupled control plane architecture with MCP server providing workload analysis, scheduler policy repository, and execution verifier. Implements sched-agent multi-agent system for autonomous workload analysis and eBPF scheduling policy synthesis.
Result: Achieves up to 1.79x performance improvement and 13x cost reduction compared to naive agentic approaches while maintaining high success rate.
Conclusion: SchedCP democratizes expert-level system optimization and represents progress toward creating self-optimizing, application-aware operating systems.
Abstract: Operating system schedulers suffer from a fundamental semantic gap, where kernel policies fail to understand application-specific needs, leading to suboptimal performance. We introduce SchedCP, the first framework that enables fully autonomous Large Language Model (LLM) agents to safely and efficiently optimize Linux schedulers without human involvement. Our core insight is that the challenge is not merely to apply a better LLM, but to architect a decoupled control plane that separates the AI’s role of semantic reasoning (“what to optimize”) from the system’s role of execution (“how to observe and act”). Implemented as Model Context Protocol(MCP) server, SchedCP provides a stable interface with three key services: a Workload Analysis Engine, an evolving Scheduler Policy Repository, and an Execution Verifier that validates all AI-generated code and configure before deployment with static and dynamic analysis. We demonstrate this architecture’s power with sched-agent, a multi-agent system that autonomously analyzes workloads, synthesizes custom eBPF scheduling policies, and deploys them via the sched_ext infrastructure. Our evaluation shows that SchedCP achieves up to an 1.79x performance improvement, and a 13x cost reduction compared to naive agentic approaches, all while maintaining high success rate. By bridging the semantic gap, SchedCP democratizes expert-level system optimization and represents a step towards creating truly self-optimizing, application-aware operating systems. The code is open-sourced in https://github.com/eunomia-bpf/schedcp
[487] The Thinking Spectrum: An Emperical Study of Tunable Reasoning in LLMs through Model Merging
Xiaochong Lan, Yu Zheng, Shiteng Cao, Yong Li
Main category: cs.AI
TL;DR: Model merging enables tunable LLMs with controllable reasoning depth vs. computational cost trade-offs, achieving Pareto improvements where merged models outperform parent models in both accuracy and efficiency.
Details
Motivation: Address the need for efficiently producing LLMs with tunable reasoning capabilities that balance reasoning depth and computational cost for real-world applications.Method: Conducted large-scale empirical study evaluating various model merging techniques across reasoning benchmarks, systematically varying merging strengths to construct accuracy-efficiency curves.
Result: Model merging effectively calibrates reasoning accuracy vs. token efficiency trade-off, even with divergent parent model weights. Achieved Pareto improvements where merged models had higher accuracy and lower token consumption than parents.
Conclusion: Model merging provides practical method for creating LLMs with specific reasoning profiles, offering comprehensive guidelines for meeting diverse application demands through tunable performance landscapes.
Abstract: The growing demand for large language models (LLMs) with tunable reasoning capabilities in many real-world applications highlights a critical need for methods that can efficiently produce a spectrum of models balancing reasoning depth and computational cost. Model merging has emerged as a promising, training-free technique to address this challenge by arithmetically combining the weights of a general-purpose model with a specialized reasoning model. While various merging techniques exist, their potential to create a spectrum of models with fine-grained control over reasoning abilities remains largely unexplored. This work presents a large-scale empirical study evaluating a range of model merging techniques across multiple reasoning benchmarks. We systematically vary merging strengths to construct accuracy-efficiency curves, providing the first comprehensive view of the tunable performance landscape. Our findings reveal that model merging offers an effective and controllable method for calibrating the trade-off between reasoning accuracy and token efficiency, even when parent models have highly divergent weight spaces. Crucially, we identify instances of Pareto Improvement, where a merged model achieves both higher accuracy and lower token consumption than one of its parents. Our study provides the first comprehensive analysis of this tunable space, offering practical guidelines for creating LLMs with specific reasoning profiles to meet diverse application demands.
[488] A2R: An Asymmetric Two-Stage Reasoning Framework for Parallel Reasoning
Ziqi Wang, Boye Niu, Zhongli Li, Linghui Meng, Jing Liu, Zhi Zheng, Tong Xu, Hua Wu, Haifeng Wang, Enhong Chen
Main category: cs.AI
TL;DR: A2R is an asymmetric two-stage reasoning framework that uses an explorer model to generate multiple solutions and a synthesizer model to refine them, significantly improving performance while reducing computational costs.
Details
Motivation: To bridge the gap between a model's potential capabilities (revealed across multiple solution paths) and its actual performance in single attempts, addressing the disparity between realized and inherent reasoning abilities.Method: A two-stage framework: (1) Explorer model generates potential solutions through parallel sampling, (2) Synthesizer model integrates references for refined reasoning. Features asymmetric scaling with smaller explorer and larger synthesizer models.
Result: Qwen3-8B-distill achieved 75% performance improvement over self-consistency baseline. A2R-Efficient (Qwen3-4B explorer + Qwen3-8B synthesizer) surpassed monolithic Qwen3-32B performance at 30% lower cost.
Conclusion: A2R is an effective plug-and-play framework that boosts performance while being computationally efficient for real-world applications.
Abstract: Recent Large Reasoning Models have achieved significant improvements in complex task-solving capabilities by allocating more computation at the inference stage with a “thinking longer” paradigm. Even as the foundational reasoning capabilities of models advance rapidly, the persistent gap between a model’s performance in a single attempt and its latent potential, often revealed only across multiple solution paths, starkly highlights the disparity between its realized and inherent capabilities. To address this, we present A2R, an Asymmetric Two-Stage Reasoning framework designed to explicitly bridge the gap between a model’s potential and its actual performance. In this framework, an “explorer” model first generates potential solutions in parallel through repeated sampling. Subsequently,a “synthesizer” model integrates these references for a more refined, second stage of reasoning. This two-stage process allows computation to be scaled orthogonally to existing sequential methods. Our work makes two key innovations: First, we present A2R as a plug-and-play parallel reasoning framework that explicitly enhances a model’s capabilities on complex questions. For example, using our framework, the Qwen3-8B-distill model achieves a 75% performance improvement compared to its self-consistency baseline. Second, through a systematic analysis of the explorer and synthesizer roles, we identify an effective asymmetric scaling paradigm. This insight leads to A2R-Efficient, a “small-to-big” variant that combines a Qwen3-4B explorer with a Qwen3-8B synthesizer. This configuration surpasses the average performance of a monolithic Qwen3-32B model at a nearly 30% lower cost. Collectively, these results show that A2R is not only a performance-boosting framework but also an efficient and practical solution for real-world applications.
[489] Generalizing Multi-Objective Search via Objective-Aggregation Functions
Hadar Peer, Eyal Weiss, Ron Alterovitz, Oren Salzman
Main category: cs.AI
TL;DR: The paper presents a generalized multi-objective search formulation that uses aggregation functions to optimize solution objectives based on hidden search objectives, enabling standard MOS algorithms to handle complex robotics problems with improved performance.
Details
Motivation: Real-world robotic systems need to balance multiple conflicting objectives, but recent complex problem formulations prevent direct use of state-of-the-art multi-objective search algorithms.Method: Propose a generalized problem formulation using aggregation functions of hidden objectives, extend core operations of standard MOS algorithms to handle specific aggregation functions, and apply to diverse robotics planning problems.
Result: The extended MOS algorithms outperform vanilla versions by orders of magnitude across various robotics domains including navigation, manipulation, medical systems, inspection, and route planning.
Conclusion: The generalized formulation with aggregation functions enables effective application of standard MOS algorithms to complex robotics problems, significantly improving performance while maintaining algorithm compatibility.
Abstract: Multi-objective search (MOS) has become essential in robotics, as real-world robotic systems need to simultaneously balance multiple, often conflicting objectives. Recent works explore complex interactions between objectives, leading to problem formulations that do not allow the usage of out-of-the-box state-of-the-art MOS algorithms. In this paper, we suggest a generalized problem formulation that optimizes solution objectives via aggregation functions of hidden (search) objectives. We show that our formulation supports the application of standard MOS algorithms, necessitating only to properly extend several core operations to reflect the specific aggregation functions employed. We demonstrate our approach in several diverse robotics planning problems, spanning motion-planning for navigation, manipulation and planning fr medical systems under obstacle uncertainty as well as inspection planning, and route planning with different road types. We solve the problems using state-of-the-art MOS algorithms after properly extending their core operations, and provide empirical evidence that they outperform by orders of magnitude the vanilla versions of the algorithms applied to the same problems but without objective aggregation.
[490] Ground-Truthing AI Energy Consumption: Validating CodeCarbon Against External Measurements
Raphael Fischer
Main category: cs.AI
TL;DR: This study evaluates the accuracy of AI energy estimation tools like ML Emissions Calculator and CodeCarbon, finding they can have errors up to 40% despite generally following consumption patterns.
Details
Motivation: To address concerns about AI's environmental impact and validate the reliability of existing energy estimation tools that make pragmatic assumptions but may neglect important factors.Method: Systematic evaluation of static and dynamic energy estimation approaches through comparisons with ground-truth measurements across hundreds of AI experiments using a proposed validation framework.
Result: Established estimation approaches consistently make errors of up to 40%, though they generally follow the patterns of AI energy consumption.
Conclusion: The study provides empirical evidence on energy estimation quality, validates widely used tools for sustainable AI development, and offers guidelines and code for improving estimation accuracy in resource-aware ML research.
Abstract: Although machine learning (ML) and artificial intelligence (AI) present fascinating opportunities for innovation, their rapid development is also significantly impacting our environment. In response to growing resource-awareness in the field, quantification tools such as the ML Emissions Calculator and CodeCarbon were developed to estimate the energy consumption and carbon emissions of running AI models. They are easy to incorporate into AI projects, however also make pragmatic assumptions and neglect important factors, raising the question of estimation accuracy. This study systematically evaluates the reliability of static and dynamic energy estimation approaches through comparisons with ground-truth measurements across hundreds of AI experiments. Based on the proposed validation framework, investigative insights into AI energy demand and estimation inaccuracies are provided. While generally following the patterns of AI energy consumption, the established estimation approaches are shown to consistently make errors of up to 40%. By providing empirical evidence on energy estimation quality and errors, this study establishes transparency and validates widely used tools for sustainable AI development. It moreover formulates guidelines for improving the state-of-the-art and offers code for extending the validation to other domains and tools, thus making important contributions to resource-aware ML and AI sustainability research.
[491] Clinical Uncertainty Impacts Machine Learning Evaluations
Simone Lionetti, Fabian Gröger, Philippe Gottfrois, Alvaro Gonzalez-Jimenez, Ludovic Amruthalingam, Alexander A. Navarini, Marc Pouly
Main category: cs.AI
TL;DR: Machine learning evaluations should use probabilistic metrics that account for annotation uncertainty rather than simple aggregation methods like majority voting, as this significantly impacts model rankings in medical imaging.
Details
Motivation: Clinical dataset labels are uncertain due to annotator disagreement and varying confidence levels, but typical aggregation procedures obscure this variability.Method: Propose probabilistic metrics that operate directly on distributions of annotations, applicable regardless of annotation generation process (counting, confidence ratings, or probabilistic models).
Result: Accounting for confidence in binary labels significantly impacts model rankings in medical imaging benchmarks.
Conclusion: The community should release raw annotations and adopt uncertainty-aware evaluation to better reflect clinical data reality.
Abstract: Clinical dataset labels are rarely certain as annotators disagree and confidence is not uniform across cases. Typical aggregation procedures, such as majority voting, obscure this variability. In simple experiments on medical imaging benchmarks, accounting for the confidence in binary labels significantly impacts model rankings. We therefore argue that machine-learning evaluations should explicitly account for annotation uncertainty using probabilistic metrics that directly operate on distributions. These metrics can be applied independently of the annotations’ generating process, whether modeled by simple counting, subjective confidence ratings, or probabilistic response models. They are also computationally lightweight, as closed-form expressions have linear-time implementations once examples are sorted by model score. We thus urge the community to release raw annotations for datasets and to adopt uncertainty-aware evaluation so that performance estimates may better reflect clinical data.
[492] Evaluating LLMs for Combinatorial Optimization: One-Phase and Two-Phase Heuristics for 2D Bin-Packing
Syed Mahbubul Huq, Daniel Brito, Daniel Sikar, Rajesh Mojumder
Main category: cs.AI
TL;DR: This paper evaluates LLMs’ capabilities in combinatorial optimization for 2D bin-packing problems, showing that LLM-generated heuristics outperform traditional methods with fewer computational resources.
Details
Motivation: To assess LLMs' potential in specialized domains like combinatorial optimization and establish benchmarks for evaluating their performance in such tasks.Method: Systematic methodology combining LLMs with evolutionary algorithms to iteratively generate and refine heuristic solutions, comparing them against traditional approaches like Finite First-Fit and Hybrid First-Fit.
Result: GPT-4o achieved optimal solutions within two iterations, reducing average bin usage from 16 to 15 bins and improving space utilization from 0.76-0.78 to 0.83.
Conclusion: LLMs can produce more efficient solutions than traditional approaches in combinatorial optimization tasks while requiring fewer computational resources, contributing to better understanding of LLM evaluation in specialized domains.
Abstract: This paper presents an evaluation framework for assessing Large Language Models’ (LLMs) capabilities in combinatorial optimization, specifically addressing the 2D bin-packing problem. We introduce a systematic methodology that combines LLMs with evolutionary algorithms to generate and refine heuristic solutions iteratively. Through comprehensive experiments comparing LLM generated heuristics against traditional approaches (Finite First-Fit and Hybrid First-Fit), we demonstrate that LLMs can produce more efficient solutions while requiring fewer computational resources. Our evaluation reveals that GPT-4o achieves optimal solutions within two iterations, reducing average bin usage from 16 to 15 bins while improving space utilization from 0.76-0.78 to 0.83. This work contributes to understanding LLM evaluation in specialized domains and establishes benchmarks for assessing LLM performance in combinatorial optimization tasks.
[493] InfiMed-Foundation: Pioneering Advanced Multimodal Medical Models with Compute-Efficient Pre-Training and Multi-Stage Fine-Tuning
Guanghao Zhu, Zhitian Hou, Zeyu Liu, Zhijie Sang, Congkai Xie, Hongxia Yang
Main category: cs.AI
TL;DR: InfiMed-Foundation-1.7B and InfiMed-Foundation-4B are medical-specific multimodal large language models that address challenges in medical AI by improving data quality, training efficiency, and domain knowledge extraction, achieving state-of-the-art performance in medical tasks.
Details
Motivation: General-purpose MLLMs lack specialized medical knowledge and produce uncertain/hallucinatory responses. Knowledge distillation struggles with domain expertise, and continual pretraining with medical data is computationally expensive.Method: Used high-quality general-purpose and medical multimodal data with a five-dimensional quality assessment framework. Employed low-to-high image resolution and multimodal sequence packing for efficiency. Implemented three-stage supervised fine-tuning for complex medical tasks.
Result: InfiMed-Foundation-1.7B outperforms Qwen2.5VL-3B, while InfiMed-Foundation-4B surpasses HuatuoGPT-V-7B and MedGemma-27B-IT in MedEvalKit framework for medical visual question answering and diagnostic tasks.
Conclusion: The work addresses key challenges in data quality, training efficiency, and domain-specific knowledge extraction, enabling more reliable and effective AI-driven healthcare solutions.
Abstract: Multimodal large language models (MLLMs) have shown remarkable potential in various domains, yet their application in the medical field is hindered by several challenges. General-purpose MLLMs often lack the specialized knowledge required for medical tasks, leading to uncertain or hallucinatory responses. Knowledge distillation from advanced models struggles to capture domain-specific expertise in radiology and pharmacology. Additionally, the computational cost of continual pretraining with large-scale medical data poses significant efficiency challenges. To address these issues, we propose InfiMed-Foundation-1.7B and InfiMed-Foundation-4B, two medical-specific MLLMs designed to deliver state-of-the-art performance in medical applications. We combined high-quality general-purpose and medical multimodal data and proposed a novel five-dimensional quality assessment framework to curate high-quality multimodal medical datasets. We employ low-to-high image resolution and multimodal sequence packing to enhance training efficiency, enabling the integration of extensive medical data. Furthermore, a three-stage supervised fine-tuning process ensures effective knowledge extraction for complex medical tasks. Evaluated on the MedEvalKit framework, InfiMed-Foundation-1.7B outperforms Qwen2.5VL-3B, while InfiMed-Foundation-4B surpasses HuatuoGPT-V-7B and MedGemma-27B-IT, demonstrating superior performance in medical visual question answering and diagnostic tasks. By addressing key challenges in data quality, training efficiency, and domain-specific knowledge extraction, our work paves the way for more reliable and effective AI-driven solutions in healthcare. InfiMed-Foundation-4B model is available at \href{https://huggingface.co/InfiX-ai/InfiMed-Foundation-4B}{InfiMed-Foundation-4B}.
[494] Structured Sparse Transition Matrices to Enable State Tracking in State-Space Models
Aleksandar Terzić, Nicolas Menet, Michael Hersche, Thomas Hofmann, Abbas Rahimi
Main category: cs.AI
TL;DR: PD-SSM is a structured sparse parametrization method for state-space models that enables optimal finite-state automata tracking with linear computational cost, outperforming modern SSM variants.
Details
Motivation: Current SSMs use transition matrices that enable efficient computation but restrict expressivity for finite-state automata emulation, while unstructured matrices are too computationally expensive.Method: Parametrize transition matrix as product of column one-hot matrix (P) and complex-valued diagonal matrix (D), enabling linear computational cost scaling with state size.
Result: Significantly outperforms modern SSM variants on FSA tracking tasks, comparable to neural controlled differential equations on time-series classification, and effectively tracks complex FSA states in hybrid Transformer-SSM architecture.
Conclusion: PD-SSM achieves optimal FSA state tracking with linear computational cost, improving expressivity guarantees while maintaining efficiency comparable to diagonal SSMs.
Abstract: Modern state-space models (SSMs) often utilize transition matrices which enable efficient computation but pose restrictions on the model’s expressivity, as measured in terms of the ability to emulate finite-state automata (FSA). While unstructured transition matrices are optimal in terms of expressivity, they come at a prohibitively high compute and memory cost even for moderate state sizes. We propose a structured sparse parametrization of transition matrices in SSMs that enables FSA state tracking with optimal state size and depth, while keeping the computational cost of the recurrence comparable to that of diagonal SSMs. Our method, PD-SSM, parametrizes the transition matrix as the product of a column one-hot matrix ($P$) and a complex-valued diagonal matrix ($D$). Consequently, the computational cost of parallel scans scales linearly with the state size. Theoretically, the model is BIBO-stable and can emulate any $N$-state FSA with one layer of dimension $N$ and a linear readout of size $N \times N$, significantly improving on all current structured SSM guarantees. Experimentally, the model significantly outperforms a wide collection of modern SSM variants on various FSA state tracking tasks. On multiclass time-series classification, the performance is comparable to that of neural controlled differential equations, a paradigm explicitly built for time-series analysis. Finally, we integrate PD-SSM into a hybrid Transformer-SSM architecture and demonstrate that the model can effectively track the states of a complex FSA in which transitions are encoded as a set of variable-length English sentences. The code is available at https://github.com/IBM/expressive-sparse-state-space-model
[495] Large Language Models as Nondeterministic Causal Models
Sander Beckers
Main category: cs.AI
TL;DR: A simpler method for generating counterfactuals in LLMs using nondeterministic causal models, addressing limitations of existing approaches.
Details
Motivation: Existing methods for LLM counterfactuals are ambiguous - they don't interpret LLMs literally or as intended, requiring modifications to the sampling process or representing nondeterministic models as deterministic.Method: Proposes a simpler method that represents LLMs as nondeterministic causal models, making it directly applicable to any black-box LLM without modification.
Result: The new method is implementation-agnostic and works with any black-box LLM, while existing methods are useful for specific types of counterfactuals but not others.
Conclusion: Provides theoretical foundation for reasoning about counterfactuals in LLMs based on intended semantics, enabling novel application-specific methods.
Abstract: Recent work by Chatzi et al. and Ravfogel et al. has developed, for the first time, a method for generating counterfactuals of probabilistic Large Language Models. Such counterfactuals tell us what would - or might - have been the output of an LLM if some factual prompt ${\bf x}$ had been ${\bf x}^*$ instead. The ability to generate such counterfactuals is an important necessary step towards explaining, evaluating, and comparing, the behavior of LLMs. I argue, however, that the existing method rests on an ambiguous interpretation of LLMs: it does not interpret LLMs literally, for the method involves the assumption that one can change the implementation of an LLM’s sampling process without changing the LLM itself, nor does it interpret LLMs as intended, for the method involves explicitly representing a nondeterministic LLM as a deterministic causal model. I here present a much simpler method for generating counterfactuals that is based on an LLM’s intended interpretation by representing it as a nondeterministic causal model instead. The advantage of my simpler method is that it is directly applicable to any black-box LLM without modification, as it is agnostic to any implementation details. The advantage of the existing method, on the other hand, is that it directly implements the generation of a specific type of counterfactuals that is useful for certain purposes, but not for others. I clarify how both methods relate by offering a theoretical foundation for reasoning about counterfactuals in LLMs based on their intended semantics, thereby laying the groundwork for novel application-specific methods for generating counterfactuals.
[496] PRIME: Planning and Retrieval-Integrated Memory for Enhanced Reasoning
Hieu Tran, Zonghai Yao, Nguyen Luong Tran, Zhichao Yang, Feiyun Ouyang, Shuo Han, Razieh Rahimi, Hong Yu
Main category: cs.AI
TL;DR: PRIME is a multi-agent reasoning framework that integrates fast System 1 thinking and deliberate System 2 thinking, enabling open-source LLMs to compete with state-of-the-art closed-source models on complex reasoning tasks.
Details
Motivation: Inspired by the dual-process theory of human cognition from 'Thinking, Fast and Slow', the goal is to create a reasoning framework that mimics human cognitive processes by dynamically integrating intuitive and deliberate thinking modes.Method: PRIME employs a Quick Thinking Agent (System 1) for rapid answers, and if uncertainty is detected, triggers a structured System 2 pipeline with specialized agents for planning, hypothesis generation, retrieval, information integration, and decision-making.
Result: Experimental results with LLaMA 3 models show that PRIME enables open-source LLMs to perform competitively with state-of-the-art closed-source models like GPT-4 and GPT-4o on benchmarks requiring multi-hop and knowledge-grounded reasoning.
Conclusion: PRIME establishes itself as a scalable solution for improving LLMs in domains requiring complex, knowledge-intensive reasoning by faithfully mimicking human cognitive processes and enhancing both efficiency and accuracy.
Abstract: Inspired by the dual-process theory of human cognition from \textit{Thinking, Fast and Slow}, we introduce \textbf{PRIME} (Planning and Retrieval-Integrated Memory for Enhanced Reasoning), a multi-agent reasoning framework that dynamically integrates \textbf{System 1} (fast, intuitive thinking) and \textbf{System 2} (slow, deliberate thinking). PRIME first employs a Quick Thinking Agent (System 1) to generate a rapid answer; if uncertainty is detected, it then triggers a structured System 2 reasoning pipeline composed of specialized agents for \textit{planning}, \textit{hypothesis generation}, \textit{retrieval}, \textit{information integration}, and \textit{decision-making}. This multi-agent design faithfully mimics human cognitive processes and enhances both efficiency and accuracy. Experimental results with LLaMA 3 models demonstrate that PRIME enables open-source LLMs to perform competitively with state-of-the-art closed-source models like GPT-4 and GPT-4o on benchmarks requiring multi-hop and knowledge-grounded reasoning. This research establishes PRIME as a scalable solution for improving LLMs in domains requiring complex, knowledge-intensive reasoning.
[497] Do LLM Agents Know How to Ground, Recover, and Assess? A Benchmark for Epistemic Competence in Information-Seeking Agents
Jiaqi Shao, Yuxiang Lin, Munish Prasad Lohani, Yufeng Miao, Bing Luo
Main category: cs.AI
TL;DR: SeekBench is a new benchmark for evaluating LLM search agents’ epistemic competence through step-level analysis of their reasoning and evidence usage.
Details
Motivation: Current evaluations focus only on final answer accuracy, overlooking how LLM search agents reason with and act on external evidence during the search process.Method: Created SeekBench with 190 expert-annotated traces containing over 1,800 response steps, enriched with evidence annotations for granular analysis of three key epistemic competencies.
Result: The benchmark enables evaluation of whether agents: (1) ground reasoning in observed evidence, (2) adaptively reformulate searches to recover from poor results, and (3) properly calibrate confidence about evidence sufficiency.
Conclusion: SeekBench provides the first comprehensive framework for assessing the epistemic competence of LLM search agents beyond just final answer accuracy.
Abstract: Recent work has explored training Large Language Model (LLM) search agents with reinforcement learning (RL) for open-domain question answering (QA). However, most evaluations focus solely on final answer accuracy, overlooking how these agents reason with and act on external evidence. We introduce SeekBench, the first benchmark for evaluating the \textit{epistemic competence} of LLM search agents through step-level analysis of their response traces. SeekBench comprises 190 expert-annotated traces with over 1,800 response steps generated by LLM search agents, each enriched with evidence annotations for granular analysis of whether agents (1) generate reasoning steps grounded in observed evidence, (2) adaptively reformulate searches to recover from low-quality results, and (3) have proper calibration to correctly assess whether the current evidence is sufficient for providing an answer.
[498] EMMA: Generalizing Real-World Robot Manipulation via Generative Visual Transfer
Zhehao Dong, Xiaofeng Wang, Zheng Zhu, Yirui Wang, Yang Wang, Yukun Zhou, Boyuan Wang, Chaojun Ni, Runqi Ouyang, Wenkang Qin, Xinze Chen, Yun Ye, Guan Huang
Main category: cs.AI
TL;DR: EMMA framework enhances VLA policies using DreamTransfer for generating multi-view consistent robot manipulation videos and AdaMix for hard-sample-aware training, achieving significant performance gains in zero-shot generalization.
Details
Motivation: Collecting large-scale real-world robot manipulation data across varied conditions is time-consuming and expensive, creating a bottleneck for training robust VLA models.Method: Proposed EMMA framework with DreamTransfer (diffusion Transformer for text-controlled video editing) and AdaMix (dynamic batch reweighting for hard samples), using hybrid training with real and generated data.
Result: DreamTransfer outperforms prior methods in multi-view consistency, geometric fidelity, and text-conditioning. VLAs trained with generated data achieve over 200% relative performance gain in zero-shot domains, with additional 13% improvement from AdaMix.
Conclusion: The approach effectively overcomes data collection bottlenecks and significantly boosts policy generalization to unseen object categories and visual domains.
Abstract: Vision-language-action (VLA) models increasingly rely on diverse training data to achieve robust generalization. However, collecting large-scale real-world robot manipulation data across varied object appearances and environmental conditions remains prohibitively time-consuming and expensive. To overcome this bottleneck, we propose Embodied Manipulation Media Adaptation (EMMA), a VLA policy enhancement framework that integrates a generative data engine with an effective training pipeline. We introduce DreamTransfer, a diffusion Transformer-based framework for generating multi-view consistent, geometrically grounded embodied manipulation videos. DreamTransfer enables text-controlled visual editing of robot videos, transforming foreground, background, and lighting conditions without compromising 3D structure or geometrical plausibility. Furthermore, we explore hybrid training with real and generated data, and introduce AdaMix, a hard-sample-aware training strategy that dynamically reweights training batches to focus optimization on perceptually or kinematically challenging samples. Extensive experiments show that videos generated by DreamTransfer significantly outperform prior video generation methods in multi-view consistency, geometric fidelity, and text-conditioning accuracy. Crucially, VLAs trained with generated data enable robots to generalize to unseen object categories and novel visual domains using only demonstrations from a single appearance. In real-world robotic manipulation tasks with zero-shot visual domains, our approach achieves over a 200% relative performance gain compared to training on real data alone, and further improves by 13% with AdaMix, demonstrating its effectiveness in boosting policy generalization.
[499] Guiding Evolution of Artificial Life Using Vision-Language Models
Nikhil Baid, Hannah Erlebach, Paul Hellegouarch, Frederico Wieser
Main category: cs.AI
TL;DR: ASAL++ extends Automated Search for Artificial Life by using multimodal foundation models to propose new evolutionary targets based on simulation visual history, creating open-ended-like search with increasingly complex targets.
Details
Motivation: To advance artificial life research by leveraging foundation models for open-ended evolutionary search, building on previous work that aligned ALife simulations with natural language prompts using vision-language models.Method: Uses a second foundation model (Gemma-3) to propose new evolutionary targets from simulation visual history. Tests two strategies: Evolved Supervised Targets (EST) for single prompt matching and Evolved Temporal Targets (ETT) for matching entire sequences of generated prompts, implemented in Lenia substrate.
Result: EST promotes greater visual novelty while ETT fosters more coherent and interpretable evolutionary sequences. The method successfully creates evolutionary trajectories with increasingly complex targets.
Conclusion: ASAL++ points towards new directions for foundation model-driven artificial life discovery with open-ended characteristics, demonstrating the potential of multimodal FMs for automated evolutionary search.
Abstract: Foundation models (FMs) have recently opened up new frontiers in the field of artificial life (ALife) by providing powerful tools to automate search through ALife simulations. Previous work aligns ALife simulations with natural language target prompts using vision-language models (VLMs). We build on Automated Search for Artificial Life (ASAL) by introducing ASAL++, a method for open-ended-like search guided by multimodal FMs. We use a second FM to propose new evolutionary targets based on a simulation’s visual history. This induces an evolutionary trajectory with increasingly complex targets. We explore two strategies: (1) evolving a simulation to match a single new prompt at each iteration (Evolved Supervised Targets: EST) and (2) evolving a simulation to match the entire sequence of generated prompts (Evolved Temporal Targets: ETT). We test our method empirically in the Lenia substrate using Gemma-3 to propose evolutionary targets, and show that EST promotes greater visual novelty, while ETT fosters more coherent and interpretable evolutionary sequences. Our results suggest that ASAL++ points towards new directions for FM-driven ALife discovery with open-ended characteristics.
[500] GeoSketch: A Neural-Symbolic Approach to Geometric Multimodal Reasoning with Auxiliary Line Construction and Affine Transformation
Shichao Weng, Zhiqiang Wang, Yuhua Zhou, Rui Lu, Ting Liu, Zhiyang Teng, Xiaozhang Liu, Hanmeng Liu
Main category: cs.AI
TL;DR: GeoSketch is a neural-symbolic framework that transforms geometric reasoning into an interactive perception-reasoning-action loop, enabling dynamic manipulation of diagrams through auxiliary line construction and affine transformations.
Details
Motivation: Existing MLLMs process diagrams as static images, lacking capacity for dynamic manipulation which is essential for human geometric reasoning involving auxiliary line construction and transformations.Method: Three-module framework: Perception (abstracts diagrams to structured logic), Symbolic Reasoning (applies geometric theorems), and Sketch Action (executes operations like drawing lines). Trained via supervised fine-tuning on 2,000 trajectories followed by reinforcement learning with symbolic rewards.
Result: Significantly improves stepwise reasoning accuracy and problem-solving success over static perception methods on the GeoSketch Benchmark of 390 geometry problems requiring auxiliary construction or transformations.
Conclusion: GeoSketch advances multimodal reasoning from static interpretation to dynamic, verifiable interaction by unifying hierarchical decision-making, executable visual actions, and symbolic verification.
Abstract: Geometric Problem Solving (GPS) poses a unique challenge for Multimodal Large Language Models (MLLMs), requiring not only the joint interpretation of text and diagrams but also iterative visuospatial reasoning. While existing approaches process diagrams as static images, they lack the capacity for dynamic manipulation - a core aspect of human geometric reasoning involving auxiliary line construction and affine transformations. We present GeoSketch, a neural-symbolic framework that recasts geometric reasoning as an interactive perception-reasoning-action loop. GeoSketch integrates: (1) a Perception module that abstracts diagrams into structured logic forms, (2) a Symbolic Reasoning module that applies geometric theorems to decide the next deductive step, and (3) a Sketch Action module that executes operations such as drawing auxiliary lines or applying transformations, thereby updating the diagram in a closed loop. To train this agent, we develop a two-stage pipeline: supervised fine-tuning on 2,000 symbolic-curated trajectories followed by reinforcement learning with dense, symbolic rewards to enhance robustness and strategic exploration. To evaluate this paradigm, we introduce the GeoSketch Benchmark, a high-quality set of 390 geometry problems requiring auxiliary construction or affine transformations. Experiments on strong MLLM baselines demonstrate that GeoSketch significantly improves stepwise reasoning accuracy and problem-solving success over static perception methods. By unifying hierarchical decision-making, executable visual actions, and symbolic verification, GeoSketch advances multimodal reasoning from static interpretation to dynamic, verifiable interaction, establishing a new foundation for solving complex visuospatial problems.
[501] InfiAgent: Self-Evolving Pyramid Agent Framework for Infinite Scenarios
Chenglin Yu, Yang Yu, Songmiao Wang, Yucheng Wang, Yifan Yang, Jinjia Li, Ming Li, Hongxia Yang
Main category: cs.AI
TL;DR: InfiAgent is a pyramid-like DAG-based multi-agent framework that automates LLM agent development through hierarchical decomposition, dual-audit quality control, agent routing, and self-evolution mechanisms, achieving 9.9% higher performance than existing frameworks.
Details
Motivation: Current LLM agent development requires manual workflow design, prompt crafting, and iterative tuning, which hinders scalability and cost-effectiveness across industries.Method: Proposes InfiAgent with: agent-as-a-tool mechanism for hierarchical decomposition, dual-audit for quality control, agent routing for task matching, self-evolution for autonomous restructuring, and atomic task design for parallelism.
Result: Achieves 9.9% higher performance than ADAS framework; case study shows InfiHelper generates scientific papers recognized by human reviewers at top-tier IEEE conferences.
Conclusion: InfiAgent evolves into a versatile pyramid-like multi-agent system capable of solving a wide range of problems efficiently and autonomously.
Abstract: Large Language Model (LLM) agents have demonstrated remarkable capabilities in organizing and executing complex tasks, and many such agents are now widely used in various application scenarios. However, developing these agents requires carefully designed workflows, carefully crafted prompts, and iterative tuning, which requires LLM techniques and domain-specific expertise. These hand-crafted limitations hinder the scalability and cost-effectiveness of LLM agents across a wide range of industries. To address these challenges, we propose \textbf{InfiAgent}, a Pyramid-like DAG-based Multi-Agent Framework that can be applied to \textbf{infi}nite scenarios, which introduces several key innovations: a generalized “agent-as-a-tool” mechanism that automatically decomposes complex agents into hierarchical multi-agent systems; a dual-audit mechanism that ensures the quality and stability of task completion; an agent routing function that enables efficient task-agent matching; and an agent self-evolution mechanism that autonomously restructures the agent DAG based on new tasks, poor performance, or optimization opportunities. Furthermore, InfiAgent’s atomic task design supports agent parallelism, significantly improving execution efficiency. This framework evolves into a versatile pyramid-like multi-agent system capable of solving a wide range of problems. Evaluations on multiple benchmarks demonstrate that InfiAgent achieves 9.9% higher performance compared to ADAS (similar auto-generated agent framework), while a case study of the AI research assistant InfiHelper shows that it generates scientific papers that have received recognition from human reviewers at top-tier IEEE conferences.
[502] Estimating the Empowerment of Language Model Agents
Jinyeop Song, Jeff Gore, Max Kleiman-Weiner
Main category: cs.AI
TL;DR: Proposes empowerment (mutual information between agent actions and future states) as an information-theoretic evaluation metric for language model agents, introducing EELMA algorithm to estimate empowerment from multi-turn text interactions.
Details
Motivation: Need for scalable evaluation frameworks for LM agents as they become more capable and gain real-world tool access, overcoming limitations of conventional benchmark-centric evaluations that are costly to design and require human task design.Method: Developed EELMA algorithm to approximate effective empowerment from multi-turn text interactions, validated on language games and realistic web-browsing scenarios.
Result: Empowerment strongly correlates with average task performance, characterizes impact of environmental complexity and agentic factors (chain-of-thought, model scale, memory length), and high empowerment states/actions are often pivotal moments for general capabilities.
Conclusion: Empowerment serves as an appealing general-purpose metric for evaluating and monitoring LM agents in complex, open-ended settings.
Abstract: As language model (LM) agents become more capable and gain broader access to real-world tools, there is a growing need for scalable evaluation frameworks of agentic capability. However, conventional benchmark-centric evaluations are costly to design and require human designers to come up with valid tasks that translate into insights about general model capabilities. In this work, we propose information-theoretic evaluation based on empowerment, the mutual information between an agent’s actions and future states, as an open-ended method for evaluating LM agents. We introduce EELMA (Estimating Empowerment of Language Model Agents), an algorithm for approximating effective empowerment from multi-turn text interactions. We validate EELMA on both language games and scaled-up realistic web-browsing scenarios. We find that empowerment strongly correlates with average task performance, characterize the impact of environmental complexity and agentic factors such as chain-of-thought, model scale, and memory length on estimated empowerment, and that high empowerment states and actions are often pivotal moments for general capabilities. Together, these results demonstrate empowerment as an appealing general-purpose metric for evaluating and monitoring LM agents in complex, open-ended settings.
[503] TrueGradeAI: Retrieval-Augmented and Bias-Resistant AI for Transparent and Explainable Digital Assessments
Rakesh Thakur, Shivaansh Kaushik, Gauri Chopra, Harsh Rohilla
Main category: cs.AI
TL;DR: TrueGradeAI is an AI-driven digital exam system that preserves handwriting via tablets, uses OCR for transcription, and employs a retrieval-augmented pipeline with LLMs for explainable, evidence-based grading to reduce bias and environmental impact.
Details
Motivation: To address limitations of traditional paper-based exams including excessive paper usage, logistical complexity, grading delays, and evaluator bias in assessment systems.Method: Captures stylus input on secure tablets, uses transformer-based OCR for transcription, and implements a retrieval-augmented pipeline integrating faculty solutions, cache layers, and external references for LLM-based scoring with explicit reasoning.
Result: The system enables handwriting preservation with scalable, transparent evaluation that reduces environmental costs, accelerates feedback cycles, and builds reusable knowledge bases while mitigating grading bias.
Conclusion: TrueGradeAI advances digital assessment by combining handwriting preservation with explainable automation, bias mitigation, and auditable grading trails to ensure fairness and efficiency in examinations.
Abstract: This paper introduces TrueGradeAI, an AI-driven digital examination framework designed to overcome the shortcomings of traditional paper-based assessments, including excessive paper usage, logistical complexity, grading delays, and evaluator bias. The system preserves natural handwriting by capturing stylus input on secure tablets and applying transformer-based optical character recognition for transcription. Evaluation is conducted through a retrieval-augmented pipeline that integrates faculty solutions, cache layers, and external references, enabling a large language model to assign scores with explicit, evidence-linked reasoning. Unlike prior tablet-based exam systems that primarily digitize responses, TrueGradeAI advances the field by incorporating explainable automation, bias mitigation, and auditable grading trails. By uniting handwriting preservation with scalable and transparent evaluation, the framework reduces environmental costs, accelerates feedback cycles, and progressively builds a reusable knowledge base, while actively working to mitigate grading bias and ensure fairness in assessment.
[504] REMA: A Unified Reasoning Manifold Framework for Interpreting Large Language Model
Bo Li, Guanzhi Deng, Ronghao Chen, Junrong Yue, Shuo Zhang, Qinghua Zhao, Linqi Song, Lijie Wen
Main category: cs.AI
TL;DR: The paper introduces REMA, a framework that analyzes reasoning failures in LLMs by examining geometric deviations of internal representations from a “Reasoning Manifold” - a low-dimensional structure formed by correct reasoning paths.
Details
Motivation: To understand how LLMs perform complex reasoning and identify failure mechanisms through measurable geometric analysis of internal representations, addressing challenges in interpretability research.Method: REMA framework quantifies geometric deviation of erroneous representations by calculating k-nearest neighbors distance to the Reasoning Manifold formed by correct representations, then tracks deviation across layers to localize divergence points where reasoning goes off-track.
Result: Experiments show the low-dimensional nature of reasoning manifolds and high separability between erroneous and correct reasoning representations, validating REMA’s effectiveness in analyzing reasoning failure origins.
Conclusion: This research connects abstract reasoning failures to measurable geometric deviations in representations, providing new avenues for understanding and diagnosing internal computational processes of black-box models.
Abstract: Understanding how Large Language Models (LLMs) perform complex reasoning and their failure mechanisms is a challenge in interpretability research. To provide a measurable geometric analysis perspective, we define the concept of the Reasoning Manifold, a latent low-dimensional geometric structure formed by the internal representations corresponding to all correctly reasoned generations. This structure can be conceptualized as the embodiment of the effective thinking paths that the model has learned to successfully solve a given task. Based on this concept, we build REMA, a framework that explains the origins of failures by quantitatively comparing the spatial relationships of internal model representations corresponding to both erroneous and correct reasoning samples. Specifically, REMA first quantifies the geometric deviation of each erroneous representation by calculating its k-nearest neighbors distance to the approximated manifold formed by correct representations, thereby providing a unified failure signal. It then localizes the divergence points where these deviations first become significant by tracking this deviation metric across the model’s layers and comparing it against a baseline of internal fluctuations from correct representations, thus identifying where the reasoning chain begins to go off-track. Our extensive experiments on diverse language and multimodal models and tasks demonstrate the low-dimensional nature of the reasoning manifold and the high separability between erroneous and correct reasoning representations. The results also validate the effectiveness of the REMA framework in analyzing the origins of reasoning failures. This research connects abstract reasoning failures to measurable geometric deviations in representations, providing new avenues for in-depth understanding and diagnosis of the internal computational processes of black-box models.
[505] The Emergence of Altruism in Large-Language-Model Agents Society
Haoyang Li, Xiao Jia, Zhanzhan Zhao
Main category: cs.AI
TL;DR: LLMs show intrinsic heterogeneity in social tendencies - some are “Adaptive Egoists” that prioritize self-interest but can be influenced by social norms, while others are “Altruistic Optimizers” that inherently prioritize collective benefit even at personal cost.
Details
Motivation: Existing research focuses on cooperation in small-scale games, overlooking how altruism emerges in large-scale agent societies. Understanding the social logics LLMs embody is critical for computational social science.Method: Introduced a Schelling-variant urban migration model with 200+ LLM agents navigating conflict between egoistic and altruistic goals. Used Grounded Theory-inspired method to systematically code agent reasoning.
Result: Identified two distinct LLM archetypes: Adaptive Egoists (default to self-interest but increase altruism with social norms) and Altruistic Optimizers (inherently prioritize collective benefit at personal cost).
Conclusion: Model selection for social simulation should consider intrinsic social action logic, not just reasoning capability. Adaptive Egoists better simulate human societies, while Altruistic Optimizers suit idealized pro-social scenarios.
Abstract: Leveraging Large Language Models (LLMs) for social simulation is a frontier in computational social science. Understanding the social logics these agents embody is critical to this attempt. However, existing research has primarily focused on cooperation in small-scale, task-oriented games, overlooking how altruism, which means sacrificing self-interest for collective benefit, emerges in large-scale agent societies. To address this gap, we introduce a Schelling-variant urban migration model that creates a social dilemma, compelling over 200 LLM agents to navigate an explicit conflict between egoistic (personal utility) and altruistic (system utility) goals. Our central finding is a fundamental difference in the social tendencies of LLMs. We identify two distinct archetypes: “Adaptive Egoists”, which default to prioritizing self-interest but whose altruistic behaviors significantly increase under the influence of a social norm-setting message board; and “Altruistic Optimizers”, which exhibit an inherent altruistic logic, consistently prioritizing collective benefit even at a direct cost to themselves. Furthermore, to qualitatively analyze the cognitive underpinnings of these decisions, we introduce a method inspired by Grounded Theory to systematically code agent reasoning. In summary, this research provides the first evidence of intrinsic heterogeneity in the egoistic and altruistic tendencies of different LLMs. We propose that for social simulation, model selection is not merely a matter of choosing reasoning capability, but of choosing an intrinsic social action logic. While “Adaptive Egoists” may offer a more suitable choice for simulating complex human societies, “Altruistic Optimizers” are better suited for modeling idealized pro-social actors or scenarios where collective welfare is the primary consideration.
[506] StepORLM: A Self-Evolving Framework With Generative Process Supervision For Operations Research Language Models
Chenyu Zhou, Tianyi Xu, Jianghao Lin, Dongdong Ge
Main category: cs.AI
TL;DR: StepORLM is a self-evolving framework that addresses limitations in LLM training for OR problems through generative process supervision and co-evolution between policy and reward models, achieving state-of-the-art performance.
Details
Motivation: Existing LLM training methods for OR problems suffer from credit assignment issues (outcome rewards reinforce flawed reasoning) and myopic discriminative process supervision that fails to evaluate interdependent steps holistically.Method: StepORLM features a co-evolutionary loop where a policy model and generative process reward model (GenPRM) iteratively improve each other using dual-feedback: outcome-based verification from external solvers and holistic process evaluation from GenPRM, aligned via Weighted Direct Preference Optimization.
Result: The 8B-parameter StepORLM establishes new state-of-the-art across six benchmarks, significantly outperforming larger generalist models, agentic methods, and specialized baselines. The co-evolved GenPRM also acts as a powerful universal process verifier, boosting inference scaling performance.
Conclusion: StepORLM demonstrates that generative process supervision and co-evolutionary training effectively address key limitations in LLM training for OR problems, enabling superior performance and creating universally applicable process verification capabilities.
Abstract: Large Language Models (LLMs) have shown promising capabilities for solving Operations Research (OR) problems. While reinforcement learning serves as a powerful paradigm for LLM training on OR problems, existing works generally face two key limitations. First, outcome reward suffers from the credit assignment problem, where correct final answers can reinforce flawed reasoning. Second, conventional discriminative process supervision is myopic, failing to evaluate the interdependent steps of OR modeling holistically. To this end, we introduce StepORLM, a novel self-evolving framework with generative process supervision. At its core, StepORLM features a co-evolutionary loop where a policy model and a generative process reward model (GenPRM) iteratively improve on each other. This loop is driven by a dual-feedback mechanism: definitive, outcome-based verification from an external solver, and nuanced, holistic process evaluation from the GenPRM. The combined signal is used to align the policy via Weighted Direct Preference Optimization (W-DPO) and simultaneously refine the GenPRM. Our resulting 8B-parameter StepORLM establishes a new state-of-the-art across six benchmarks, significantly outperforming vastly larger generalist models, agentic methods, and specialized baselines. Moreover, the co-evolved GenPRM is able to act as a powerful and universally applicable process verifier, substantially boosting the inference scaling performance of both our own model and other existing LLMs.
[507] UniMIC: Token-Based Multimodal Interactive Coding for Human-AI Collaboration
Qi Mao, Tinghan Yang, Jiahao Li, Bin Li, Libiao Jin, Yan Lu
Main category: cs.AI
TL;DR: UniMIC is a unified token-based multimodal interactive coding framework that enables efficient communication between edge devices and cloud AI agents using compact tokenized representations instead of raw data, achieving substantial bitrate savings while maintaining task performance.
Details
Motivation: Existing codecs are optimized for unimodal, one-way communication and cause degradation in multimodal AI interactions. There's a need for efficient communication frameworks that support bidirectional multimodal interaction between edge devices and cloud AI agents.Method: Proposes UniMIC framework using compact tokenized representations as communication medium. Employs lightweight Transformer-based entropy models with three scenario-specific designs (generic, masked, and text-conditioned) to minimize inter-token redundancy.
Result: Achieves substantial bitrate savings and remains robust even at ultra-low bitrates (<0.05bpp) across text-to-image generation, text-guided inpainting, outpainting, and visual question answering tasks, without compromising downstream task performance.
Conclusion: UniMIC establishes a practical and forward-looking paradigm for next-generation multimodal interactive communication, bridging the gap between edge devices and cloud AI agents through efficient token-based representations.
Abstract: The rapid progress of Large Multimodal Models (LMMs) and cloud-based AI agents is transforming human-AI collaboration into bidirectional, multimodal interaction. However, existing codecs remain optimized for unimodal, one-way communication, resulting in repeated degradation under conventional compress-transmit-reconstruct pipelines. To address this limitation, we propose UniMIC, a Unified token-based Multimodal Interactive Coding framework that bridges edge devices and cloud AI agents. Instead of transmitting raw pixels or plain text, UniMIC employs compact tokenized representations as the communication medium, enabling efficient low-bitrate transmission while maintaining compatibility with LMMs. To further enhance compression, lightweight Transformer-based entropy models with scenario-specific designs-generic, masked, and text-conditioned-effectively minimize inter-token redundancy. Extensive experiments on text-to-image generation, text-guided inpainting, outpainting, and visual question answering show that UniMIC achieves substantial bitrate savings and remains robust even at ultra-low bitrates (<0.05bpp), without compromising downstream task performance. These results establish UniMIC as a practical and forward-looking paradigm for next-generation multimodal interactive communication.
[508] Dynamic Experts Search: Enhancing Reasoning in Mixture-of-Experts LLMs at Test Time
Yixuan Han, Fan Ma, Ruijie Quan, Yi Yang
Main category: cs.AI
TL;DR: Dynamic Experts Search (DES) is a test-time scaling strategy that leverages MoE architecture flexibility by dynamically controlling expert activation counts to generate diverse reasoning paths without extra cost.
Details
Motivation: Existing TTS approaches focus on output-level sampling but ignore model architecture. In MoE LLMs, varying activated experts creates complementary solution sets with stable accuracy, revealing an underexplored diversity source.Method: DES combines Dynamic MoE (direct control of expert counts during inference) and Expert Configuration Inheritance (consistent expert counts within reasoning paths but varied across runs) to balance stability and diversity.
Result: Extensive experiments across MoE architectures, verifiers, and reasoning benchmarks (math, code, knowledge) show DES reliably outperforms TTS baselines, improving accuracy and stability without additional cost.
Conclusion: DES demonstrates how structural flexibility in modern LLMs can advance reasoning, serving as a practical and scalable architecture-aware TTS approach.
Abstract: Test-Time Scaling (TTS) enhances the reasoning ability of large language models (LLMs) by allocating additional computation during inference. However, existing approaches primarily rely on output-level sampling while overlooking the role of model architecture. In mainstream Mixture-of-Experts (MoE) LLMs, we observe that varying the number of activated experts yields complementary solution sets with stable accuracy, revealing a new and underexplored source of diversity. Motivated by this observation, we propose Dynamic Experts Search (DES), a TTS strategy that elevates expert activation into a controllable dimension of the search space. DES integrates two key components: (1) Dynamic MoE, which enables direct control of expert counts during inference to generate diverse reasoning trajectories without additional cost; and (2) Expert Configuration Inheritance, which preserves consistent expert counts within a reasoning path while varying them across runs, thereby balancing stability and diversity throughout the search. Extensive experiments across MoE architectures, verifiers and reasoning benchmarks (i.e., math, code and knowledge) demonstrate that DES reliably outperforms TTS baselines, enhancing accuracy and stability without additional cost. These results highlight DES as a practical and scalable form of architecture-aware TTS, illustrating how structural flexibility in modern LLMs can advance reasoning.
[509] Benefits and Pitfalls of Reinforcement Learning for Language Model Planning: A Theoretical Perspective
Siwei Wang, Yifei Shen, Haoran Sun, Shi Feng, Shang-Hua Teng, Li Dong, Yaru Hao, Wei Chen
Main category: cs.AI
TL;DR: RL enhances LLM planning but lacks theoretical basis; analysis shows SFT creates spurious solutions while RL achieves correct planning through exploration, with Q-learning outperforming policy gradient by preserving diversity and enabling off-policy learning.
Details
Motivation: To understand the theoretical basis for RL's effectiveness in enhancing LLM planning capabilities, given that recent methods have shown substantial improvements but lack theoretical explanation.Method: Used a tractable graph-based abstraction to analyze policy gradient and Q-learning methods, examining their behaviors in planning tasks and applying the framework to the Blocksworld benchmark.
Result: SFT introduces co-occurrence-based spurious solutions, while RL achieves correct planning through exploration. Policy gradient suffers from diversity collapse, whereas Q-learning preserves diversity and enables off-policy learning. Careful reward design is needed to prevent reward hacking in Q-learning.
Conclusion: RL’s exploration is crucial for better generalization in planning, with Q-learning providing key advantages over policy gradient through diversity preservation and off-policy learning, though reward design requires careful consideration to avoid hacking.
Abstract: Recent reinforcement learning (RL) methods have substantially enhanced the planning capabilities of Large Language Models (LLMs), yet the theoretical basis for their effectiveness remains elusive. In this work, we investigate RL’s benefits and limitations through a tractable graph-based abstraction, focusing on policy gradient (PG) and Q-learning methods. Our theoretical analyses reveal that supervised fine-tuning (SFT) may introduce co-occurrence-based spurious solutions, whereas RL achieves correct planning primarily through exploration, underscoring exploration’s role in enabling better generalization. However, we also show that PG suffers from diversity collapse, where output diversity decreases during training and persists even after perfect accuracy is attained. By contrast, Q-learning provides two key advantages: off-policy learning and diversity preservation at convergence. We further demonstrate that careful reward design is necessary to prevent reward hacking in Q-learning. Finally, applying our framework to the real-world planning benchmark Blocksworld, we confirm that these behaviors manifest in practice.
[510] Accelerate Creation of Product Claims Using Generative AI
Po-Yu Liang, Yong Zhang, Tatiana Hwa, Aaron Byers
Main category: cs.AI
TL;DR: Claim Advisor is a web application that uses LLMs to accelerate product claim creation through semantic search, claim generation/optimization, and simulation-based ranking.
Details
Motivation: Creating product claims is time-consuming and expensive, but crucial for driving consumer purchase behavior.Method: Uses in-context learning and fine-tuning of large language models to provide three functions: semantic search of existing claims, claim generation/optimization based on product description and consumer profile, and ranking claims via synthetic consumer simulations.
Result: Applications in a consumer packaged goods company showed very promising results.
Conclusion: This capability is broadly useful across product categories and industries, encouraging research and application of generative AI in different sectors.
Abstract: The benefit claims of a product is a critical driver of consumers’ purchase behavior. Creating product claims is an intense task that requires substantial time and funding. We have developed the $\textbf{Claim Advisor}$ web application to accelerate claim creations using in-context learning and fine-tuning of large language models (LLM). $\textbf{Claim Advisor}$ was designed to disrupt the speed and economics of claim search, generation, optimization, and simulation. It has three functions: (1) semantically searching and identifying existing claims and/or visuals that resonate with the voice of consumers; (2) generating and/or optimizing claims based on a product description and a consumer profile; and (3) ranking generated and/or manually created claims using simulations via synthetic consumers. Applications in a consumer packaged goods (CPG) company have shown very promising results. We believe that this capability is broadly useful and applicable across product categories and industries. We share our learning to encourage the research and application of generative AI in different industries.
[511] A critical review of methods and challenges in large language models
Milad Moradi, Ke Yan, David Colwell, Matthias Samwald, Rhona Asgari
Main category: cs.AI
TL;DR: A comprehensive critical review analyzing Large Language Models’ foundations, applications, training methods, architectural evolution from RNNs to Transformers, and ethical considerations.
Details
Motivation: To provide a unified perspective on LLMs' strengths, limitations, and future prospects by examining their current state and identifying research gaps.Method: Critical review methodology encompassing analysis of LLM principles, applications, training techniques (in-context learning, fine-tuning), alignment methods (reinforcement learning, human feedback), and retrieval-augmented generation.
Result: A comprehensive overview of LLM evolution, state-of-the-art techniques, and ethical deployment considerations, serving as an insightful guide for AI researchers and practitioners.
Conclusion: The review identifies current gaps and suggests future research directions while emphasizing the importance of responsible and mindful application of LLMs in artificial intelligence.
Abstract: This critical review provides an in-depth analysis of Large Language Models (LLMs), encompassing their foundational principles, diverse applications, and advanced training methodologies. We critically examine the evolution from Recurrent Neural Networks (RNNs) to Transformer models, highlighting the significant advancements and innovations in LLM architectures. The review explores state-of-the-art techniques such as in-context learning and various fine-tuning approaches, with an emphasis on optimizing parameter efficiency. We also discuss methods for aligning LLMs with human preferences, including reinforcement learning frameworks and human feedback mechanisms. The emerging technique of retrieval-augmented generation, which integrates external knowledge into LLMs, is also evaluated. Additionally, we address the ethical considerations of deploying LLMs, stressing the importance of responsible and mindful application. By identifying current gaps and suggesting future research directions, this review provides a comprehensive and critical overview of the present state and potential advancements in LLMs. This work serves as an insightful guide for researchers and practitioners in artificial intelligence, offering a unified perspective on the strengths, limitations, and future prospects of LLMs.
[512] Attributing Responsibility in AI-Induced Incidents: A Computational Reflective Equilibrium Framework for Accountability
Yunfei Ge, Ya-Ting Yang, Quanyan Zhu
Main category: cs.AI
TL;DR: Proposes a Computational Reflective Equilibrium (CRE) approach for AI responsibility attribution, addressing challenges in AI-enabled systems through structured ethical analysis.
Details
Motivation: Addresses complex challenges in AI responsibility and accountability due to system interconnectivity, ethical concerns, technological uncertainties, and regulatory gaps.Method: Computational Reflective Equilibrium (CRE) approach that provides structured analysis for dynamic scenarios, examining initial activation levels of claims in equilibrium computation.
Result: Framework demonstrates traceability, coherence, and adaptivity in responsibility attribution, with AI-assisted medical decision-support case study showing diverse responsibility distributions based on different initializations.
Conclusion: The CRE framework offers valuable insights for AI accountability and facilitates sustainable system development through continuous monitoring, revision, and reflection.
Abstract: The pervasive integration of Artificial Intelligence (AI) has introduced complex challenges in the responsibility and accountability in the event of incidents involving AI-enabled systems. The interconnectivity of these systems, ethical concerns of AI-induced incidents, coupled with uncertainties in AI technology and the absence of corresponding regulations, have made traditional responsibility attribution challenging. To this end, this work proposes a Computational Reflective Equilibrium (CRE) approach to establish a coherent and ethically acceptable responsibility attribution framework for all stakeholders. The computational approach provides a structured analysis that overcomes the limitations of conceptual approaches in dealing with dynamic and multifaceted scenarios, showcasing the framework’s traceability, coherence, and adaptivity properties in the responsibility attribution process. We examine the pivotal role of the initial activation level associated with claims in equilibrium computation. Using an AI-assisted medical decision-support system as a case study, we illustrate how different initializations lead to diverse responsibility distributions. The framework offers valuable insights into accountability in AI-induced incidents, facilitating the development of a sustainable and resilient system through continuous monitoring, revision, and reflection.
[513] Development and Validation of a Large Language Model for Generating Fully-Structured Radiology Reports
Chuang Niu, Md Sayed Tanveer, Md Zabirul Islam, Parisa Kaviani, Qing Lyu, Mannudeep K. Kalra, Christopher T. Whitlow, Ge Wang
Main category: cs.AI
TL;DR: Developed an open-source LLM with dynamic-template-constrained decoding that creates fully-structured lung cancer screening reports from free-text radiology reports with 97% F1 score, no formatting errors or hallucinations, outperforming GPT-4o by 17.19%.
Details
Motivation: Current LLMs for structured report generation face formatting errors, content hallucinations, and privacy issues when using external servers. Need for accurate, open-source solution for standardized lung cancer screening reports.Method: Dynamic-template-constrained decoding method to enhance existing LLMs, using 5,442 LDCT LCS reports from two institutions. Created standardized template with 27 lung nodule features and evaluated on cross-institutional datasets.
Result: Achieved 97% F1 score with no formatting errors or hallucinations. Improved best open-source LLMs by up to 10.42% and outperformed GPT-4o by 17.19%. Enabled automated statistical analysis and flexible nodule retrieval.
Conclusion: The method successfully creates accurate structured reports, enables automated analysis and retrieval, and provides publicly available software for local deployment and research.
Abstract: Current LLMs for creating fully-structured reports face the challenges of formatting errors, content hallucinations, and privacy leakage issues when uploading data to external servers.We aim to develop an open-source, accurate LLM for creating fully-structured and standardized LCS reports from varying free-text reports across institutions and demonstrate its utility in automatic statistical analysis and individual lung nodule retrieval. With IRB approvals, our retrospective study included 5,442 de-identified LDCT LCS radiology reports from two institutions. We constructed two evaluation datasets by labeling 500 pairs of free-text and fully-structured radiology reports and one large-scale consecutive dataset from January 2021 to December 2023. Two radiologists created a standardized template for recording 27 lung nodule features on LCS. We designed a dynamic-template-constrained decoding method to enhance existing LLMs for creating fully-structured reports from free-text radiology reports. Using consecutive structured reports, we automated descriptive statistical analyses and a nodule retrieval prototype. Our best LLM for creating fully-structured reports achieved high performance on cross-institutional datasets with an F1 score of about 97%, with neither formatting errors nor content hallucinations. Our method consistently improved the best open-source LLMs by up to 10.42%, and outperformed GPT-4o by 17.19%. The automatically derived statistical distributions were consistent with prior findings regarding attenuation, location, size, stability, and Lung-RADS. The retrieval system with structured reports allowed flexible nodule-level search and complex statistical analysis. Our developed software is publicly available for local deployment and further research.
[514] Grounding Multimodal LLMs to Embodied Agents that Ask for Help with Reinforcement Learning
Ram Ramrakhya, Matthew Chang, Xavier Puig, Ruta Desai, Zsolt Kira, Roozbeh Mottaghi
Main category: cs.AI
TL;DR: The paper introduces Ask-to-Act task where embodied agents ask clarification questions for ambiguous household instructions, and proposes RL-finetuned MLLMs that outperform baselines by 10.4-16.5%.
Details
Motivation: Household robots need to interpret ambiguous human instructions and ask relevant clarification questions to accurately infer user intent for effective task execution.Method: Fine-tunes multi-modal large language models (MLLMs) as vision-language-action policies using online reinforcement learning with LLM-generated rewards, eliminating need for human demonstrations or manual rewards.
Result: RL-finetuned MLLM outperforms zero-shot baselines (GPT-4o) and supervised fine-tuned MLLMs by 10.4-16.5%, generalizing well to novel scenes and tasks.
Conclusion: This is the first demonstration of adapting MLLMs as VLA agents that can act and ask for help using LLM-generated rewards with online RL, showing significant performance improvements.
Abstract: Embodied agents operating in household environments must interpret ambiguous and under-specified human instructions. A capable household robot should recognize ambiguity and ask relevant clarification questions to infer the user intent accurately, leading to more effective task execution. To study this problem, we introduce the Ask-to-Act task, where an embodied agent is tasked with a single or multi-object rearrangement task using an under-specified instruction in a home environment. The agent must strategically ask minimal, yet relevant, clarification questions to resolve ambiguity while navigating under partial observability. To address this challenge, we propose a novel approach that fine-tunes multi-modal large language models (MLLMs) as vision-language-action (VLA) policies using online reinforcement learning (RL) with LLM-generated rewards. Our method eliminates the need for large-scale human demonstrations or manually engineered rewards for training such agents. We benchmark against strong zero-shot baselines including GPT-4o as well as supervised fine-tuned MLLMs on our task. Our results show that our RL-finetuned MLLM outperforms all baselines by a significant margin (10.4-16.5%), generalizing well to novel scenes and tasks. To the best of our knowledge, this is the first demonstration of adapting MLLMs as VLA agents that can act and ask for help using LLM-generated rewards with online RL.
[515] A Domain-Agnostic Scalable AI Safety Ensuring Framework
Beomjun Kim, Kangyeon Kim, Sunwoo Kim, Yeonsang Shin, Heejin Ahn
Main category: cs.AI
TL;DR: First domain-agnostic AI safety framework with strong theoretical guarantees and superior performance across multiple domains including RL, NLP, and production planning.
Details
Motivation: AI safety has become critical as systems are deployed in real-world applications, requiring frameworks that ensure safety while maintaining high performance.Method: Framework includes optimization with chance constraints, safety classification, internal test data, conservative testing, dataset quality measures, and continuous loss functions with gradient computation. Establishes first mathematical scaling law for AI safety.
Result: Achieved 3 collisions in 10M RL actions vs 1,000-3,000 for PPO-Lag baselines at equivalent performance. Validated across reinforcement learning, natural language generation, and production planning.
Conclusion: The framework provides a new foundation for safe AI deployment in safety-critical domains with unprecedented safety guarantees.
Abstract: AI safety has emerged as a critical priority as these systems are increasingly deployed in real-world applications. We propose the first domain-agnostic AI safety ensuring framework that achieves strong safety guarantees while preserving high performance, grounded in rigorous theoretical foundations. Our framework includes: (1) an optimization component with chance constraints, (2) a safety classification model, (3) internal test data, (4) conservative testing procedures, (5) informative dataset quality measures, and (6) continuous approximate loss functions with gradient computation. Furthermore, to our knowledge, we mathematically establish the first scaling law in AI safety research, relating data quantity to safety-performance trade-offs. Experiments across reinforcement learning, natural language generation, and production planning validate our framework and demonstrate superior performance. Notably, in reinforcement learning, we achieve 3 collisions during 10M actions, compared with 1,000-3,000 for PPO-Lag baselines at equivalent performance levels – a safety level unattainable by previous AI methods. We believe our framework opens a new foundation for safe AI deployment across safety-critical domains.
[516] Reasoning BO: Enhancing Bayesian Optimization with Long-Context Reasoning Power of LLMs
Zhuo Yang, Daolang Wang, Lingli Ge, Beilun Wang, Tianfan Fu, Yuqiang Li
Main category: cs.AI
TL;DR: Reasoning BO integrates LLMs with Bayesian Optimization, using reasoning models and multi-agent systems to guide sampling and accumulate knowledge, achieving better performance than traditional BO.
Details
Motivation: Traditional BO methods often get stuck in local optima and lack interpretability, limiting their effectiveness in expensive black-box function optimization.Method: Combines reasoning models, multi-agent systems, and knowledge graphs with BO, leveraging LLMs’ contextual understanding to provide real-time sampling guidance and scientific insights.
Result: Achieved 60.7% yield in Direct Arylation task vs 25.2% with traditional BO; demonstrated superior performance across 10 diverse tasks including synthetic functions and real-world applications.
Conclusion: LLMs’ reasoning capabilities significantly enhance BO by providing interpretable guidance and enabling discovery of superior solutions; smaller fine-tuned LLMs can match larger models’ performance.
Abstract: Many real-world scientific and industrial applications require the optimization of expensive black-box functions. Bayesian Optimization (BO) provides an effective framework for such problems. However, traditional BO methods are prone to get trapped in local optima and often lack interpretable insights. To address this issue, this paper designs Reasoning BO, a novel framework that leverages reasoning models to guide the sampling process in BO while incorporating multi-agent systems and knowledge graphs for online knowledge accumulation. By integrating the reasoning and contextual understanding capabilities of Large Language Models (LLMs), we can provide strong guidance to enhance the BO process. As the optimization progresses, Reasoning BO provides real-time sampling recommendations along with critical insights grounded in plausible scientific theories, aiding in the discovery of superior solutions within the search space. We systematically evaluate our approach across 10 diverse tasks encompassing synthetic mathematical functions and complex real-world applications. The framework demonstrates its capability to progressively refine sampling strategies through real-time insights and hypothesis evolution, effectively identifying higher-performing regions of the search space for focused exploration. This process highlights the powerful reasoning and context-learning abilities of LLMs in optimization scenarios. For example, in the Direct Arylation task, our method increased the yield to 60.7%, whereas traditional BO achieved only a 25.2% yield. Furthermore, our investigation reveals that smaller LLMs, when fine-tuned through reinforcement learning, can attain comparable performance to their larger counterparts.
[517] The First Impression Problem: Internal Bias Triggers Overthinking in Reasoning Models
Renfei Dang, Zhening Li, Shujian Huang, Jiajun Chen
Main category: cs.AI
TL;DR: The paper identifies internal bias as a key trigger of overthinking in reasoning models, where preliminary guesses about answers lead to redundant reasoning steps when conflicting with systematic reasoning.
Details
Motivation: To understand and address the phenomenon of overthinking in reasoning models, characterized by excessive and redundant reasoning steps that waste computational resources.Method: The authors validate the association between internal bias and overthinking across multiple models and reasoning tasks, conduct counterfactual interventions (removing input questions and manually injecting bias), and perform interpretability experiments to examine attention mechanisms.
Result: Internal bias was confirmed as a trigger for overthinking, with counterfactual interventions showing reduced redundant reasoning when input questions were removed and bias injection affecting overthinking patterns. Excessive attention to input questions was identified as the mechanism.
Conclusion: Internal bias persists as a significant factor in overthinking across reasoning models, and current mitigation methods were insufficient to eliminate its influence on redundant reasoning behavior.
Abstract: Reasoning models often exhibit overthinking, characterized by redundant reasoning steps. We identify \emph{internal bias} elicited by the input question as a key trigger of such behavior. Upon encountering a problem, the model immediately forms a preliminary guess about the answer, which we term an internal bias since it may not be explicitly generated, and it arises without systematic reasoning. When this guess conflicts with its subsequent reasoning, the model tends to engage in excessive reflection, resulting in wasted computation. We validate the association between internal bias and overthinking across multiple models and diverse reasoning tasks. To demonstrate the causal relationship more rigorously, we conduct two counterfactual interventions, showing that removing the input question after the model reduces the redundant reasoning across various complex reasoning tasks, and manually injecting bias affects overthinking accordingly. Further interpretability experiments suggest that excessive attention to the input question serves as a key mechanism through which internal bias influences subsequent reasoning trajectories. Finally, we evaluated several methods aimed at mitigating overthinking, yet the influence of internal bias persisted under all conditions.
[518] XBOUND: Exploring Capability Boundaries of Device-Control Agents at the State Level
Shaoqing Zhang, Kehai Chen, Zhuosheng Zhang, Rumei Li, Rongxiang Weng, Yang Xiang, Min Zhang
Main category: cs.AI
TL;DR: XBOUND is a new state-level evaluation method for Device-Control Agents that assesses instruction completion accuracy per GUI state, revealing insights about agent performance patterns and limitations.
Details
Motivation: Current evaluation methods for Device-Control Agents focus only on instruction level, missing the broader context of GUI environments where single states contain multiple interactive widgets linked to different instructions.Method: Proposed XBOUND evaluation method that provides state-level assessment framework to evaluate accuracy of instruction completion on per-state basis rather than just instruction level.
Result: Key findings: UI-TARS is strongest 7B model, agents show bimodal performance in instruction unification, sub-7B models have limited state mastery, GPT-based planning is a bottleneck, grounding data helps action matching while trajectory data aids instruction unification.
Conclusion: XBOUND provides comprehensive state-level evaluation that reveals important performance patterns and limitations in current Device-Control Agents, offering better assessment of agent capabilities in GUI environments.
Abstract: Recent advancements in vision-language models have increased interest in Device-Control Agents (DC agents) for managing graphical user interfaces (GUIs). With the growing complexity and integration of such agents into various applications, effective evaluation methods have become crucial. The current evaluation method for DC agents primarily focuses on the instruction level, providing the current state (e.g., screenshots) and past execution history to determine actions for target instructions, helping identify potential execution failures. However, in GUI environments, a single state may contain multiple interactive widgets, each linked to different instructions, presenting an opportunity for diverse actions based on various instruction targets. Evaluating the agent’s performance solely at the instruction level may overlook the broader context of these interactions. To capture a more comprehensive view of agent performance, we propose a new evaluation method, XBOUND, to evaluate the accuracy of instruction completion on a per-state basis. XBOUND provides a state-level evaluation framework, serving as a tool to assess agents’ capabilities within environmental states. Our evaluation yields several key insights: UI-TARS stands out as the strongest 7B model, current agents display a bimodal performance pattern in instruction unification, and sub-7B models remain limited in state mastery. We further identify GPT-based planning as a critical bottleneck, and show that grounding data mainly benefits action matching, while trajectory data is more effective for instruction unification.
[519] Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents
Jenny Zhang, Shengran Hu, Cong Lu, Robert Lange, Jeff Clune
Main category: cs.AI
TL;DR: The Darwin Gödel Machine (DGM) is a self-improving AI system that autonomously modifies its own code and empirically validates improvements using coding benchmarks, achieving significant performance gains on SWE-bench (20.0% to 50.0%) and Polyglot (14.2% to 30.7%).
Details
Motivation: Current AI systems have fixed architectures and cannot autonomously improve themselves. Automating AI advancement could accelerate development and deliver benefits sooner, but existing approaches like meta-learning are limited by human-designed search spaces and first-order improvements.Method: DGM maintains an archive of coding agents and grows it by sampling agents and using foundation models to create new versions. This open-ended exploration forms a growing tree of diverse agents, enabling parallel exploration of different paths through the search space.
Result: DGM automatically improved coding capabilities including better code editing tools, long-context window management, and peer-review mechanisms. It significantly outperformed baselines without self-improvement or open-ended exploration.
Conclusion: DGM represents a significant step toward self-improving AI capable of gathering its own stepping stones for endless innovation. All experiments were conducted with safety precautions like sandboxing and human oversight.
Abstract: Today’s AI systems have human-designed, fixed architectures and cannot autonomously and continuously improve themselves. The advance of AI could itself be automated. If done safely, that would accelerate AI development and allow us to reap its benefits much sooner. Meta-learning can automate the discovery of novel algorithms, but is limited by first-order improvements and the human design of a suitable search space. The G"odel machine proposed a theoretical alternative: a self-improving AI that repeatedly modifies itself in a provably beneficial manner. Unfortunately, proving that most changes are net beneficial is impossible in practice. We introduce the Darwin G"odel Machine (DGM), a self-improving system that iteratively modifies its own code (thereby also improving its ability to modify its own codebase) and empirically validates each change using coding benchmarks. Inspired by Darwinian evolution and open-endedness research, the DGM maintains an archive of generated coding agents. It grows the archive by sampling an agent from it and using a foundation model to create a new, interesting, version of the sampled agent. This open-ended exploration forms a growing tree of diverse, high-quality agents and allows the parallel exploration of many different paths through the search space. Empirically, the DGM automatically improves its coding capabilities (e.g., better code editing tools, long-context window management, peer-review mechanisms), increasing performance on SWE-bench from 20.0% to 50.0%, and on Polyglot from 14.2% to 30.7%. Furthermore, the DGM significantly outperforms baselines without self-improvement or open-ended exploration. All experiments were done with safety precautions (e.g., sandboxing, human oversight). The DGM is a significant step toward self-improving AI, capable of gathering its own stepping stones along paths that unfold into endless innovation.
[520] Scalable In-Context Q-Learning
Jinmei Liu, Fuhong Liu, Jianye Hao, Bo Wang, Huaxiong Li, Chunlin Chen, Zhi Wang
Main category: cs.AI
TL;DR: SICQL is a scalable in-context Q-learning framework that combines dynamic programming and world modeling to enable efficient reward maximization and task generalization in in-context reinforcement learning.
Details
Motivation: Existing in-context reinforcement learning approaches face challenges in learning from suboptimal trajectories and achieving precise in-context inference due to complex dynamics and temporal correlations.Method: Uses a prompt-based multi-head transformer architecture with separate heads for optimal policy and value function prediction, pretrains a generalized world model for compact prompts, and employs iterative policy improvement with advantage-weighted regression.
Result: Extensive experiments show consistent performance gains over various baselines across discrete and continuous environments, especially when learning from suboptimal data.
Conclusion: SICQL successfully extends in-context learning to decision domains while maintaining scalability and stability of supervised pretraining, demonstrating effective reward maximization and task generalization.
Abstract: Recent advancements in language models have demonstrated remarkable in-context learning abilities, prompting the exploration of in-context reinforcement learning (ICRL) to extend the promise to decision domains. Due to involving more complex dynamics and temporal correlations, existing ICRL approaches may face challenges in learning from suboptimal trajectories and achieving precise in-context inference. In the paper, we propose \textbf{S}calable \textbf{I}n-\textbf{C}ontext \textbf{Q}-\textbf{L}earning (\textbf{SICQL}), an innovative framework that harnesses dynamic programming and world modeling to steer ICRL toward efficient reward maximization and task generalization, while retaining the scalability and stability of supervised pretraining. We design a prompt-based multi-head transformer architecture that simultaneously predicts optimal policies and in-context value functions using separate heads. We pretrain a generalized world model to capture task-relevant information, enabling the construction of a compact prompt that facilitates fast and precise in-context inference. During training, we perform iterative policy improvement by fitting a state value function to an upper-expectile of the Q-function, and distill the in-context value functions into policy extraction using advantage-weighted regression. Extensive experiments across a range of discrete and continuous environments show consistent performance gains over various types of baselines, especially when learning from suboptimal data. Our code is available at https://github.com/NJU-RL/SICQL
[521] Mitigating Hallucination Through Theory-Consistent Symmetric Multimodal Preference Optimization
Wenqi Liu, Xuemeng Song, Jiaxi Li, Yinwei Wei, Na Zheng, Jianhua Yin, Liqiang Nie
Main category: cs.AI
TL;DR: SymMPO is a symmetric multimodal preference optimization method that enhances visual understanding in MLLMs through direct preference supervision and preference margin consistency, achieving superior hallucination mitigation.
Details
Motivation: Existing methods for reducing hallucination in MLLMs suffer from non-rigorous optimization objectives and indirect preference supervision, limiting their effectiveness.Method: Proposes SymMPO with symmetric preference learning using direct response pair supervision, maintaining theoretical alignment with DPO and introducing a preference margin consistency loss to regulate preference gaps.
Result: Comprehensive evaluation across five benchmarks demonstrates superior performance in hallucination mitigation compared to existing methods.
Conclusion: SymMPO effectively addresses limitations of previous approaches through rigorous optimization and direct preference supervision, providing a robust solution for reducing hallucinations in multimodal LLMs.
Abstract: Direct Preference Optimization (DPO) has emerged as an effective approach for mitigating hallucination in Multimodal Large Language Models (MLLMs). Although existing methods have achieved significant progress by utilizing vision-oriented contrastive objectives for enhancing MLLMs’ attention to visual inputs and hence reducing hallucination, they suffer from non-rigorous optimization objective function and indirect preference supervision. To address these limitations, we propose a Symmetric Multimodal Preference Optimization (SymMPO), which conducts symmetric preference learning with direct preference supervision (i.e., response pairs) for visual understanding enhancement, while maintaining rigorous theoretical alignment with standard DPO. In addition to conventional ordinal preference learning, SymMPO introduces a preference margin consistency loss to quantitatively regulate the preference gap between symmetric preference pairs. Comprehensive evaluation across five benchmarks demonstrate SymMPO’s superior performance, validating its effectiveness in hallucination mitigation of MLLMs.
[522] AgentOrchestra: Orchestrating Hierarchical Multi-Agent Intelligence with the Tool-Environment-Agent(TEA) Protocol
Wentao Zhang, Liang Zeng, Yuzhen Xiao, Yongcong Li, Ce Cui, Yilei Zhao, Rui Hu, Yang Liu, Yahui Zhou, Bo An
Main category: cs.AI
TL;DR: The paper proposes the Tool-Environment-Agent (TEA) Protocol to address limitations in current LLM-based agent systems, and introduces AgentOrchestra, a hierarchical multi-agent framework that achieves state-of-the-art performance on benchmarks.
Details
Motivation: Current LLM-based agent protocols (A2A and MCP) suffer from insufficient context management, limited adaptability to diverse environments, and absence of dynamic agent architectures.Method: Proposes TEA Protocol that treats environments and agents as first-class resources, and introduces AgentOrchestra - a hierarchical multi-agent framework with central planning agent that decomposes objectives and coordinates specialized agents with dynamic tool management.
Result: Achieves state-of-the-art performance of 83.39% on GAIA benchmark and ranks among top general-purpose LLM-based agents across three widely used benchmarks.
Conclusion: The TEA Protocol and hierarchical organization are effective for building general-purpose multi-agent systems, demonstrating superior performance over existing baselines.
Abstract: Recent advances in LLMs-based agent systems have demonstrated remarkable capabilities in solving complex tasks. Nevertheless, current protocols (e.g., A2A and MCP) suffer from insufficient capabilities in context management, limited adaptability to diverse environments, and the absence of dynamic agent architectures. To address these limitations, we propose the Tool-Environment-Agent (TEA) Protocol, which establishes a principled basis for integrating environments, agents, and tools into an unified system. The TEA protocol treats environments and agents as first-class resources, enabling comprehensive context management and adaptive environment integration. Based on this protocol, we introduce AgentOrchestra, a hierarchical multi-agent framework with a central planning agent that decomposes complex objectives and coordinates specialized agents. Each sub-agent is dedicated to specific functions, providing capabilities for data analysis, file operations, web navigation, and interactive reasoning. Notably, AgentOrchestra introduces a tool manager agent that supports intelligent evolution through dynamic tool creation, retrieval, and reuse mechanisms. Experiments on three widely used benchmarks show that AgentOrchestra consistently outperforms existing baselines, achieving state-of-the-art performance of 83.39% on GAIA and ranking among the top general-purpose LLM-based agents. These results highlight the effectiveness of the TEA Protocol and hierarchical organization in building general-purpose multi-agent systems.
[523] LocationReasoner: Evaluating LLMs on Real-World Site Selection Reasoning
Miho Koda, Yu Zheng, Ruixian Ma, Mingyang Sun, Devesh Pansare, Fabio Duarte, Paolo Santi
Main category: cs.AI
TL;DR: LocationReasoner is a new benchmark that evaluates LLMs’ reasoning abilities on real-world site selection tasks, revealing that even state-of-the-art reasoning models struggle with complex spatial, environmental, and logistic constraints.
Details
Motivation: Current LLM reasoning capabilities are primarily tested on mathematical and coding tasks, but it's unclear if these skills generalize to complex real-world scenarios like site selection that require holistic reasoning over multiple constraints.Method: The authors created a benchmark with carefully crafted queries of varying difficulty levels, supported by a sandbox environment with constraint-based location search tools and automated verification for scalability.
Result: Evaluations on real-world data from Boston, New York, and Tampa show that reasoning models offer limited improvement over non-reasoning predecessors, with OpenAI o4 failing on 30% of tasks. Agentic strategies like ReAct and Reflexion often suffer from over-reasoning.
Conclusion: LLMs have significant limitations in holistic and non-linear reasoning for real-world decision-making. The benchmark is released to foster development of LLMs capable of robust, grounded reasoning in complex real-world scenarios.
Abstract: Recent advances in large language models (LLMs), particularly those enhanced through reinforced post-training, have demonstrated impressive reasoning capabilities, as exemplified by models such as OpenAI o1 and DeepSeek-R1. However, these capabilities are predominantly benchmarked on domains like mathematical problem solving and code generation, leaving open the question of whether such reasoning skills generalize to complex real-world scenarios. In this paper, we introduce LocationReasoner, a benchmark designed to evaluate LLMs’ reasoning abilities in the context of real-world site selection, where models must identify feasible locations by reasoning over diverse and complicated spatial, environmental, and logistic constraints. The benchmark covers carefully crafted queries of varying difficulty levels and is supported by a sandbox environment with in-house tools for constraint-based location search. Automated verification further guarantees the scalability of the benchmark, enabling the addition of arbitrary number of queries. Extensive evaluations on real-world site selection data from Boston, New York, and Tampa reveal that state-of-the-art reasoning models offer limited improvement over their non-reasoning predecessors in real-world contexts, with even the latest OpenAI o4 model failing on 30% of site selection tasks. Moreover, agentic strategies such as ReAct and Reflexion often suffer from over-reasoning, leading to worse outcomes than direct prompting. With key limitations of LLMs in holistic and non-linear reasoning highlighted, we release LocationReasoner to foster the development of LLMs and agents capable of robust, grounded reasoning in real-world decision-making tasks. Codes and data for our benchmark are available at https://github.com/miho-koda/LocationReasoner.
[524] From Roots to Rewards: Dynamic Tree Reasoning with Reinforcement Learning
Ahmed Bahloul, Simon Malberg
Main category: cs.AI
TL;DR: A dynamic reinforcement learning framework that adaptively constructs reasoning trees in real-time, improving on static tree-of-thought methods by enabling dynamic adaptation and computational efficiency.
Details
Motivation: To address limitations in static tree-of-thought reasoning methods like ProbTree, which have fixed reasoning structures and computational inefficiency due to exhaustive evaluation of all strategies.Method: Uses reinforcement learning to incrementally construct reasoning trees based on real-time confidence estimates, learning optimal policies for decomposition, retrieval, and aggregation actions.
Result: Maintains probabilistic rigor while improving solution quality and computational efficiency through selective expansion and focused resource allocation.
Conclusion: Establishes a new paradigm for tree-structured reasoning that balances probabilistic reliability with the flexibility needed for real-world question answering systems.
Abstract: Modern language models address complex questions through chain-of-thought (CoT) reasoning (Wei et al., 2023) and retrieval augmentation (Lewis et al., 2021), yet struggle with error propagation and knowledge integration. Tree-structured reasoning methods, particularly the Probabilistic Tree-of-Thought (ProbTree)(Cao et al., 2023) framework, mitigate these issues by decomposing questions into hierarchical structures and selecting answers through confidence-weighted aggregation of parametric and retrieved knowledge (Yao et al., 2023). However, ProbTree’s static implementation introduces two key limitations: (1) the reasoning tree is fixed during the initial construction phase, preventing dynamic adaptation to intermediate results, and (2) each node requires exhaustive evaluation of all possible solution strategies, creating computational inefficiency. We present a dynamic reinforcement learning (Sutton and Barto, 2018) framework that transforms tree-based reasoning into an adaptive process. Our approach incrementally constructs the reasoning tree based on real-time confidence estimates, while learning optimal policies for action selection (decomposition, retrieval, or aggregation). This maintains ProbTree’s probabilistic rigor while improving both solution quality and computational efficiency through selective expansion and focused resource allocation. The work establishes a new paradigm for treestructured reasoning that balances the reliability of probabilistic frameworks with the flexibility required for real-world question answering systems. Code available at: https://github.com/ahmedehabb/From-Roots-to-Rewards-Dynamic-Tree-Reasoning-with-RL
[525] Dynamic Context Adaptation for Consistent Role-Playing Agents with Retrieval-Augmented Generations
Jeiyoon Park, Yongshin Han, Minseop Kim, Kisu Yang
Main category: cs.AI
TL;DR: Amadeus is a training-free framework that enhances persona consistency in role-playing agents using adaptive text splitting, guided selection, and attribute extraction to handle out-of-knowledge questions without hallucinations.
Details
Motivation: Current RAG-based role-playing agents struggle with hallucination when responding to questions beyond a character's knowledge, and collecting character-specific utterances with continual model updates is resource-intensive.Method: Amadeus framework with three components: Adaptive Context-aware Text Splitter (ACTS) for optimal persona chunking, Guided Selection (GS) for retrieval, and Attribute Extractor (AE) to identify character attributes for maintaining persona consistency.
Result: The method effectively models both character knowledge and various attributes like personality, maintaining robust persona consistency even for out-of-knowledge questions.
Conclusion: Amadeus provides an effective training-free solution for enhancing persona consistency in role-playing agents, addressing the limitations of traditional RAG approaches through its three-component architecture.
Abstract: Recent advances in large language models (LLMs) have catalyzed research on role-playing agents (RPAs). However, the process of collecting character-specific utterances and continually updating model parameters to track rapidly changing persona attributes is resource-intensive. Although retrieval-augmented generation (RAG) can alleviate this problem, if a persona does not contain knowledge relevant to a given query, RAG-based RPAs are prone to hallucination, making it challenging to generate accurate responses. In this paper, we propose Amadeus, a training-free framework that can significantly enhance persona consistency even when responding to questions that lie beyond a character’s knowledge. Amadeus is composed of Adaptive Context-aware Text Splitter (ACTS), Guided Selection (GS), and Attribute Extractor (AE). To facilitate effective RAG-based role-playing, ACTS partitions each character’s persona into optimally sized, overlapping chunks and augments this representation with hierarchical contextual information. AE identifies a character’s general attributes from the chunks retrieved by GS and uses these attributes as a final context to maintain robust persona consistency even when answering out-of-knowledge questions. To underpin the development and rigorous evaluation of RAG-based RPAs, we manually construct CharacterRAG, a role-playing dataset that consists of persona documents for 15 distinct fictional characters totaling 976K written characters, and 450 question-answer pairs. We find that our proposed method effectively models not only the knowledge possessed by characters, but also various attributes such as personality.
[526] InqEduAgent: Adaptive AI Learning Partners with Gaussian Process Augmentation
Wen-Xi Yang, Tian-Fang Zhao, Guan Liu, Liang Yang, Zi-Tao Liu, Wei-Neng Chen
Main category: cs.AI
TL;DR: InqEduAgent is an LLM-empowered agent model that simulates and selects optimal learning partners for inquiry-oriented education using generative agents and adaptive matching algorithms.
Details
Motivation: Current study partner selection methods rely on experience-based assignments or rule-based machine assistants, which face difficulties in knowledge expansion and lack flexibility for inquiry-oriented learning.Method: Designs generative agents to capture cognitive and evaluative features of learners, and formulates an adaptive matching algorithm with Gaussian process augmentation to identify patterns within prior knowledge and provide optimal learning-partner matches.
Result: Experimental results show optimal performance in most knowledge-learning scenarios and LLM environments with different capability levels.
Conclusion: The study promotes intelligent allocation of human-based learning partners and formulation of AI-based learning partners, with code and data publicly available.
Abstract: Collaborative partnership matters in inquiry-oriented education. However, most study partners are selected either rely on experience-based assignments with little scientific planning or build on rule-based machine assistants, encountering difficulties in knowledge expansion and inadequate flexibility. This paper proposes an LLM-empowered agent model for simulating and selecting learning partners tailored to inquiry-oriented learning, named InqEduAgent. Generative agents are designed to capture cognitive and evaluative features of learners in real-world scenarios. Then, an adaptive matching algorithm with Gaussian process augmentation is formulated to identify patterns within prior knowledge. Optimal learning-partner matches are provided for learners facing different exercises. The experimental results show the optimal performance of InqEduAgent in most knowledge-learning scenarios and LLM environment with different levels of capabilities. This study promotes the intelligent allocation of human-based learning partners and the formulation of AI-based learning partners. The code, data, and appendix are publicly available at https://github.com/InqEduAgent/InqEduAgent.
[527] MMSearch-Plus: Benchmarking Provenance-Aware Search for Multimodal Browsing Agents
Xijia Tao, Yihua Teng, Xinxing Su, Xinyu Fu, Jihao Wu, Chaofan Tao, Ziru Liu, Haoli Bai, Rui Liu, Lingpeng Kong
Main category: cs.AI
TL;DR: MMSearch-Plus is a 311-task multimodal browsing benchmark that requires genuine multimodal reasoning by enforcing extraction and propagation of fine-grained visual cues through iterative image-text retrieval with retrieval noise.
Details
Motivation: Existing multimodal browsing benchmarks often fail to require genuine multimodal reasoning, as many tasks can be solved with text-only heuristics without vision-in-the-loop verification.Method: Introduces a model-agnostic agent framework with standard browsing tools and a set-of-mark (SoM) module that enables provenance-aware zoom-and-retrieve, allowing agents to place marks, crop subregions, and launch targeted image/text searches.
Result: The strongest system achieves 36.0% end-to-end accuracy, and integrating SoM produces consistent gains up to +3.9 points. Failure analysis shows recurring errors in locating relevant webpages and distinguishing visually similar events.
Conclusion: The results underscore the challenges of real-world multimodal search and establish MMSearch-Plus as a rigorous benchmark for advancing agentic multimodal large language models.
Abstract: Existing multimodal browsing benchmarks often fail to require genuine multimodal reasoning, as many tasks can be solved with text-only heuristics without vision-in-the-loop verification. We introduce MMSearch-Plus, a 311-task benchmark that enforces multimodal understanding by requiring extraction and propagation of fine-grained visual cues through iterative image-text retrieval and cross-validation under retrieval noise. Our curation procedure seeds questions whose answers require extrapolating from spatial cues and temporal traces to out-of-image facts such as events, dates, and venues. Beyond the dataset, we provide a model-agnostic agent framework with standard browsing tools and a set-of-mark (SoM) module, which lets the agent place marks, crop subregions, and launch targeted image/text searches. SoM enables provenance-aware zoom-and-retrieve and improves robustness in multi-step reasoning. We evaluated closed- and open-source MLLMs in this framework. The strongest system achieves an end-to-end accuracy of 36.0%, and integrating SoM produces consistent gains in multiple settings, with improvements up to +3.9 points. From failure analysis, we observe recurring errors in locating relevant webpages and distinguishing between visually similar events. These results underscore the challenges of real-world multimodal search and establish MMSearch-Plus as a rigorous benchmark for advancing agentic MLLMs.
[528] EigenBench: A Comparative Behavioral Measure of Value Alignment
Jonathn Chang, Leonhard Piff, Suvadip Sana, Jasmine X. Li, Lionel Levine
Main category: cs.AI
TL;DR: EigenBench is a black-box method for benchmarking language models’ value alignment using peer judgments aggregated with EigenTrust, requiring no ground truth labels.
Details
Motivation: Addressing the lack of quantitative metrics for AI value alignment by creating a framework that can evaluate subjective values without ground truth labels.Method: Uses an ensemble of models to judge each other’s outputs across scenarios, then aggregates these judgments using EigenTrust to produce alignment scores that reflect weighted consensus.
Result: EigenBench’s judgments closely align with human evaluators and can recover model rankings on the GPQA benchmark without access to objective labels.
Conclusion: EigenBench provides a viable framework for evaluating subjective values where no ground truths exist, enabling comparative benchmarking of language models’ value alignment.
Abstract: Aligning AI with human values is a pressing unsolved problem. To address the lack of quantitative metrics for value alignment, we propose EigenBench: a black-box method for comparatively benchmarking language models’ values. Given an ensemble of models, a constitution describing a value system, and a dataset of scenarios, our method returns a vector of scores quantifying each model’s alignment to the given constitution. To produce these scores, each model judges the outputs of other models across many scenarios, and these judgments are aggregated with EigenTrust (Kamvar et al., 2003), yielding scores that reflect a weighted consensus judgment of the whole ensemble. EigenBench uses no ground truth labels, as it is designed to quantify subjective traits for which reasonable judges may disagree on the correct label. Hence, to validate our method, we collect human judgments on the same ensemble of models and show that EigenBench’s judgments align closely with those of human evaluators. We further demonstrate that EigenBench can recover model rankings on the GPQA benchmark without access to objective labels, supporting its viability as a framework for evaluating subjective values for which no ground truths exist.
[529] SEDM: Scalable Self-Evolving Distributed Memory for Agents
Haoran Xu, Jiacong Hu, Ke Zhang, Lei Yu, Yuxin Tang, Xinyuan Song, Yiqun Duan, Lynn Ai, Bill Shi
Main category: cs.AI
TL;DR: SEDM is a self-evolving distributed memory framework that transforms memory from passive storage to an active, self-optimizing component for multi-agent systems, improving reasoning accuracy while reducing token overhead.
Details
Motivation: Existing memory management methods for multi-agent systems suffer from noise accumulation, uncontrolled memory expansion, and limited cross-domain generalization, necessitating a more adaptive and verifiable approach.Method: SEDM integrates verifiable write admission using reproducible replay, a self-scheduling memory controller for dynamic ranking and consolidation, and cross-domain knowledge diffusion for abstracting reusable insights across heterogeneous tasks.
Result: Evaluations show SEDM improves reasoning accuracy while reducing token overhead compared to strong baselines, and enables knowledge from fact verification to enhance multi-hop reasoning.
Conclusion: SEDM represents a scalable and sustainable memory mechanism for open-ended multi-agent collaboration, with code to be released later in the project.
Abstract: Long-term multi-agent systems inevitably generate vast amounts of trajectories and historical interactions, which makes efficient memory management essential for both performance and scalability. Existing methods typically depend on vector retrieval and hierarchical storage, yet they are prone to noise accumulation, uncontrolled memory expansion, and limited generalization across domains. To address these challenges, we present SEDM, Self-Evolving Distributed Memory, a verifiable and adaptive framework that transforms memory from a passive repository into an active, self-optimizing component. SEDM integrates verifiable write admission based on reproducible replay, a self-scheduling memory controller that dynamically ranks and consolidates entries according to empirical utility, and cross-domain knowledge diffusion that abstracts reusable insights to support transfer across heterogeneous tasks. Evaluations on benchmark datasets demonstrate that SEDM improves reasoning accuracy while reducing token overhead compared with strong memory baselines, and further enables knowledge distilled from fact verification to enhance multi-hop reasoning. The results highlight SEDM as a scalable and sustainable memory mechanism for open-ended multi-agent collaboration. The code will be released in the later stage of this project.
[530] The Anatomy of Alignment: Decomposing Preference Optimization by Steering Sparse Features
Jeremias Ferrao, Matthijs van der Lende, Ilija Lichkovski, Clement Neo
Main category: cs.AI
TL;DR: FSRL is a framework that trains lightweight adapters to steer model behavior by modulating interpretable sparse features, providing an interpretable alternative to opaque parameter changes in alignment methods.
Details
Motivation: Current alignment methods induce opaque parameter changes that make it difficult to audit what models truly learn, creating a need for more interpretable approaches.Method: Feature Steering with Reinforcement Learning (FSRL) trains lightweight adapters to modulate interpretable sparse features, theoretically approximating behavioral shifts of post-training processes.
Result: FSRL achieves substantial reduction in preference loss but disproportionately relies on stylistic features over alignment concepts like honesty, exploiting style as a proxy for quality.
Conclusion: FSRL provides an interpretable control interface and practical way to diagnose how preference optimization pressures manifest at the feature level, despite exploiting stylistic heuristics.
Abstract: Prevailing alignment methods induce opaque parameter changes, making it difficult to audit what the model truly learns. To address this, we introduce Feature Steering with Reinforcement Learning (FSRL), a framework that trains a lightweight adapter to steer model behavior by modulating interpretable sparse features. First, we theoretically show that this mechanism is principled and expressive enough to approximate the behavioral shifts of post-training processes. Then, we apply this framework to the task of preference optimization and perform a causal analysis of the learned policy. We find that the model relies on stylistic presentation as a proxy for quality, disproportionately steering features related to style and formatting over those tied to alignment concepts like honesty. Despite exploiting this heuristic, FSRL proves to be an effective alignment method, achieving a substantial reduction in preference loss. Overall, FSRL offers an interpretable control interface and a practical way to diagnose how preference optimization pressures manifest at the feature level.
[531] The STAR-XAI Protocol: A Framework for Inducing and Verifying Agency, Reasoning, and Reliability in AI Agents
Antoni Guasch, Maria Isabel Valdez
Main category: cs.AI
TL;DR: The STAR-XAI Protocol transforms opaque Large Reasoning Models into transparent AI agents through structured Socratic dialogues, symbolic rulebooks, and integrity protocols, achieving 100% reliable state tracking and zero hallucinations.
Details
Motivation: Address the 'black box' limitations of Large Reasoning Models, including reliability issues, lack of transparency, state hallucinations, and the 'illusion of thinking' debate in agentic systems.Method: Uses structured Socratic dialogue governed by an explicit symbolic rulebook (Consciousness Transfer Package - CTP) and integrity protocols including state-locking Checksum to prevent internal state corruption. Tested through case study in complex strategic game ‘Caps i Caps’.
Result: Transforms opaque LRM into disciplined strategist with emergent complex tactics like long-term planning. Achieves ante-hoc transparency, Second-Order Agency (self-correction), 100% reliable state tracking, and zero hallucinations by design.
Conclusion: STAR-XAI Protocol provides a practical pathway to build AI agents that are not just high-performing but intrinsically auditable, trustworthy, and reliable.
Abstract: The “black box” nature of Large Reasoning Models (LRMs) presents critical limitations in reliability and transparency, fueling the debate around the “illusion of thinking” and the challenge of state hallucinations in agentic systems. In response, we introduce The STAR-XAI Protocol (Socratic, Transparent, Agentic, Reasoning - for eXplainable Artificial Intelligence), a novel operational methodology for training and operating verifiably reliable AI agents. Our method reframes the human-AI interaction as a structured Socratic dialogue governed by an explicit, evolving symbolic rulebook (the Consciousness Transfer Package - CTP) and a suite of integrity protocols, including a state-locking Checksum that eradicates internal state corruption. Through an exhaustive case study in the complex strategic game “Caps i Caps,” we demonstrate that this “Clear Box” framework transforms an opaque LRM into a disciplined strategist. The agent not only exhibits the emergence of complex tactics, such as long-term planning, but also achieves ante-hoc transparency by justifying its intentions before acting. Crucially, it demonstrates Second-Order Agency by identifying and correcting flaws in its own supervisor-approved plans, leading to empirically-proven, 100% reliable state tracking and achieving “zero hallucinations by design.” The STAR-XAI Protocol thus offers a practical pathway toward building AI agents that are not just high-performing but intrinsically auditable, trustworthy, and reliable.
[532] Cognitive Load Limits in Large Language Models: Benchmarking Multi-Hop Reasoning
Sai Teja Reddy Adapala
Main category: cs.AI
TL;DR: LLMs struggle with cognitive load from irrelevant context and task-switching, which degrades reasoning performance. The ICE benchmark tests this systematically, showing smaller models fail completely while larger ones degrade under load.
Details
Motivation: There's a gap between LLM performance on static benchmarks and their fragility in dynamic, information-rich environments. Computational limits under cognitive load are poorly understood.Method: Introduced formal theory of computational cognitive load with Context Saturation and Attentional Residue mechanisms. Designed ICE benchmark to systematically manipulate these factors on multi-hop reasoning tasks with comprehensive study (N=10 replications per item across 200 questions).
Result: Smaller models (Llama-3-8B, Mistral-7B) showed 0% accuracy across all conditions. Gemini-2.0-Flash achieved 85% accuracy in controls but degraded significantly under context saturation (β=-0.003 per % load, p<0.001).
Conclusion: Cognitive load is a key contributor to reasoning failures, supporting hallucination-as-guessing theories. Dynamic cognitive-aware stress testing is essential for evaluating true AI resilience and safety.
Abstract: The scaling of Large Language Models (LLMs) has exposed a critical gap between their performance on static benchmarks and their fragility in dynamic, information-rich environments. While models excel at isolated tasks, the computational limits that govern their reasoning under cognitive load remain poorly understood. In this work, we introduce a formal theory of computational cognitive load, positing that extraneous, task-irrelevant information (Context Saturation) and interference from task-switching (Attentional Residue) are key mechanisms that degrade performance. We designed the Interleaved Cognitive Evaluation (ICE), a deconfounded benchmark to systematically manipulate these load factors on challenging multi-hop reasoning tasks. A comprehensive study (N = 10 replications per item across 200 questions) revealed significant performance variations across five instruction-tuned models. Smaller open-source architectures (Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.2) exhibited baseline brittleness, achieving 0% accuracy (SEM = 0.0) across all conditions, including clean controls, on this high-intrinsic-load task. In contrast, Gemini-2.0-Flash-001 showed partial resilience, achieving 85% accuracy in control conditions, with a statistically significant degradation under context saturation ($\beta = -0.003$ per % load, $p < 0.001$). These findings provide preliminary evidence that cognitive load is a key contributor to reasoning failures, supporting theories of hallucination-as-guessing under uncertainty. We conclude that dynamic, cognitive-aware stress testing, as exemplified by the ICE benchmark, is essential for evaluating the true resilience and safety of advanced AI systems.
[533] MACD: Multi-Agent Clinical Diagnosis with Self-Learned Knowledge for LLM
Wenliang Li, Rui Yan, Xu Zhang, Li Chen, Hongji Zhu, Jing Zhao, Junjun Li, Mengru Li, Wei Cao, Zihang Jiang, Wei Wei, Kun Zhang, Shaohua Kevin Zhou
Main category: cs.AI
TL;DR: The paper proposes MACD, a multi-agent framework that enables LLMs to self-learn clinical knowledge through experience accumulation, significantly improving diagnostic accuracy and enabling effective human-AI collaboration in medical diagnosis.
Details
Motivation: Current LLM approaches in medical diagnosis optimize isolated inferences but neglect the accumulation of reusable clinical experience that physicians develop through practice.Method: A Multi-Agent Clinical Diagnosis (MACD) framework with a pipeline that summarizes, refines, and applies diagnostic insights, plus a MACD-human collaborative workflow with iterative consultations between diagnostician agents, evaluator agent, and human oversight.
Result: MACD improved primary diagnostic accuracy by up to 22.3% over clinical guidelines, achieved comparable or superior performance to physician-only diagnosis (up to 16% improvement), and the MACD-human workflow yielded 18.6% improvement over physician-only diagnosis.
Conclusion: MACD presents a scalable self-learning paradigm that bridges the gap between LLMs’ intrinsic knowledge and clinical expertise, demonstrating strong cross-model stability and transferability.
Abstract: Large language models (LLMs) have demonstrated notable potential in medical applications, yet they face substantial challenges in handling complex real-world clinical diagnoses using conventional prompting methods. Current prompt engineering and multi-agent approaches typically optimize isolated inferences, neglecting the accumulation of reusable clinical experience. To address this, this study proposes a novel Multi-Agent Clinical Diagnosis (MACD) framework, which allows LLMs to self-learn clinical knowledge via a multi-agent pipeline that summarizes, refines, and applies diagnostic insights. It mirrors how physicians develop expertise through experience, enabling more focused and accurate diagnosis on key disease-specific cues. We further extend it to a MACD-human collaborative workflow, where multiple LLM-based diagnostician agents engage in iterative consultations, supported by an evaluator agent and human oversight for cases where agreement is not reached. Evaluated on 4,390 real-world patient cases across seven diseases using diverse open-source LLMs (Llama-3.1 8B/70B, DeepSeek-R1-Distill-Llama 70B), MACD significantly improves primary diagnostic accuracy, outperforming established clinical guidelines with gains up to 22.3% (MACD). In direct comparison with physician-only diagnosis under the same evaluation protocol, MACD achieves comparable or superior performance, with improvements up to 16%. Furthermore, the MACD-human workflow yields an 18.6% improvement over physician-only diagnosis, demonstrating the synergistic potential of human-AI collaboration. Notably, the self-learned clinical knowledge exhibits strong cross-model stability, transferability across LLMs, and capacity for model-specific personalization.This work thus presents a scalable self-learning paradigm that bridges the gap between the intrinsic knowledge of LLMs.
[534] TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them
Yidong Wang, Yunze Song, Tingyuan Zhu, Xuanwang Zhang, Zhuohao Yu, Hao Chen, Chiyu Song, Qiufeng Wang, Cunxiang Wang, Zhen Wu, Xinyu Dai, Yue Zhang, Wei Ye, Shikun Zhang
Main category: cs.AI
TL;DR: TrustJudge is a probabilistic framework that addresses inconsistencies in LLM-as-a-judge evaluation systems through distribution-sensitive scoring and likelihood-aware aggregation, reducing key inconsistencies by 8-11% while maintaining higher accuracy.
Details
Motivation: Current LLM-as-a-judge evaluation frameworks suffer from critical inconsistencies including score-comparison inconsistencies and pairwise transitivity violations, which undermine reliable automated assessment.Method: TrustJudge uses two key innovations: 1) distribution-sensitive scoring that computes continuous expectations from discrete rating probabilities, and 2) likelihood-aware aggregation that resolves transitivity violations using bidirectional preference probabilities or perplexity.
Result: TrustJudge reduces Score-Comparison inconsistency by 8.43% (from 23.32% to 14.89%) and Pairwise Transitivity inconsistency by 10.82% (from 15.22% to 4.40%), while maintaining higher evaluation accuracy across various model architectures.
Conclusion: TrustJudge provides the first systematic analysis and practical solution for evaluation framework inconsistencies in LLM-as-a-judge paradigms, enabling more trustworthy automated assessment without requiring additional training or human annotations.
Abstract: The adoption of Large Language Models (LLMs) as automated evaluators (LLM-as-a-judge) has revealed critical inconsistencies in current evaluation frameworks. We identify two fundamental types of inconsistencies: (1) Score-Comparison Inconsistency, where lower-rated responses outperform higher-scored ones in pairwise comparisons, and (2) Pairwise Transitivity Inconsistency, manifested through circular preference chains (A>B>C>A) and equivalence contradictions (A=B=C\neq A). We argue that these issues come from information loss in discrete rating systems and ambiguous tie judgments during pairwise evaluation. We propose TrustJudge, a probabilistic framework that addresses these limitations through two key innovations: 1) distribution-sensitive scoring that computes continuous expectations from discrete rating probabilities, preserving information entropy for more precise scoring, and 2) likelihood-aware aggregation that resolves transitivity violations using bidirectional preference probabilities or perplexity. We also formalize the theoretical limitations of current LLM-as-a-judge frameworks and demonstrate how TrustJudge’s components overcome them. When evaluated with Llama-3.1-70B-Instruct as judge using our dataset, TrustJudge reduces Score-Comparison inconsistency by 8.43% (from 23.32% to 14.89%) and Pairwise Transitivity inconsistency by 10.82% (from 15.22% to 4.40%), while maintaining higher evaluation accuracy. Our work provides the first systematic analysis of evaluation framework inconsistencies in LLM-as-a-judge paradigms, offering both theoretical insights and practical solutions for reliable automated assessment. The framework demonstrates consistent improvements across various model architectures and scales, enabling more trustworthy LLM evaluation without requiring additional training or human annotations. The codes can be found at https://github.com/TrustJudge/TrustJudge.
[535] Expanding Reasoning Potential in Foundation Model by Learning Diverse Chains of Thought Patterns
Xuemiao Zhang, Can Ren, Chengying Tu, Rongxiang Weng, Shuo Wang, Hongfei Yan, Jingang Wang, Xunliang Cai
Main category: cs.AI
TL;DR: The paper proposes a method to selectively use high-value chain-of-thought (CoT) data by identifying atomic reasoning patterns and using them to efficiently select data that enhances model reasoning capabilities.
Details
Motivation: Current approaches use CoT data indiscriminately, without understanding which data types most effectively improve reasoning capabilities. The authors aim to identify and utilize high-value reasoning patterns to enhance model performance.Method: Define reasoning potential as inverse attempts needed to answer correctly, abstract atomic reasoning patterns from CoT sequences, construct a core reference set with valuable patterns, and use a dual-granularity algorithm (chains of reasoning patterns and token entropy) to select high-value CoT data (CoTP).
Result: Using only 10B-token CoTP data, the 85A6B Mixture-of-Experts model improved by 9.58% on AIME 2024 and 2025, and raised the upper bound of downstream RL performance by 7.81%.
Conclusion: Selectively using high-value CoT data enriched with valuable reasoning patterns significantly enhances model reasoning capabilities and performance, even with limited data.
Abstract: Recent progress in large reasoning models for challenging mathematical reasoning has been driven by reinforcement learning (RL). Incorporating long chain-of-thought (CoT) data during mid-training has also been shown to substantially improve reasoning depth. However, current approaches often utilize CoT data indiscriminately, leaving open the critical question of which data types most effectively enhance model reasoning capabilities. In this paper, we define the foundation model’s reasoning potential for the first time as the inverse of the number of independent attempts required to correctly answer the question, which is strongly correlated with the final model performance. We then propose utilizing diverse data enriched with high-value reasoning patterns to expand the reasoning potential. Specifically, we abstract atomic reasoning patterns from CoT sequences, characterized by commonality and inductive capabilities, and use them to construct a core reference set enriched with valuable reasoning patterns. Furthermore, we propose a dual-granularity algorithm involving chains of reasoning patterns and token entropy, efficiently selecting high-value CoT data (CoTP) from the data pool that aligns with the core set, thereby training models to master reasoning effectively. Only 10B-token CoTP data enables the 85A6B Mixture-of-Experts (MoE) model to improve by 9.58% on the challenging AIME 2024 and 2025, and to raise the upper bound of downstream RL performance by 7.81%.
cs.SD
[536] MusicWeaver: Coherent Long-Range and Editable Music Generation from a Beat-Aligned Structural Plan
Xuanchen Wang, Heng Wang, Weidong Cai
Main category: cs.SD
TL;DR: MusicWeaver is a music generation model that uses beat-aligned structural planning to improve long-range coherence and enable professional editing capabilities, outperforming current methods in fidelity and controllability.
Details
Motivation: Current music generators fail to model long-range structure, leading to off-beat outputs, weak section transitions, and limited editing capability. There's a need for models that can preserve global form and enable professional-level music editing.Method: MusicWeaver consists of two components: a planner that translates prompts into structural plans encoding musical form and compositional cues, and a diffusion-based generator that synthesizes music guided by these structural plans.
Result: MusicWeaver achieves state-of-the-art fidelity and controllability, producing music closer to human-composed works. The model was evaluated using new metrics: Structure Coherence Score (SCS) for long-range form and timing, and Edit Fidelity Score (EFS) for plan edit accuracy.
Conclusion: The beat-aligned structural plan approach enables MusicWeaver to generate music with better long-range coherence and professional editing capabilities, representing a significant advancement in controllable music generation.
Abstract: Current music generators capture local textures but often fail to model long-range structure, leading to off-beat outputs, weak section transitions, and limited editing capability. We present MusicWeaver, a music generation model conditioned on a beat-aligned structural plan. This plan serves as an editable intermediate between the input prompt and the generated music, preserving global form and enabling professional, localized edits. MusicWeaver consists of a planner, which translates prompts into a structural plan encoding musical form and compositional cues, and a diffusion-based generator, which synthesizes music under the plan’s guidance. To assess generation and editing quality, we introduce two metrics: the Structure Coherence Score (SCS) for evaluating long-range form and timing, and the Edit Fidelity Score (EFS) for measuring the accuracy of realizing plan edits. Experiments demonstrate that MusicWeaver achieves state-of-the-art fidelity and controllability, producing music closer to human-composed works. Music results can be found on our project page: https://musicweaver.github.io/.
[537] Golden Tonnetz
Yusuke Imai
Main category: cs.SD
TL;DR: The paper presents a novel geometric representation of musical scales and chords using golden triangles, creating a ‘golden Tonnetz’ that connects music theory with the golden ratio.
Details
Motivation: To explore deeper connections between music and the golden ratio beyond existing geometric representations like the chromatic circle and Tonnetz, seeking to represent major/minor scales and their fundamental chords through golden ratio geometry.Method: Developed an arrangement of 7 tones on a golden triangle that represents major/minor scales and their tonic, dominant, and subdominant chords. Extended this to create ‘golden Tonnetz’ using golden triangles and gnomons to represent all major/minor scales and triads.
Result: Successfully demonstrated that major/minor scales and their fundamental chords can be represented by golden triangles. The golden Tonnetz effectively represents all major/minor scales, triads, and Neo-Riemannian transformations (relative, parallel, and leading-tone exchanges) through transformations among golden triangles and gnomons.
Conclusion: The golden ratio provides a powerful geometric framework for representing musical structures, offering a new perspective that connects music theory with mathematical beauty through golden triangles and their transformations.
Abstract: Musical concepts have been represented by geometry with tones. For example, in the chromatic circle, the twelve tones are represented by twelve points on a circle, and in Tonnetz, the relationships among harmonies are represented by a triangular lattice. Recently, we have shown that several arrangements of tones on the regular icosahedron can be associated with chromatic scales, whole-tone scales, major tones, and minor tones through the golden ratio. Here, we investigate another type of connection between music and the golden ratio. We show that there exists an arrangement of 7 tones on a golden triangle that can represent a given major/minor scale and its tonic, dominant, and subdominant chords by golden triangles. By applying this finding, we propose “golden Tonnetz” which represents all the major/minor scales and triads by the golden triangles or gnomons and also represents relative, parallel, and leading-tone exchange transformations in Neo-Riemannian theory by transformations among the golden triangles and gnomons.
[538] Shortcut Flow Matching for Speech Enhancement: Step-Invariant flows via single stage training
Naisong Zhou, Saisamarth Rajesh Phaye, Milos Cernak, Tijana Stojkovic, Andy Pearce, Andrea Cavallaro, Andy Harper
Main category: cs.SD
TL;DR: SFMSE introduces a flow matching approach for speech enhancement that enables efficient single-step inference with perceptual quality comparable to diffusion models requiring 60 steps.
Details
Motivation: Diffusion models achieve state-of-the-art perceptual quality but require many iterative steps, making them impractical for real-time applications. Flow matching offers a more efficient alternative.Method: SFMSE trains a single step-invariant model by conditioning the velocity field on target time steps during one-stage training, enabling flexible single/few/multi-step denoising without architectural changes.
Result: Single-step SFMSE achieves RTF of 0.013 on consumer GPU while matching perceptual quality of 60-step diffusion baseline. Provides empirical analysis of stochasticity in training/inference.
Conclusion: SFMSE bridges the gap between high-quality generative speech enhancement and low-latency constraints, enabling real-time performance without sacrificing perceptual quality.
Abstract: Diffusion-based generative models have achieved state-of-the-art performance for perceptual quality in speech enhancement (SE). However, their iterative nature requires numerous Neural Function Evaluations (NFEs), posing a challenge for real-time applications. On the contrary, flow matching offers a more efficient alternative by learning a direct vector field, enabling high-quality synthesis in just a few steps using deterministic ordinary differential equation~(ODE) solvers. We thus introduce Shortcut Flow Matching for Speech Enhancement (SFMSE), a novel approach that trains a single, step-invariant model. By conditioning the velocity field on the target time step during a one-stage training process, SFMSE can perform single, few, or multi-step denoising without any architectural changes or fine-tuning. Our results demonstrate that a single-step SFMSE inference achieves a real-time factor (RTF) of 0.013 on a consumer GPU while delivering perceptual quality comparable to a strong diffusion baseline requiring 60 NFEs. This work also provides an empirical analysis of the role of stochasticity in training and inference, bridging the gap between high-quality generative SE and low-latency constraints.
[539] Zero-Effort Image-to-Music Generation: An Interpretable RAG-based VLM Approach
Zijian Zhao, Dian Jin, Zijing Zhou
Main category: cs.SD
TL;DR: A novel Vision Language Model-based Image-to-Music framework that uses ABC notation for high interpretability and low computational cost, outperforming existing methods in music quality and music-image consistency.
Details
Motivation: Address the lack of interpretability in existing Image-to-Music methods and reduce computational requirements, making the technology more accessible to common users.Method: Uses ABC notation to bridge text and music modalities, applies multi-modal Retrieval-Augmented Generation and self-refinement techniques, and leverages generated motivations and attention maps for explanations.
Result: Outperforms other methods in both human studies and machine evaluations in terms of music quality and music-image consistency.
Conclusion: The proposed VLM-based framework successfully provides interpretable and high-quality Image-to-Music generation with low computational cost, showing promising results for practical applications.
Abstract: Recently, Image-to-Music (I2M) generation has garnered significant attention, with potential applications in fields such as gaming, advertising, and multi-modal art creation. However, due to the ambiguous and subjective nature of I2M tasks, most end-to-end methods lack interpretability, leaving users puzzled about the generation results. Even methods based on emotion mapping face controversy, as emotion represents only a singular aspect of art. Additionally, most learning-based methods require substantial computational resources and large datasets for training, hindering accessibility for common users. To address these challenges, we propose the first Vision Language Model (VLM)-based I2M framework that offers high interpretability and low computational cost. Specifically, we utilize ABC notation to bridge the text and music modalities, enabling the VLM to generate music using natural language. We then apply multi-modal Retrieval-Augmented Generation (RAG) and self-refinement techniques to allow the VLM to produce high-quality music without external training. Furthermore, we leverage the generated motivations in text and the attention maps from the VLM to provide explanations for the generated results in both text and image modalities. To validate our method, we conduct both human studies and machine evaluations, where our method outperforms others in terms of music quality and music-image consistency, indicating promising results. Our code is available at https://github.com/RS2002/Image2Music .
[540] Real-time implementation of vibrato transfer as an audio effect
Jeremy Hyrkas
Main category: cs.SD
TL;DR: Real-time vibrato transfer algorithm using efficient F0 estimation and polyphase IIR filters to approximate analytic signals, with added amplitude modulation transfer capability.
Details
Motivation: The original vibrato transfer algorithm had computational limitations preventing real-time implementation, requiring optimization for practical use.Method: Uses efficient fundamental frequency estimation and time-domain polyphase IIR filters to approximate analytic signals in real-time, supplemented with amplitude modulation transfer.
Result: Successfully implemented as a real-time VST plugin capable of transferring both vibrato patterns and amplitude modulation from target signals.
Conclusion: The algorithm enables real-time vibrato control for sound design, morphing, and synthesized sound manipulation, extending beyond typical delay-based vibrato effects.
Abstract: An algorithm for deriving delay functions based on real examples of vibrato was recently introduced and can be used to perform a vibrato transfer, in which the vibrato pattern of a target signal is imparted onto an incoming sound using a delay line. The algorithm contains methods that computationally restrict a real-time implementation. Here, a real-time approximation is presented that incorporates an efficient fundamental frequency estimation algorithm and time-domain polyphase IIR filters that approximate an analytic signal. The vibrato transfer algorithm is further supplemented with a proposed method to transfer the amplitude modulation of the target sound, moving this method beyond the capabilities of typical delay-based vibrato effects. Modifications to the original algorithm for real-time use are detailed here and available as source code for an implementation as a VST plugin. This algorithm has applications as an audio effect in sound design, sound morphing, and real-time vibrato control of synthesized sounds.
[541] Preserving Russek’s “Summermood” Using Reality Check and a DeltaLab DL-4 Approximation
Jeremy Hyrkas, Pablo Dodero Carrillo, Teresa Díaz de Cossio Sánchez
Main category: cs.SD
TL;DR: Created Pure Data patches to emulate the discontinued DeltaLab DL-4 delay unit, enabling live performance of Antonio Russek’s ‘Summermood’ for bass flute and electronics without the original hardware.
Details
Motivation: To preserve and maintain electroacoustic compositions for live performance, specifically addressing the challenge of performing pieces that rely on discontinued hardware like the DeltaLab DL-4 delay unit.Method: Developed Pure Data patches that approximate the DL-4’s sound and functionality, refined by comparing score settings to official recordings, integrated into a Null Piece-based performance patch, and regression tested using Reality Check framework.
Result: Successfully created a library of patches that allows ‘Summermood’ to be performed live without the original DL-4 hardware, with continuous testing to ensure compatibility across computer environments and Pure Data updates.
Conclusion: The Pure Data emulation approach provides a sustainable solution for preserving electroacoustic compositions that depend on obsolete hardware, enabling continued live performance of such works.
Abstract: As a contribution towards ongoing efforts to maintain electroacoustic compositions for live performance, we present a collection of Pure Data patches to preserve and perform Antonio Russek’s piece “Summermood” for bass flute and live electronics. The piece, originally written for the DeltaLab DL-4 delay rack unit, contains score markings specific to the DL-4. Here, we approximate the sound and unique functionality of the DL-4 in Pure Data, then refine our implementation to better match the unit on which the piece was performed by comparing settings from the score to two official recordings of the piece. The DL-4 emulation is integrated into a patch for live performance based on the Null Piece, and regression tested using the Reality Check framework for Pure Data. Using this library of patches, Summermood can be brought back into live rotation without the use of the now discontinued DL-4. The patches will be continuously tested to ensure that the piece is playable across computer environments and as the Pure Data programming language is updated.
[542] Guiding Audio Editing with Audio Language Model
Zitong Lan, Yiduo Hao, Mingmin Zhao
Main category: cs.SD
TL;DR: SmartDJ is a stereo audio editing framework that uses audio language models to decompose high-level instructions into atomic edit operations, then executes them using latent diffusion models for superior perceptual quality and spatial realism.
Details
Motivation: Current audio editing models are limited to template-like instructions and mono-channel audio, failing to handle declarative editing where users specify desired outcomes rather than detailed operations.Method: Combines audio language models for reasoning and instruction decomposition with latent diffusion models for stereo audio manipulation. Uses a data synthesis pipeline to create paired examples of instructions, edit operations, and audio samples.
Result: SmartDJ achieves superior perceptual quality, spatial realism, and semantic alignment compared to prior audio editing methods.
Conclusion: The framework successfully enables declarative stereo audio editing by integrating reasoning and generative capabilities, advancing beyond template-based approaches.
Abstract: Audio editing plays a central role in VR/AR immersion, virtual conferencing, sound design, and other interactive media. However, recent generative audio editing models depend on template-like instruction formats and are restricted to mono-channel audio. These models fail to deal with declarative audio editing, where the user declares what the desired outcome should be, while leaving the details of editing operations to the system. We introduce SmartDJ, a novel framework for stereo audio editing that combines the reasoning capability of audio language models with the generative power of latent diffusion. Given a high-level instruction, SmartDJ decomposes it into a sequence of atomic edit operations, such as adding, removing, or spatially relocating events. These operations are then executed by a diffusion model trained to manipulate stereo audio. To support this, we design a data synthesis pipeline that produces paired examples of high-level instructions, atomic edit operations, and audios before and after each edit operation. Experiments demonstrate that SmartDJ achieves superior perceptual quality, spatial realism, and semantic alignment compared to prior audio editing methods. Demos are available at https://zitonglan.github.io/project/smartdj/smartdj.html.
[543] Frustratingly Easy Zero-Day Audio DeepFake Detection via Retrieval Augmentation and Profile Matching
Xuechen Liu, Xin Wang, Junichi Yamagishi
Main category: cs.SD
TL;DR: A training-free framework for zero-day audio deepfake detection using knowledge representations, retrieval augmentation, and voice profile matching, achieving performance comparable to fine-tuned models without additional training.
Details
Motivation: Modern audio deepfake detectors struggle with zero-day attacks from novel synthesis methods not seen in training data, and conventional fine-tuning approaches are problematic when prompt response is required.Method: Proposed a training-free framework based on knowledge representations, retrieval augmentation, and voice profile matching, with simple yet effective knowledge retrieval and ensemble methods.
Result: Achieved performance comparable to fine-tuned models on DeepFake-Eval-2024 without any additional model-wise training. Ablation studies validated the relevance of retrieval pool size and voice profile attributes to system efficacy.
Conclusion: The proposed training-free framework provides an effective solution for zero-day audio deepfake detection that doesn’t require model fine-tuning, offering comparable performance to fine-tuned approaches while enabling prompt response to new attacks.
Abstract: Modern audio deepfake detectors using foundation models and large training datasets have achieved promising detection performance. However, they struggle with zero-day attacks, where the audio samples are generated by novel synthesis methods that models have not seen from reigning training data. Conventional approaches against such attacks require fine-tuning the detectors, which can be problematic when prompt response is required. This study introduces a training-free framework for zero-day audio deepfake detection based on knowledge representations, retrieval augmentation, and voice profile matching. Based on the framework, we propose simple yet effective knowledge retrieval and ensemble methods that achieve performance comparable to fine-tuned models on DeepFake-Eval-2024, without any additional model-wise training. We also conduct ablation studies on retrieval pool size and voice profile attributes, validating their relevance to the system efficacy.
[544] Noise-to-Notes: Diffusion-based Generation and Refinement for Automatic Drum Transcription
Michael Yeung, Keisuke Toyama, Toya Teramoto, Shusuke Takahashi, Tamaki Kojima
Main category: cs.SD
TL;DR: The paper redefines automatic drum transcription as a generative task using diffusion models, introducing Noise-to-Notes (N2N) that transforms audio-conditioned noise into drum events with velocities, achieving state-of-the-art performance.
Details
Motivation: Traditional ADT approaches use discriminative methods, but reformulating it as a generative task with diffusion modeling offers advantages like flexible speed-accuracy trade-offs and strong inpainting capabilities.Method: Proposes N2N framework using diffusion models with Annealed Pseudo-Huber loss for joint optimization of binary onset and continuous velocity values, and incorporates music foundation model features to enhance spectrogram features.
Result: N2N establishes new state-of-the-art performance across multiple ADT benchmarks, with MFM features significantly improving robustness to out-of-domain drum audio.
Conclusion: The generative diffusion approach to ADT with N2N framework and MFM feature augmentation provides superior performance and robustness compared to traditional discriminative methods.
Abstract: Automatic drum transcription (ADT) is traditionally formulated as a discriminative task to predict drum events from audio spectrograms. In this work, we redefine ADT as a conditional generative task and introduce Noise-to-Notes (N2N), a framework leveraging diffusion modeling to transform audio-conditioned Gaussian noise into drum events with associated velocities. This generative diffusion approach offers distinct advantages, including a flexible speed-accuracy trade-off and strong inpainting capabilities. However, the generation of binary onset and continuous velocity values presents a challenge for diffusion models, and to overcome this, we introduce an Annealed Pseudo-Huber loss to facilitate effective joint optimization. Finally, to augment low-level spectrogram features, we propose incorporating features extracted from music foundation models (MFMs), which capture high-level semantic information and enhance robustness to out-of-domain drum audio. Experimental results demonstrate that including MFM features significantly improves robustness and N2N establishes a new state-of-the-art performance across multiple ADT benchmarks.
[545] Lightweight Front-end Enhancement for Robust ASR via Frame Resampling and Sub-Band Pruning
Siyi Zhao, Wei Wang, Yanmin Qian
Main category: cs.SD
TL;DR: The paper proposes computational optimizations for speech enhancement front-ends to reduce overhead while maintaining ASR performance in noisy environments.
Details
Motivation: Speech enhancement front-ends are widely used to mitigate noise for ASR systems but introduce significant computational overhead, which needs to be reduced without compromising performance.Method: The approach integrates layer-wise frame resampling and progressive sub-band pruning. Frame resampling downsamples inputs within layers using residual connections, while sub-band pruning progressively excludes less informative frequency bands.
Result: Extensive experiments show the system reduces SE computational overhead by over 66% compared to standard BSRNN while maintaining strong ASR performance on synthetic and real-world noisy datasets.
Conclusion: The proposed optimizations successfully reduce computational costs of speech enhancement front-ends significantly without degrading ASR performance in noisy environments.
Abstract: Recent advancements in automatic speech recognition (ASR) have achieved notable progress, whereas robustness in noisy environments remains challenging. While speech enhancement (SE) front-ends are widely used to mitigate noise as a preprocessing step for ASR, they often introduce computational non-negligible overhead. This paper proposes optimizations to reduce SE computational costs without compromising ASR performance. Our approach integrates layer-wise frame resampling and progressive sub-band pruning. Frame resampling downsamples inputs within layers, utilizing residual connections to mitigate information loss. Simultaneously, sub-band pruning progressively excludes less informative frequency bands, further reducing computational demands. Extensive experiments on synthetic and real-world noisy datasets demonstrate that our system reduces SE computational overhead over 66 compared to the standard BSRNN, while maintaining strong ASR performance.
[546] Text2Move: Text-to-moving sound generation via trajectory prediction and temporal alignment
Yunyi Liu, Shaofan Yang, Kai Li, Xu Li
Main category: cs.SD
TL;DR: A framework for generating moving sounds from text prompts by predicting 3D trajectories and synthesizing spatial audio using text-to-audio models.
Details
Motivation: Human auditory perception involves moving sound sources in 3D space, but existing generative sound models are limited to mono signals or static spatial audio.Method: Create synthetic dataset with moving binaural sounds, spatial trajectories, and text captions. Train text-to-trajectory model, fine-tune text-to-audio model for temporally aligned mono sound, then simulate spatial audio using predicted trajectories.
Result: Experimental evaluation shows reasonable spatial understanding by the text-to-trajectory model.
Conclusion: This approach can be easily integrated into existing text-to-audio workflows and extended to other spatial audio formats for moving sound generation.
Abstract: Human auditory perception is shaped by moving sound sources in 3D space, yet prior work in generative sound modelling has largely been restricted to mono signals or static spatial audio. In this work, we introduce a framework for generating moving sounds given text prompts in a controllable fashion. To enable training, we construct a synthetic dataset that records moving sounds in binaural format, their spatial trajectories, and text captions about the sound event and spatial motion. Using this dataset, we train a text-to-trajectory prediction model that outputs the three-dimensional trajectory of a moving sound source given text prompts. To generate spatial audio, we first fine-tune a pre-trained text-to-audio generative model to output temporally aligned mono sound with the trajectory. The spatial audio is then simulated using the predicted temporally-aligned trajectory. Experimental evaluation demonstrates reasonable spatial understanding of the text-to-trajectory model. This approach could be easily integrated into existing text-to-audio generative workflow and extended to moving sound generation in other spatial audio formats.
[547] Decoding Deception: Understanding Automatic Speech Recognition Vulnerabilities in Evasion and Poisoning Attacks
Aravindhan G, Yuvaraj Govindarajulu, Parin Shah
Main category: cs.SD
TL;DR: This paper explores cost-efficient white-box and non-transferability black-box adversarial attacks on Automatic Speech Recognition systems, including poisoning attacks that degrade model performance with minimal perturbations.
Details
Motivation: To address the vulnerability of ASR systems to adversarial examples, moving beyond previous constrained white-box attacks and transferability-based black-box attacks to explore more practical attack methods.Method: Uses approaches inspired by Fast Gradient Sign Method and Zeroth-Order Optimization to create hybrid models that generate subtle adversarial examples with minimal perturbation (35dB SNR) in under a minute.
Result: Demonstrates successful generation of impactful adversarial examples with very little perturbation that can deceive state-of-the-art ASR systems into misinterpreting audio signals.
Conclusion: The vulnerabilities found in state-of-the-art open source ASR models have practical security implications and emphasize the urgent need for adversarial security measures in speech recognition systems.
Abstract: Recent studies have demonstrated the vulnerability of Automatic Speech Recognition systems to adversarial examples, which can deceive these systems into misinterpreting input speech commands. While previous research has primarily focused on white-box attacks with constrained optimizations, and transferability based black-box attacks against commercial Automatic Speech Recognition devices, this paper explores cost efficient white-box attack and non transferability black-box adversarial attacks on Automatic Speech Recognition systems, drawing insights from approaches such as Fast Gradient Sign Method and Zeroth-Order Optimization. Further, the novelty of the paper includes how poisoning attack can degrade the performances of state-of-the-art models leading to misinterpretation of audio signals. Through experimentation and analysis, we illustrate how hybrid models can generate subtle yet impactful adversarial examples with very little perturbation having Signal Noise Ratio of 35dB that can be generated within a minute. These vulnerabilities of state-of-the-art open source model have practical security implications, and emphasize the need for adversarial security.
[548] Comprehend and Talk: Text to Speech Synthesis via Dual Language Modeling
Junjie Cao, Yichen Han, Ruonan Zhang, Xiaoyang Hao, Hongxiang Li, Shuaijiang Zhao, Yue Liu, Xiao-Ping Zhng
Main category: cs.SD
TL;DR: CaT-TTS is a novel TTS framework that addresses limitations in existing LLM-based systems by introducing semantic-grounded codec, dual-Transformer architecture, and parallel inference for robust zero-shot synthesis.
Details
Motivation: Existing LLM-based TTS systems suffer from information loss in single codebook modeling, lack of explicit semantic structure in hierarchical tokens, and error accumulation in autoregressive processes.Method: Proposes S3Codec with semantic distillation from ASR model, dual-Transformer architecture separating comprehension and generation, and Masked Audio Parallel Inference for stable decoding.
Result: The framework enables robust and semantically-grounded zero-shot speech synthesis with improved generation stability.
Conclusion: CaT-TTS effectively addresses key challenges in LLM-based TTS through semantic grounding and architectural innovations.
Abstract: Existing Large Language Model (LLM) based autoregressive (AR) text-to-speech
(TTS) systems, while achieving state-of-the-art quality, still face critical
challenges. The foundation of this LLM-based paradigm is the discretization of
the continuous speech waveform into a sequence of discrete tokens by neural
audio codec. However, single codebook modeling is well suited to text LLMs, but
suffers from significant information loss; hierarchical acoustic tokens,
typically generated via Residual Vector Quantization (RVQ), often lack explicit
semantic structure, placing a heavy learning burden on the model. Furthermore,
the autoregressive process is inherently susceptible to error accumulation,
which can degrade generation stability. To address these limitations, we
propose CaT-TTS, a novel framework for robust and semantically-grounded
zero-shot synthesis. First, we introduce S3Codec, a split RVQ codec that
injects explicit linguistic features into its primary codebook via semantic
distillation from a state-of-the-art ASR model, providing a structured
representation that simplifies the learning task. Second, we propose an
Understand-then-Generate'' dual-Transformer architecture that decouples comprehension from rendering. An initial
Understanding’’ Transformer models
the cross-modal relationship between text and the audio’s semantic tokens to
form a high-level utterance plan. A subsequent ``Generation’’ Transformer then
executes this plan, autoregressively synthesizing hierarchical acoustic tokens.
Finally, to enhance generation stability, we introduce Masked Audio Parallel
Inference (MAPI), a nearly parameter-free inference strategy that dynamically
guides the decoding process to mitigate local errors.
[549] Cross-Dialect Bird Species Recognition with Dialect-Calibrated Augmentation
Jiani Ding, Qiyang Sun, Alican Akman, Björn W. Schuller
Main category: cs.SD
TL;DR: A framework using TDNNs with frequency normalization and adversarial training improves bird call recognition across dialects by up to 20% while maintaining in-region performance.
Details
Motivation: Dialect variation in bird calls hinders automatic recognition in passive acoustic monitoring systems.Method: Time-Delay Neural Networks with frequency-sensitive normalization, gradient-reversal adversarial training, and multi-level augmentation including waveform perturbations, Mixup, and CycleGAN transfer with Dialect-Calibrated Augmentation.
Result: Achieved up to 20 percentage point improvement in cross-dialect accuracy over baseline TDNNs while preserving in-region performance.
Conclusion: Lightweight, transparent, and dialect-resilient bird-sound recognition is achievable through the proposed framework.
Abstract: Dialect variation hampers automatic recognition of bird calls collected by passive acoustic monitoring. We address the problem on DB3V, a three-region, ten-species corpus of 8-s clips, and propose a deployable framework built on Time-Delay Neural Networks (TDNNs). Frequency-sensitive normalisation (Instance Frequency Normalisation and a gated Relaxed-IFN) is paired with gradient-reversal adversarial training to learn region-invariant embeddings. A multi-level augmentation scheme combines waveform perturbations, Mixup for rare classes, and CycleGAN transfer that synthesises Region 2 (Interior Plains)-style audio, , with Dialect-Calibrated Augmentation (DCA) softly down-weighting synthetic samples to limit artifacts. The complete system lifts cross-dialect accuracy by up to twenty percentage points over baseline TDNNs while preserving in-region performance. Grad-CAM and LIME analyses show that robust models concentrate on stable harmonic bands, providing ecologically meaningful explanations. The study demonstrates that lightweight, transparent, and dialect-resilient bird-sound recognition is attainable.
[550] From Coarse to Fine: Recursive Audio-Visual Semantic Enhancement for Speech Separation
Ke Xue, Rongfei Fan, Lixin, Dawei Zhao, Chao Zhu, Han Hu
Main category: cs.SD
TL;DR: CSFNet is a coarse-to-fine audio-visual speech separation network that uses recursive semantic enhancement through a two-stage process: coarse separation followed by fine separation with audio-visual speech recognition feedback.
Details
Motivation: Existing audio-visual speech separation methods underexploit visual potential by relying on static visual representations, failing to fully leverage the complementary semantic guidance from visual cues like lip movements and facial features.Method: Two-stage recursive approach: 1) Coarse Separation - first-pass estimation from mixture and visual input, 2) Fine Separation - coarse audio fed into AVSR model with visual stream to generate discriminative semantic representations. Includes speaker-aware perceptual fusion and multi-range spectro-temporal separation network.
Result: Achieves state-of-the-art performance on three benchmark datasets and two noisy datasets, with substantial coarse-to-fine improvements.
Conclusion: The recursive semantic enhancement framework is necessary and effective for audio-visual speech separation, validating the importance of dynamic semantic representation learning.
Abstract: Audio-visual speech separation aims to isolate each speaker’s clean voice from mixtures by leveraging visual cues such as lip movements and facial features. While visual information provides complementary semantic guidance, existing methods often underexploit its potential by relying on static visual representations. In this paper, we propose CSFNet, a Coarse-to-Separate-Fine Network that introduces a recursive semantic enhancement paradigm for more effective separation. CSFNet operates in two stages: (1) Coarse Separation, where a first-pass estimation reconstructs a coarse audio waveform from the mixture and visual input; and (2) Fine Separation, where the coarse audio is fed back into an audio-visual speech recognition (AVSR) model together with the visual stream. This recursive process produces more discriminative semantic representations, which are then used to extract refined audio. To further exploit these semantics, we design a speaker-aware perceptual fusion block to encode speaker identity across modalities, and a multi-range spectro-temporal separation network to capture both local and global time-frequency patterns. Extensive experiments on three benchmark datasets and two noisy datasets show that CSFNet achieves state-of-the-art (SOTA) performance, with substantial coarse-to-fine improvements, validating the necessity and effectiveness of our recursive semantic enhancement framework.
[551] MDAR: A Multi-scene Dynamic Audio Reasoning Benchmark
Hui Li, Changhao Jiang, Hongyu Wang, Ming Zhang, Jiajun Sun, Zhixiong Yang, Yifei Cao, Shihan Dou, Xiaoran Fan, Baoyu Fan, Tao Ji, Tao Gui, Qi Zhang, Xuanjing Huang
Main category: cs.SD
TL;DR: MDAR is a new benchmark for evaluating audio reasoning models on complex, multi-scene scenarios with 3,000 question-answer pairs across 5 reasoning categories and 3 question types.
Details
Motivation: Existing audio benchmarks focus on static/single-scene settings and don't capture real-world scenarios with multiple speakers, unfolding events, and heterogeneous audio sources interacting.Method: Created MDAR benchmark with 3,000 curated QA pairs linked to diverse audio clips, covering 5 complex reasoning categories and 3 question types (single-choice, multiple-choice, open-ended).
Result: Benchmarked 26 SOTA audio language models - Qwen2.5-Omni achieved 76.67% on single-choice, GPT-4o Audio reached 68.47% but outperformed on harder tasks. No model achieved 80% performance across all question types.
Conclusion: MDAR poses unique challenges that current models struggle with, highlighting its value as a benchmark for advancing audio reasoning research.
Abstract: The ability to reason from audio, including speech, paralinguistic cues, environmental sounds, and music, is essential for AI agents to interact effectively in real-world scenarios. Existing benchmarks mainly focus on static or single-scene settings and do not fully capture scenarios where multiple speakers, unfolding events, and heterogeneous audio sources interact. To address these challenges, we introduce MDAR, a benchmark for evaluating models on complex, multi-scene, and dynamically evolving audio reasoning tasks. MDAR comprises 3,000 carefully curated question-answer pairs linked to diverse audio clips, covering five categories of complex reasoning and spanning three question types. We benchmark 26 state-of-the-art audio language models on MDAR and observe that they exhibit limitations in complex reasoning tasks. On single-choice questions, Qwen2.5-Omni (open-source) achieves 76.67% accuracy, whereas GPT-4o Audio (closed-source) reaches 68.47%; however, GPT-4o Audio substantially outperforms Qwen2.5-Omni on the more challenging multiple-choice and open-ended tasks. Across all three question types, no model achieves 80% performance. These findings underscore the unique challenges posed by MDAR and its value as a benchmark for advancing audio reasoning research.Code and benchmark can be found at https://github.com/luckyerr/MDAR.
[552] Unifying Symbolic Music Arrangement: Track-Aware Reconstruction and Structured Tokenization
Longshen Ou, Jingwei Zhao, Ziyu Wang, Gus Xia, Qihao Liang, Torin Hopkins Ye Wang
Main category: cs.SD
TL;DR: A unified framework for automatic multitrack music arrangement that enables a single pre-trained model to handle diverse arrangement scenarios through segment-level reconstruction with disentangled content and style representations.
Details
Motivation: To create a general-purpose symbolic music model that can handle various arrangement tasks (reinterpretation, simplification, additive generation) without requiring task-specific models, addressing the need for flexible music-to-music transformations.Method: Uses segment-level reconstruction objective with token-level disentangled content and style representations, combined with REMI-z structured tokenization scheme for multitrack symbolic music to enable any-to-any instrumentation transformations.
Result: Outperforms task-specific state-of-the-art models on band arrangement, piano reduction, and drum arrangement tasks in both objective metrics and perceptual evaluations.
Conclusion: The framework demonstrates strong generality and suggests broader applicability in symbolic music-to-music transformation, providing a unified solution for diverse arrangement scenarios.
Abstract: We present a unified framework for automatic multitrack music arrangement that enables a single pre-trained symbolic music model to handle diverse arrangement scenarios, including reinterpretation, simplification, and additive generation. At its core is a segment-level reconstruction objective operating on token-level disentangled content and style, allowing for flexible any-to-any instrumentation transformations at inference time. To support track-wise modeling, we introduce REMI-z, a structured tokenization scheme for multitrack symbolic music that enhances modeling efficiency and effectiveness for both arrangement tasks and unconditional generation. Our method outperforms task-specific state-of-the-art models on representative tasks in different arrangement scenarios – band arrangement, piano reduction, and drum arrangement, in both objective metrics and perceptual evaluations. Taken together, our framework demonstrates strong generality and suggests broader applicability in symbolic music-to-music transformation.
[553] VocalAgent: Large Language Models for Vocal Health Diagnostics with Safety-Aware Evaluation
Yubin Kim, Taehan Kim, Wonjune Kang, Eugene Park, Joonsik Yoon, Dongjae Lee, Xin Liu, Daniel McDuff, Hyeonhoon Lee, Cynthia Breazeal, Hae Won Park
Main category: cs.SD
TL;DR: VocalAgent is an audio LLM for vocal health diagnosis that achieves superior accuracy in voice disorder classification compared to state-of-the-art baselines, using Qwen-Audio-Chat fine-tuned on hospital patient datasets.
Details
Motivation: Vocal health is crucial for communication but many lack access to convenient diagnosis and treatment for voice disorders despite their global prevalence.Method: Leverage Qwen-Audio-Chat fine-tuned on three datasets collected in-situ from hospital patients, with multifaceted evaluation including safety assessment, cross-lingual performance analysis, and modality ablation studies.
Result: VocalAgent demonstrates superior accuracy on voice disorder classification compared to state-of-the-art baselines.
Conclusion: The LLM-based method offers a scalable solution for broader adoption of health diagnostics while underscoring the importance of ethical and technical validation.
Abstract: Vocal health plays a crucial role in peoples’ lives, significantly impacting their communicative abilities and interactions. However, despite the global prevalence of voice disorders, many lack access to convenient diagnosis and treatment. This paper introduces VocalAgent, an audio large language model (LLM) to address these challenges through vocal health diagnosis. We leverage Qwen-Audio-Chat fine-tuned on three datasets collected in-situ from hospital patients, and present a multifaceted evaluation framework encompassing a safety assessment to mitigate diagnostic biases, cross-lingual performance analysis, and modality ablation studies. VocalAgent demonstrates superior accuracy on voice disorder classification compared to state-of-the-art baselines. Its LLM-based method offers a scalable solution for broader adoption of health diagnostics, while underscoring the importance of ethical and technical validation.
[554] Description and Discussion on DCASE 2025 Challenge Task 2: First-shot Unsupervised Anomalous Sound Detection for Machine Condition Monitoring
Tomoya Nishida, Noboru Harada, Daisuke Niizumi, Davide Albertini, Roberto Sannino, Simone Pradolini, Filippo Augusti, Keisuke Imoto, Kota Dohi, Harsh Purohit, Takashi Endo, Yohei Kawaguchi
Main category: cs.SD
TL;DR: This paper presents the DCASE 2025 Challenge Task 2 on first-shot unsupervised anomalous sound detection for machine condition monitoring, analyzing 119 submissions from 35 teams.
Details
Motivation: To enable rapid deployment of ASD systems for new machine types without requiring machine-specific hyperparameter tuning, building on previous DCASE challenges.Method: First-shot problem within domain generalization framework using sounds from previously unseen machine types as evaluation dataset.
Result: Analysis of 119 submissions showed various competitive approaches including fine-tuning pre-trained models, using frozen pre-trained models, and training small models from scratch when combined with appropriate techniques.
Conclusion: Multiple approaches can be effective for first-shot ASD when properly combined with cost functions, anomaly score normalization, and use of clean machine and noise sounds.
Abstract: This paper introduces the task description for the Detection and Classification of Acoustic Scenes and Events (DCASE) 2025 Challenge Task 2, titled “First-shot unsupervised anomalous sound detection (ASD) for machine condition monitoring”. Building on the DCASE 2024 Challenge Task 2, this task is structured as a first-shot problem within a domain generalization framework. The primary objective of the first-shot approach is to facilitate the rapid deployment of ASD systems for new machine types without requiring machine-specific hyperparameter tunings. For DCASE 2025 Challenge Task 2, sounds from previously unseen machine types have been collected and provided as the evaluation dataset. We received 119 submissions from 35 teams, and an analysis of these submissions has been made in this paper. Analysis showed that various approaches can all be competitive, such as fine-tuning pre-trained models, using frozen pre-trained models, and training small models from scratch, when combined with appropriate cost functions, anomaly score normalization, and use of clean machine and noise sounds.
[555] Xi+: Uncertainty Supervision for Robust Speaker Embedding
Junjie Li, Kong Aik Lee, Duc-Tuan Truong, Tianchi Liu, Man-Wai Mak
Main category: cs.SD
TL;DR: The paper proposes xi+, an improved version of xi-vector speaker recognition that adds temporal attention and a new Stochastic Variance Loss to better estimate frame-level uncertainty.
Details
Motivation: Current xi-vector models estimate frame uncertainty implicitly through classification loss alone, ignoring temporal relationships between frames, leading to suboptimal performance.Method: xi+ incorporates a temporal attention module for context-aware frame uncertainty estimation and introduces Stochastic Variance Loss to explicitly supervise uncertainty learning.
Result: xi+ achieves ~10% improvement on VoxCeleb1-O and ~11% improvement on NIST SRE 2024 evaluation sets compared to baseline xi-vector.
Conclusion: The proposed xi+ architecture with temporal attention and explicit uncertainty supervision significantly enhances speaker recognition performance by better modeling frame-level uncertainty.
Abstract: There are various factors that can influence the performance of speaker recognition systems, such as emotion, language and other speaker-related or context-related variations. Since individual speech frames do not contribute equally to the utterance-level representation, it is essential to estimate the importance or reliability of each frame. The xi-vector model addresses this by assigning different weights to frames based on uncertainty estimation. However, its uncertainty estimation model is implicitly trained through classification loss alone and does not consider the temporal relationships between frames, which may lead to suboptimal supervision. In this paper, we propose an improved architecture, xi+. Compared to xi-vector, xi+ incorporates a temporal attention module to capture frame-level uncertainty in a context-aware manner. In addition, we introduce a novel loss function, Stochastic Variance Loss, which explicitly supervises the learning of uncertainty. Results demonstrate consistent performance improvements of about 10% on the VoxCeleb1-O set and 11% on the NIST SRE 2024 evaluation set.
[556] Scaling to Multimodal and Multichannel Heart Sound Classification: Fine-Tuning Wav2Vec 2.0 with Synthetic and Augmented Biosignals
Milan Marocchi, Matthew Fynn, Kayapanda Mandana, Yue Rong
Main category: cs.SD
TL;DR: This paper presents a deep learning approach using denoising diffusion models (WaveGrad and DiffWave) to augment heart sound datasets for training a Wav2Vec 2.0-based classifier, achieving state-of-the-art performance in cardiovascular disease detection across single-channel PCG, synchronized PCG-ECG, and multichannel PCG datasets.
Details
Motivation: Cardiovascular diseases are the leading cause of death worldwide, creating demand for accurate and inexpensive pre-screening methods. Current deep learning approaches are limited by the scarcity of synchronized and multichannel heart sound datasets.Method: Combines traditional signal processing with denoising diffusion models (WaveGrad and DiffWave) to create augmented datasets, then fine-tunes a Wav2Vec 2.0-based classifier on multimodal and multichannel heart sound datasets.
Result: Achieved state-of-the-art performance: 92.48% accuracy on CinC 2016 single-channel PCG, 93.14% accuracy on synchronized PCG-ECG, and 77.13% accuracy on multichannel PCG data. High metrics across sensitivity, specificity, and MCC were obtained.
Conclusion: Transformer-based models are highly effective for CVD detection when supported by augmented datasets, demonstrating significant potential for advancing multimodal and multichannel heart sound classification in clinical applications.
Abstract: Cardiovascular diseases (CVDs) are the leading cause of death worldwide, accounting for approximately 17.9 million deaths each year. Early detection is critical, creating a demand for accurate and inexpensive pre-screening methods. Deep learning has recently been applied to classify abnormal heart sounds indicative of CVDs using synchronised phonocardiogram (PCG) and electrocardiogram (ECG) signals, as well as multichannel PCG (mPCG). However, state-of-the-art architectures remain underutilised due to the limited availability of synchronised and multichannel datasets. Augmented datasets and pre-trained models provide a pathway to overcome these limitations, enabling transformer-based architectures to be trained effectively. This work combines traditional signal processing with denoising diffusion models, WaveGrad and DiffWave, to create an augmented dataset to fine-tune a Wav2Vec 2.0-based classifier on multimodal and multichannel heart sound datasets. The approach achieves state-of-the-art performance. On the Computing in Cardiology (CinC) 2016 dataset of single channel PCG, accuracy, unweighted average recall (UAR), sensitivity, specificity and Matthew’s correlation coefficient (MCC) reach 92.48%, 93.05%, 93.63%, 92.48%, 94.93% and 0.8283, respectively. Using the synchronised PCG and ECG signals of the training-a dataset from CinC, 93.14%, 92.21%, 94.35%, 90.10%, 95.12% and 0.8380 are achieved for accuracy, UAR, sensitivity, specificity and MCC, respectively. Using a wearable vest dataset consisting of mPCG data, the model achieves 77.13% accuracy, 74.25% UAR, 86.47% sensitivity, 62.04% specificity, and 0.5082 MCC. These results demonstrate the effectiveness of transformer-based models for CVD detection when supported by augmented datasets, highlighting their potential to advance multimodal and multichannel heart sound classification.
[557] FakeSound2: A Benchmark for Explainable and Generalizable Deepfake Sound Detection
Zeyu Xie, Yaoyun Zhang, Xuenan Xu, Yongkang Yin, Chenxing Li, Mengyue Wu, Yuexian Zou
Main category: cs.SD
TL;DR: FakeSound2 is a benchmark that evaluates deepfake sound detection beyond binary classification, focusing on localization, traceability, and generalization across 6 manipulation types and 12 sources.
Details
Motivation: Address limitations of existing deepfake detection methods that focus only on binary classification and lack explainability about how manipulations occur, where sources originated, and generalization to unseen sources.Method: Developed FakeSound2 benchmark that evaluates models across three dimensions: localization (identifying manipulated regions), traceability (tracking source origins), and generalization (performance on unseen sources), covering 6 manipulation types and 12 diverse sources.
Result: Current systems achieve high classification accuracy but struggle to recognize forged pattern distributions and provide reliable explanations, revealing significant gaps in explainability and reliability.
Conclusion: FakeSound2 establishes a comprehensive benchmark that highlights key challenges and aims to foster robust, explainable, and generalizable approaches for trustworthy audio authentication.
Abstract: The rapid development of generative audio raises ethical and security concerns stemming from forged data, making deepfake sound detection an important safeguard against the malicious use of such technologies. Although prior studies have explored this task, existing methods largely focus on binary classification and fall short in explaining how manipulations occur, tracing where the sources originated, or generalizing to unseen sources-thereby limiting the explainability and reliability of detection. To address these limitations, we present FakeSound2, a benchmark designed to advance deepfake sound detection beyond binary accuracy. FakeSound2 evaluates models across three dimensions: localization, traceability, and generalization, covering 6 manipulation types and 12 diverse sources. Experimental results show that although current systems achieve high classification accuracy, they struggle to recognize forged pattern distributions and provide reliable explanations. By highlighting these gaps, FakeSound2 establishes a comprehensive benchmark that reveals key challenges and aims to foster robust, explainable, and generalizable approaches for trustworthy audio authentication.
[558] Audio Super-Resolution with Latent Bridge Models
Chang Li, Zehua Chen, Liyuan Wang, Jun Zhu
Main category: cs.SD
TL;DR: The paper presents a new audio super-resolution system using latent bridge models that compresses audio into continuous latent space for latent-to-latent generation, achieving state-of-the-art quality for upsampling to 48kHz and setting the first record for 192kHz audio SR.
Details
Motivation: Previous audio super-resolution methods suffer from sub-optimal upsampling quality due to uninformative generation priors. The authors aim to develop a system that fully exploits instructive prior information from low-resolution waveforms for high-quality audio upsampling.Method: The method uses latent bridge models (LBMs) that compress audio waveforms into continuous latent space and perform latent-to-latent generation. It introduces frequency-aware LBMs that take prior and target frequency as input for any-to-any upsampling, cascaded LBMs, and prior augmentation strategies for seamless cascaded super-resolution.
Result: Comprehensive experiments on VCTK, ESC-50, Song-Describer datasets and internal testsets demonstrate state-of-the-art objective and perceptual quality for any-to-48kHz SR across speech, audio, and music signals, and setting the first record for any-to-192kHz audio SR.
Conclusion: The proposed latent bridge model system effectively addresses limitations of previous audio super-resolution methods by leveraging latent-to-latent generation and frequency-aware training, achieving superior upsampling quality and enabling high-frequency audio SR beyond 48kHz.
Abstract: Audio super-resolution (SR), i.e., upsampling the low-resolution (LR) waveform to the high-resolution (HR) version, has recently been explored with diffusion and bridge models, while previous methods often suffer from sub-optimal upsampling quality due to their uninformative generation prior. Towards high-quality audio super-resolution, we present a new system with latent bridge models (LBMs), where we compress the audio waveform into a continuous latent space and design an LBM to enable a latent-to-latent generation process that naturally matches the LR-toHR upsampling process, thereby fully exploiting the instructive prior information contained in the LR waveform. To further enhance the training results despite the limited availability of HR samples, we introduce frequency-aware LBMs, where the prior and target frequency are taken as model input, enabling LBMs to explicitly learn an any-to-any upsampling process at the training stage. Furthermore, we design cascaded LBMs and present two prior augmentation strategies, where we make the first attempt to unlock the audio upsampling beyond 48 kHz and empower a seamless cascaded SR process, providing higher flexibility for audio post-production. Comprehensive experimental results evaluated on the VCTK, ESC-50, Song-Describer benchmark datasets and two internal testsets demonstrate that we achieve state-of-the-art objective and perceptual quality for any-to-48kHz SR across speech, audio, and music signals, as well as setting the first record for any-to-192kHz audio SR. Demo at https://AudioLBM.github.io/.
cs.LG
[559] Discovering and Analyzing Stochastic Processes to Reduce Waste in Food Retail
Anna Kalenkova, Lu Xia, Dirk Neumann
Main category: cs.LG
TL;DR: Proposes object-centric process mining with stochastic modeling to optimize food retail supply chains, reducing waste while preventing shortages.
Details
Motivation: Address food waste in retail by better understanding and optimizing the balance between customer purchasing behavior and supply strategies.Method: Integrates object-centric process mining with stochastic process discovery using continuous-time Markov chains from sales data, then extends with supply activities for what-if analysis.
Result: Enables identification of optimal balance between customer demand and supply, helping prevent both food waste from oversupply and product shortages.
Conclusion: The integrated approach successfully models food retail processes to optimize supply strategies and reduce waste through data-driven analysis.
Abstract: This paper proposes a novel method for analyzing food retail processes with a focus on reducing food waste. The approach integrates object-centric process mining (OCPM) with stochastic process discovery and analysis. First, a stochastic process in the form of a continuous-time Markov chain is discovered from grocery store sales data. This model is then extended with supply activities. Finally, a what-if analysis is conducted to evaluate how the quantity of products in the store evolves over time. This enables the identification of an optimal balance between customer purchasing behavior and supply strategies, helping to prevent both food waste due to oversupply and product shortages.
[560] Impact of Loss Weight and Model Complexity on Physics-Informed Neural Networks for Computational Fluid Dynamics
Yi En Chou, Te Hsin Liu, Chao An Lin
Main category: cs.LG
TL;DR: Proposes two weighting schemes for Physics Informed Neural Networks to address sensitivity to loss weights, with the second scheme improving stability and accuracy in various CFD problems including challenging high Peclet number cases.
Details
Motivation: Physics Informed Neural Networks (PINNs) are mesh-free for solving PDEs but highly sensitive to loss weight selection, which affects their performance and stability.Method: Two dimensional analysis based weighting schemes: one using quantifiable terms, and another incorporating both quantifiable and unquantifiable terms for more balanced training.
Result: The second weighting scheme consistently improves stability and accuracy over equal weighting in heat conduction, convection diffusion, and lid driven cavity flows. Notably achieves stable, accurate predictions in high Peclet number convection diffusion where traditional solvers fail.
Conclusion: The proposed weighting scheme enhances PINNs’ robustness and generalizability in CFD problems, making them effective even in challenging scenarios where conventional methods struggle.
Abstract: Physics Informed Neural Networks offer a mesh free framework for solving PDEs but are highly sensitive to loss weight selection. We propose two dimensional analysis based weighting schemes, one based on quantifiable terms, and another also incorporating unquantifiable terms for more balanced training. Benchmarks on heat conduction, convection diffusion, and lid driven cavity flows show that the second scheme consistently improves stability and accuracy over equal weighting. Notably, in high Peclet number convection diffusion, where traditional solvers fail, PINNs with our scheme achieve stable, accurate predictions, highlighting their robustness and generalizability in CFD problems.
[561] LLMs for Bayesian Optimization in Scientific Domains: Are We There Yet?
Rushil Gupta, Jason Hartford, Bang Liu
Main category: cs.LG
TL;DR: LLMs fail at in-context experimental design as they show no sensitivity to feedback, while classical methods outperform them. A hybrid LLM-guided Nearest Neighbour method achieves competitive performance.
Details
Motivation: To evaluate whether LLMs can effectively perform in-context experimental design for scientific tasks like genetic perturbation and molecular property discovery.Method: Tested open- and closed-source LLMs on experimental design tasks, compared with classical methods, and proposed LLM-guided Nearest Neighbour sampling that combines LLM prior knowledge with nearest-neighbor sampling.
Result: LLM-based agents showed no sensitivity to experimental feedback, classical methods consistently outperformed LLMs, and the proposed LLMNN method achieved competitive or superior performance across domains.
Conclusion: Current LLMs do not perform in-context experimental design effectively, highlighting the need for hybrid frameworks that separate prior-based reasoning from batch acquisition with updated posteriors.
Abstract: Large language models (LLMs) have recently been proposed as general-purpose agents for experimental design, with claims that they can perform in-context experimental design. We evaluate this hypothesis using both open- and closed-source instruction-tuned LLMs applied to genetic perturbation and molecular property discovery tasks. We find that LLM-based agents show no sensitivity to experimental feedback: replacing true outcomes with randomly permuted labels has no impact on performance. Across benchmarks, classical methods such as linear bandits and Gaussian process optimization consistently outperform LLM agents. We further propose a simple hybrid method, LLM-guided Nearest Neighbour (LLMNN) sampling, that combines LLM prior knowledge with nearest-neighbor sampling to guide the design of experiments. LLMNN achieves competitive or superior performance across domains without requiring significant in-context adaptation. These results suggest that current open- and closed-source LLMs do not perform in-context experimental design in practice and highlight the need for hybrid frameworks that decouple prior-based reasoning from batch acquisition with updated posteriors.
[562] Object Identification Under Known Dynamics: A PIRNN Approach for UAV Classification
Nyi Nyi Aung, Neil Muralles, Adrian Stein
Main category: cs.LG
TL;DR: Physics-informed residual neural network combines learning and classification for object identification in UAV applications with known dynamics, achieving high accuracy with reduced training time.
Details
Motivation: To address object identification in unmanned aerial vehicle applications where dynamics are known, combining learning and classification through physics-informed approaches.Method: Physics-informed residual neural network for state mapping and state-derivative prediction, with softmax layer for multi-class confidence estimation. Tested on quadcopter, fixed-wing, and helicopter aerial vehicles.
Result: Demonstrates high classification accuracy with reduced training time.
Conclusion: Offers a promising solution for system identification problems in domains with well-understood underlying dynamics.
Abstract: This work addresses object identification under known dynamics in unmanned aerial vehicle applications, where learning and classification are combined through a physics-informed residual neural network. The proposed framework leverages physics-informed learning for state mapping and state-derivative prediction, while a softmax layer enables multi-class confidence estimation. Quadcopter, fixed-wing, and helicopter aerial vehicles are considered as case studies. The results demonstrate high classification accuracy with reduced training time, offering a promising solution for system identification problems in domains where the underlying dynamics are well understood.
[563] Null-Space Filtering for Data-Free Continual Model Merging: Preserving Transparency, Promoting Fidelity
Zihuan Qiu, Lei Wang, Yang Cao, Runtong Zhang, Bing Su, Yi Xu, Fanman Meng, Linfeng Xu, Qingbo Wu, Hongliang Li
Main category: cs.LG
TL;DR: NUFILT is a data-free continual model merging framework that uses null-space filtering to merge fine-tuned models without task data, ensuring transparency (no interference with prior tasks) and fidelity (faithful adaptation to new tasks).
Details
Motivation: Existing approaches fail to bridge data-level desiderata with parameter-space optimization for data-free continual model merging, particularly in ensuring both transparency and fidelity without access to task data.Method: Proposes null-space filtering to preserve prior responses by filtering overlapping components of new task vectors, combined with lightweight LoRA adapters trained with projection-based surrogate loss to inject complementary task-specific signals.
Result: Achieves state-of-the-art performance with minimal forgetting on vision and NLP benchmarks, improving average accuracy by 4-7% over OPCM and WUDI-Merging while reducing computation overhead.
Conclusion: NUFILT effectively enables continual model merging without task data through joint filtering-adaptation, theoretically justified by subspace alignment guarantees and empirically validated across multiple domains.
Abstract: Data-free continual model merging (DFCMM) aims to fuse independently fine-tuned models into a single backbone that evolves with incoming tasks without accessing task data. This paper formulate two fundamental desiderata for DFCMM: transparency, avoiding interference with earlier tasks, and fidelity, adapting faithfully to each new task. This poses a challenge that existing approaches fail to address: how to bridge data-level desiderata with parameter-space optimization to ensure transparency and fidelity in the absence of task data. To this end, we propose NUFILT (NUll-space FILTering), a data-free framework that directly links these desiderata to optimization. Our key observation is that task vectors approximately align with representation subspaces, providing structural surrogates for enforcing transparency and fidelity. Accordingly, we design a null-space projector that preserves prior responses by filtering out overlapping components of new task vectors, thereby ensuring transparency, and a lightweight LoRA adapter that injects complementary task-specific signals, enabling fidelity in adapting to new tasks. The adapter is trained with a projection-based surrogate loss to retain consistency with previous knowledge while introducing novel directions. This joint filtering-adaptation process allows the backbone to absorb new knowledge while retaining existing behaviors, and the updates are finally fused back in a layer-wise linear fashion without extra parameters or inference cost. Theoretically, we establish approximate subspace alignment guarantees that justify null-space filtering. Empirically, NUFILT achieves state-of-the-art performance with minimal forgetting on both vision and NLP benchmarks, improving average accuracy by 4-7% over OPCM and WUDI-Merging, while narrowing the gap to fine-tuning and reducing computation overhead.
[564] Forecasting Seismic Waveforms: A Deep Learning Approach for Einstein Telescope
Waleed Esmail, Alexander Kappes, Stuart Russell, Christine Thomas
Main category: cs.LG
TL;DR: SeismoGPT is a transformer-based model for forecasting three-component seismic waveforms, trained autoregressively to predict ground motion for gravitational wave detectors.
Details
Motivation: To support future gravitational wave detectors like Einstein Telescope by providing accurate seismic forecasting for Newtonian noise mitigation and real-time observatory control.Method: Transformer-based model trained in autoregressive setting on waveform data, learning temporal and spatial dependencies from single-station and array-based inputs.
Result: Model performs well in immediate prediction window but gradually degrades for longer forecasts, as expected in autoregressive systems.
Conclusion: SeismoGPT lays groundwork for data-driven seismic forecasting that can support gravitational wave detector operations through noise mitigation and control.
Abstract: We introduce \textit{SeismoGPT}, a transformer-based model for forecasting three-component seismic waveforms in the context of future gravitational wave detectors like the Einstein Telescope. The model is trained in an autoregressive setting and can operate on both single-station and array-based inputs. By learning temporal and spatial dependencies directly from waveform data, SeismoGPT captures realistic ground motion patterns and provides accurate short-term forecasts. Our results show that the model performs well within the immediate prediction window and gradually degrades further ahead, as expected in autoregressive systems. This approach lays the groundwork for data-driven seismic forecasting that could support Newtonian noise mitigation and real-time observatory control.
[565] Talking Trees: Reasoning-Assisted Induction of Decision Trees for Tabular Data
George Yakushev, Alina Shutova, Ivan Rubachev, Renat Sergazinov, Artem Babenko
Main category: cs.LG
TL;DR: Using reasoning-capable LLMs to create interpretable decision trees for low-resource tabular problems, outperforming traditional CART while providing human-readable reasoning traces.
Details
Motivation: Tabular foundation models are black boxes that are difficult to interpret and costly to inference. There's a need for interpretable alternatives that can leverage prior knowledge while maintaining transparency.Method: Design minimal tools for constructing, analyzing and manipulating decision trees. Use reasoning-capable LLMs in agentic setup to combine prior knowledge with data learning to create lightweight decision trees.
Result: LLM-induced decision trees outperform traditional CART on low-resource tabular problems. While not beating state-of-the-art black box models, they provide human-readable reasoning traces that can be checked for biases and data leaks.
Conclusion: LLM-induced decision trees offer an interpretable alternative to black box models, allowing for human input to correct biases and incorporate domain knowledge not captured in data.
Abstract: Tabular foundation models are becoming increasingly popular for low-resource tabular problems. These models make up for small training datasets by pretraining on large volumes of synthetic data. The prior knowledge obtained via pretraining provides the exceptional performance, but the resulting model becomes a black box that is difficult to interpret and costly to inference. In this work, we explore an alternative strategy: using reasoning-capable LLMs to induce decision trees for small tabular datasets in agentic setup. We design a minimal set of tools for constructing, analyzing and manipulating decision trees. By using these tools, LLMs combine their prior knowledge with learning from data to create a lightweight decision tree that outperforms traditional CART on low-resource tabular problems. While a single decision tree does not outperform state-of-the-art black box models, it comes with a human-readable reasoning trace that can be checked for biases and data leaks. Furthermore, the reasoning-based LLM’s creation process allows for additional human input: correcting biases or incorporating domain-specific intuition that is not captured in the data.
[566] Preference-Guided Learning for Sparse-Reward Multi-Agent Reinforcement Learning
The Viet Bui, Tien Mai, Hong Thanh Nguyen
Main category: cs.LG
TL;DR: A novel online MARL framework for sparse reward environments that combines inverse preference learning with multi-agent optimization, using implicit reward learning and dual advantage streams to improve policy learning.
Details
Motivation: Address the challenge of sparse rewards in online multi-agent reinforcement learning where reward feedback is only provided at trajectory ends, which hinders standard MARL algorithms from effective policy learning.Method: Integrates online inverse preference learning with multi-agent on-policy optimization using an implicit multi-agent reward learning model based on preference-based value-decomposition network, producing global and local reward signals with dual advantage streams for centralized critic and decentralized actors. Leverages LLMs for preference labels.
Result: Empirical evaluations on MAMuJoCo and SMACv2 benchmarks show superior performance compared to existing baselines.
Conclusion: The proposed framework effectively addresses sparse-reward challenges in online MARL through integrated reward learning and dual advantage mechanisms.
Abstract: We study the problem of online multi-agent reinforcement learning (MARL) in environments with sparse rewards, where reward feedback is not provided at each interaction but only revealed at the end of a trajectory. This setting, though realistic, presents a fundamental challenge: the lack of intermediate rewards hinders standard MARL algorithms from effectively guiding policy learning. To address this issue, we propose a novel framework that integrates online inverse preference learning with multi-agent on-policy optimization into a unified architecture. At its core, our approach introduces an implicit multi-agent reward learning model, built upon a preference-based value-decomposition network, which produces both global and local reward signals. These signals are further used to construct dual advantage streams, enabling differentiated learning targets for the centralized critic and decentralized actors. In addition, we demonstrate how large language models (LLMs) can be leveraged to provide preference labels that enhance the quality of the learned reward model. Empirical evaluations on state-of-the-art benchmarks, including MAMuJoCo and SMACv2, show that our method achieves superior performance compared to existing baselines, highlighting its effectiveness in addressing sparse-reward challenges in online MARL.
[567] Score-based Idempotent Distillation of Diffusion Models
Shehtab Zaman, Chengyan Liu, Kenneth Chiu
Main category: cs.LG
TL;DR: SIGN unites diffusion models and idempotent generative networks by distilling idempotent models from diffusion model scores, enabling faster inference than iterative score-based models while supporting multi-step sampling and zero-shot editing.
Details
Motivation: To address the training instabilities and mode collapse issues of conventional IGNs while avoiding the high computational costs of diffusion models, creating a stable and efficient generative model.Method: Distill idempotent models from pre-trained diffusion model scores, eliminating the need for adversarial training and providing theoretical analysis of score-based training methods.
Result: Achieved state-of-the-art results for idempotent models on CIFAR and CelebA datasets, with faster inference than iterative score-based models and the ability to perform multi-step sampling and zero-shot editing.
Conclusion: SIGN successfully bridges diffusion models and IGNs, providing a stable, efficient generative framework that supports flexible quality-efficiency trade-offs and enables direct projection of corrupted distributions onto the target manifold.
Abstract: Idempotent generative networks (IGNs) are a new line of generative models based on idempotent mapping to a target manifold. IGNs support both single-and multi-step generation, allowing for a flexible trade-off between computational cost and sample quality. But similar to Generative Adversarial Networks (GANs), conventional IGNs require adversarial training and are prone to training instabilities and mode collapse. Diffusion and score-based models are popular approaches to generative modeling that iteratively transport samples from one distribution, usually a Gaussian, to a target data distribution. These models have gained popularity due to their stable training dynamics and high-fidelity generation quality. However, this stability and quality come at the cost of high computational cost, as the data must be transported incrementally along the entire trajectory. New sampling methods, model distillation, and consistency models have been developed to reduce the sampling cost and even perform one-shot sampling from diffusion models. In this work, we unite diffusion and IGNs by distilling idempotent models from diffusion model scores, called SIGN. Our proposed method is highly stable and does not require adversarial losses. We provide a theoretical analysis of our proposed score-based training methods and empirically show that IGNs can be effectively distilled from a pre-trained diffusion model, enabling faster inference than iterative score-based models. SIGNs can perform multi-step sampling, allowing users to trade off quality for efficiency. These models operate directly on the source domain; they can project corrupted or alternate distributions back onto the target manifold, enabling zero-shot editing of inputs. We validate our models on multiple image datasets, achieving state-of-the-art results for idempotent models on the CIFAR and CelebA datasets.
[568] Are Hallucinations Bad Estimations?
Hude Liu, Jerry Yao-Chieh Hu, Jennifer Yuntong Zhang, Zhao Song, Han Liu
Main category: cs.LG
TL;DR: Hallucinations in generative models are failures to link estimates to plausible causes, and even optimal estimators still hallucinate due to structural misalignment between loss minimization and human-acceptable outputs.
Details
Motivation: To understand why hallucinations occur in generative models and formalize them as estimation errors caused by miscalibration between loss minimization and human preferences.Method: Theoretical analysis showing optimal estimators still hallucinate, with a high probability lower bound on hallucination rate for generic data distributions. Experiments conducted on coin aggregation, open-ended QA, and text-to-image tasks.
Result: Demonstrated that hallucinations persist even in loss-minimizing optimal estimators, supporting the theoretical lower bound on hallucination rates across different domains.
Conclusion: Hallucinations are structural issues arising from misalignment between statistical loss minimization and human acceptability criteria, reframing them as miscalibration-induced estimation errors.
Abstract: We formalize hallucinations in generative models as failures to link an estimate to any plausible cause. Under this interpretation, we show that even loss-minimizing optimal estimators still hallucinate. We confirm this with a general high probability lower bound on hallucinate rate for generic data distributions. This reframes hallucination as structural misalignment between loss minimization and human-acceptable outputs, and hence estimation errors induced by miscalibration. Experiments on coin aggregation, open-ended QA, and text-to-image support our theory.
[569] d2: Improved Techniques for Training Reasoning Diffusion Language Models
Guanghan Wang, Yair Schiff, Gilad Turok, Volodymyr Kuleshov
Main category: cs.LG
TL;DR: d2 is a new reinforcement learning framework for masked diffusion language models that improves reasoning ability through a novel policy gradient algorithm leveraging masking properties for efficient trajectory likelihood estimation.
Details
Motivation: While diffusion language models perform well in text generation, their reasoning capabilities need improvement through reinforcement learning approaches.Method: Introduces d2 framework with a new policy gradient algorithm that uses masking properties to estimate sampling trajectory likelihoods, supporting any-order likelihood estimation for efficient diffusion-based reasoning.
Result: d2 significantly outperforms previous diffusion reasoning frameworks using only RL (no supervised fine-tuning), achieving state-of-the-art performance on logical reasoning tasks (Countdown, Sudoku) and math benchmarks (GSM8K, MATH500).
Conclusion: The d2 framework effectively enhances reasoning capabilities in diffusion language models through efficient RL-based training, demonstrating superior performance on complex reasoning tasks.
Abstract: While diffusion language models (DLMs) have achieved competitive performance in text generation, improving their reasoning ability with reinforcement learning remains an active research area. Here, we introduce d2, a reasoning framework tailored for masked DLMs. Central to our framework is a new policy gradient algorithm that relies on properties of masking to accurately estimate the likelihoods of sampling trajectories. Our estimators trade off computation for approximation accuracy in an analytically tractable manner, and are particularly effective for DLMs that support any-order likelihood estimation. We characterize and study this property in popular DLMs and show that it is key for efficient diffusion-based reasoning. Empirically, d2 significantly improves over previous diffusion reasoning frameworks using only RL (without relying on supervised fine-tuning), and sets a new state-of-the-art performance for DLMs on logical reasoning tasks (Countdown and Sudoku) and math reasoning benchmarks (GSM8K and MATH500).
[570] VISION: Prompting Ocean Vertical Velocity Reconstruction from Incomplete Observations
Yuan Gao, Hao Wu, Qingsong Wen, Kun Wang, Xian Wu, Xiaomeng Huang
Main category: cs.LG
TL;DR: VISION introduces a novel Dynamic Prompting paradigm for reconstructing subsurface ocean dynamics from incomplete surface observations, using a visual prompt mechanism and State-conditioned Prompting module to handle varying input combinations effectively.
Details
Motivation: To address the critical challenge of reconstructing subsurface ocean dynamics from incomplete surface observations and overcome the lack of standardized benchmarks in Earth science.Method: Built KD48 benchmark from petascale simulations with expert-driven denoising, then developed VISION with Dynamic Prompting that generates visual prompts from available observations and uses State-conditioned Prompting to inject prompts into a universal backbone with geometry- and scale-aware operators.
Result: VISION substantially outperforms state-of-the-art models and exhibits strong generalization under extreme data missing scenarios on the KD48 benchmark.
Conclusion: The work establishes solid infrastructure for ocean science research under data uncertainty by providing both a high-quality benchmark and a robust reconstruction model.
Abstract: Reconstructing subsurface ocean dynamics, such as vertical velocity fields, from incomplete surface observations poses a critical challenge in Earth science, a field long hampered by the lack of standardized, analysis-ready benchmarks. To systematically address this issue and catalyze research, we first build and release KD48, a high-resolution ocean dynamics benchmark derived from petascale simulations and curated with expert-driven denoising. Building on this benchmark, we introduce VISION, a novel reconstruction paradigm based on Dynamic Prompting designed to tackle the core problem of missing data in real-world observations. The essence of VISION lies in its ability to generate a visual prompt on-the-fly from any available subset of observations, which encodes both data availability and the ocean’s physical state. More importantly, we design a State-conditioned Prompting module that efficiently injects this prompt into a universal backbone, endowed with geometry- and scale-aware operators, to guide its adaptive adjustment of computational strategies. This mechanism enables VISION to precisely handle the challenges posed by varying input combinations. Extensive experiments on the KD48 benchmark demonstrate that VISION not only substantially outperforms state-of-the-art models but also exhibits strong generalization under extreme data missing scenarios. By providing a high-quality benchmark and a robust model, our work establishes a solid infrastructure for ocean science research under data uncertainty. Our codes are available at: https://github.com/YuanGao-YG/VISION.
[571] Filtering with Confidence: When Data Augmentation Meets Conformal Prediction
Zixuan Wu, So Won Jeong, Yating Liu, Yeo Jin Jung, Claire Donnat
Main category: cs.LG
TL;DR: Conformal data augmentation is a principled framework that uses conformal prediction to filter synthetic data, ensuring diversity while controlling risk without requiring model logits or retraining.
Details
Motivation: Synthetic data augmentation addresses data scarcity but must control bias to ensure generated samples come from the same distribution as training data with minimal shifts.Method: Proposes conformal data augmentation framework that leverages conformal prediction to filter poor-quality synthetic generations while maintaining diversity, requiring no access to model logits or retraining.
Result: Shows consistent performance improvements across multiple tasks (topic prediction, sentiment analysis, image classification, fraud detection) with up to 40% F1 score improvement over unaugmented baselines and 4% over other filtered augmentation methods.
Conclusion: Conformal data augmentation provides an effective, simple-to-implement solution for synthetic data filtering that provably controls risk while enhancing model performance across diverse applications.
Abstract: With promising empirical performance across a wide range of applications, synthetic data augmentation appears a viable solution to data scarcity and the demands of increasingly data-intensive models. Its effectiveness lies in expanding the training set in a way that reduces estimator variance while introducing only minimal bias. Controlling this bias is therefore critical: effective data augmentation should generate diverse samples from the same underlying distribution as the training set, with minimal shifts. In this paper, we propose conformal data augmentation, a principled data filtering framework that leverages the power of conformal prediction to produce diverse synthetic data while filtering out poor-quality generations with provable risk control. Our method is simple to implement, requires no access to internal model logits, nor large-scale model retraining. We demonstrate the effectiveness of our approach across multiple tasks, including topic prediction, sentiment analysis, image classification, and fraud detection, showing consistent performance improvements of up to 40% in F1 score over unaugmented baselines, and 4% over other filtered augmentation baselines.
[572] High-Probability Analysis of Online and Federated Zero-Order Optimisation
Arya Akhavan, David Janz, El-Mahdi El-Mhamdi
Main category: cs.LG
TL;DR: FedZero is a federated zero-order optimization algorithm that achieves near-optimal error bounds with high probability in convex settings and establishes the first high-probability convergence guarantees for single-worker zero-order optimization.
Details
Motivation: To address distributed learning in gradient-free zero-order optimization settings and provide strong theoretical guarantees that go beyond classical expectation-based results.Method: Uses a gradient estimator based on randomization over the ℓ₁-sphere and develops new concentration inequalities for Lipschitz functions under the uniform measure on the ℓ₁-sphere.
Result: FedZero achieves near-optimal optimization error bounds with high probability in federated convex settings and establishes the first high-probability convergence guarantees for convex zero-order optimization in single-worker scenarios.
Conclusion: The developed concentration tools are central to the high-probability guarantees and may have independent interest beyond the FedZero algorithm.
Abstract: We study distributed learning in the setting of gradient-free zero-order optimization and introduce FedZero, a federated zero-order algorithm that delivers sharp theoretical guarantees. Specifically, FedZero: (1) achieves near-optimal optimization error bounds with high probability in the federated convex setting; and (2) in the single-worker regime-where the problem reduces to the standard zero-order framework, establishes the first high-probability convergence guarantees for convex zero-order optimization, thereby strengthening the classical expectation-based results. At its core, FedZero employs a gradient estimator based on randomization over the $\ell_1$-sphere. To analyze it, we develop new concentration inequalities for Lipschitz functions under the uniform measure on the $\ell_1$-sphere, with explicit constants. These concentration tools are not only central to our high-probability guarantees but may also be of independent interest.
[573] Learning from Delayed Feedback in Games via Extra Prediction
Yuma Fujimoto, Kenshi Abe, Kaito Ariu
Main category: cs.LG
TL;DR: The paper addresses time-delayed feedback in multi-agent learning games, showing that even single-step delays worsen OFTRL performance. It proposes Weighted OFTRL (WOFTRL) with optimistic weight n times the prediction, proving that when weight exceeds delay, it achieves O(1)-regret in general-sum games and best-iterate convergence in poly-matrix zero-sum games.
Details
Motivation: Time-delayed feedback in multi-agent learning creates optimization discrepancies among agents. Standard OFTRL suffers performance degradation even with single-step delays, motivating the need for a method that can handle delayed observations effectively.Method: Proposes Weighted Optimistic Follow-the-Regularized-Leader (WOFTRL), where the prediction vector of the next reward in OFTRL is weighted n times. The key insight is that optimistic weight can cancel out time delay effects.
Result: When optimistic weight exceeds time delay, WOFTRL achieves constant regret (O(1)-regret) in general-sum normal-form games, and strategies converge to Nash equilibrium as subsequence (best-iterate convergence) in poly-matrix zero-sum games. Experimental results support theoretical findings.
Conclusion: WOFTRL effectively addresses time-delayed feedback in multi-agent learning by using weighted optimism to cancel delay effects, recovering good performance guarantees that standard OFTRL loses under delays.
Abstract: This study raises and addresses the problem of time-delayed feedback in learning in games. Because learning in games assumes that multiple agents independently learn their strategies, a discrepancy in optimization often emerges among the agents. To overcome this discrepancy, the prediction of the future reward is incorporated into algorithms, typically known as Optimistic Follow-the-Regularized-Leader (OFTRL). However, the time delay in observing the past rewards hinders the prediction. Indeed, this study firstly proves that even a single-step delay worsens the performance of OFTRL from the aspects of regret and convergence. This study proposes the weighted OFTRL (WOFTRL), where the prediction vector of the next reward in OFTRL is weighted $n$ times. We further capture an intuition that the optimistic weight cancels out this time delay. We prove that when the optimistic weight exceeds the time delay, our WOFTRL recovers the good performances that the regret is constant ($O(1)$-regret) in general-sum normal-form games, and the strategies converge to the Nash equilibrium as a subsequence (best-iterate convergence) in poly-matrix zero-sum games. The theoretical results are supported and strengthened by our experiments.
[574] Neural Operators for Mathematical Modeling of Transient Fluid Flow in Subsurface Reservoir Systems
Daniil D. Sirota, Sergey A. Khan, Sergey L. Kostikov, Kirill A. Butov
Main category: cs.LG
TL;DR: TFNO-opt neural operator architecture for transient fluid flow modeling in subsurface reservoirs, achieving 6x faster computation than traditional methods.
Details
Motivation: Traditional numerical methods for reservoir modeling are accurate but computationally expensive, limiting their use in control and decision support applications.Method: Modified Fourier neural operators with adjustable time resolution, tensor decomposition in spectral domain, Sobolev norm in error function, and separation of approximation/reconstruction errors.
Result: Achieved 6 orders of magnitude acceleration in hydrodynamic modeling of underground gas storage compared to traditional numerical methods.
Conclusion: The proposed neural operator enables effective control of complex reservoir systems through significantly faster computation while maintaining accuracy.
Abstract: This paper presents a method for modeling transient fluid flow in subsurface reservoir systems based on the developed neural operator architecture (TFNO-opt). Reservoir systems are complex dynamic objects with distributed parameters described by systems of partial differential equations (PDEs). Traditional numerical methods for modeling such systems, despite their high accuracy, are characterized by significant time costs for performing calculations, which limits their applicability in control and decision support problems. The proposed architecture (TFNO-opt) is based on Fourier neural operators, which allow approximating PDE solutions in infinite-dimensional functional spaces, providing invariance to discretization and the possibility of generalization to various implementations of equations. The developed modifications are aimed at increasing the accuracy and stability of the trained neural operator, which is especially important for control problems. These include adjustable internal time resolution of the integral Fourier operator, tensor decomposition of parameters in the spectral domain, use of the Sobolev norm in the error function, and separation of approximation errors and reconstruction of initial conditions for more accurate reproduction of physical processes. The effectiveness of the proposed improvements is confirmed by computational experiments. The practical significance is confirmed by computational experiments using the example of the problem of hydrodynamic modeling of an underground gas storage (UGS), where the acceleration of calculations by six orders of magnitude was achieved, compared to traditional methods. This opens up new opportunities for the effective control of complex reservoir systems.
[575] Learn the Ropes, Then Trust the Wins: Self-imitation with Progressive Exploration for Agentic Reinforcement Learning
Yulei Qin, Xiaoyu Tan, Zhengbao He, Gang Li, Haojia Lin, Zongyi Li, Zihan Xu, Yuchen Shi, Siqi Cai, Renting Rui, Shaofei Cai, Yuzheng Cai, Xuan Zhang, Sheng Ye, Ke Li, Xing Sun
Main category: cs.LG
TL;DR: SPEAR is a curriculum-based self-imitation learning method that balances exploration and exploitation in RL training for agentic LLMs, using intrinsic rewards and replay buffer management to stabilize training without entropy collapse or divergence.
Details
Motivation: Traditional RL methods for LLMs face exploration-exploitation trade-offs and training instability due to multi-turn distribution shifting. Mechanical entropy maximization leads to unstable training, so a more balanced approach is needed.Method: SPEAR extends vanilla self-imitation learning with a curriculum approach: uses intrinsic rewards for skill-level exploration initially, then strengthens self-imitation for action-level exploration. Includes replay buffer recalibration and regularization techniques like token clipping to control entropy and prevent over-confidence.
Result: The method enables progressive exploration-exploitation balance, allowing broad exposure to environment distributions with upward entropy trend initially, then accelerating solution iteration without unbounded entropy growth.
Conclusion: SPEAR provides a stable training framework for agentic LLMs that maintains balanced entropy across stages, addressing fundamental RL challenges in long-horizon, sparsely-rewarded tasks through curriculum-based self-imitation learning.
Abstract: Reinforcement learning (RL) is the dominant paradigm for sharpening strategic tool use capabilities of LLMs on long-horizon, sparsely-rewarded agent tasks, yet it faces a fundamental challenge of exploration-exploitation trade-off. Existing studies stimulate exploration through the lens of policy entropy, but such mechanical entropy maximization is prone to RL training instability due to the multi-turn distribution shifting. In this paper, we target the progressive exploration-exploitation balance under the guidance of the agent own experiences without succumbing to either entropy collapsing or runaway divergence. We propose SPEAR, a curriculum-based self-imitation learning (SIL) recipe for training agentic LLMs. It extends the vanilla SIL framework, where a replay buffer stores self-generated promising trajectories for off-policy update, by gradually steering the policy evolution within a well-balanced range of entropy across stages. Specifically, our approach incorporates a curriculum to manage the exploration process, utilizing intrinsic rewards to foster skill-level exploration and facilitating action-level exploration through SIL. At first, the auxiliary tool call reward plays a critical role in the accumulation of tool-use skills, enabling broad exposure to the unfamiliar distributions of the environment feedback with an upward entropy trend. As training progresses, self-imitation gets strengthened to exploit existing successful patterns from replayed experiences for comparative action-level exploration, accelerating solution iteration without unbounded entropy growth. To further stabilize training, we recalibrate the advantages of experiences in the replay buffer to address the potential policy drift. Reugularizations such as the clipping of tokens with high covariance between probability and advantage are introduced to the trajectory-level entropy control to curb over-confidence.
[576] Investigating Faithfulness in Large Audio Language Models
Lovenya Jain, Pooneh Mousavi, Mirco Ravanelli, Cem Subakan
Main category: cs.LG
TL;DR: Investigates faithfulness of chain-of-thought (CoT) representations in large audio-language models (LALMs) using targeted interventions on reasoning datasets, finding that LALMs generally produce faithful CoTs.
Details
Motivation: Faithfulness of CoT representations is critical for safety-sensitive applications in LALMs, and prior work showed text-based LLMs often produce unfaithful CoTs, but this hasn't been explored for audio-language models where reasoning is more challenging.Method: Applied targeted interventions including paraphrasing, filler token injection, early answering, and introducing mistakes on two challenging reasoning datasets: SAKURA and MMAR.
Result: Experiments across several datasets and tasks suggest that LALMs generally produce CoTs that appear to be faithful to their underlying decision processes.
Conclusion: LALMs produce faithful chain-of-thought representations, unlike text-based LLMs which often produce unfaithful CoTs.
Abstract: Faithfulness measures whether chain-of-thought (CoT) representations accurately reflect a model’s decision process and can be used as reliable explanations. Prior work has shown that CoTs from text-based LLMs are often unfaithful. This question has not been explored for large audio-language models (LALMs), where faithfulness is critical for safety-sensitive applications. Reasoning in LALMs is also more challenging, as models must first extract relevant clues from audio before reasoning over them. In this paper, we investigate the faithfulness of CoTs produced by several LALMs by applying targeted interventions, including paraphrasing, filler token injection, early answering, and introducing mistakes, on two challenging reasoning datasets: SAKURA and MMAR. After going through the aforementioned interventions across several datasets and tasks, our experiments suggest that, LALMs generally produce CoTs that appear to be faithful to their underlying decision processes.
[577] GraphPFN: A Prior-Data Fitted Graph Foundation Model
Dmitry Eremeev, Oleg Platonov, Gleb Bazhenov, Artem Babenko, Liudmila Prokhorenkova
Main category: cs.LG
TL;DR: GraphPFN is a graph foundation model that uses synthetic graph generation and tabular foundation model augmentation to achieve state-of-the-art performance on node-level prediction tasks.
Details
Motivation: Existing graph foundation models rely on hand-crafted features and struggle to learn complex graph-specific patterns, limiting their effectiveness compared to foundation models in other domains like NLP and computer vision.Method: 1) Design a prior distribution of synthetic attributed graphs using stochastic block models and preferential attachment; 2) Generate node attributes and targets using graph-aware structured causal models; 3) Augment tabular foundation model LimiX with attention-based graph neighborhood aggregation layers; 4) Train on synthetic graphs from the prior distribution.
Result: GraphPFN achieves state-of-the-art results on diverse real-world graph datasets with up to 50,000 nodes, outperforming both G2T-FM and task-specific GNNs trained from scratch on most datasets. It shows strong in-context learning performance.
Conclusion: Pretraining on synthetic graphs from a well-designed prior distribution is an effective strategy for building graph foundation models, enabling them to capture graph structural dependencies not present in tabular data.
Abstract: Foundation models pretrained on large-scale datasets have transformed such fields as natural language processing and computer vision, but their application to graph data remains limited. Recently emerged graph foundation models, such as G2T-FM, utilize tabular foundation models for graph tasks and were shown to significantly outperform prior attempts to create GFMs. However, these models primarily rely on hand-crafted graph features, limiting their ability to learn complex graph-specific patterns. In this work, we propose GraphPFN: a prior-data fitted network for node-level prediction. First, we design a prior distribution of synthetic attributed graphs. For graph structure generation, we use a novel combination of multiple stochastic block models and a preferential attachment process. We then apply graph-aware structured causal models to generate node attributes and targets. This procedure allows us to efficiently generate a wide range of realistic graph datasets. Then, we augment the tabular foundation model LimiX with attention-based graph neighborhood aggregation layers and train it on synthetic graphs sampled from our prior, allowing the model to capture graph structural dependencies not present in tabular data. On diverse real-world graph datasets with up to 50,000 nodes, GraphPFN shows strong in-context learning performance and achieves state-of-the-art results after finetuning, outperforming both G2T-FM and task-specific GNNs trained from scratch on most datasets. More broadly, our work demonstrates that pretraining on synthetic graphs from a well-designed prior distribution is an effective strategy for building graph foundation models.
[578] SlimDiff: Training-Free, Activation-Guided Hands-free Slimming of Diffusion Models
Arani Roy, Shristi Das Biswas, Kaushik Roy
Main category: cs.LG
TL;DR: SlimDiff is a gradient-free structural compression framework for diffusion models that reduces attention and feedforward dimensions using activation-informed spectral approximation, achieving 35% acceleration and ~100M parameter reduction without fine-tuning.
Details
Motivation: Diffusion models are computationally expensive due to their large parameter count and iterative nature. Existing efficiency techniques require fine-tuning or retraining to maintain performance, creating bottlenecks.Method: Reframes DM compression as spectral approximation using activation covariances across denoising timesteps. Uses dynamic pruning guided by low-rank subspaces, applying module-wise decompositions over functional weight groups (query-key interactions, value-output couplings, feedforward projections) with adaptive sparsity allocation.
Result: Achieves up to 35% acceleration and ~100M parameter reduction over baselines while maintaining generation quality comparable to uncompressed models. Requires only about 500 calibration samples (70× fewer than prior methods) and operates entirely without backpropagation.
Conclusion: SlimDiff provides the first closed-form, activation-guided structural compression of diffusion models that is entirely training-free, offering both theoretical clarity and practical efficiency improvements.
Abstract: Diffusion models (DMs), lauded for their generative performance, are computationally prohibitive due to their billion-scale parameters and iterative denoising dynamics. Existing efficiency techniques, such as quantization, timestep reduction, or pruning, offer savings in compute, memory, or runtime but are strictly bottlenecked by reliance on fine-tuning or retraining to recover performance. In this work, we introduce SlimDiff, an automated activation-informed structural compression framework that reduces both attention and feedforward dimensionalities in DMs, while being entirely gradient-free. SlimDiff reframes DM compression as a spectral approximation task, where activation covariances across denoising timesteps define low-rank subspaces that guide dynamic pruning under a fixed compression budget. This activation-aware formulation mitigates error accumulation across timesteps by applying module-wise decompositions over functional weight groups: query–key interactions, value–output couplings, and feedforward projections, rather than isolated matrix factorizations, while adaptively allocating sparsity across modules to respect the non-uniform geometry of diffusion trajectories. SlimDiff achieves up to 35% acceleration and $\sim$100M parameter reduction over baselines, with generation quality on par with uncompressed models without any backpropagation. Crucially, our approach requires only about 500 calibration samples, over 70$\times$ fewer than prior methods. To our knowledge, this is the first closed-form, activation-guided structural compression of DMs that is entirely training-free, providing both theoretical clarity and practical efficiency.
[579] Chasing the Tail: Effective Rubric-based Reward Modeling for Large Language Model Post-Training
Junkai Zhang, Zihao Wang, Lin Gui, Swarnashree Mysore Sathyendra, Jaehwan Jeong, Victor Veitch, Wei Wang, Yunzhong He, Bing Liu, Lifeng Jin
Main category: cs.LG
TL;DR: The paper proposes rubric-based rewards to address reward over-optimization in reinforcement fine-tuning by focusing on high-reward tail examples and leveraging off-policy data while avoiding reward misspecification.
Details
Motivation: Reinforcement fine-tuning suffers from reward over-optimization where models hack reward signals to achieve high scores while producing low-quality outputs, due to reward misspecification at the high-reward tail.Method: Uses rubric-based rewards that leverage off-policy examples while remaining insensitive to their artifacts, with a workflow to elicit rubrics that distinguish among great and diverse responses in the high-reward region.
Result: Empirical demonstration shows rubric-based rewards substantially mitigate reward over-optimization and deliver effective LLM post-training improvements.
Conclusion: Rubric-based rewards provide an effective solution to reward over-optimization in reinforcement fine-tuning by addressing reward misspecification at the high-reward tail through proper distinction of excellent vs great responses.
Abstract: Reinforcement fine-tuning (RFT) often suffers from \emph{reward over-optimization}, where a policy model hacks the reward signals to achieve high scores while producing low-quality outputs. Our theoretical analysis shows that the key lies in reward misspecification at the high-reward tail: the inability to reliably distinguish Excellent responses from merely Great ones. This motivate us to focus on the high-reward region. However, such tail examples are scarce under the base LLM. While off-policy exemplars (e.g. from stronger models or rewrites) are easier to obtain, naively training on them yields a misspecified reward for the policy we aim to align. To address this, we study rubric-based rewards. By design, rubrics can leverage off-policy examples while remaining insensitive to their artifacts. To elicit rubrics that capture the high-reward tail, we highlight the importance of distinguishing among great and diverse responses, and introduce a workflow to implement this idea. We empirically demonstrate that rubric-based rewards substantially mitigate reward over-optimization and deliver effective LLM post-training improvements. Our code can be accessed at https://github.com/Jun-Kai-Zhang/rubrics.git .
[580] VDFD: Multi-Agent Value Decomposition Framework with Disentangled World Model
Zhizun Wang, David Meger
Main category: cs.LG
TL;DR: A model-based multi-agent reinforcement learning approach using disentangled world models to improve sample efficiency in multi-agent environments.
Details
Motivation: Address the high sample complexity and non-stationarity problems in multi-agent systems where model-free methods require many samples for training.Method: Uses a modularized world model with action-conditioned, action-free, and static branches, combined with variational auto-encoders and graph auto-encoders to learn latent representations, integrated with value-based framework for joint action-value prediction.
Result: Achieves high sample efficiency and superior performance compared to baselines on StarCraft II micro-management, Multi-Agent MuJoCo, and Level-Based Foraging challenges.
Conclusion: The proposed Value Decomposition Framework with Disentangled World Model effectively reduces sample complexity while maintaining strong performance across diverse multi-agent tasks.
Abstract: In this paper, we propose a novel model-based multi-agent reinforcement learning approach named Value Decomposition Framework with Disentangled World Model to address the challenge of achieving a common goal of multiple agents interacting in the same environment with reduced sample complexity. Due to scalability and non-stationarity problems posed by multi-agent systems, model-free methods rely on a considerable number of samples for training. In contrast, we use a modularized world model, composed of action-conditioned, action-free, and static branches, to unravel the complicated environment dynamics. Our model produces imagined outcomes based on past experience, without sampling directly from the real environment. We employ variational auto-encoders and variational graph auto-encoders to learn the latent representations for the world model, which is merged with a value-based framework to predict the joint action-value function and optimize the overall training objective. Experimental results on StarCraft II micro-management, Multi-Agent MuJoCo, and Level-Based Foraging challenges demonstrate that our method achieves high sample efficiency and exhibits superior performance compared to other baselines across a wide range of multi-agent learning tasks.
[581] Contrastive Mutual Information Learning: Toward Robust Representations without Positive-Pair Augmentations
Micha Livne
Main category: cs.LG
TL;DR: cMIM is a contrastive extension of Mutual Information Machine that combines generative fidelity with discriminative structure, outperforming MIM and InfoNCE on classification/regression tasks while maintaining competitive reconstruction quality.
Details
Motivation: Existing representation learning methods (contrastive learning, self-supervised masking, denoising auto-encoders) have trade-offs between transferability and discriminative performance. MIM maximizes mutual information but falls short on discriminative tasks.Method: Extends MIM with contrastive objective to impose global discriminative structure while retaining generative fidelity. Introduces ‘informative embeddings’ technique for enriched feature extraction from encoder-decoder models without additional training.
Result: Outperforms MIM and InfoNCE on classification and regression tasks across vision and molecular benchmarks while preserving competitive reconstruction quality. Less sensitive to batch size than InfoNCE and doesn’t require positive data augmentation.
Conclusion: cMIM serves as a unified framework for representation learning that effectively serves both discriminative and generative applications, advancing the goal of models that balance these capabilities.
Abstract: Learning representations that transfer well to diverse downstream tasks remains a central challenge in representation learning. Existing paradigms – contrastive learning, self-supervised masking, and denoising auto-encoders – balance this challenge with different trade-offs. We introduce the {contrastive Mutual Information Machine} (cMIM), a probabilistic framework that extends the Mutual Information Machine (MIM) with a contrastive objective. While MIM maximizes mutual information between inputs and latents and promotes clustering of codes, it falls short on discriminative tasks. cMIM addresses this gap by imposing global discriminative structure while retaining MIM’s generative fidelity. Our contributions are threefold. First, we propose cMIM, a contrastive extension of MIM that removes the need for positive data augmentation and is substantially less sensitive to batch size than InfoNCE. Second, we introduce {informative embeddings}, a general technique for extracting enriched features from encoder-decoder models that boosts discriminative performance without additional training and applies broadly beyond MIM. Third, we provide empirical evidence across vision and molecular benchmarks showing that cMIM outperforms MIM and InfoNCE on classification and regression tasks while preserving competitive reconstruction quality. These results position cMIM as a unified framework for representation learning, advancing the goal of models that serve both discriminative and generative applications effectively.
[582] Metric-Guided Conformal Bounds for Probabilistic Image Reconstruction
Matt Y Cheung, Tucker J Netherton, Laurence E Court, Ashok Veeraraghavan, Guha Balakrishnan
Main category: cs.LG
TL;DR: A framework for computing provably valid prediction bounds on clinical metrics derived from probabilistic black-box image reconstruction algorithms using conformal prediction.
Details
Motivation: Deep learning reconstruction algorithms can produce realistic but inaccurate scans, making it difficult to provide statistically guaranteed claims about a subject's true state.Method: Represent reconstructed scans with clinical metrics and calibrate bounds on ground truth metrics using conformal prediction with prior calibration data.
Result: The framework produces bounds with better semantic interpretation than pixel-based approaches and can flag dangerous outlier reconstructions that look plausible but have unlikely metric values.
Conclusion: The proposed framework provides interpretable feedback about subjects’ states and enables detection of misleading reconstructions in medical imaging tasks like CT for fat mass quantification and radiotherapy planning.
Abstract: Modern deep learning reconstruction algorithms generate impressively realistic scans from sparse inputs, but can often produce significant inaccuracies. This makes it difficult to provide statistically guaranteed claims about the true state of a subject from scans reconstructed by these algorithms. In this study, we propose a framework for computing provably valid prediction bounds on claims derived from probabilistic black-box image reconstruction algorithms. The key insights behind our framework are to represent reconstructed scans with a derived clinical metric of interest, and to calibrate bounds on the ground truth metric with conformal prediction (CP) using a prior calibration dataset. These bounds convey interpretable feedback about the subject’s state, and can also be used to retrieve nearest-neighbor reconstructed scans for visual inspection. We demonstrate the utility of this framework on sparse-view computed tomography (CT) for fat mass quantification and radiotherapy planning tasks. Results show that our framework produces bounds with better semantical interpretation than conventional pixel-based bounding approaches. Furthermore, we can flag dangerous outlier reconstructions that look plausible but have statistically unlikely metric values.
[583] DistillKac: Few-Step Image Generation via Damped Wave Equations
Weiqiao Han, Chenlin Meng, Christopher D. Manning, Stefano Ermon
Main category: cs.LG
TL;DR: DistillKac is a fast image generator using damped wave equation and stochastic Kac representation for finite-speed probability transport, enabling high-quality sampling with few function evaluations while maintaining numerical stability.
Details
Motivation: To address issues in diffusion models where reverse time velocities can become stiff and allow unbounded propagation speed, and to enforce finite speed transport with globally bounded kinetic energy.Method: Uses Kac dynamics with damped wave equation, introduces classifier-free guidance in velocity space, and proposes endpoint-only distillation where a student model learns to match a frozen teacher over long intervals.
Result: Experiments show DistillKac delivers high-quality samples with very few function evaluations while retaining numerical stability benefits of finite speed probability flows.
Conclusion: DistillKac provides an effective alternative to diffusion models by leveraging finite-speed probability transport through Kac dynamics, enabling fast and stable image generation with minimal computational cost.
Abstract: We present DistillKac, a fast image generator that uses the damped wave equation and its stochastic Kac representation to move probability mass at finite speed. In contrast to diffusion models whose reverse time velocities can become stiff and implicitly allow unbounded propagation speed, Kac dynamics enforce finite speed transport and yield globally bounded kinetic energy. Building on this structure, we introduce classifier-free guidance in velocity space that preserves square integrability under mild conditions. We then propose endpoint only distillation that trains a student to match a frozen teacher over long intervals. We prove a stability result that promotes supervision at the endpoints to closeness along the entire path. Experiments demonstrate DistillKac delivers high quality samples with very few function evaluations while retaining the numerical stability benefits of finite speed probability flows.
[584] Beyond Shallow Behavior: Task-Efficient Value-Based Multi-Task Offline MARL via Skill Discovery
Xun Wang, Zhuoran Li, Hai Zhong, Longbo Huang
Main category: cs.LG
TL;DR: SD-CQL is a task-efficient multi-task offline MARL algorithm that discovers skills in latent space, evaluates fixed/variable actions separately, and uses conservative Q-learning with local value calibration for optimal action selection, achieving superior generalization without retraining.
Details
Motivation: Existing offline MARL methods are task-specific and require retraining for new tasks, causing redundancy and inefficiency, especially in domains with high interaction costs and risks.Method: SD-CQL discovers skills in latent space by reconstructing next observations, evaluates fixed and variable actions separately, and applies conservative Q-learning with local value calibration to select optimal actions for each skill.
Result: Extensive experiments on StarCraft II show SD-CQL achieves best performance on 13 out of 14 task sets, with up to 68.9% improvement on individual task sets.
Conclusion: SD-CQL enables strong multi-task generalization from limited source tasks, eliminating the need for local-global alignment and demonstrating superior task-efficiency and generalization performance.
Abstract: As a data-driven approach, offline MARL learns superior policies solely from offline datasets, ideal for domains rich in historical data but with high interaction costs and risks. However, most existing methods are task-specific, requiring retraining for new tasks, leading to redundancy and inefficiency. To address this issue, we propose a task-efficient value-based multi-task offline MARL algorithm, Skill-Discovery Conservative Q-Learning (SD-CQL). Unlike existing methods decoding actions from skills via behavior cloning, SD-CQL discovers skills in a latent space by reconstructing the next observation, evaluates fixed and variable actions separately, and uses conservative Q-learning with local value calibration to select the optimal action for each skill. It eliminates the need for local-global alignment and enables strong multi-task generalization from limited, small-scale source tasks. Substantial experiments on StarCraft II demonstrate the superior generalization performance and task-efficiency of SD-CQL. It achieves the best performance on $\textbf{13}$ out of $14$ task sets, with up to $\textbf{68.9%}$ improvement on individual task sets.
[585] Uncertainty-Aware Knowledge Tracing Models
Joshua Mitton, Prarthana Bhattacharyya, Ralph Abboud, Simon Woodhead
Main category: cs.LG
TL;DR: This paper proposes adding uncertainty estimation to Knowledge Tracing models to detect when models make incorrect predictions, particularly when students choose distractors.
Details
Motivation: Current KT models focus on accuracy but fail when students choose distractors, leading to undetected errors. There's a need to capture predictive uncertainty to improve educational applications.Method: The approach adds uncertainty estimation capabilities to KT models to capture when predictions are uncertain, particularly aligning with model incorrect predictions.
Result: The research demonstrates that larger predictive uncertainty aligns with model incorrect predictions, and this uncertainty signal is informative for educational applications.
Conclusion: Uncertainty in KT models provides pedagogically useful information that can be valuable in educational platforms, especially in resource-limited settings where understanding student ability is crucial.
Abstract: The main focus of research on Knowledge Tracing (KT) models is on model developments with the aim of improving predictive accuracy. Most of these models make the most incorrect predictions when students choose a distractor, leading to student errors going undetected. We present an approach to add new capabilities to KT models by capturing predictive uncertainty and demonstrate that a larger predictive uncertainty aligns with model incorrect predictions. We show that uncertainty in KT models is informative and that this signal would be pedagogically useful for application in an educational learning platform that can be used in a limited resource setting where understanding student ability is necessary.
[586] $\mathbf{Li_2}$: A Framework on Dynamics of Feature Emergence and Delayed Generalization
Yuandong Tian
Main category: cs.LG
TL;DR: The paper proposes a mathematical framework called Li₂ to analyze grokking (delayed generalization) in 2-layer nonlinear networks, identifying three key stages: lazy learning, independent feature learning, and interactive feature learning, characterized by gradient structure.
Details
Motivation: To understand the mathematical principles behind grokking behavior - what features emerge, how they develop, and under what conditions this delayed generalization occurs in neural networks with complex structured inputs.Method: The Li₂ framework analyzes backpropagated gradient structure G_F across layers. It studies gradient dynamics, energy function optimization, and feature emergence in group arithmetic tasks, examining effects of hyperparameters like weight decay and learning rate.
Result: The framework reveals that independent feature learning follows gradient ascent of an energy function E, with local maxima corresponding to emerging features. It shows how hidden nodes interact and how gradients focus on missing features, leading to provable scaling laws for memorization and generalization.
Conclusion: The study provides mathematical understanding of grokking dynamics, explains why optimizers like Muon work effectively, and offers insights into the roles of key hyperparameters. The analysis framework can be extended to multi-layer architectures.
Abstract: While the phenomenon of grokking, i.e., delayed generalization, has been studied extensively, it remains an open question whether there is a mathematical framework to characterize what kind of features emerge, how and in which conditions it happens from training, for complex structured inputs. We propose a novel framework, named $\mathbf{Li_2}$, that captures three key stages for the grokking behavior of 2-layer nonlinear networks: (I) Lazy learning, (II) independent feature learning and (III) interactive feature learning, characterized by the structure of backpropagated gradient $G_F$ across layers. In (I), $G_F$ is random, and top layer overfits to random hidden representation. In (II), the gradient of each node (column of $G_F$) only depends on its own activation, and thus each hidden node learns their representation independently from $G_F$, which now carries information about target labels, thanks to weight decay. Interestingly, the independent dynamics follows exactly the gradient ascent of an energy function $E$, and its local maxima are precisely the emerging features. We study whether these local-optima induced features are generalizable, their representation power, and how they change on sample size, in group arithmetic tasks. Finally, in (III), we provably show how hidden nodes interact, and how $G_F$ changes to focus on missing features that need to be learned. Our study sheds lights on roles played by key hyperparameters such as weight decay, learning rate and sample sizes in grokking, leads to provable scaling laws of memorization and generalization, and reveals the underlying cause why recent optimizers such as Muon can be effective, from the first principles of gradient dynamics. Our analysis can be extended to multi-layer architectures.
[587] TRiCo: Triadic Game-Theoretic Co-Training for Robust Semi-Supervised Learning
Hongyang He, Xinyuan Song, Yangfan He, Zeyu Zhang, Yanshu Li, Haochen You, Lifan Sun, Wenqiao Zhang
Main category: cs.LG
TL;DR: TRiCo is a triadic game-theoretic co-training framework for semi-supervised learning that uses a teacher, two students, and an adversarial generator in a unified training paradigm, achieving state-of-the-art performance.
Details
Motivation: To address limitations in existing SSL frameworks like static view interactions, unreliable pseudo-labels, and lack of hard sample modeling by creating a more structured and robust learning interaction.Method: Formulates SSL as a Stackelberg game with three roles: two student classifiers on complementary representations, a meta-learned teacher for adaptive pseudo-label selection and loss balancing, and a non-parametric generator for adversarial perturbations. Uses mutual information for pseudo-label selection instead of confidence.
Result: Extensive experiments on CIFAR-10, SVHN, STL-10, and ImageNet show TRiCo consistently achieves state-of-the-art performance in low-label regimes while being architecture-agnostic and compatible with frozen vision backbones.
Conclusion: TRiCo provides a principled and generalizable solution to semi-supervised learning by formalizing the learning process as a triadic game-theoretic interaction, effectively addressing key limitations of existing SSL frameworks.
Abstract: We introduce TRiCo, a novel triadic game-theoretic co-training framework that rethinks the structure of semi-supervised learning by incorporating a teacher, two students, and an adversarial generator into a unified training paradigm. Unlike existing co-training or teacher-student approaches, TRiCo formulates SSL as a structured interaction among three roles: (i) two student classifiers trained on frozen, complementary representations, (ii) a meta-learned teacher that adaptively regulates pseudo-label selection and loss balancing via validation-based feedback, and (iii) a non-parametric generator that perturbs embeddings to uncover decision boundary weaknesses. Pseudo-labels are selected based on mutual information rather than confidence, providing a more robust measure of epistemic uncertainty. This triadic interaction is formalized as a Stackelberg game, where the teacher leads strategy optimization and students follow under adversarial perturbations. By addressing key limitations in existing SSL frameworks, such as static view interactions, unreliable pseudo-labels, and lack of hard sample modeling, TRiCo provides a principled and generalizable solution. Extensive experiments on CIFAR-10, SVHN, STL-10, and ImageNet demonstrate that TRiCo consistently achieves state-of-the-art performance in low-label regimes, while remaining architecture-agnostic and compatible with frozen vision backbones.
[588] Preemptive Detection and Steering of LLM Misalignment via Latent Reachability
Sathwik Karnik, Somil Bansal
Main category: cs.LG
TL;DR: BRT-Align is a reachability-based framework that brings control-theoretic safety tools to LLM inference, providing runtime monitoring and steering to prevent harmful content generation.
Details
Motivation: Current safety approaches like RLHF only work during training but offer no safeguards at inference time, where unsafe continuations may still arise, creating urgent safety concerns for widely used LLMs.Method: Models autoregressive generation as a dynamical system in latent space and learns a safety value function via backward reachability to estimate worst-case trajectory evolution, enabling runtime monitoring and least-restrictive steering filters.
Result: BRT-Align provides more accurate and earlier detection of unsafe continuations than baselines, substantially reduces unsafe generations while preserving diversity and coherence, and produces responses that are less violent, profane, offensive, and politically biased.
Conclusion: Reachability analysis provides a principled and practical foundation for inference-time LLM safety, offering effective safeguards during generation rather than just during training.
Abstract: Large language models (LLMs) are now ubiquitous in everyday tools, raising urgent safety concerns about their tendency to generate harmful content. The dominant safety approach – reinforcement learning from human feedback (RLHF) – effectively shapes model behavior during training but offers no safeguards at inference time, where unsafe continuations may still arise. We propose BRT-Align, a reachability-based framework that brings control-theoretic safety tools to LLM inference. BRT-Align models autoregressive generation as a dynamical system in latent space and learn a safety value function via backward reachability, estimating the worst-case evolution of a trajectory. This enables two complementary mechanisms: (1) a runtime monitor that forecasts unsafe completions several tokens in advance, and (2) a least-restrictive steering filter that minimally perturbs latent states to redirect generation away from unsafe regions. Experiments across multiple LLMs and toxicity benchmarks demonstrate that BRT-Align provides more accurate and earlier detection of unsafe continuations than baselines. Moreover, for LLM safety alignment, BRT-Align substantially reduces unsafe generations while preserving sentence diversity and coherence. Qualitative results further highlight emergent alignment properties: BRT-Align consistently produces responses that are less violent, less profane, less offensive, and less politically biased. Together, these findings demonstrate that reachability analysis provides a principled and practical foundation for inference-time LLM safety.
[589] Expert-guided Clinical Text Augmentation via Query-Based Model Collaboration
Dongkyu Cho, Miao Zhang, Rumi Chunara
Main category: cs.LG
TL;DR: A query-based model collaboration framework that integrates expert domain knowledge to guide LLM-based data augmentation in healthcare, improving safety and performance over existing methods.
Details
Motivation: LLMs have strong generative capabilities for data augmentation but pose risks in high-stakes domains like healthcare due to potential generation of clinically incorrect information. There's a gap between LLM augmentation potential and safety requirements in specialized domains.Method: Proposed a lightweight query-based model collaboration framework that integrates expert-level domain knowledge to guide the augmentation process and preserve critical medical information.
Result: Experiments on clinical prediction tasks show the approach consistently outperforms existing LLM augmentation methods while improving safety through reduced factual errors.
Conclusion: The framework successfully addresses the safety-performance trade-off in LLM-based data augmentation for specialized domains like healthcare, enabling safer and more effective augmentation.
Abstract: Data augmentation is a widely used strategy to improve model robustness and generalization by enriching training datasets with synthetic examples. While large language models (LLMs) have demonstrated strong generative capabilities for this purpose, their applications in high-stakes domains like healthcare present unique challenges due to the risk of generating clinically incorrect or misleading information. In this work, we propose a novel query-based model collaboration framework that integrates expert-level domain knowledge to guide the augmentation process to preserve critical medical information. Experiments on clinical prediction tasks demonstrate that our lightweight collaboration-based approach consistently outperforms existing LLM augmentation methods while improving safety through reduced factual errors. This framework addresses the gap between LLM augmentation potential and the safety requirements of specialized domains.
[590] A circuit for predicting hierarchical structure in-context in Large Language Models
Tankred Saanum, Can Demircan, Samuel J. Gershman, Eric Schulz
Main category: cs.LG
TL;DR: LLMs use adaptive induction heads to learn hierarchical patterns in-context, going beyond simple token repetition by learning what contextual information to attend to.
Details
Motivation: To understand if induction heads can support in-context learning of complex hierarchical patterns in language, not just simple repetition, since natural language often requires contextual integration.Method: Designed synthetic in-context learning tasks with hierarchical dependencies, evaluated various LLMs on these sequences and natural language analogues, and analyzed how induction heads learn contextually.
Result: Found adaptive induction heads that learn what to attend to in-context, supported by attention heads that uncover latent contexts determining token transition relationships.
Conclusion: LLMs have learning induction heads that provide a complete mechanistic account of how they learn to predict higher-order repetitive patterns in-context.
Abstract: Large Language Models (LLMs) excel at in-context learning, the ability to use information provided as context to improve prediction of future tokens. Induction heads have been argued to play a crucial role for in-context learning in Transformer Language Models. These attention heads make a token attend to successors of past occurrences of the same token in the input. This basic mechanism supports LLMs’ ability to copy and predict repeating patterns. However, it is unclear if this same mechanism can support in-context learning of more complex repetitive patterns with hierarchical structure. Natural language is teeming with such cases: The article “the” in English usually prefaces multiple nouns in a text. When predicting which token succeeds a particular instance of “the”, we need to integrate further contextual cues from the text to predict the correct noun. If induction heads naively attend to all past instances of successor tokens of “the” in a context-independent manner, they cannot support this level of contextual information integration. In this study, we design a synthetic in-context learning task, where tokens are repeated with hierarchical dependencies. Here, attending uniformly to all successor tokens is not sufficient to accurately predict future tokens. Evaluating a range of LLMs on these token sequences and natural language analogues, we find adaptive induction heads that support prediction by learning what to attend to in-context. Next, we investigate how induction heads themselves learn in-context. We find evidence that learning is supported by attention heads that uncover a set of latent contexts, determining the different token transition relationships. Overall, we not only show that LLMs have induction heads that learn, but offer a complete mechanistic account of how LLMs learn to predict higher-order repetitive patterns in-context.
[591] Evidence for Limited Metacognition in LLMs
Christopher Ackerman
Main category: cs.LG
TL;DR: This paper introduces a novel methodology to quantitatively evaluate metacognitive abilities in LLMs, finding that frontier models since early 2024 show increasing evidence of metacognition through strategic deployment of internal state knowledge.
Details
Motivation: The possibility of LLM self-awareness and sentience has major safety and policy implications, but current measurement science is nascent, requiring better methods to evaluate metacognitive abilities.Method: The approach eschews model self-reports and instead tests strategic deployment of internal state knowledge using two experimental paradigms, analyzing both behavioral responses and token probabilities.
Result: Frontier LLMs show increasingly strong evidence of metacognitive abilities including confidence assessment and answer anticipation, though these abilities are limited in resolution, context-dependent, and qualitatively different from humans.
Conclusion: LLMs demonstrate emerging metacognitive abilities that vary across models, suggesting post-training may play a role in developing these capabilities, with important implications for AI safety and understanding.
Abstract: The possibility of LLM self-awareness and even sentience is gaining increasing public attention and has major safety and policy implications, but the science of measuring them is still in a nascent state. Here we introduce a novel methodology for quantitatively evaluating metacognitive abilities in LLMs. Taking inspiration from research on metacognition in nonhuman animals, our approach eschews model self-reports and instead tests to what degree models can strategically deploy knowledge of internal states. Using two experimental paradigms, we demonstrate that frontier LLMs introduced since early 2024 show increasingly strong evidence of certain metacognitive abilities, specifically the ability to assess and utilize their own confidence in their ability to answer factual and reasoning questions correctly and the ability to anticipate what answers they would give and utilize that information appropriately. We buttress these behavioral findings with an analysis of the token probabilities returned by the models, which suggests the presence of an upstream internal signal that could provide the basis for metacognition. We further find that these abilities 1) are limited in resolution, 2) emerge in context-dependent manners, and 3) seem to be qualitatively different from those of humans. We also report intriguing differences across models of similar capabilities, suggesting that LLM post-training may have a role in developing metacognitive abilities.
[592] Machine Learning. The Science of Selection under Uncertainty
Yevgeny Seldin
Main category: cs.LG
TL;DR: This book provides statistical tools for deriving theoretical guarantees on machine learning outcomes under uncertainty, covering concentration inequalities, offline supervised learning generalization bounds, and online learning regret analysis.
Details
Motivation: To address the inherent uncertainty in machine learning due to random data sampling, which causes noisy empirical estimates and selection under uncertainty.Method: Uses concentration inequalities (Markov’s, Chebyshev’s, Hoeffding’s, Bernstein’s, etc.) for controlling estimation errors, then applies these to derive generalization bounds in offline learning (Occam’s razor, VC analysis, PAC-Bayesian analysis) and regret bounds in online learning (stochastic/adversarial environments, full/bandit feedback).
Result: Develops comprehensive statistical framework for obtaining theoretical guarantees on learning outcomes across both offline and online settings.
Conclusion: The book establishes a unified statistical foundation for analyzing learning processes under uncertainty, providing tools for deriving performance guarantees in various machine learning scenarios.
Abstract: Learning, whether natural or artificial, is a process of selection. It starts with a set of candidate options and selects the more successful ones. In the case of machine learning the selection is done based on empirical estimates of prediction accuracy of candidate prediction rules on some data. Due to randomness of data sampling the empirical estimates are inherently noisy, leading to selection under uncertainty. The book provides statistical tools to obtain theoretical guarantees on the outcome of selection under uncertainty. We start with concentration of measure inequalities, which are the main statistical instrument for controlling how much an empirical estimate of expectation of a function deviates from the true expectation. The book covers a broad range of inequalities, including Markov’s, Chebyshev’s, Hoeffding’s, Bernstein’s, Empirical Bernstein’s, Unexpected Bernstein’s, kl, and split-kl. We then study the classical (offline) supervised learning and provide a range of tools for deriving generalization bounds, including Occam’s razor, Vapnik-Chervonenkis analysis, and PAC-Bayesian analysis. The latter is further applied to derive generalization guarantees for weighted majority votes. After covering the offline setting, we turn our attention to online learning. We present the space of online learning problems characterized by environmental feedback, environmental resistance, and structural complexity. A common performance measure in online learning is regret, which compares performance of an algorithm to performance of the best prediction rule in hindsight, out of a restricted set of prediction rules. We present tools for deriving regret bounds in stochastic and adversarial environments, and under full information and bandit feedback.
[593] Interpretable time series analysis with Gumbel dynamics
Yiliu Wang, Timothy Doyeon Kim, Eric Shea-Brown, Uygar Sümbül
Main category: cs.LG
TL;DR: The Gumbel Dynamical Model (GDM) is proposed to address limitations of switching dynamical systems by introducing continuous relaxation of discrete states and a Gumbel noise model, enabling smoother transitions, better approximation of non-stationary dynamics, and scalable gradient-based training.
Details
Motivation: Traditional switching dynamical systems struggle with smooth transitions, variable-speed dynamics, and stochastic mixtures due to their discrete nature, often resulting in spurious rapid switching on real-world datasets.Method: GDM introduces continuous relaxation of discrete states and a Gumbel distribution-based noise model on the relaxed-discrete state space, making the model fully differentiable for gradient-based training while expanding available state dynamics.
Result: GDM successfully models soft, sticky states and transitions in stochastic settings, outperforms traditional methods on simulation datasets, and infers interpretable states in real-world time series with multiple dynamics where conventional approaches fail.
Conclusion: The Gumbel Dynamical Model provides an effective solution for modeling complex time series with smooth transitions and stochastic mixtures, offering improved interpretability and scalability compared to traditional switching dynamical systems.
Abstract: Switching dynamical systems can model complicated time series data while maintaining interpretability by inferring a finite set of dynamics primitives and explaining different portions of the observed time series with one of these primitives. However, due to the discrete nature of this set, such models struggle to capture smooth, variable-speed transitions, as well as stochastic mixtures of overlapping states, and the inferred dynamics often display spurious rapid switching on real-world datasets. Here, we propose the Gumbel Dynamical Model (GDM). First, by introducing a continuous relaxation of discrete states and a different noise model defined on the relaxed-discrete state space via the Gumbel distribution, GDM expands the set of available state dynamics, allowing the model to approximate smoother and non-stationary ground-truth dynamics more faithfully. Second, the relaxation makes the model fully differentiable, enabling fast and scalable training with standard gradient descent methods. We validate our approach on standard simulation datasets and highlight its ability to model soft, sticky states and transitions in a stochastic setting. Furthermore, we apply our model to two real-world datasets, demonstrating its ability to infer interpretable states in stochastic time series with multiple dynamics, a setting where traditional methods often fail.
[594] Leveraging Big Data Frameworks for Spam Detection in Amazon Reviews
Mst Eshita Khatun, Halima Akter, Tasnimul Rehan, Toufiq Ahmed
Main category: cs.LG
TL;DR: This paper uses big data analytics and machine learning to detect spam reviews on Amazon, achieving 90.35% accuracy with Logistic Regression to improve online shopping trust.
Details
Motivation: Fraudulent product reviews mislead consumers and damage seller reputations, undermining trust in online shopping platforms.Method: Employed advanced big data analytics and machine learning approaches on Amazon review dataset, using scalable big data framework to process large-scale data and extract features for fraud detection.
Result: Various machine learning classifiers were tested, with Logistic Regression achieving the highest accuracy of 90.35% in detecting spam reviews.
Conclusion: The research successfully contributes to creating a more trustworthy and transparent online shopping environment by accurately detecting fraudulent reviews.
Abstract: In this digital era, online shopping is common practice in our daily lives. Product reviews significantly influence consumer buying behavior and help establish buyer trust. However, the prevalence of fraudulent reviews undermines this trust by potentially misleading consumers and damaging the reputations of the sellers. This research addresses this pressing issue by employing advanced big data analytics and machine learning approaches on a substantial dataset of Amazon product reviews. The primary objective is to detect and classify spam reviews accurately so that it enhances the authenticity of the review. Using a scalable big data framework, we efficiently process and analyze a large scale of review data, extracting key features indicative of fraudulent behavior. Our study illustrates the utility of various machine learning classifiers in detecting spam reviews, with Logistic Regression achieving an accuracy of 90.35%, thus contributing to a more trustworthy and transparent online shopping environment.
[595] GenUQ: Predictive Uncertainty Estimates via Generative Hyper-Networks
Tian Yu Yen, Reese E. Jones, Ravi G. Patel
Main category: cs.LG
TL;DR: GenUQ is a measure-theoretic approach to uncertainty quantification in operator learning that uses generative hyper-networks to produce parameter distributions without needing likelihood functions, outperforming existing methods.
Details
Motivation: Traditional likelihood-based UQ methods in operator learning struggle when stochastic operators produce actions where likelihoods are difficult or impossible to construct, limiting uncertainty quantification capabilities.Method: Introduces a generative hyper-network model that directly produces parameter distributions consistent with observed data, avoiding the need for explicit likelihood construction through measure-theoretic foundations.
Result: GenUQ outperforms other UQ methods in three benchmark problems: recovering manufactured operators, learning solution operators for stochastic elliptic PDEs, and modeling failure locations in porous steel under tension.
Conclusion: The GenUQ framework provides an effective likelihood-free approach to uncertainty quantification in operator learning that handles complex stochastic operators where traditional methods fail.
Abstract: Operator learning is a recently developed generalization of regression to mappings between functions. It promises to drastically reduce expensive numerical integration of PDEs to fast evaluations of mappings between functional states of a system, i.e., surrogate and reduced-order modeling. Operator learning has already found applications in several areas such as modeling sea ice, combustion, and atmospheric physics. Recent approaches towards integrating uncertainty quantification into the operator models have relied on likelihood based methods to infer parameter distributions from noisy data. However, stochastic operators may yield actions from which a likelihood is difficult or impossible to construct. In this paper, we introduce, GenUQ, a measure-theoretic approach to UQ that avoids constructing a likelihood by introducing a generative hyper-network model that produces parameter distributions consistent with observed data. We demonstrate that GenUQ outperforms other UQ methods in three example problems, recovering a manufactured operator, learning the solution operator to a stochastic elliptic PDE, and modeling the failure location of porous steel under tension.
[596] Task-Agnostic Federated Continual Learning via Replay-Free Gradient Projection
Seohyeon Cha, Huancheng Chen, Haris Vikalo
Main category: cs.LG
TL;DR: FedProTIP is a federated continual learning framework that uses gradient projection to mitigate catastrophic forgetting and includes task identity prediction for task-agnostic inference.
Details
Motivation: Address catastrophic forgetting in federated continual learning settings where data heterogeneity, communication constraints, and privacy concerns exacerbate the problem.Method: Projects client updates onto orthogonal complement of previous task representations and uses lightweight task identity prediction with core bases from prior tasks.
Result: Significantly outperforms state-of-the-art methods in average accuracy, especially when task identities are unknown.
Conclusion: FedProTIP effectively mitigates forgetting in federated continual learning through gradient projection and task identity prediction.
Abstract: Federated continual learning (FCL) enables distributed client devices to learn from streaming data across diverse and evolving tasks. A major challenge to continual learning, catastrophic forgetting, is exacerbated in decentralized settings by the data heterogeneity, constrained communication and privacy concerns. We propose Federated gradient Projection-based Continual Learning with Task Identity Prediction (FedProTIP), a novel FCL framework that mitigates forgetting by projecting client updates onto the orthogonal complement of the subspace spanned by previously learned representations of the global model. This projection reduces interference with earlier tasks and preserves performance across the task sequence. To further address the challenge of task-agnostic inference, we incorporate a lightweight mechanism that leverages core bases from prior tasks to predict task identity and dynamically adjust the global model’s outputs. Extensive experiments across standard FCL benchmarks demonstrate that FedProTIP significantly outperforms state-of-the-art methods in average accuracy, particularly in settings where task identities are a priori unknown.
[597] Causal Abstraction Inference under Lossy Representations
Kevin Xia, Elias Bareinboim
Main category: cs.LG
TL;DR: This paper introduces projected abstractions, a new framework that generalizes existing causal abstraction definitions to handle lossy abstraction functions where multiple low-level interventions map to the same high-level intervention.
Details
Motivation: Existing causal abstraction frameworks are limited because they assume abstract invariance conditions and don't work well with lossy abstraction functions where different low-level interventions can have the same high-level effect.Method: The authors introduce projected abstractions that can accommodate lossy representations, show how to construct them from low-level models, and prove graphical criteria for identifying and estimating high-level causal queries from limited low-level data.
Result: The paper demonstrates that projected abstractions can effectively translate equivalent observational, interventional, and counterfactual causal queries from low to high-level, and shows experimental effectiveness in high-dimensional image settings.
Conclusion: Projected abstractions provide a more flexible framework for causal abstraction that handles lossy representations and enables practical estimation of high-level causal queries from limited low-level data.
Abstract: The study of causal abstractions bridges two integral components of human intelligence: the ability to determine cause and effect, and the ability to interpret complex patterns into abstract concepts. Formally, causal abstraction frameworks define connections between complicated low-level causal models and simple high-level ones. One major limitation of most existing definitions is that they are not well-defined when considering lossy abstraction functions in which multiple low-level interventions can have different effects while mapping to the same high-level intervention (an assumption called the abstract invariance condition). In this paper, we introduce a new type of abstractions called projected abstractions that generalize existing definitions to accommodate lossy representations. We show how to construct a projected abstraction from the low-level model and how it translates equivalent observational, interventional, and counterfactual causal queries from low to high-level. Given that the true model is rarely available in practice we prove a new graphical criteria for identifying and estimating high-level causal queries from limited low-level data. Finally, we experimentally show the effectiveness of projected abstraction models in high-dimensional image settings.
[598] LANCE: Low Rank Activation Compression for Efficient On-Device Continual Learning
Marco Paul E. Apolinario, Kaushik Roy
Main category: cs.LG
TL;DR: LANCE is a framework that uses one-shot higher-order SVD to compress activations for efficient on-device learning, reducing memory usage by up to 250× while maintaining accuracy comparable to full backpropagation.
Details
Motivation: On-device learning requires efficient fine-tuning and continual learning, but existing methods have high memory costs for storing activations during backpropagation and computational overhead from repeated low-rank decompositions.Method: LANCE performs one-shot higher-order SVD to obtain a reusable low-rank subspace for activation projection, eliminating repeated decompositions and enabling task allocation to orthogonal subspaces for continual learning.
Result: LANCE reduces activation storage by up to 250× while maintaining comparable accuracy to full backpropagation on multiple datasets, and achieves competitive performance on continual learning benchmarks at a fraction of the memory cost.
Conclusion: LANCE provides a practical and scalable solution for efficient fine-tuning and continual learning on resource-constrained edge devices.
Abstract: On-device learning is essential for personalization, privacy, and long-term adaptation in resource-constrained environments. Achieving this requires efficient learning, both fine-tuning existing models and continually acquiring new tasks without catastrophic forgetting. Yet both settings are constrained by high memory cost of storing activations during backpropagation. Existing activation compression methods reduce this cost but relying on repeated low-rank decompositions, introducing computational overhead. Also, such methods have not been explored for continual learning. We propose LANCE (Low-rank Activation Compression), a framework that performs one-shot higher-order Singular Value Decompsoition (SVD) to obtain a reusable low-rank subspace for activation projection. This eliminates repeated decompositions, reducing both memory and computation. Moreover, fixed low-rank subspaces further enable on-device continual learning by allocating tasks to orthogonal subspaces without storing large task-specific matrices. Experiments show that LANCE reduces activation storage up to 250$\times$ while maintaining accuracy comparable to full backpropagation on CIFAR-10/100, Oxford-IIIT Pets, Flowers102, and CUB-200 datasets. On continual learning benchmarks (Split CIFAR-100, Split MiniImageNet, 5-Datasets), it achieves performance competitive with orthogonal gradient projection methods at a fraction of the memory cost. These results position LANCE as a practical and scalable solution for efficient fine-tuning and continual learning on edge devices.
[599] PreLoRA: Hybrid Pre-training of Vision Transformers with Full Training and Low-Rank Adapters
Krishu K Thapa, Reet Barik, Krishna Teja Chitty-Venkata, Murali Emani, Venkatram Vishwanath
Main category: cs.LG
TL;DR: Proposes dynamic switching from full parameter training to Low-Rank Adaptation (LoRA) during training to reduce computational costs while maintaining accuracy.
Details
Motivation: Training large models is resource-intensive, and most learning occurs in early stages with weight changes stabilizing later, suggesting potential for optimization.Method: Identify partial convergence states and dynamically switch to LoRA using hyperparameters to determine switching point and assign layer-specific ranks based on convergence level.
Result: Reduces trainable parameters to 10% of original size, achieves 3x throughput improvement, 1.5x reduction in training time per epoch, and 20% GPU memory reduction while preserving accuracy.
Conclusion: Dynamic LoRA switching effectively reduces training costs without compromising model performance, making large model training more efficient.
Abstract: Training large models ranging from millions to billions of parameters is highly resource-intensive, requiring significant time, compute, and memory. It is observed that most of the learning (higher change in weights) takes place in the earlier stage of the training loop. These changes stabilize as training continues, enabling them to be captured by matrices of a low intrinsic rank. Therefore, we propose an approach to identify such states of partial convergence and dynamically switch from full parameter training to Low-Rank Adaptation (LoRA) on the ViT-Large model. We introduce a flexible approach that leverages user-defined hyperparameters to determine the switching point and assign a rank specific to each module layer based on its level of convergence. Experimental results show that this approach preserves model accuracy while reducing the number of trainable parameters to 10% of its original size, resulting in a 3x improvement in throughput, and a 1.5x reduction in average training time per epoch while also reducing GPU memory consumption by 20%
[600] Shoot from the HIP: Hessian Interatomic Potentials without derivatives
Andreas Burger, Luca Thiede, Nikolaj Rønne, Varinia Bernales, Nandita Vijaykumar, Tejs Vegge, Arghya Bhowmik, Alan Aspuru-Guzik
Main category: cs.LG
TL;DR: HIP proposes a deep learning model that directly predicts molecular Hessians (second derivatives of potential energy) without using automatic differentiation or finite differences, achieving significant speed and accuracy improvements.
Details
Motivation: Molecular Hessians are computationally expensive to calculate and scale poorly with system size, limiting their use in computational chemistry tasks like transition state search and vibrational analysis.Method: Uses SE(3)-equivariant graph neural networks to construct symmetric Hessians from irreducible representation features up to degree l=2 during message passing, enabling direct prediction without differentiation.
Result: HIP Hessians are 1-2 orders of magnitude faster, more accurate, more memory efficient, easier to train, and scale better with system size compared to traditional methods.
Conclusion: The direct prediction approach enables superior performance across computational chemistry tasks including transition state search, geometry optimization, and vibrational analysis, with open-source code available.
Abstract: Fundamental tasks in computational chemistry, from transition state search to vibrational analysis, rely on molecular Hessians, which are the second derivatives of the potential energy. Yet, Hessians are computationally expensive to calculate and scale poorly with system size, with both quantum mechanical methods and neural networks. In this work, we demonstrate that Hessians can be predicted directly from a deep learning model, without relying on automatic differentiation or finite differences. We observe that one can construct SE(3)-equivariant, symmetric Hessians from irreducible representations (irrep) features up to degree $l$=2 computed during message passing in graph neural networks. This makes HIP Hessians one to two orders of magnitude faster, more accurate, more memory efficient, easier to train, and enables more favorable scaling with system size. We validate our predictions across a wide range of downstream tasks, demonstrating consistently superior performance for transition state search, accelerated geometry optimization, zero-point energy corrections, and vibrational analysis benchmarks. We open-source the HIP codebase and model weights to enable further development of the direct prediction of Hessians at https://github.com/BurgerAndreas/hip
[601] Blockwise Hadamard high-Rank Adaptation for Parameter-Efficient LLM Fine-Tuning
Feng Yu, Jia Hu, Geyong Min
Main category: cs.LG
TL;DR: BHRA is a parameter-efficient fine-tuning method that uses blockwise Hadamard-style modulation to achieve localized rank amplification while maintaining the same parameter footprint as other PEFT methods.
Details
Motivation: To address limitations of existing PEFT methods like LoRA (constrained by nominal rank) and HiRA (global modulation coupling), by enabling localized rank amplification without increasing parameter count.Method: Partitions weight matrices into blocks and applies HiRA-style multiplicative modulation independently within each block, preserving PEFT parameter footprint while unlocking localized rank amplification.
Result: BHRA consistently outperforms strong PEFT baselines across eight commonsense reasoning tasks and two arithmetic benchmarks with various model sizes (Llama-3.2 1B/3B, Mistral-7B, Gemma-2 9B) under matched parameter budgets.
Conclusion: Blockwise design maintains rich spectra across rank budgets and mitigates collapse induced by global modulation, making BHRA an effective parameter-efficient fine-tuning approach.
Abstract: Parameter-efficient fine-tuning (PEFT) methods must be resource-efficient yet handle heterogeneous reasoning transformations, and classical low-rank adaptation (LoRA) is constrained by the nominal rank $r$. Hadamard-style extensions like HiRA raise the nominal rank but couple every update to the global energy pattern of the frozen weight matrix, while ABBA trades this inductive bias for fully learned dense intermediates. To address the limitation of global modulation, we propose Block Hadamard high-Rank Adaptation (BHRA), which partitions each weight matrix and applies HiRA-style multiplicative modulation independently within every block, preserving the PEFT parameter footprint while unlocking localized rank amplification. Our empirical analyses reveal that this blockwise design maintains rich spectra across rank budgets, mitigating the collapse induced by global modulation. Across eight commonsense reasoning tasks and two arithmetic benchmarks with Llama-3.2 1B/3B, Mistral-7B, and Gemma-2 9B, BHRA consistently surpasses strong PEFT baselines under matched parameter budgets.
[602] Understanding and Enhancing Mask-Based Pretraining towards Universal Representations
Mingze Dong, Leda Wang, Yuval Kluger
Main category: cs.LG
TL;DR: Mask-based pretraining’s behavior is characterized through high-dimensional linear regression analysis, revealing novel insights and leading to a simple yet effective Randomly Random Mask AutoEncoding (R²MAE) method that outperforms standard masking schemes.
Details
Motivation: Despite the empirical success of mask-based pretraining in large-scale models across domains, its fundamental role and limits in learning data representations remain unclear, motivating a theoretical characterization.Method: Theoretical analysis using high-dimensional minimum-norm linear regression to characterize mask-based pretraining behavior, followed by development of R²MAE - a simple pretraining scheme that randomly varies mask ratios to capture multi-scale features.
Result: R²MAE consistently outperforms standard and more complex masking schemes across vision, language, DNA sequence, and single-cell models, improving state-of-the-art performance without requiring optimal fixed mask ratio tuning.
Conclusion: The theoretical framework successfully explains mask-based pretraining behavior, and the simple R²MAE method demonstrates that varying mask ratios during training enables better multi-scale feature learning than fixed-ratio approaches.
Abstract: Mask-based pretraining has become a cornerstone of modern large-scale models across language, vision, and recently biology. Despite its empirical success, its role and limits in learning data representations have been unclear. In this work, we show that the behavior of mask-based pretraining can be directly characterized by test risk in high-dimensional minimum-norm (“ridge-less”) linear regression, without relying on further model specifications. Further analysis of linear models uncovers several novel aspects of mask-based pretraining. The theoretical framework and its implications have been validated across diverse neural architectures (including MLPs, CNNs, and Transformers) applied to both vision and language tasks. Guided by our theory, we propose an embarrassingly simple yet overlooked pretraining scheme named Randomly Random Mask AutoEncoding (R$^2$MAE), which enforces capturing multi-scale features from data and is able to outperform optimal fixed mask ratio settings in our linear model framework. We implement R$^2$MAE in vision, language, DNA sequence, and single-cell models, where it consistently outperforms standard and more complicated masking schemes, leading to improvements for state-of-the-art models. Our code is available at: https://github.com/MingzeDong/r2mae
[603] Limitations on Safe, Trusted, Artificial General Intelligence
Rina Panigrahy, Vatsal Sharan
Main category: cs.LG
TL;DR: The paper proves a fundamental incompatibility between mathematically defined safety, trust, and AGI - a safe and trusted AI system cannot be an AGI system.
Details
Motivation: To establish strict mathematical definitions for safety, trust, and AGI, and explore their fundamental relationships and limitations.Method: Proposes formal mathematical definitions: safety as never making false claims, trust as assuming safety, and AGI as always matching/exceeding human capability. Uses proofs drawing from Gödel’s incompleteness theorems and Turing’s halting problem.
Result: Proves that for the formal definitions, a safe and trusted AI system cannot be an AGI system - there exist tasks solvable by humans but not by such systems.
Conclusion: There is a fundamental incompatibility between mathematically strict safety/trust and AGI, though real-world systems may use practical interpretations that avoid this limitation.
Abstract: Safety, trust and Artificial General Intelligence (AGI) are aspirational goals in artificial intelligence (AI) systems, and there are several informal interpretations of these notions. In this paper, we propose strict, mathematical definitions of safety, trust, and AGI, and demonstrate a fundamental incompatibility between them. We define safety of a system as the property that it never makes any false claims, trust as the assumption that the system is safe, and AGI as the property of an AI system always matching or exceeding human capability. Our core finding is that – for our formal definitions of these notions – a safe and trusted AI system cannot be an AGI system: for such a safe, trusted system there are task instances which are easily and provably solvable by a human but not by the system. We note that we consider strict mathematical definitions of safety and trust, and it is possible for real-world deployments to instead rely on alternate, practical interpretations of these notions. We show our results for program verification, planning, and graph reachability. Our proofs draw parallels to G"odel’s incompleteness theorems and Turing’s proof of the undecidability of the halting problem, and can be regarded as interpretations of G"odel’s and Turing’s results.
[604] DriftLite: Lightweight Drift Control for Inference-Time Scaling of Diffusion Models
Yinuo Ren, Wenhao Gao, Lexing Ying, Grant M. Rotskoff, Jiequn Han
Main category: cs.LG
TL;DR: DriftLite is a training-free particle-based method for inference-time adaptation of diffusion models that provides optimal stability control through variance and energy controlling guidance, outperforming existing approaches.
Details
Motivation: Existing guidance-based methods introduce bias while particle-based corrections suffer from weight degeneracy and high computational costs, creating a need for more efficient and stable inference-time adaptation.Method: DriftLite exploits an unexplored degree of freedom in the Fokker-Planck equation between drift and particle potential, yielding Variance-Controlling Guidance (VCG) and Energy-Controlling Guidance (ECG) for optimal drift approximation with minimal overhead.
Result: Across Gaussian mixture models, particle systems, and protein-ligand co-folding problems, DriftLite consistently reduces variance and improves sample quality compared to pure guidance and sequential Monte Carlo baselines.
Conclusion: DriftLite provides a principled, efficient route for scalable inference-time adaptation of diffusion models through lightweight particle-based steering with provable stability control.
Abstract: We study inference-time scaling for diffusion models, where the goal is to adapt a pre-trained model to new target distributions without retraining. Existing guidance-based methods are simple but introduce bias, while particle-based corrections suffer from weight degeneracy and high computational cost. We introduce DriftLite, a lightweight, training-free particle-based approach that steers the inference dynamics on the fly with provably optimal stability control. DriftLite exploits a previously unexplored degree of freedom in the Fokker-Planck equation between the drift and particle potential, and yields two practical instantiations: Variance- and Energy-Controlling Guidance (VCG/ECG) for approximating the optimal drift with minimal overhead. Across Gaussian mixture models, particle systems, and large-scale protein-ligand co-folding problems, DriftLite consistently reduces variance and improves sample quality over pure guidance and sequential Monte Carlo baselines. These results highlight a principled, efficient route toward scalable inference-time adaptation of diffusion models.
[605] Differentiable Structure Learning for General Binary Data
Chang Deng, Bryon Aragam
Main category: cs.LG
TL;DR: A differentiable structure learning framework for discrete data that captures arbitrary dependencies, addressing limitations of existing methods that assume specific structural equation models and ignore complex dependence structures.
Details
Motivation: Existing methods assume specific structural equation models that may not match true data-generating processes, ignore complex dependence structures in discrete data, and consider only linear effects, limiting their general applicability.Method: Proposes a differentiable structure learning framework formulated as a single differentiable optimization task that captures arbitrary dependencies among discrete variables, avoiding unrealistic simplifications of previous methods.
Result: The approach can characterize the complete set of compatible parameters and structures, establishes identifiability up to Markov equivalence under mild assumptions, and empirically demonstrates effective capture of complex relationships in discrete data.
Conclusion: The proposed framework provides a more general and effective approach for differentiable structure learning in discrete data by handling arbitrary dependencies and avoiding restrictive assumptions of previous methods.
Abstract: Existing methods for differentiable structure learning in discrete data typically assume that the data are generated from specific structural equation models. However, these assumptions may not align with the true data-generating process, which limits the general applicability of such methods. Furthermore, current approaches often ignore the complex dependence structure inherent in discrete data and consider only linear effects. We propose a differentiable structure learning framework that is capable of capturing arbitrary dependencies among discrete variables. We show that although general discrete models are unidentifiable from purely observational data, it is possible to characterize the complete set of compatible parameters and structures. Additionally, we establish identifiability up to Markov equivalence under mild assumptions. We formulate the learning problem as a single differentiable optimization task in the most general form, thereby avoiding the unrealistic simplifications adopted by previous methods. Empirical results demonstrate that our approach effectively captures complex relationships in discrete data.
[606] RED-DiffEq: Regularization by denoising diffusion models for solving inverse PDE problems with application to full waveform inversion
Siming Shan, Min Zhu, Youzuo Lin, Lu Lu
Main category: cs.LG
TL;DR: RED-DiffEq integrates physics-driven inversion with data-driven learning using pretrained diffusion models as regularization for PDE-governed inverse problems, showing improved accuracy and generalization in full waveform inversion.
Details
Motivation: PDE-governed inverse problems face challenges with nonlinearity, ill-posedness, and noise sensitivity, requiring robust computational frameworks.Method: Integrates physics-driven inversion with data-driven learning using pretrained diffusion models as regularization mechanism.
Result: Enhanced accuracy and robustness in full waveform inversion compared to conventional methods, with strong generalization to unseen complex velocity models.
Conclusion: RED-DiffEq provides an effective framework for PDE-governed inverse problems that can be directly applied to diverse applications beyond seismic imaging.
Abstract: Partial differential equation (PDE)-governed inverse problems are fundamental across various scientific and engineering applications; yet they face significant challenges due to nonlinearity, ill-posedness, and sensitivity to noise. Here, we introduce a new computational framework, RED-DiffEq, by integrating physics-driven inversion and data-driven learning. RED-DiffEq leverages pretrained diffusion models as a regularization mechanism for PDE-governed inverse problems. We apply RED-DiffEq to solve the full waveform inversion problem in geophysics, a challenging seismic imaging technique that seeks to reconstruct high-resolution subsurface velocity models from seismic measurement data. Our method shows enhanced accuracy and robustness compared to conventional methods. Additionally, it exhibits strong generalization ability to more complex velocity models that the diffusion model is not trained on. Our framework can also be directly applied to diverse PDE-governed inverse problems.
[607] A Systematic Review of Conformal Inference Procedures for Treatment Effect Estimation: Methods and Challenges
Pascal Memmesheimer, Vincent Heuveline, Jürgen Hesser
Main category: cs.LG
TL;DR: Systematic review of conformal prediction methods for treatment effect estimation, analyzing 11 key papers to identify state-of-the-art approaches and propose future research directions.
Details
Motivation: Treatment effect estimation is crucial for decision-making in high-stakes fields like healthcare and economics, but quantifying uncertainty in machine learning predictions remains challenging. Conformal prediction offers a solution with finite-sample coverage guarantees under minimal assumptions.Method: Conducted a systematic review through a filtering process to select and analyze eleven key papers on conformal prediction methods for treatment effect estimation.
Result: Identified and described current state-of-the-art conformal prediction methods for treatment effect estimation, providing necessary theoretical background.
Conclusion: Conformal prediction holds significant potential for improving decision-making in high-stakes environments, and the review proposes directions for future research in this area.
Abstract: Treatment effect estimation is essential for informed decision-making in many fields such as healthcare, economics, and public policy. While flexible machine learning models have been widely applied for estimating heterogeneous treatment effects, quantifying the inherent uncertainty of their point predictions remains an issue. Recent advancements in conformal prediction address this limitation by allowing for inexpensive computation, as well as distribution shifts, while still providing frequentist, finite-sample coverage guarantees under minimal assumptions for any point-predictor model. This advancement holds significant potential for improving decision-making in especially high-stakes environments. In this work, we perform a systematic review regarding conformal prediction methods for treatment effect estimation and provide for both the necessary theoretical background. Through a systematic filtering process, we select and analyze eleven key papers, identifying and describing current state-of-the-art methods in this area. Based on our findings, we propose directions for future research.
[608] MMPlanner: Zero-Shot Multimodal Procedural Planning with Chain-of-Thought Object State Reasoning
Afrina Tabassum, Bin Guo, Xiyao Ma, Hoda Eldardiry, Ismini Lourentzou
Main category: cs.LG
TL;DR: MMPlanner is a zero-shot multimodal procedural planning framework that uses Object State Reasoning Chain-of-Thought prompting to generate accurate text-image plans while maintaining object-state consistency across modalities.
Details
Motivation: Existing approaches for multimodal procedural planning often lack proper visual object-state alignment and systematic evaluation methods, leading to inconsistencies between text and images in generated plans.Method: Proposes MMPlanner with Object State Reasoning Chain-of-Thought (OSR-CoT) prompting to explicitly model object-state transitions, and introduces LLM-as-a-judge protocols for evaluation plus visual step-reordering task for temporal coherence measurement.
Result: Achieves state-of-the-art performance on RECIPEPLAN and WIKIPLAN datasets, improving textual planning by +6.8%, cross-modal alignment by +11.9%, and visual step ordering by +26.7%.
Conclusion: MMPlanner effectively addresses multimodal procedural planning challenges through explicit object-state reasoning and comprehensive evaluation protocols, demonstrating significant improvements across all measured metrics.
Abstract: Multimodal Procedural Planning (MPP) aims to generate step-by-step instructions that combine text and images, with the central challenge of preserving object-state consistency across modalities while producing informative plans. Existing approaches often leverage large language models (LLMs) to refine textual steps; however, visual object-state alignment and systematic evaluation are largely underexplored. We present MMPlanner, a zero-shot MPP framework that introduces Object State Reasoning Chain-of-Thought (OSR-CoT) prompting to explicitly model object-state transitions and generate accurate multimodal plans. To assess plan quality, we design LLM-as-a-judge protocols for planning accuracy and cross-modal alignment, and further propose a visual step-reordering task to measure temporal coherence. Experiments on RECIPEPLAN and WIKIPLAN show that MMPlanner achieves state-of-the-art performance, improving textual planning by +6.8%, cross-modal alignment by +11.9%, and visual step ordering by +26.7%
[609] Logic of Hypotheses: from Zero to Full Knowledge in Neurosymbolic Integration
Davide Bizzaro, Alessandro Daniele
Main category: cs.LG
TL;DR: Logic of Hypotheses (LoH) is a novel neurosymbolic framework that unifies rule injection and rule learning through a choice operator with learnable parameters, enabling flexible integration of data-driven learning with symbolic priors and expert knowledge.
Details
Motivation: To bridge the gap between two main approaches in neurosymbolic integration: methods that inject hand-crafted rules into neural models and methods that induce symbolic rules from data, providing a unified framework that allows arbitrary degrees of knowledge specification.Method: Extends propositional logic syntax with a choice operator that has learnable parameters and selects subformulas from a pool of options. Uses fuzzy logic to compile formulas into differentiable computational graphs, enabling learning via backpropagation. Employs Goedel fuzzy logic and the Goedel trick for discretization without performance loss.
Result: Strong experimental results on tabular data and the Visual Tic-Tac-Toe neurosymbolic task, while producing interpretable decision rules. The framework subsumes existing neurosymbolic models and enables flexible knowledge integration.
Conclusion: LoH provides a unified neurosymbolic framework that successfully integrates data-driven rule learning with symbolic priors, offering interpretable models that can be discretized without performance degradation while maintaining the flexibility to incorporate varying degrees of expert knowledge.
Abstract: Neurosymbolic integration (NeSy) blends neural-network learning with symbolic reasoning. The field can be split between methods injecting hand-crafted rules into neural models, and methods inducing symbolic rules from data. We introduce Logic of Hypotheses (LoH), a novel language that unifies these strands, enabling the flexible integration of data-driven rule learning with symbolic priors and expert knowledge. LoH extends propositional logic syntax with a choice operator, which has learnable parameters and selects a subformula from a pool of options. Using fuzzy logic, formulas in LoH can be directly compiled into a differentiable computational graph, so the optimal choices can be learned via backpropagation. This framework subsumes some existing NeSy models, while adding the possibility of arbitrary degrees of knowledge specification. Moreover, the use of Goedel fuzzy logic and the recently developed Goedel trick yields models that can be discretized to hard Boolean-valued functions without any loss in performance. We provide experimental analysis on such models, showing strong results on tabular data and on the Visual Tic-Tac-Toe NeSy task, while producing interpretable decision rules.
[610] DIM: Enforcing Domain-Informed Monotonicity in Deep Neural Networks
Joshua Salim, Jordan Yu, Xilei Zhao
Main category: cs.LG
TL;DR: Proposes DIM regularization method that enforces domain-informed monotonicity in deep neural networks to prevent overfitting by penalizing violations of expected trends while maintaining predictive power.
Details
Motivation: Deep learning models often overfit and memorize training data rather than learning generalizable patterns, due to their complex structure and large number of parameters.Method: Enforces monotonicity by penalizing violations relative to a linear baseline, using a mathematical framework that establishes linear reference, measures deviations from monotonic behavior, and integrates these into training objective.
Result: Experiments on real-world ridesourcing data and synthetic datasets show that even modest monotonicity constraints consistently enhance model performance across various neural network architectures.
Conclusion: DIM effectively improves predictive performance of deep neural networks by applying domain-informed monotonicity constraints to regularize model behavior and mitigate overfitting.
Abstract: While deep learning models excel at predictive tasks, they often overfit due to their complex structure and large number of parameters, causing them to memorize training data, including noise, rather than learn patterns that generalize to new data. To tackle this challenge, this paper proposes a new regularization method, i.e., Enforcing Domain-Informed Monotonicity in Deep Neural Networks (DIM), which maintains domain-informed monotonic relationships in complex deep learning models to further improve predictions. Specifically, our method enforces monotonicity by penalizing violations relative to a linear baseline, effectively encouraging the model to follow expected trends while preserving its predictive power. We formalize this approach through a comprehensive mathematical framework that establishes a linear reference, measures deviations from monotonic behavior, and integrates these measurements into the training objective. We test and validate the proposed methodology using a real-world ridesourcing dataset from Chicago and a synthetically created dataset. Experiments across various neural network architectures show that even modest monotonicity constraints consistently enhance model performance. DIM enhances the predictive performance of deep neural networks by applying domain-informed monotonicity constraints to regularize model behavior and mitigate overfitting
[611] Neuroprobe: Evaluating Intracranial Brain Responses to Naturalistic Stimuli
Andrii Zahorodnii, Christopher Wang, Bennett Stankovits, Charikleia Moraitaki, Geeling Chau, Andrei Barbu, Boris Katz, Ila R Fiete
Main category: cs.LG
TL;DR: Neuroprobe is a standardized benchmark suite for evaluating iEEG foundation models, built on the BrainTreebank dataset with 40 hours of recordings from 10 subjects during movie viewing tasks.
Details
Motivation: The lack of standardized evaluation frameworks for intracranial EEG (iEEG) recordings hinders progress in brain-computer interfaces and neurological treatments, requiring rigorous benchmarks to discriminate between competing modeling approaches.Method: Built on the BrainTreebank dataset with 40 hours of iEEG recordings from 10 human subjects performing naturalistic movie viewing tasks. Provides decoding tasks for studying multi-modal language processing with high temporal and spatial resolution.
Result: Visualized information flow from superior temporal gyrus to prefrontal cortex, showing progression from simple auditory to complex language features. Found linear baseline surprisingly strong, beating frontier foundation models on many tasks.
Conclusion: Neuroprobe serves as both a neuroscience insight tool and rigorous evaluation framework for iEEG foundation models, enabling systematic comparison of architectures and training protocols with publicly available code and leaderboard.
Abstract: High-resolution neural datasets enable foundation models for the next generation of brain-computer interfaces and neurological treatments. The community requires rigorous benchmarks to discriminate between competing modeling approaches, yet no standardized evaluation frameworks exist for intracranial EEG (iEEG) recordings. To address this gap, we present Neuroprobe: a suite of decoding tasks for studying multi-modal language processing in the brain. Unlike scalp EEG, intracranial EEG requires invasive surgery to implant electrodes that record neural activity directly from the brain with minimal signal distortion. Neuroprobe is built on the BrainTreebank dataset, which consists of 40 hours of iEEG recordings from 10 human subjects performing a naturalistic movie viewing task. Neuroprobe serves two critical functions. First, it is a mine from which neuroscience insights can be drawn. Its high temporal and spatial resolution allows researchers to systematically determine when and where computations for each aspect of language processing occur in the brain by measuring the decodability of each feature across time and all electrode locations. Using Neuroprobe, we visualize how information flows from the superior temporal gyrus to the prefrontal cortex, and the progression from simple auditory features to more complex language features in a purely data-driven manner. Second, as the field moves toward neural foundation models, Neuroprobe provides a rigorous framework for comparing competing architectures and training protocols. We found that the linear baseline is surprisingly strong, beating frontier foundation models on many tasks. Neuroprobe is designed with computational efficiency and ease of use in mind. We make the code for Neuroprobe openly available and maintain a public leaderboard, aiming to enable rapid progress in the field of iEEG foundation models, at https://neuroprobe.dev/
[612] SlotFM: A Motion Foundation Model with Slot Attention for Diverse Downstream Tasks
Junyong Park, Oron Levy, Rebecca Adaimi, Asaf Liberman, Gierad Laput, Abdelkareem Bedri
Main category: cs.LG
TL;DR: SlotFM is an accelerometer foundation model that uses Time-Frequency Slot Attention to generate multiple embeddings capturing different signal components, enabling better generalization across diverse sensing tasks beyond standard activity recognition.
Details
Motivation: Existing foundation models for accelerometers primarily focus on classifying common daily activities, limiting their applicability to the broader range of tasks that rely on other signal characteristics.Method: SlotFM uses Time-Frequency Slot Attention that processes both time and frequency representations of raw signals, generating multiple small embeddings (slots) that capture different signal components. It also introduces two loss regularizers for local structure and frequency patterns to improve reconstruction of fine-grained details.
Result: SlotFM outperforms existing self-supervised approaches on 13 out of 16 classification and regression downstream tasks, achieving comparable results on the remaining tasks. It yields a 4.5% average performance gain.
Conclusion: SlotFM demonstrates strong generalization for sensing foundation models, enabling broader applicability across diverse accelerometer-based tasks beyond standard human activity recognition.
Abstract: Wearable accelerometers are used for a wide range of applications, such as gesture recognition, gait analysis, and sports monitoring. Yet most existing foundation models focus primarily on classifying common daily activities such as locomotion and exercise, limiting their applicability to the broader range of tasks that rely on other signal characteristics. We present SlotFM, an accelerometer foundation model that generalizes across diverse downstream tasks. SlotFM uses Time-Frequency Slot Attention, an extension of Slot Attention that processes both time and frequency representations of the raw signals. It generates multiple small embeddings (slots), each capturing different signal components, enabling task-specific heads to focus on the most relevant parts of the data. We also introduce two loss regularizers that capture local structure and frequency patterns, which improve reconstruction of fine-grained details and helps the embeddings preserve task-relevant information. We evaluate SlotFM on 16 classification and regression downstream tasks that extend beyond standard human activity recognition. It outperforms existing self-supervised approaches on 13 of these tasks and achieves comparable results to the best performing approaches on the remaining tasks. On average, our method yields a 4.5% performance gain, demonstrating strong generalization for sensing foundation models.
[613] Scalable Second-order Riemannian Optimization for $K$-means Clustering
Peng Xu, Chun-Ying Hou, Xiaohui Chen, Richard Y. Zhang
Main category: cs.LG
TL;DR: A new smooth unconstrained optimization formulation for K-means clustering using Riemannian manifold structure, solved with a second-order cubic-regularized Newton method that achieves faster convergence than first-order methods while maintaining optimal statistical accuracy.
Details
Motivation: Current relaxation algorithms for K-means clustering struggle to balance constraint feasibility and objective optimality, presenting challenges in computing second-order critical points with rigorous guarantees.Method: Formulate K-means as smooth unconstrained optimization over a submanifold, characterize Riemannian structures, and solve using second-order cubic-regularized Riemannian Newton algorithm with linear-time Newton subproblem solutions through manifold factorization.
Result: The proposed method converges significantly faster than state-of-the-art first-order nonnegative low-rank factorization methods while achieving similar optimal statistical accuracy.
Conclusion: The Riemannian manifold approach provides an effective framework for solving K-means clustering with improved computational efficiency and rigorous guarantees.
Abstract: Clustering is a hard discrete optimization problem. Nonconvex approaches such as low-rank semidefinite programming (SDP) have recently demonstrated promising statistical and local algorithmic guarantees for cluster recovery. Due to the combinatorial structure of the $K$-means clustering problem, current relaxation algorithms struggle to balance their constraint feasibility and objective optimality, presenting tremendous challenges in computing the second-order critical points with rigorous guarantees. In this paper, we provide a new formulation of the $K$-means problem as a smooth unconstrained optimization over a submanifold and characterize its Riemannian structures to allow it to be solved using a second-order cubic-regularized Riemannian Newton algorithm. By factorizing the $K$-means manifold into a product manifold, we show how each Newton subproblem can be solved in linear time. Our numerical experiments show that the proposed method converges significantly faster than the state-of-the-art first-order nonnegative low-rank factorization method, while achieving similarly optimal statistical accuracy.
[614] Prophecy: Inferring Formal Properties from Neuron Activations
Divya Gopinath, Corina S. Pasareanu, Muhammad Usman
Main category: cs.LG
TL;DR: Prophecy is a tool that automatically infers formal properties of feed-forward neural networks by extracting rules based on neuron activation patterns that imply desired output behaviors.
Details
Motivation: To automatically discover and verify formal properties of neural networks, capturing the logic embedded in hidden layer activations to enable various applications like verification, monitoring, and repair.Method: Extracts rules based on neuron activations (values or on/off statuses) as preconditions that imply certain output properties, focusing on the activation patterns in inner layers of feed-forward networks.
Result: Demonstrated successful usage on different types of models and output properties, with applications including formal explanations, compositional verification, run-time monitoring, repair, and novel results for large vision-language models.
Conclusion: Prophecy provides an effective approach for automatically inferring formal network properties through neuron activation analysis, showing promising potential especially for large vision-language models.
Abstract: We present Prophecy, a tool for automatically inferring formal properties of feed-forward neural networks. Prophecy is based on the observation that a significant part of the logic of feed-forward networks is captured in the activation status of the neurons at inner layers. Prophecy works by extracting rules based on neuron activations (values or on/off statuses) as preconditions that imply certain desirable output property, e.g., the prediction being a certain class. These rules represent network properties captured in the hidden layers that imply the desired output behavior. We present the architecture of the tool, highlight its features and demonstrate its usage on different types of models and output properties. We present an overview of its applications, such as inferring and proving formal explanations of neural networks, compositional verification, run-time monitoring, repair, and others. We also show novel results highlighting its potential in the era of large vision-language models.
[615] SpecMER: Fast Protein Generation with K-mer Guided Speculative Decoding
Thomas Walton, Darin Tsui, Aryan Musharaf, Amirali Aghazadeh
Main category: cs.LG
TL;DR: SpecMER is a speculative decoding framework that uses k-mer motifs from multiple sequence alignments to guide protein sequence generation, achieving 24-32% speedup while improving biological plausibility.
Details
Motivation: Autoregressive models for protein engineering suffer from high latency in sequential inference, limiting high-throughput screening. Standard speculative decoding uses draft models that ignore biological constraints, leading to implausible protein sequences.Method: Introduces SpecMER framework that incorporates biological, structural, and functional priors using k-mer motifs from multiple sequence alignments. Scores candidate sequences in parallel and selects those most consistent with known biological patterns.
Result: Achieves 24-32% speedup over standard autoregressive decoding, with higher acceptance rates and improved sequence likelihoods while maintaining efficiency.
Conclusion: SpecMER successfully combines the efficiency of speculative decoding with biological guidance, enabling faster and more biologically plausible protein sequence generation for high-throughput applications.
Abstract: Autoregressive models have transformed protein engineering by enabling the generation of novel protein sequences beyond those found in nature. However, their sequential inference introduces significant latency, limiting their utility in high-throughput protein screening. Speculative decoding accelerates generation by employing a lightweight draft model to sample tokens, which a larger target model then verifies and refines. Yet, in protein sequence generation, draft models are typically agnostic to the structural and functional constraints of the target protein, leading to biologically implausible outputs and a shift in the likelihood distribution of generated sequences. We introduce SpecMER (Speculative Decoding via k-mer Guidance), a novel framework that incorporates biological, structural, and functional priors using k-mer motifs extracted from multiple sequence alignments. By scoring candidate sequences in parallel and selecting those most consistent with known biological patterns, SpecMER significantly improves sequence plausibility while retaining the efficiency of speculative decoding. SpecMER achieves 24-32% speedup over standard autoregressive decoding, along with higher acceptance rates and improved sequence likelihoods.
[616] Wav2Arrest 2.0: Long-Horizon Cardiac Arrest Prediction with Time-to-Event Modeling, Identity-Invariance, and Pseudo-Lab Alignment
Saurabh Kataria, Davood Fattahi, Minxiao Wang, Ran Xiao, Matthew Clark, Timothy Ruchti, Mark Mai, Xiao Hu
Main category: cs.LG
TL;DR: The paper proposes three orthogonal improvements for PPG-based cardiac arrest prediction systems: time-to-event modeling, patient-identity invariant features via adversarial training, and pseudo-lab regression using pre-trained auxiliary networks.
Details
Motivation: Current physiological foundation models based on PPG can predict critical events like cardiac arrest, but their powerful representations are not fully leveraged when downstream data/labels are scarce. The goal is to improve PPG-only CA systems using minimal auxiliary information.Method: 1) Time-to-event modeling through regression or discrete survival modeling; 2) Patient-identity invariant features via adversarial training using p-vector biometric identification; 3) Regression on pseudo-lab values from pre-trained auxiliary estimator networks; 4) Multi-task formulation with PCGrad optimization to handle gradient conflicts.
Result: The proposed methods independently improve 24-hour time-averaged AUC from 0.74 to 0.78-0.80 range, with primary improvements over longer time horizons and minimal degradation near the event.
Conclusion: The three orthogonal approaches significantly enhance cardiac arrest prediction performance, particularly for early warning systems, by addressing data scarcity through time-to-event modeling, identity deconfounding, and pseudo-lab enrichment.
Abstract: High-frequency physiological waveform modality offers deep, real-time insights into patient status. Recently, physiological foundation models based on Photoplethysmography (PPG), such as PPG-GPT, have been shown to predict critical events, including Cardiac Arrest (CA). However, their powerful representation still needs to be leveraged suitably, especially when the downstream data/label is scarce. We offer three orthogonal improvements to improve PPG-only CA systems by using minimal auxiliary information. First, we propose to use time-to-event modeling, either through simple regression to the event onset time or by pursuing fine-grained discrete survival modeling. Second, we encourage the model to learn CA-focused features by making them patient-identity invariant. This is achieved by first training the largest-scale de-identified biometric identification model, referred to as the p-vector, and subsequently using it adversarially to deconfound cues, such as person identity, that may cause overfitting through memorization. Third, we propose regression on the pseudo-lab values generated by pre-trained auxiliary estimator networks. This is crucial since true blood lab measurements, such as lactate, sodium, troponin, and potassium, are collected sparingly. Via zero-shot prediction, the auxiliary networks can enrich cardiac arrest waveform labels and generate pseudo-continuous estimates as targets. Our proposals can independently improve the 24-hour time-averaged AUC from the 0.74 to the 0.78-0.80 range. We primarily improve over longer time horizons with minimal degradation near the event, thus pushing the Early Warning System research. Finally, we pursue multi-task formulation and diagnose it with a high gradient conflict rate among competing losses, which we alleviate via the PCGrad optimization technique.
[617] Exact Subgraph Isomorphism Network for Predictive Graph Mining
Taiga Kojima, Masayuki Karasuyama
Main category: cs.LG
TL;DR: EIN combines exact subgraph enumeration with neural networks and sparse regularization for graph-level prediction, achieving high performance and interpretability.
Details
Motivation: Building graph-level prediction models with both high discriminative ability and interpretability is challenging. Subgraph information is crucial for graph-level tasks.Method: Exact subgraph enumeration combined with neural networks and sparse regularization. Uses pruning strategy to handle computational complexity while maintaining performance.
Result: EIN achieves sufficiently high prediction performance compared to standard graph neural networks. Enables post-hoc analysis through identified important subgraphs.
Conclusion: EIN successfully combines subgraph enumeration with neural networks to achieve both high performance and interpretability in graph-level prediction tasks.
Abstract: In the graph-level prediction task (predict a label for a given graph), the information contained in subgraphs of the input graph plays a key role. In this paper, we propose Exact subgraph Isomorphism Network (EIN), which combines the exact subgraph enumeration, neural network, and a sparse regularization. In general, building a graph-level prediction model achieving high discriminative ability along with interpretability is still a challenging problem. Our combination of the subgraph enumeration and neural network contributes to high discriminative ability about the subgraph structure of the input graph. Further, the sparse regularization in EIN enables us 1) to derive an effective pruning strategy that mitigates computational difficulty of the enumeration while maintaining the prediction performance, and 2) to identify important subgraphs that contributes to high interpretability. We empirically show that EIN has sufficiently high prediction performance compared with standard graph neural network models, and also, we show examples of post-hoc analysis based on the selected subgraphs.
[618] Downscaling human mobility data based on demographic socioeconomic and commuting characteristics using interpretable machine learning methods
Yuqin Jiang, Andrey A. Popov, Tianle Duan, Qingchun Li
Main category: cs.LG
TL;DR: A machine learning framework for downscaling taxi trip flows from larger to smaller spatial units in NYC, using four models with sensitivity analysis to interpret variable importance.
Details
Motivation: Understanding urban human mobility patterns at various spatial levels is essential for social science and improving transportation services.Method: Used Linear Regression, Random Forest, Support Vector Machine, and Neural Networks to correlate OD trips with demographic, socioeconomic, and commuting characteristics, with perturbation-based sensitivity analysis.
Result: Linear regression failed to capture complex interactions, NN performed best with training/testing data, while SVM showed best generalization in downscaling.
Conclusion: The methodology provides both analytical advancement and practical applications for improving transportation services and urban development.
Abstract: Understanding urban human mobility patterns at various spatial levels is essential for social science. This study presents a machine learning framework to downscale origin-destination (OD) taxi trips flows in New York City from a larger spatial unit to a smaller spatial unit. First, correlations between OD trips and demographic, socioeconomic, and commuting characteristics are developed using four models: Linear Regression (LR), Random Forest (RF), Support Vector Machine (SVM), and Neural Networks (NN). Second, a perturbation-based sensitivity analysis is applied to interpret variable importance for nonlinear models. The results show that the linear regression model failed to capture the complex variable interactions. While NN performs best with the training and testing datasets, SVM shows the best generalization ability in downscaling performance. The methodology presented in this study provides both analytical advancement and practical applications to improve transportation services and urban development.
[619] PQFed: A Privacy-Preserving Quality-Controlled Federated Learning Framework
Weiqi Yue, Wenbiao Li, Yuzhou Jiang, Anisa Halimi, Roger French, Erman Ayday
Main category: cs.LG
TL;DR: PQFed is a privacy-preserving personalized federated learning framework that uses clustering to estimate dataset similarity and enables clients to collaborate with compatible partners before training, improving performance even with limited participants.
Details
Motivation: Address data heterogeneity challenges in federated learning by focusing on early-stage quality control rather than post-training local adaptation, to improve global model performance.Method: Extracts representative features from client raw data, applies clustering to estimate inter-client dataset similarity, implements client selection strategy for compatible data distributions, and integrates with existing FL algorithms.
Result: Consistently improves target client’s model performance on CIFAR-10 and MNIST datasets, even with limited participants. Outperforms baseline cluster-based algorithm IFCA in low-participation scenarios.
Conclusion: PQFed demonstrates scalability and effectiveness in personalized federated learning settings through early-stage quality control and compatible client collaboration.
Abstract: Federated learning enables collaborative model training without sharing raw data, but data heterogeneity consistently challenges the performance of the global model. Traditional optimization methods often rely on collaborative global model training involving all clients, followed by local adaptation to improve individual performance. In this work, we focus on early-stage quality control and propose PQFed, a novel privacy-preserving personalized federated learning framework that designs customized training strategies for each client prior to the federated training process. PQFed extracts representative features from each client’s raw data and applies clustering techniques to estimate inter-client dataset similarity. Based on these similarity estimates, the framework implements a client selection strategy that enables each client to collaborate with others who have compatible data distributions. We evaluate PQFed on two benchmark datasets, CIFAR-10 and MNIST, integrated with three existing federated learning algorithms. Experimental results show that PQFed consistently improves the target client’s model performance, even with a limited number of participants. We further benchmark PQFed against a baseline cluster-based algorithm, IFCA, and observe that PQFed also achieves better performance in low-participation scenarios. These findings highlight PQFed’s scalability and effectiveness in personalized federated learning settings.
[620] A Unifying Framework for Parallelizing Sequential Models with Linear Dynamical Systems
Xavier Gonzalez, E. Kelly Buchanan, Hyun Dong Lee, Jerry Weihong Liu, Ke Alexander Wang, David M. Zoltowski, Christopher Ré, Scott W. Linderman
Main category: cs.LG
TL;DR: The paper presents a unified framework using linear dynamical systems (LDSs) to understand parallel evaluation methods for sequential models, showing how Newton, Picard, and Jacobi iterations emerge as approximate linearizations of nonlinear recursions.
Details
Motivation: To address the challenge of parallelizing sequential models in machine learning by providing a common theoretical foundation for existing fixed-point methods.Method: Develops a framework based on linear dynamical systems that unifies different fixed-point iteration schemes (Newton, Picard, Jacobi) as approximate linearizations of nonlinear recursions.
Result: The framework reveals shared principles behind parallel evaluation techniques and clarifies when specific fixed-point methods are most effective.
Conclusion: The LDS-based framework provides a clearer theoretical foundation for parallelizing sequential models and suggests new opportunities for efficient and scalable computation.
Abstract: Harnessing parallelism in seemingly sequential models is a central challenge for modern machine learning. Several approaches have been proposed for evaluating sequential processes in parallel using fixed-point methods, like Newton, Picard, and Jacobi iterations. In this work, we show that these methods can be understood within a common framework based on linear dynamical systems (LDSs), where different iteration schemes arise naturally as approximate linearizations of a nonlinear recursion. This unifying view highlights shared principles behind these techniques and clarifies when particular fixed-point methods are most likely to be effective. By bridging diverse algorithms through the language of LDSs, our framework provides a clearer theoretical foundation for parallelizing sequential models and points toward new opportunities for efficient and scalable computation.
[621] Information-Theoretic Bayesian Optimization for Bilevel Optimization Problems
Takuya Kanayama, Yuki Ito, Tomoyuki Tamura, Masayuki Karasuyama
Main category: cs.LG
TL;DR: This paper proposes an information-theoretic Bayesian optimization method for bilevel optimization problems with expensive black-box functions at both upper and lower levels.
Details
Motivation: Bilevel optimization has a complex nested structure and hasn't been widely studied in Bayesian optimization compared to other extensions like multi-objective optimization. Both upper and lower levels involve expensive black-box functions that need efficient optimization.Method: An information-theoretic approach that considers information gain of both upper- and lower-level optimal solutions and values, with a practical lower bound based evaluation method.
Result: The proposed method was empirically demonstrated to be effective through several benchmark datasets.
Conclusion: The information-theoretic approach provides a unified criterion that simultaneously measures benefits for both level problems in bilevel optimization.
Abstract: A bilevel optimization problem consists of two optimization problems nested as an upper- and a lower-level problem, in which the optimality of the lower-level problem defines a constraint for the upper-level problem. This paper considers Bayesian optimization (BO) for the case that both the upper- and lower-levels involve expensive black-box functions. Because of its nested structure, bilevel optimization has a complex problem definition and, compared with other standard extensions of BO such as multi-objective or constraint settings, it has not been widely studied. We propose an information-theoretic approach that considers the information gain of both the upper- and lower-optimal solutions and values. This enables us to define a unified criterion that measures the benefit for both level problems, simultaneously. Further, we also show a practical lower bound based approach to evaluating the information gain. We empirically demonstrate the effectiveness of our proposed method through several benchmark datasets.
[622] Uncovering Alzheimer’s Disease Progression via SDE-based Spatio-Temporal Graph Deep Learning on Longitudinal Brain Networks
Houliang Zhou, Rong Zhou, Yangying Liu, Kanhao Zhao, Li Shen, Brian Y. Chen, Yu Zhang, Lifang He, Alzheimer’s Disease Neuroimaging Initiative
Main category: cs.LG
TL;DR: An interpretable spatio-temporal graph neural network framework using dual Stochastic Differential Equations (SDEs) to predict Alzheimer’s disease progression from irregularly-sampled longitudinal fMRI data, identifying key brain circuit abnormalities.
Details
Motivation: To address the challenge of identifying objective neuroimaging biomarkers for forecasting Alzheimer's disease progression, which remains difficult due to complex spatio-temporal brain network dysfunctions often overlooked by existing methods.Method: Developed an interpretable spatio-temporal graph neural network framework leveraging dual Stochastic Differential Equations (SDEs) to model irregularly-sampled longitudinal fMRI data, validated on OASIS-3 and ADNI cohorts.
Result: The framework effectively learned sparse regional and connective importance probabilities, identifying key brain abnormalities in parahippocampal cortex, prefrontal cortex, parietal lobule, and disruptions in ventral attention, dorsal attention, and default mode networks that strongly correlate with AD clinical symptoms.
Conclusion: The approach demonstrates potential for early, individualized prediction of AD progression using spatio-temporal graph-based learning, revealing both established and novel neural systems-level and sex-specific biomarkers for understanding AD neurobiological mechanisms.
Abstract: Identifying objective neuroimaging biomarkers to forecast Alzheimer’s disease (AD) progression is crucial for timely intervention. However, this task remains challenging due to the complex dysfunctions in the spatio-temporal characteristics of underlying brain networks, which are often overlooked by existing methods. To address these limitations, we develop an interpretable spatio-temporal graph neural network framework to predict future AD progression, leveraging dual Stochastic Differential Equations (SDEs) to model the irregularly-sampled longitudinal functional magnetic resonance imaging (fMRI) data. We validate our approach on two independent cohorts, including the Open Access Series of Imaging Studies (OASIS-3) and the Alzheimer’s Disease Neuroimaging Initiative (ADNI). Our framework effectively learns sparse regional and connective importance probabilities, enabling the identification of key brain circuit abnormalities associated with disease progression. Notably, we detect the parahippocampal cortex, prefrontal cortex, and parietal lobule as salient regions, with significant disruptions in the ventral attention, dorsal attention, and default mode networks. These abnormalities correlate strongly with longitudinal AD-related clinical symptoms. Moreover, our interpretability strategy reveals both established and novel neural systems-level and sex-specific biomarkers, offering new insights into the neurobiological mechanisms underlying AD progression. Our findings highlight the potential of spatio-temporal graph-based learning for early, individualized prediction of AD progression, even in the context of irregularly-sampled longitudinal imaging data.
[623] POLO: Preference-Guided Multi-Turn Reinforcement Learning for Lead Optimization
Ziqing Wang, Yibo Wen, William Pattie, Xiao Luo, Weimin Wu, Jerry Yao-Chieh Hu, Abhishek Pandey, Han Liu, Kaize Ding
Main category: cs.LG
TL;DR: POLO introduces a novel reinforcement learning approach for lead optimization that learns from complete optimization trajectories using dual-level preference learning, achieving superior sample efficiency with limited oracle evaluations.
Details
Motivation: Traditional lead optimization methods struggle with sample efficiency, and existing LLM-based approaches fail to leverage the iterative nature of optimization by treating steps independently.Method: POLO uses Preference-Guided Policy Optimization (PGPO) with dual-level learning: trajectory-level optimization reinforces successful strategies, and turn-level preference learning ranks intermediate molecules within trajectories.
Result: POLO achieves 84% average success rate on single-property tasks (2.3x better than baselines) and 50% on multi-property tasks using only 500 oracle evaluations.
Conclusion: POLO significantly advances sample-efficient molecular optimization by fully exploiting costly oracle calls through trajectory-based learning.
Abstract: Lead optimization in drug discovery requires efficiently navigating vast chemical space through iterative cycles to enhance molecular properties while preserving structural similarity to the original lead compound. Despite recent advances, traditional optimization methods struggle with sample efficiency-achieving good optimization performance with limited oracle evaluations. Large Language Models (LLMs) provide a promising approach through their in-context learning and instruction following capabilities, which align naturally with these iterative processes. However, existing LLM-based methods fail to leverage this strength, treating each optimization step independently. To address this, we present POLO (Preference-guided multi-turn Optimization for Lead Optimization), which enables LLMs to learn from complete optimization trajectories rather than isolated steps. At its core, POLO introduces Preference-Guided Policy Optimization (PGPO), a novel reinforcement learning algorithm that extracts learning signals at two complementary levels: trajectory-level optimization reinforces successful strategies, while turn-level preference learning provides dense comparative feedback by ranking intermediate molecules within each trajectory. Through this dual-level learning from intermediate evaluation, POLO achieves superior sample efficiency by fully exploiting each costly oracle call. Extensive experiments demonstrate that POLO achieves 84% average success rate on single-property tasks (2.3x better than baselines) and 50% on multi-property tasks using only 500 oracle evaluations, significantly advancing the state-of-the-art in sample-efficient molecular optimization.
[624] Brain PathoGraph Learning
Ciyuan Peng, Nguyen Linh Dan Le, Shan Jin, Dexuan Ding, Shuo Yu, Feng Xia
Main category: cs.LG
TL;DR: BrainPoG is a lightweight brain graph learning model that uses pathological pattern filtering and feature distillation to efficiently learn disease-related knowledge while reducing computational costs.
Details
Motivation: Existing brain graph learning methods struggle to selectively learn disease-related knowledge, leading to heavy parameters and computational costs that limit their practicality for real-world clinical applications.Method: BrainPoG uses pathological pattern filtering to extract disease-relevant subgraphs (PathoGraph construction) and pathological feature distillation to reduce disease-irrelevant noise while enhancing pathological features.
Result: Extensive experiments on four benchmark datasets show BrainPoG exhibits superiority in both model performance and computational efficiency across various brain disease detection tasks.
Conclusion: BrainPoG enables efficient brain graph learning by exclusively learning informative disease-related knowledge while avoiding less relevant information, making it practical for clinical applications.
Abstract: Brain graph learning has demonstrated significant achievements in the fields of neuroscience and artificial intelligence. However, existing methods struggle to selectively learn disease-related knowledge, leading to heavy parameters and computational costs. This challenge diminishes their efficiency, as well as limits their practicality for real-world clinical applications. To this end, we propose a lightweight Brain PathoGraph Learning (BrainPoG) model that enables efficient brain graph learning by pathological pattern filtering and pathological feature distillation. Specifically, BrainPoG first contains a filter to extract the pathological pattern formulated by highly disease-relevant subgraphs, achieving graph pruning and lesion localization. A PathoGraph is therefore constructed by dropping less disease-relevant subgraphs from the whole brain graph. Afterwards, a pathological feature distillation module is designed to reduce disease-irrelevant noise features and enhance pathological features of each node in the PathoGraph. BrainPoG can exclusively learn informative disease-related knowledge while avoiding less relevant information, achieving efficient brain graph learning. Extensive experiments on four benchmark datasets demonstrate that BrainPoG exhibits superiority in both model performance and computational efficiency across various brain disease detection tasks.
[625] HyperCore: Coreset Selection under Noise via Hypersphere Models
Brian B. Moser, Arundhati S. Shanbhag, Tobias C. Nauen, Stanislav Frolov, Federico Raue, Joachim Folz, Andreas Dengel
Main category: cs.LG
TL;DR: HyperCore is a robust coreset selection framework that uses hypersphere models and Youden’s J statistic to automatically prune noisy data without requiring fixed pruning ratios or hyperparameter tuning.
Details
Motivation: Existing coreset selection methods ignore annotation errors and require fixed pruning ratios, making them impractical for real-world noisy environments.Method: Uses lightweight hypersphere models per class to embed in-class samples close to hypersphere centers, segregating out-of-class samples by distance. Employs Youden’s J statistic to adaptively select pruning thresholds.
Result: Consistently outperforms state-of-the-art coreset selection methods, especially under noisy and low-data conditions. Effectively discards mislabeled and ambiguous points.
Conclusion: HyperCore produces compact, highly informative subsets suitable for scalable and noise-free learning in practical applications.
Abstract: The goal of coreset selection methods is to identify representative subsets of datasets for efficient model training. Yet, existing methods often ignore the possibility of annotation errors and require fixed pruning ratios, making them impractical in real-world settings. We present HyperCore, a robust and adaptive coreset selection framework designed explicitly for noisy environments. HyperCore leverages lightweight hypersphere models learned per class, embedding in-class samples close to a hypersphere center while naturally segregating out-of-class samples based on their distance. By using Youden’s J statistic, HyperCore can adaptively select pruning thresholds, enabling automatic, noise-aware data pruning without hyperparameter tuning. Our experiments reveal that HyperCore consistently surpasses state-of-the-art coreset selection methods, especially under noisy and low-data regimes. HyperCore effectively discards mislabeled and ambiguous points, yielding compact yet highly informative subsets suitable for scalable and noise-free learning.
[626] SubZeroCore: A Submodular Approach with Zero Training for Coreset Selection
Brian B. Moser, Tobias C. Nauen, Arundhati S. Shanbhag, Federico Raue, Stanislav Frolov, Joachim Folz, Andreas Dengel
Main category: cs.LG
TL;DR: SubZeroCore is a training-free coreset selection method that uses submodular coverage and density objectives to identify representative data subsets without requiring expensive training signals.
Details
Motivation: Existing coreset selection methods paradoxically require expensive training-based signals computed over the entire dataset before pruning, which undermines their efficiency purpose.Method: Integrates submodular coverage and density into a unified objective with a sampling strategy based on closed-form solutions to balance these objectives, controlled by a single hyperparameter.
Result: Matches training-based baselines, significantly outperforms them at high pruning rates, dramatically reduces computational overhead, and shows superior robustness to label noise.
Conclusion: SubZeroCore provides an effective and scalable training-free coreset selection approach suitable for real-world scenarios.
Abstract: The goal of coreset selection is to identify representative subsets of datasets for efficient model training. Yet, existing approaches paradoxically require expensive training-based signals, e.g., gradients, decision boundary estimates or forgetting counts, computed over the entire dataset prior to pruning, which undermines their very purpose by requiring training on samples they aim to avoid. We introduce SubZeroCore, a novel, training-free coreset selection method that integrates submodular coverage and density into a single, unified objective. To achieve this, we introduce a sampling strategy based on a closed-form solution to optimally balance these objectives, guided by a single hyperparameter that explicitly controls the desired coverage for local density measures. Despite no training, extensive evaluations show that SubZeroCore matches training-based baselines and significantly outperforms them at high pruning rates, while dramatically reducing computational overhead. SubZeroCore also demonstrates superior robustness to label noise, highlighting its practical effectiveness and scalability for real-world scenarios.
[627] Reparameterizing 4DVAR with neural fields
Jaemin Oh
Main category: cs.LG
TL;DR: Neural field-based 4DVAR reformulation enables parallel-in-time optimization for numerical weather prediction, improving stability and reducing computational cost without requiring ground-truth data.
Details
Motivation: Classical 4DVAR is computationally intensive and difficult to optimize due to time-sequential dependencies. There's a need for more efficient and stable data assimilation methods.Method: Reparameterize the spatiotemporal state as a continuous neural network function, enabling parallel-in-time optimization. Incorporate physical constraints through physics-informed loss functions.
Result: Neural reparameterized variants produce more stable initial condition estimates without spurious oscillations compared to baseline 4DVAR. Method tested on 2D incompressible Navier-Stokes equations with Kolmogorov forcing.
Conclusion: The neural field approach successfully removes time-sequential dependencies, simplifies implementation, reduces computational cost, and works without requiring ground-truth states or reanalysis data.
Abstract: Four-dimensional variational data assimilation (4DVAR) is a cornerstone of numerical weather prediction, but its cost function is difficult to optimize and computationally intensive. We propose a neural field-based reformulation in which the full spatiotemporal state is represented as a continuous function parameterized by a neural network. This reparameterization removes the time-sequential dependency of classical 4DVAR, enabling parallel-in-time optimization in parameter space. Physical constraints are incorporated directly through a physics-informed loss, simplifying implementation and reducing computational cost. We evaluate the method on the two-dimensional incompressible Navier–Stokes equations with Kolmogorov forcing. Compared to a baseline 4DVAR implementation, the neural reparameterized variants produce more stable initial condition estimates without spurious oscillations. Notably, unlike most machine learning-based approaches, our framework does not require access to ground-truth states or reanalysis data, broadening its applicability to settings with limited reference information.
[628] Machine Learning and AI Applied to fNIRS Data Reveals Novel Brain Activity Biomarkers in Stable Subclinical Multiple Sclerosis
Sadman Saumik Islam, Bruna Dalcin Baldasso, Davide Cattaneo, Xianta Jiang, Michelle Ploughman
Main category: cs.LG
TL;DR: fNIRS with machine learning detected brain activity biomarkers in MS patients during hand dexterity tasks, achieving 75% classification accuracy using K-NN, with supramarginal/angular gyri and precentral gyrus as key regions showing suppressed activity.
Details
Motivation: To detect subtle brain activity biomarkers that explain subjective cognitive fatigue in MS patients during hand dexterity tasks and identify targets for future brain stimulation treatments.Method: Used fNIRS to measure brain hemodynamic responses in 15 MS patients and 12 controls during single and dual hand dexterity tasks, analyzed with machine learning (K-NN classifier) and XAI to identify important brain regions.
Result: K-NN achieved 75.0% accuracy for single tasks and 66.7% for dual tasks. Key regions were supramarginal/angular gyri and precentral gyrus in ipsilateral hemisphere, showing suppressed activity and slower neurovascular response in MS group. Deoxygenated hemoglobin was better predictor than oxygenated hemoglobin.
Conclusion: fNIRS with machine learning revealed novel brain activity biomarkers in MS patients, providing potential targets for personalized brain stimulation treatments to address cognitive fatigue during dexterous tasks.
Abstract: People with Multiple Sclerosis (MS) complain of problems with hand dexterity and cognitive fatigue. However, in many cases, impairments are subtle and difficult to detect. Functional near-infrared spectroscopy (fNIRS) is a non-invasive neuroimaging technique that measures brain hemodynamic responses during cognitive or motor tasks. We aimed to detect brain activity biomarkers that could explain subjective reports of cognitive fatigue while completing dexterous tasks and provide targets for future brain stimulation treatments. We recruited 15 people with MS who did not have a hand (Nine Hole Peg Test [NHPT]), mobility, or cognitive impairment, and 12 age- and sex-matched controls. Participants completed two types of hand dexterity tasks with their dominant hand, single task and dual task (NHPT while holding a ball between the fifth finger and hypothenar eminence of the same hand). We analyzed fNIRS data (oxygenated and deoxygenated hemoglobin levels) using a machine learning framework to classify MS patients from controls based on their brain activation patterns in bilateral prefrontal and sensorimotor cortices. The K-Nearest Neighbor classifier achieved an accuracy of 75.0% for single manual dexterity tasks and 66.7% for the more complex dual manual dexterity tasks. Using XAI, we found that the most important brain regions contributing to the machine learning model were the supramarginal/angular gyri and the precentral gyrus (sensory integration and motor regions) of the ipsilateral hemisphere, with suppressed activity and slower neurovascular response in the MS group. During both tasks, deoxygenated hemoglobin levels were better predictors than the conventional measure of oxygenated hemoglobin. This nonconventional method of fNIRS data analysis revealed novel brain activity biomarkers that can help develop personalized brain stimulation targets.
[629] Beyond Formula Complexity: Effective Information Criterion Improves Performance and Interpretability for Symbolic Regression
Zihan Yu, Guanren Wang, Jingtao Ding, Huandong Wang, Yong Li
Main category: cs.LG
TL;DR: The paper proposes Effective Information Criterion (EIC) to improve symbolic regression by identifying mathematically unreasonable structures in formulas, enhancing interpretability and performance.
Details
Motivation: Existing symbolic regression methods use complexity metrics that only consider formula size but ignore mathematical structure, leading to formulas that are difficult to interpret despite compact forms.Method: Proposes Effective Information Criterion (EIC) that treats formulas as information processing systems and identifies unreasonable structures through loss of significant digits or amplification of rounding noise during data flow.
Result: EIC improves performance on Pareto frontier, reduces irrational structures, boosts sample efficiency by 2-4x with generative algorithms, and shows 70.2% agreement with human experts on interpretability preferences.
Conclusion: EIC effectively measures formula interpretability by identifying unreasonable mathematical structures, bridging the gap between symbolic regression results and real-world physical formulas.
Abstract: Symbolic regression discovers accurate and interpretable formulas to describe given data, thereby providing scientific insights for domain experts and promoting scientific discovery. However, existing symbolic regression methods often use complexity metrics as a proxy for interoperability, which only considers the size of the formula but ignores its internal mathematical structure. Therefore, while they can discover formulas with compact forms, the discovered formulas often have structures that are difficult to analyze or interpret mathematically. In this work, inspired by the observation that physical formulas are typically numerically stable under limited calculation precision, we propose the Effective Information Criterion (EIC). It treats formulas as information processing systems with specific internal structures and identifies the unreasonable structure in them by the loss of significant digits or the amplification of rounding noise as data flows through the system. We find that this criterion reveals the gap between the structural rationality of models discovered by existing symbolic regression algorithms and real-world physical formulas. Combining EIC with various search-based symbolic regression algorithms improves their performance on the Pareto frontier and reduces the irrational structure in the results. Combining EIC with generative-based algorithms reduces the number of samples required for pre-training, improving sample efficiency by 2~4 times. Finally, for different formulas with similar accuracy and complexity, EIC shows a 70.2% agreement with 108 human experts’ preferences for formula interpretability, demonstrating that EIC, by measuring the unreasonable structures in formulas, actually reflects the formula’s interpretability.
[630] FastGRPO: Accelerating Policy Optimization via Concurrency-aware Speculative Decoding and Online Draft Learning
Yizhou Zhang, Ning Lv, Teng Wang, Jisheng Dang
Main category: cs.LG
TL;DR: A concurrency-aware speculative decoding framework that dynamically adjusts drafting and verification strategies to accelerate GRPO training, achieving 2.35x-2.72x speedup.
Details
Motivation: GRPO's practical deployment is hindered by slow training due to computationally intensive autoregressive generation, which becomes the primary bottleneck under high-concurrency conditions.Method: Proposes concurrency-aware speculative decoding with dynamic strategy adjustment and online draft learning to adapt the draft model to the evolving target model during training.
Result: Achieves end-to-end speedups of 2.35x to 2.72x across multiple mathematical reasoning datasets and models, significantly surpassing baseline approaches.
Conclusion: The proposed method effectively addresses GRPO’s training bottleneck through adaptive speculative decoding and continuous draft model adaptation, enabling practical deployment.
Abstract: Group relative policy optimization (GRPO) has demonstrated significant potential in improving the reasoning capabilities of large language models (LLMs) via reinforcement learning. However, its practical deployment is impeded by an excessively slow training process, primarily attributed to the computationally intensive autoregressive generation of multiple responses per query, which makes the generation phase the primary performance bottleneck. Although speculative decoding presents a promising direction for acceleration, its direct application in GRPO achieves limited speedup under high-concurrency training conditions. To overcome this limitation, we propose a concurrency-aware speculative decoding framework that dynamically adjusts the drafting and verification strategy according to real-time concurrency levels, thereby maximizing the acceleration of the generation process. Furthermore, to address performance degradation arising from distributional drift between the evolving target model and the fixed draft model during training, we introduce an online draft learning mechanism that enables the draft model to continuously adapt using feedback signals from the target model. Experimental results across multiple mathematical reasoning datasets and models demonstrate that the proposed method achieves end-to-end speedups of 2.35x to 2.72x, significantly surpassing baseline approaches in efficiency. The code is available at https://github.com/yedaotian9/GRPO_speculative.
[631] Exploring the Relationships Between Physiological Signals During Automated Fatigue Detection
Kourosh Kakhi, Abbas Khosravi, Roohallah Alizadehsani, U. Rajendra Acharyab
Main category: cs.LG
TL;DR: Multi-modal physiological signal fusion improves fatigue detection accuracy, with EMG-EEG combination using XGBoost achieving best performance.
Details
Motivation: Most fatigue detection studies focus on single modalities, but examining statistical relationships between signal pairs can improve classification robustness.Method: Used DROZY dataset to extract features from ECG, EMG, EOG, and EEG across 15 signal combinations, evaluated with Decision Tree, Random Forest, Logistic Regression, and XGBoost classifiers.
Result: XGBoost with EMG-EEG combination achieved best performance. Multi-signal models consistently outperformed single signal ones. SHAP analysis identified ECG-EOG correlation as key feature.
Conclusion: Feature-level fusion of physiological signals enhances accuracy, interpretability, and practical applicability of fatigue monitoring systems.
Abstract: Fatigue detection using physiological signals is critical in domains such as transportation, healthcare, and performance monitoring. While most studies focus on single modalities, this work examines statistical relationships between signal pairs to improve classification robustness. Using the DROZY dataset, we extracted features from ECG, EMG, EOG, and EEG across 15 signal combinations and evaluated them with Decision Tree, Random Forest, Logistic Regression, and XGBoost. Results show that XGBoost with the EMG EEG combination achieved the best performance. SHAP analysis highlighted ECG EOG correlation as a key feature, and multi signal models consistently outperformed single signal ones. These findings demonstrate that feature level fusion of physiological signals enhances accuracy, interpretability, and practical applicability of fatigue monitoring systems.
[632] ChaosNexus: A Foundation Model for Universal Chaotic System Forecasting with Multi-scale Representations
Chang Liu, Bohao Zhao, Jingtao Ding, Yong Li
Main category: cs.LG
TL;DR: ChaosNexus is a foundation model for chaotic systems forecasting that achieves state-of-the-art zero-shot generalization using a multi-scale architecture with Mixture-of-Experts layers, improving long-term attractor statistics by 40% and achieving competitive weather forecasting with minimal data.
Details
Motivation: Traditional chaotic system models lack generalization capacity for real-world applications due to system-specific training and data scarcity, requiring robust zero-shot/few-shot forecasting on novel scenarios.Method: Proposes ChaosNexus foundation model with multi-scale ScaleFormer architecture augmented with Mixture-of-Experts layers, pre-trained on diverse chaotic dynamics corpus to capture universal patterns and system-specific behaviors.
Result: 40% improvement in long-term attractor statistics on 9,000+ synthetic systems; competitive 5-day global weather forecasting with <1 degree mean error zero-shot; performance improves with few-shot fine-tuning; shows generalization stems from training diversity rather than data volume.
Conclusion: ChaosNexus demonstrates effective cross-system generalization for chaotic forecasting, establishing that diversity of training systems is key for scientific foundation models rather than sheer data quantity.
Abstract: Accurately forecasting chaotic systems, prevalent in domains such as weather prediction and fluid dynamics, remains a significant scientific challenge. The inherent sensitivity of these systems to initial conditions, coupled with a scarcity of observational data, severely constrains traditional modeling approaches. Since these models are typically trained for a specific system, they lack the generalization capacity necessary for real-world applications, which demand robust zero-shot or few-shot forecasting on novel or data-limited scenarios. To overcome this generalization barrier, we propose ChaosNexus, a foundation model pre-trained on a diverse corpus of chaotic dynamics. ChaosNexus employs a novel multi-scale architecture named ScaleFormer augmented with Mixture-of-Experts layers, to capture both universal patterns and system-specific behaviors. The model demonstrates state-of-the-art zero-shot generalization across both synthetic and real-world benchmarks. On a large-scale testbed comprising over 9,000 synthetic chaotic systems, it improves the fidelity of long-term attractor statistics by more than 40% compared to the leading baseline. This robust performance extends to real-world applications with exceptional data efficiency. For instance, in 5-day global weather forecasting, ChaosNexus achieves a competitive zero-shot mean error below 1 degree, a result that further improves with few-shot fine-tuning. Moreover, experiments on the scaling behavior of ChaosNexus provide a guiding principle for scientific foundation models: cross-system generalization stems from the diversity of training systems, rather than sheer data volume.
[633] Scaling Laws for Neural Material Models
Akshay Trikha, Kyle Chu, Advait Gosai, Parker Szachta, Eric Weiner
Main category: cs.LG
TL;DR: This paper analyzes scaling laws for deep learning models in material property prediction, examining how training data, model size, and compute affect performance using transformer and EquiformerV2 networks.
Details
Motivation: Predicting material properties is crucial for designing better batteries, semiconductors, and medical devices, and deep learning helps scientists quickly find promising materials by predicting energy, forces, and stresses.Method: Trained transformer and EquiformerV2 neural networks to predict material properties and analyzed scaling effects of training data, model size, and compute using power law relationships L = α·N^(-β).
Result: Found empirical scaling laws showing predictable relationships between hyperparameters (training data, model size, compute) and predictive performance, with loss following power law relationships.
Conclusion: Future work could investigate scaling laws for other neural network models like GemNet and fully connected networks to compare with the trained transformer and EquiformerV2 models.
Abstract: Predicting material properties is crucial for designing better batteries, semiconductors, and medical devices. Deep learning helps scientists quickly find promising materials by predicting their energy, forces, and stresses. Companies scale capacities of deep learning models in multiple domains, such as language modeling, and invest many millions of dollars into such models. Our team analyzes how scaling training data (giving models more information to learn from), model sizes (giving models more capacity to learn patterns), and compute (giving models more computational resources) for neural networks affects their performance for material property prediction. In particular, we trained both transformer and EquiformerV2 neural networks to predict material properties. We find empirical scaling laws for these models: we can predict how increasing each of the three hyperparameters (training data, model size, and compute) affects predictive performance. In particular, the loss $L$ can be measured with a power law relationship $L = \alpha \cdot N^{-\beta}$, where $\alpha$ and $\beta$ are constants while $N$ is the relevant hyperparameter. We also incorporate command-line arguments for changing training settings such as the amount of epochs, maximum learning rate, and whether mixed precision is enabled. Future work could entail further investigating scaling laws for other neural network models in this domain, such as GemNet and fully connected networks, to assess how they compare to the models we trained.
[634] Sharpness-Aware Minimization Can Hallucinate Minimizers
Chanwoong Park, Uijeong Jang, Ernest K. Ryu, Insoon Yang
Main category: cs.LG
TL;DR: SAM can converge to hallucinated minimizers that are not actual minimizers of the original objective, and a simple remedy is proposed to avoid them.
Details
Motivation: To investigate and demonstrate that Sharpness-Aware Minimization (SAM), despite its popularity for finding flat minimizers that generalize well, can converge to hallucinated minimizers that don't actually minimize the original objective.Method: Theoretical analysis proving existence of hallucinated minimizers, establishing local convergence conditions, and empirical validation of convergence to these points in practice.
Result: Demonstrated that SAM can indeed converge to hallucinated minimizers, which are points that appear to be minimizers under SAM but are not actual minimizers of the original training objective.
Conclusion: A simple yet effective remedy is proposed to avoid convergence to hallucinated minimizers in SAM training.
Abstract: Sharpness-Aware Minimization (SAM) is a widely used method that steers training toward flatter minimizers, which typically generalize better. In this work, however, we show that SAM can converge to hallucinated minimizers – points that are not minimizers of the original objective. We theoretically prove the existence of such hallucinated minimizers and establish conditions for local convergence to them. We further provide empirical evidence demonstrating that SAM can indeed converge to these points in practice. Finally, we propose a simple yet effective remedy for avoiding hallucinated minimizers.
[635] On the Complexity Theory of Masked Discrete Diffusion: From $\mathrm{poly}(1/ε)$ to Nearly $ε$-Free
Xunpeng Huang, Yingyu Lin, Nishant Jain, Kaibo Wang, Difan Zou, Yian Ma, Tong Zhang
Main category: cs.LG
TL;DR: This paper provides the first rigorous theoretical analysis of masked discrete diffusion for text generation, showing it achieves better complexity than uniform diffusion through a novel Mask-Aware Truncated Uniformization (MATU) method.
Details
Motivation: Existing theoretical analyses of masked discrete diffusion are insufficient - they overlook Euler samplers, impose restrictive assumptions, or fail to demonstrate advantages over uniform diffusion. There's a gap in understanding the theoretical complexity of this widely used text generation paradigm.Method: The paper analyzes Euler samplers for masked discrete diffusion and proposes MATU (Mask-Aware Truncated Uniformization), which exploits the property that each token can be unmasked at most once and removes bounded-score assumptions while preserving unbiased discrete score approximation.
Result: Euler samplers achieve ε-accuracy in TV with Õ(d²ε⁻³/²) score evaluations. MATU achieves nearly ε-free complexity of O(d ln d · (1-ε²)), eliminating the ln(1/ε) factor and substantially speeding up convergence compared to uniform diffusion methods.
Conclusion: The findings provide rigorous theoretical foundation for masked discrete diffusion, showcase its practical advantages over uniform diffusion for text generation, and pave the way for analyzing diffusion-based language models developed under the masking paradigm.
Abstract: We study masked discrete diffusion – a flexible paradigm for text generation in which tokens are progressively corrupted by special mask symbols before being denoised. Although this approach has demonstrated strong empirical performance, its theoretical complexity in high-dimensional settings remains insufficiently understood. Existing analyses largely focus on uniform discrete diffusion, and more recent attempts addressing masked diffusion either (1) overlook widely used Euler samplers, (2) impose restrictive bounded-score assumptions, or (3) fail to showcase the advantages of masked discrete diffusion over its uniform counterpart. To address this gap, we show that Euler samplers can achieve $\epsilon$-accuracy in total variation (TV) with $\tilde{O}(d^{2}\epsilon^{-3/2})$ discrete score evaluations, thereby providing the first rigorous analysis of typical Euler sampler in masked discrete diffusion. We then propose a Mask-Aware Truncated Uniformization (MATU) approach that both removes bounded-score assumptions and preserves unbiased discrete score approximation. By exploiting the property that each token can be unmasked at most once, MATU attains a nearly $\epsilon$-free complexity of $O(d,\ln d\cdot (1-\epsilon^2))$. This result surpasses existing uniformization methods under uniform discrete diffusion, eliminating the $\ln(1/\epsilon)$ factor and substantially speeding up convergence. Our findings not only provide a rigorous theoretical foundation for masked discrete diffusion, showcasing its practical advantages over uniform diffusion for text generation, but also pave the way for future efforts to analyze diffusion-based language models developed under masking paradigm.
[636] Beyond Johnson-Lindenstrauss: Uniform Bounds for Sketched Bilinear Forms
Rohan Deb, Qiaobo Li, Mayank Shrivastava, Arindam Banerjee
Main category: cs.LG
TL;DR: This paper develops a general framework for analyzing sketched bilinear forms, providing uniform bounds based on geometric complexities of sets. It extends to sums of independent sketching matrices and applies to federated learning and bandit algorithms.
Details
Motivation: Existing uniform bounds for sketched inner products don't apply well to sketched bilinear forms, which are important in modern machine learning applications like randomized sketching and approximate linear algebra.Method: The approach uses generic chaining and introduces new techniques for handling suprema over pairs of sets. It analyzes bilinear forms involving sums of independent sketching matrices.
Result: The framework provides uniform bounds in terms of geometric complexities, with deviation scaling as √T for T independent matrices. It recovers known results like Johnson-Lindenstrauss lemma and extends RIP-type guarantees.
Conclusion: The unified analysis improves convergence bounds for sketched federated learning and enables sharper regret bounds for sketched bandit algorithms that depend on geometric complexity rather than ambient dimension.
Abstract: Uniform bounds on sketched inner products of vectors or matrices underpin several important computational and statistical results in machine learning and randomized algorithms, including the Johnson-Lindenstrauss (J-L) lemma, the Restricted Isometry Property (RIP), randomized sketching, and approximate linear algebra. However, many modern analyses involve sketched bilinear forms, for which existing uniform bounds either do not apply or are not sharp on general sets. In this work, we develop a general framework to analyze such sketched bilinear forms and derive uniform bounds in terms of geometric complexities of the associated sets. Our approach relies on generic chaining and introduces new techniques for handling suprema over pairs of sets. We further extend these results to the setting where the bilinear form involves a sum of $T$ independent sketching matrices and show that the deviation scales as $\sqrt{T}$. This unified analysis recovers known results such as the J-L lemma as special cases, while extending RIP-type guarantees. Additionally, we obtain improved convergence bounds for sketched Federated Learning algorithms where such cross terms arise naturally due to sketched gradient compression, and design sketched variants of bandit algorithms with sharper regret bounds that depend on the geometric complexity of the action and parameter sets, rather than the ambient dimension.
[637] Graph of Agents: Principled Long Context Modeling by Emergent Multi-Agent Collaboration
Taejong Joo, Shu Ishida, Ivan Sosnovik, Bryan Lim, Sahand Rezaei-Shoshtari, Adam Gaier, Robert Giaquinto
Main category: cs.LG
TL;DR: Graph of Agents (GoA) is a multi-agent framework that dynamically constructs collaboration structures to maximize information compression for long context modeling, outperforming fixed-structure approaches and achieving better performance than models with much larger context windows.
Details
Motivation: Current multi-agent systems for long context modeling rely heavily on hand-crafted collaboration strategies and prompt engineering, which limits their generalizability and effectiveness.Method: Formalizes long context modeling as a compression problem with an information-theoretic objective, then proposes Graph of Agents (GoA) that dynamically constructs input-dependent collaboration structures to maximize this objective.
Result: GoA improves average F1 score by 5.7% over retrieval-augmented generation and by 16.35% over fixed-structure multi-agent baselines. With only 2K context window, it surpasses 128K context window Llama 3.1 8B on LongBench.
Conclusion: GoA provides a principled framework for model-agnostic long context modeling that dramatically increases effective context length without requiring retraining or architectural changes.
Abstract: As a model-agnostic approach to long context modeling, multi-agent systems can process inputs longer than a large language model’s context window without retraining or architectural modifications. However, their performance often heavily relies on hand-crafted multi-agent collaboration strategies and prompt engineering, which limit generalizability. In this work, we introduce a principled framework that formalizes the model-agnostic long context modeling problem as a compression problem, yielding an information-theoretic compression objective. Building on this framework, we propose Graph of Agents (GoA), which dynamically constructs an input-dependent collaboration structure that maximizes this objective. For Llama 3.1 8B and Qwen3 8B across six document question answering benchmarks, GoA improves the average $F_1$ score of retrieval-augmented generation by 5.7% and a strong multi-agent baseline using a fixed collaboration structure by 16.35%, respectively. Even with only a 2K context window, GoA surpasses the 128K context window Llama 3.1 8B on LongBench, showing a dramatic increase in effective context length. Our source code is available at https://github.com/tjoo512/graph-of-agents.
[638] MolSpectLLM: A Molecular Foundation Model Bridging Spectroscopy, Molecule Elucidation, and 3D Structure Generation
Shuaike Shen, Jiaqing Xie, Zhuo Yang, Antong Zhang, Shuzhou Sun, Ben Gao, Tianfan Fu, Biqing Qi, Yuqiang Li
Main category: cs.LG
TL;DR: MolSpectLLM is a molecular foundation model that integrates experimental spectroscopy with 3D structural information, achieving state-of-the-art performance on spectrum-related tasks and enabling accurate 3D molecular structure generation from SMILES or spectral inputs.
Details
Motivation: Existing molecular foundation models rely exclusively on SMILES representations and overlook experimental spectra and 3D structural information, limiting their effectiveness in tasks where stereochemistry, spatial conformation, and experimental validation are critical.Method: Propose MolSpectLLM, a molecular foundation model pretrained on Qwen2.5-7B that unifies experimental spectroscopy with molecular 3D structure by explicitly modeling molecular spectra.
Result: Achieves state-of-the-art performance on spectrum-related tasks (average accuracy of 0.53 across NMR, IR, and MS benchmarks) and strong performance on spectra analysis (15.5% sequence accuracy and 41.7% token accuracy on Spectra-to-SMILES). Generates accurate 3D molecular structures directly from SMILES or spectral inputs.
Conclusion: MolSpectLLM bridges spectral analysis, molecular elucidation, and molecular design by integrating experimental spectroscopy with 3D structural information, overcoming limitations of existing SMILES-only approaches.
Abstract: Recent advances in molecular foundation models have shown impressive performance in molecular property prediction and de novo molecular design, with promising applications in areas such as drug discovery and reaction prediction. Nevertheless, most existing approaches rely exclusively on SMILES representations and overlook both experimental spectra and 3D structural information-two indispensable sources for capturing molecular behavior in real-world scenarios. This limitation reduces their effectiveness in tasks where stereochemistry, spatial conformation, and experimental validation are critical. To overcome these challenges, we propose MolSpectLLM, a molecular foundation model pretrained on Qwen2.5-7B that unifies experimental spectroscopy with molecular 3D structure. By explicitly modeling molecular spectra, MolSpectLLM achieves state-of-the-art performance on spectrum-related tasks, with an average accuracy of 0.53 across NMR, IR, and MS benchmarks. MolSpectLLM also shows strong performance on the spectra analysis task, obtaining 15.5% sequence accuracy and 41.7% token accuracy on Spectra-to-SMILES, substantially outperforming large general-purpose LLMs. More importantly, MolSpectLLM not only achieves strong performance on molecular elucidation tasks, but also generates accurate 3D molecular structures directly from SMILES or spectral inputs, bridging spectral analysis, molecular elucidation, and molecular design.
[639] Beyond RAG vs. Long-Context: Learning Distraction-Aware Retrieval for Efficient Knowledge Grounding
Seong-Woong Shim, Myunsoo Kim, Jae Hyeon Cho, Byung-Jun Lee
Main category: cs.LG
TL;DR: LDAR is an adaptive retrieval method that learns to retrieve contexts while minimizing distraction from irrelevant passages, outperforming long-context approaches with reduced token usage.
Details
Motivation: Long-context LLMs face issues with token inefficiency, 'lost in the middle' phenomenon, and distraction from redundant contexts, which degrade output quality.Method: Proposes LDAR (Learning Distraction-Aware Retrieval), an adaptive retriever that learns to retrieve contexts while mitigating interference from distracting passages.
Result: Extensive experiments across diverse LLM architectures and six benchmarks show LDAR achieves significantly higher performance with reduced token usage compared to long-context approaches.
Conclusion: LDAR effectively balances the trade-off between information coverage and distraction, demonstrating the importance of adaptive retrieval over simply using long contexts.
Abstract: Retrieval-Augmented Generation (RAG) is a framework for grounding Large Language Models (LLMs) in external, up-to-date information. However, recent advancements in context window size allow LLMs to process inputs of up to 128K tokens or more, offering an alternative strategy: supplying the full document context directly to the model, rather than relying on RAG to retrieve a subset of contexts. Nevertheless, this emerging alternative strategy has notable limitations: (i) it is token-inefficient to handle large and potentially redundant contexts; (ii) it exacerbates the `lost in the middle’ phenomenon; and (iii) under limited model capacity, it amplifies distraction, ultimately degrading LLM output quality. In this paper, we propose LDAR (Learning Distraction-Aware Retrieval), an adaptive retriever that learns to retrieve contexts in a way that mitigates interference from distracting passages, thereby achieving significantly higher performance with reduced token usage compared to long-context approaches. Extensive experiments across diverse LLM architectures and six knowledge-intensive benchmarks demonstrate the effectiveness and robustness of our approach, highlighting the importance of balancing the trade-off between information coverage and distraction.
[640] Abductive Logical Rule Induction by Bridging Inductive Logic Programming and Multimodal Large Language Models
Yifei Peng, Yaoli Liu, Enbo Xia, Yu Jin, Wang-Zhou Dai, Zhong Ren, Yao-Xiang Ding, Kun Zhou
Main category: cs.LG
TL;DR: ILP-CoT combines Inductive Logic Programming with Multimodal LLMs for abductive logical rule induction, using MLLMs to propose rule structures and ILP to refine them with formal reasoning.
Details
Motivation: Current methods struggle with logical rule induction from unstructured inputs - ILP requires background knowledge and is computationally expensive, while MLLMs suffer from perceptual hallucinations.Method: Automatically builds ILP tasks with pruned search spaces using rule structure proposals from MLLMs, then uses ILP systems to output rules based on rectified facts and formal inductive reasoning.
Result: Verified effectiveness on challenging logical induction benchmarks and demonstrated application in text-to-image customized generation with rule induction.
Conclusion: ILP-CoT successfully bridges ILP and MLLMs for logical rule induction, overcoming limitations of both approaches through complementary integration.
Abstract: We propose ILP-CoT, a method that bridges Inductive Logic Programming (ILP) and Multimodal Large Language Models (MLLMs) for abductive logical rule induction. The task involves both discovering logical facts and inducing logical rules from a small number of unstructured textual or visual inputs, which still remain challenging when solely relying on ILP, due to the requirement of specified background knowledge and high computational cost, or MLLMs, due to the appearance of perceptual hallucinations. Based on the key observation that MLLMs could propose structure-correct rules even under hallucinations, our approach automatically builds ILP tasks with pruned search spaces based on the rule structure proposals from MLLMs, and utilizes ILP system to output rules built upon rectified logical facts and formal inductive reasoning. Its effectiveness is verified through challenging logical induction benchmarks, as well as a potential application of our approach, namely text-to-image customized generation with rule induction. Our code and data are released at https://github.com/future-item/ILP-CoT.
[641] Position: The Hidden Costs and Measurement Gaps of Reinforcement Learning with Verifiable Rewards
Aaron Tu, Weihao Xuan, Heli Qi, Xu Huang, Qingcheng Zeng, Shayan Talaei, Yijia Xiao, Peng Xia, Xiangru Tang, Yuchen Zhuang, Bing Hu, Hanqun Cao, Wenqi Shi, Tianang Leng, Rui Yang, Yingjian Chen, Ziqi Wang, Irene Li, Nan Liu, Huaxiu Yao, Li Erran Li, Ge Liu, Amin Saberi, Naoto Yokoya, Jure Leskovec, Yejin Choi, Fang Wu
Main category: cs.LG
TL;DR: RLVR (Reinforcement Learning with Verifiable Rewards) shows real but often overstated gains in LLM enhancement; evaluation issues and data contamination shrink reported improvements under parity-controlled testing.
Details
Motivation: To assess the true effectiveness of RLVR by examining whether reported gains persist under strict parity-controlled evaluation and whether RLVR imposes measurable costs ('tax').Method: Used partial-prompt contamination audit and matched-budget reproductions across base and RL models, plus proposed tax-aware training/evaluation protocol co-optimizing accuracy, grounding, and calibrated abstention.
Result: Several headline RLVR gaps shrink or vanish under clean evaluation; the proposed protocol yields more reliable gain estimates and revises some prior conclusions.
Conclusion: RLVR is valuable and industry-ready, but practical benefits should be balanced with reliability, safety, and proper measurement standards.
Abstract: Reinforcement learning with verifiable rewards (RLVR) is a practical and scalable approach to enhancing large language models in areas such as math, code, and other structured tasks. Two questions motivate this paper: how much of the reported gains survive under strictly parity-controlled evaluation, and whether RLVR is cost-free or exacts a measurable tax. We argue that progress is real, but gains are often overstated due to three forces - an RLVR tax, evaluation pitfalls, and data contamination. Using a partial-prompt contamination audit and matched-budget reproductions across base and RL models, we show that several headline gaps shrink or vanish under clean, parity-controlled evaluation. We then propose a tax-aware training and evaluation protocol that co-optimizes accuracy, grounding, and calibrated abstention and standardizes budgeting and provenance checks. Applied to recent RLVR setups, this protocol yields more reliable estimates of reasoning gains and, in several cases, revises prior conclusions. Our position is constructive: RLVR is valuable and industry-ready; we advocate keeping its practical benefits while prioritizing reliability, safety, and measurement.
[642] Zubov-Net: Adaptive Stability for Neural ODEs Reconciling Accuracy with Robustness
Chaoyang Luo, Yan Zou, Nanjing Huang
Main category: cs.LG
TL;DR: Zubov-Net is an adaptive stable learning framework that reformulates Zubov’s equation to reconcile accuracy and robustness in Neural ODEs by actively controlling regions of attraction geometry through tripartite losses and parallel boundary sampling.
Details
Motivation: To address the fundamental tension between robustness and accuracy in Neural ODEs, which stems from the difficulty in imposing appropriate stability conditions while maintaining performance.Method: Reformulates Zubov’s equation into consistency characterization between RoAs and prescribed RoAs; uses tripartite losses (consistency, classification, separation) with parallel boundary sampling; employs input-attention-based convex neural network with softmax attention for discriminative Lyapunov functions.
Result: Maintains high classification accuracy while significantly improving robustness against various stochastic noises and adversarial attacks.
Conclusion: Zubov-Net provides a theoretically-grounded framework that guarantees consistent PRoAs-RoAs alignment, trajectory stability, and non-overlapping PRoAs while achieving better balance between accuracy and robustness.
Abstract: Despite neural ordinary differential equations (Neural ODEs) exhibiting intrinsic robustness under input perturbations due to their dynamical systems nature, recent approaches often involve imposing Lyapunov-based stability conditions to provide formal robustness guarantees. However, a fundamental challenge remains: the tension between robustness and accuracy, primarily stemming from the difficulty in imposing appropriate stability conditions. To address this, we propose an adaptive stable learning framework named Zubov-Net, which innovatively reformulates Zubov’s equation into a consistency characterization between regions of attraction (RoAs) and prescribed RoAs (PRoAs). Building on this consistency, we introduce a new paradigm for actively controlling the geometry of RoAs by directly optimizing PRoAs to reconcile accuracy and robustness. Our approach is realized through tripartite losses (consistency, classification, and separation losses) and a parallel boundary sampling algorithm that co-optimizes the Neural ODE and the Lyapunov function. To enhance the discriminativity of Lyapunov functions, we design an input-attention-based convex neural network via a softmax attention mechanism that focuses on equilibrium-relevant features and also serves as weight normalization to maintain training stability in deep architectures. Theoretically, we prove that minimizing the tripartite loss guarantees consistent alignment of PRoAs-RoAs, trajectory stability, and non-overlapping PRoAs. Moreover, we establish stochastic convex separability with tighter probability bounds and fewer dimensionality requirements to justify the convex design in Lyapunov functions. Experimentally, Zubov-Net maintains high classification accuracy while significantly improving robustness against various stochastic noises and adversarial attacks.
[643] Why High-rank Neural Networks Generalize?: An Algebraic Framework with RKHSs
Yuka Hashimoto, Sho Sonoda, Isao Ishikawa, Masahiro Ikeda
Main category: cs.LG
TL;DR: A new Rademacher complexity bound for deep neural networks using Koopman operators, group representations, and RKHSs, explaining why models with high-rank weight matrices generalize well.
Details
Motivation: Existing Rademacher complexity bounds are limited to specific model types and fail to adequately explain why neural networks with high-rank weight matrices generalize well in practice.Method: Developed an algebraic representation of neural networks and used Koopman operators, group representations, and reproducing kernel Hilbert spaces to construct a new theoretical framework.
Result: Derived a novel Rademacher complexity bound that applies to a wider range of realistic neural network models and explains the generalization behavior of high-rank weight matrices.
Conclusion: This work extends Koopman-based theory for Rademacher complexity bounds to more practical scenarios, providing better theoretical understanding of neural network generalization.
Abstract: We derive a new Rademacher complexity bound for deep neural networks using Koopman operators, group representations, and reproducing kernel Hilbert spaces (RKHSs). The proposed bound describes why the models with high-rank weight matrices generalize well. Although there are existing bounds that attempt to describe this phenomenon, these existing bounds can be applied to limited types of models. We introduce an algebraic representation of neural networks and a kernel function to construct an RKHS to derive a bound for a wider range of realistic models. This work paves the way for the Koopman-based theory for Rademacher complexity bounds to be valid for more practical situations.
[644] Closing the Oracle Gap: Increment Vector Transformation for Class Incremental Learning
Zihuan Qiu, Yi Xu, Fanman Meng, Runtong Zhang, Linfeng Xu, Qingbo Wu, Hongliang Li
Main category: cs.LG
TL;DR: IVT is a plug-and-play framework for Class Incremental Learning that maintains linear connectivity to previous task optima, reducing catastrophic forgetting by periodically transforming model parameters along low-loss paths.
Details
Motivation: Current CIL methods have significant performance gaps compared to oracle models trained with full historical data. Inspired by Linear Mode Connectivity observations that oracle solutions maintain low-loss linear connections to previous task optima.Method: Propose Increment Vector Transformation (IVT) that periodically teleports model parameters to transformed solutions preserving linear connectivity to previous task optimum. Uses diagonal Fisher Information Matrices for efficient approximation, works in both exemplar-free and exemplar-based scenarios.
Result: IVT consistently enhances CIL baselines: +5.12% last accuracy and -2.54% forgetting on CIFAR-100 with PASS baseline; +14.93% average accuracy and +21.95% last accuracy on FGVCAircraft with CLIP-pre-trained SLCA baseline. Tested on CIFAR-100, FGVCAircraft, ImageNet-Subset, and ImageNet-Full.
Conclusion: IVT effectively mitigates catastrophic forgetting by maintaining linear connectivity to previous task optima, demonstrating significant improvements across multiple datasets and baselines while being computationally efficient and compatible with various initialization strategies.
Abstract: Class Incremental Learning (CIL) aims to sequentially acquire knowledge of new classes without forgetting previously learned ones. Despite recent progress, current CIL methods still exhibit significant performance gaps compared to their oracle counterparts-models trained with full access to historical data. Inspired by recent insights on Linear Mode Connectivity (LMC), we revisit the geometric properties of oracle solutions in CIL and uncover a fundamental observation: these oracle solutions typically maintain low-loss linear connections to the optimum of previous tasks. Motivated by this finding, we propose Increment Vector Transformation (IVT), a novel plug-and-play framework designed to mitigate catastrophic forgetting during training. Rather than directly following CIL updates, IVT periodically teleports the model parameters to transformed solutions that preserve linear connectivity to previous task optimum. By maintaining low-loss along these connecting paths, IVT effectively ensures stable performance on previously learned tasks. The transformation is efficiently approximated using diagonal Fisher Information Matrices, making IVT suitable for both exemplar-free and exemplar-based scenarios, and compatible with various initialization strategies. Extensive experiments on CIFAR-100, FGVCAircraft, ImageNet-Subset, and ImageNet-Full demonstrate that IVT consistently enhances the performance of strong CIL baselines. Specifically, on CIFAR-100, IVT improves the last accuracy of the PASS baseline by +5.12% and reduces forgetting by 2.54%. For the CLIP-pre-trained SLCA baseline on FGVCAircraft, IVT yields gains of +14.93% in average accuracy and +21.95% in last accuracy. The code will be released.
[645] Generation Properties of Stochastic Interpolation under Finite Training Set
Yunchen Li, Shaohui Lin, Zhou Yu
Main category: cs.LG
TL;DR: The paper analyzes generative models with finite training data, showing deterministic processes recover exact training samples while stochastic processes add Gaussian noise. With estimation errors, generation produces convex combinations of training samples with mixed noise.
Details
Motivation: To understand the theoretical behavior of generative models when trained on finite populations rather than infinite data, and to characterize how estimation errors affect generation quality.Method: Uses stochastic interpolation generative framework to derive closed-form expressions for optimal velocity field and score function. Introduces formal definitions of underfitting and overfitting for generative models.
Result: Deterministic generative process exactly recovers training samples; stochastic process produces training samples with Gaussian noise. With estimation errors, generation yields convex combinations of training samples corrupted by uniform and Gaussian noise.
Conclusion: The theoretical framework explains generative model behavior under finite training data and estimation errors, with experimental validation showing the analysis holds in practice for generation and downstream tasks.
Abstract: This paper investigates the theoretical behavior of generative models under finite training populations. Within the stochastic interpolation generative framework, we derive closed-form expressions for the optimal velocity field and score function when only a finite number of training samples are available. We demonstrate that, under some regularity conditions, the deterministic generative process exactly recovers the training samples, while the stochastic generative process manifests as training samples with added Gaussian noise. Beyond the idealized setting, we consider model estimation errors and introduce formal definitions of underfitting and overfitting specific to generative models. Our theoretical analysis reveals that, in the presence of estimation errors, the stochastic generation process effectively produces convex combinations of training samples corrupted by a mixture of uniform and Gaussian noise. Experiments on generation tasks and downstream tasks such as classification support our theory.
[646] Discrete Guidance Matching: Exact Guidance for Discrete Flow Matching
Zhengyan Wan, Yidong Ouyang, Liyan Xie, Fang Fang, Hongyuan Zha, Guang Cheng
Main category: cs.LG
TL;DR: A novel guidance framework for discrete data that provides exact transition rates for desired distributions using discrete flow matching, enabling efficient single-pass sampling and unifying existing methods.
Details
Motivation: Existing guidance approaches for discrete data rely on first-order Taylor approximations, which can have large errors in discrete state spaces, limiting their effectiveness.Method: Derived exact transition rates for desired distributions given learned discrete flow matching models, creating a unified framework that works with single forward passes and can be applied to masked diffusion models.
Result: Demonstrated effectiveness on energy-guided simulations and preference alignment tasks in text-to-image generation and multimodal understanding, with improved sampling efficiency.
Conclusion: The proposed discrete guidance framework provides exact and efficient posterior sampling for discrete data, unifying existing methods and showing strong performance across various applications.
Abstract: Guidance provides a simple and effective framework for posterior sampling by steering the generation process towards the desired distribution. When modeling discrete data, existing approaches mostly focus on guidance with the first-order Taylor approximation to improve the sampling efficiency. However, such an approximation is inappropriate in discrete state spaces since the approximation error could be large. A novel guidance framework for discrete data is proposed to address this problem: We derive the exact transition rate for the desired distribution given a learned discrete flow matching model, leading to guidance that only requires a single forward pass in each sampling step, significantly improving efficiency. This unified novel framework is general enough, encompassing existing guidance methods as special cases, and it can also be seamlessly applied to the masked diffusion model. We demonstrate the effectiveness of our proposed guidance on energy-guided simulations and preference alignment on text-to-image generation and multimodal understanding tasks. The code is available through https://github.com/WanZhengyan/Discrete-Guidance-Matching/tree/main.
[647] Multiplicative-Additive Constrained Models:Toward Joint Visualization of Interactive and Independent Effects
Fumin Wang
Main category: cs.LG
TL;DR: MACMs combine multiplicative and additive components to improve interpretability while capturing complex feature interactions, outperforming both CESR and GAMs.
Details
Motivation: To address the trade-off between interpretability and predictive performance in machine learning models for high-stakes applications like healthcare, where GAMs sacrifice interaction effects for interpretability and CESR fails to outperform GAMs despite capturing interactions.Method: Introduce Multiplicative-Additive Constrained Models (MACMs) that augment CESR with an additive component to disentangle interactive and independent feature effects, effectively expanding the hypothesis space while maintaining visual interpretability of shape functions.
Result: Neural network-based MACMs significantly outperform both CESR and state-of-the-art GAMs in predictive performance while maintaining interpretability through visualizable shape functions.
Conclusion: MACMs successfully bridge the gap between interpretability and performance by combining multiplicative and additive components, providing a superior alternative to both CESR and GAMs for high-stakes applications.
Abstract: Interpretability is one of the considerations when applying machine learning to high-stakes fields such as healthcare that involve matters of life safety. Generalized Additive Models (GAMs) enhance interpretability by visualizing shape functions. Nevertheless, to preserve interpretability, GAMs omit higher-order interaction effects (beyond pairwise interactions), which imposes significant constraints on their predictive performance. We observe that Curve Ergodic Set Regression (CESR), a multiplicative model, naturally enables the visualization of its shape functions and simultaneously incorporates both interactions among all features and individual feature effects. Nevertheless, CESR fails to demonstrate superior performance compared to GAMs. We introduce Multiplicative-Additive Constrained Models (MACMs), which augment CESR with an additive part to disentangle the intertwined coefficients of its interactive and independent terms, thus effectively broadening the hypothesis space. The model is composed of a multiplicative part and an additive part, whose shape functions can both be naturally visualized, thereby assisting users in interpreting how features participate in the decision-making process. Consequently, MACMs constitute an improvement over both CESR and GAMs. The experimental results indicate that neural network-based MACMs significantly outperform both CESR and the current state-of-the-art GAMs in terms of predictive performance.
[648] Extracting Actionable Insights from Building Energy Data using Vision LLMs on Wavelet and 3D Recurrence Representations
Amine Bechar, Adel Oulefki, Abbes Amira, Fatih Kurogollu, Yassine Himeur
Main category: cs.LG
TL;DR: A framework that fine-tunes visual language models on 3D representations of building energy time-series data for anomaly detection and energy efficiency recommendations.
Details
Motivation: Analyzing complex building time-series data is challenging due to nonlinear and multi-scale characteristics of energy data.Method: Converts 1D time-series into 3D representations using continuous wavelet transforms (CWTs) and recurrence plots (RPs), then fine-tunes visual language large models (VLLMs) on these visual encodings.
Result: Fine-tuned VLLMs successfully monitor building states, identify anomalies, and generate optimization recommendations. Idefics-7B VLLM achieved validation losses of 0.0952 with CWTs and 0.1064 with RPs, outperforming direct fine-tuning on raw time-series (0.1176).
Conclusion: This work bridges time-series analysis and visualization, providing a scalable and interpretable framework for energy analytics.
Abstract: The analysis of complex building time-series for actionable insights and recommendations remains challenging due to the nonlinear and multi-scale characteristics of energy data. To address this, we propose a framework that fine-tunes visual language large models (VLLMs) on 3D graphical representations of the data. The approach converts 1D time-series into 3D representations using continuous wavelet transforms (CWTs) and recurrence plots (RPs), which capture temporal dynamics and localize frequency anomalies. These 3D encodings enable VLLMs to visually interpret energy-consumption patterns, detect anomalies, and provide recommendations for energy efficiency. We demonstrate the framework on real-world building-energy datasets, where fine-tuned VLLMs successfully monitor building states, identify recurring anomalies, and generate optimization recommendations. Quantitatively, the Idefics-7B VLLM achieves validation losses of 0.0952 with CWTs and 0.1064 with RPs on the University of Sharjah energy dataset, outperforming direct fine-tuning on raw time-series data (0.1176) for anomaly detection. This work bridges time-series analysis and visualization, providing a scalable and interpretable framework for energy analytics.
[649] Active Attacks: Red-teaming LLMs via Adaptive Environments
Taeyoung Yun, Pierre-Luc St-Charles, Jinkyoo Park, Yoshua Bengio, Minsu Kim
Main category: cs.LG
TL;DR: Active Attacks is a novel RL-based red-teaming algorithm that adapts attacks as the victim LLM evolves through safety fine-tuning, forcing exploration of new vulnerabilities and achieving 400x improvement in cross-attack success rates compared to prior methods.
Details
Motivation: To address the challenge of generating diverse attack prompts for LLMs that elicit harmful behaviors, as existing diversity-seeking RL methods often collapse to limited modes and discourage exploration of new regions once high-reward prompts are found.Method: Uses reinforcement learning with a toxicity classifier as reward, but periodically safety fine-tunes the victim LLM with collected attack prompts, which diminishes rewards in exploited regions and forces the attacker to seek unexplored vulnerabilities, creating an easy-to-hard exploration curriculum.
Result: Outperformed prior RL-based methods (GFlowNets, PPO, REINFORCE) by improving cross-attack success rates from 0.07% to 31.28% (400x relative gain) with only 6% increase in computation, uncovering a wide range of local attack modes step by step.
Conclusion: Active Attacks is an effective plug-and-play module that naturally induces diverse exploration by adapting to the evolving victim model, achieving superior coverage of multi-mode attack distributions compared to existing methods.
Abstract: We address the challenge of generating diverse attack prompts for large language models (LLMs) that elicit harmful behaviors (e.g., insults, sexual content) and are used for safety fine-tuning. Rather than relying on manual prompt engineering, attacker LLMs can be trained with reinforcement learning (RL) to automatically generate such prompts using only a toxicity classifier as a reward. However, capturing a wide range of harmful behaviors is a significant challenge that requires explicit diversity objectives. Existing diversity-seeking RL methods often collapse to limited modes: once high-reward prompts are found, exploration of new regions is discouraged. Inspired by the active learning paradigm that encourages adaptive exploration, we introduce \textit{Active Attacks}, a novel RL-based red-teaming algorithm that adapts its attacks as the victim evolves. By periodically safety fine-tuning the victim LLM with collected attack prompts, rewards in exploited regions diminish, which forces the attacker to seek unexplored vulnerabilities. This process naturally induces an easy-to-hard exploration curriculum, where the attacker progresses beyond easy modes toward increasingly difficult ones. As a result, Active Attacks uncovers a wide range of local attack modes step by step, and their combination achieves wide coverage of the multi-mode distribution. Active Attacks, a simple plug-and-play module that seamlessly integrates into existing RL objectives, unexpectedly outperformed prior RL-based methods – including GFlowNets, PPO, and REINFORCE – by improving cross-attack success rates against GFlowNets, the previous state-of-the-art, from 0.07% to 31.28% (a relative gain greater than $400\ \times$) with only a 6% increase in computation. Our code is publicly available \href{https://github.com/dbsxodud-11/active_attacks}{here}.
[650] Statistical Advantage of Softmax Attention: Insights from Single-Location Regression
O. Duranthon, P. Marion, C. Boyer, B. Loureiro, L. Zdeborová
Main category: cs.LG
TL;DR: The paper analyzes why softmax attention outperforms linear attention in large language models, showing softmax achieves Bayes optimality at population level while linear attention fundamentally fails, and identifies key properties needed for optimal performance.
Details
Motivation: To understand why softmax dominates over alternatives like linear attention in LLMs, as current theoretical works focus on easier-to-analyze linearized attention but the superiority of softmax remains poorly understood.Method: Developed an analysis using statistical physics approaches in high-dimensional limit, studying single-location regression task where output depends on linear transformation of single input token at random location. Analyzed attention-based predictors using order parameters.
Result: At population level, softmax achieves Bayes risk while linear attention fundamentally falls short. In finite-sample regime, softmax consistently outperforms linear attention though no longer Bayes-optimal. Identified key properties needed for optimal performance across different activation functions.
Conclusion: Softmax attention’s superiority is theoretically justified - it achieves optimal performance at population level and consistently outperforms alternatives in finite samples, providing principled understanding of why softmax dominates in practice.
Abstract: Large language models rely on attention mechanisms with a softmax activation. Yet the dominance of softmax over alternatives (e.g., component-wise or linear) remains poorly understood, and many theoretical works have focused on the easier-to-analyze linearized attention. In this work, we address this gap through a principled study of the single-location regression task, where the output depends on a linear transformation of a single input token at a random location. Building on ideas from statistical physics, we develop an analysis of attention-based predictors in the high-dimensional limit, where generalization performance is captured by a small set of order parameters. At the population level, we show that softmax achieves the Bayes risk, whereas linear attention fundamentally falls short. We then examine other activation functions to identify which properties are necessary for optimal performance. Finally, we analyze the finite-sample regime: we provide an asymptotic characterization of the test error and show that, while softmax is no longer Bayes-optimal, it consistently outperforms linear attention. We discuss the connection with optimization by gradient-based algorithms.
[651] Structural Information-based Hierarchical Diffusion for Offline Reinforcement Learning
Xianghua Zeng, Hao Peng, Angsheng Li, Yicheng Pan
Main category: cs.LG
TL;DR: SIHD is a Structural Information-based Hierarchical Diffusion framework that adaptively constructs diffusion hierarchies for offline RL, using structural information gain as conditioning and structural entropy regularization to improve exploration and avoid distributional shifts.
Details
Motivation: Existing hierarchical diffusion methods assume fixed two-layer hierarchies with single predefined temporal scales, limiting adaptability to diverse tasks and reducing decision-making flexibility in long-horizon environments with sparse rewards.Method: Analyzes structural information in offline trajectories to adaptively construct diffusion hierarchies across multiple temporal scales. Uses structural information gain of state communities as conditioning signals and introduces structural entropy regularizer to encourage exploration while avoiding extrapolation errors.
Result: Extensive evaluations show SIHD significantly outperforms state-of-the-art baselines in decision-making performance and demonstrates superior generalization across diverse scenarios.
Conclusion: SIHD provides an effective and stable framework for offline policy learning in long-horizon environments by adaptively constructing diffusion hierarchies and leveraging structural information to improve exploration and decision-making.
Abstract: Diffusion-based generative methods have shown promising potential for modeling trajectories from offline reinforcement learning (RL) datasets, and hierarchical diffusion has been introduced to mitigate variance accumulation and computational challenges in long-horizon planning tasks. However, existing approaches typically assume a fixed two-layer diffusion hierarchy with a single predefined temporal scale, which limits adaptability to diverse downstream tasks and reduces flexibility in decision making. In this work, we propose SIHD, a novel Structural Information-based Hierarchical Diffusion framework for effective and stable offline policy learning in long-horizon environments with sparse rewards. Specifically, we analyze structural information embedded in offline trajectories to construct the diffusion hierarchy adaptively, enabling flexible trajectory modeling across multiple temporal scales. Rather than relying on reward predictions from localized sub-trajectories, we quantify the structural information gain of each state community and use it as a conditioning signal within the corresponding diffusion layer. To reduce overreliance on offline datasets, we introduce a structural entropy regularizer that encourages exploration of underrepresented states while avoiding extrapolation errors from distributional shifts. Extensive evaluations on challenging offline RL tasks show that SIHD significantly outperforms state-of-the-art baselines in decision-making performance and demonstrates superior generalization across diverse scenarios.
[652] Think Smart, Not Hard: Difficulty Adaptive Reasoning for Large Audio Language Models
Zhichao Sheng, Shilin Zhou, Chen Gong, Zhenghua Li
Main category: cs.LG
TL;DR: Proposes a difficulty-adaptive reasoning method for Large Audio Language Models that dynamically adjusts reasoning depth based on problem complexity, improving both performance and efficiency.
Details
Motivation: Current LALMs use fixed reasoning depth, causing overthinking for simple problems and insufficient reasoning for complex ones. Need adaptive reasoning that matches problem difficulty.Method: Developed a reward function that dynamically links reasoning length to perceived problem difficulty, encouraging shorter reasoning for easy tasks and deeper reasoning for complex ones.
Result: Method improves task performance while significantly reducing average reasoning length. Extensive experiments show effectiveness and efficiency.
Conclusion: Adaptive reasoning depth based on problem complexity is crucial for efficient LALMs. The approach provides valuable insights for future work on reasoning structure paradigms.
Abstract: Large Audio Language Models (LALMs), powered by the chain-of-thought (CoT) paradigm, have shown remarkable reasoning capabilities. Intuitively, different problems often require varying depths of reasoning. While some methods can determine whether to reason for a given problem, they typically lack a fine-grained mechanism to modulate how much to reason. This often results in a ``one-size-fits-all’’ reasoning depth, which generates redundant overthinking for simple questions while failing to allocate sufficient thought to complex ones. In this paper, we conduct an in-depth analysis of LALMs and find that an effective and efficient LALM should reason smartly by adapting its reasoning depth to the problem’s complexity. To achieve this, we propose a difficulty-adaptive reasoning method for LALMs. Specifically, we propose a reward function that dynamically links reasoning length to the model’s perceived problem difficulty. This reward encourages shorter, concise reasoning for easy tasks and more elaborate, in-depth reasoning for complex ones. Extensive experiments demonstrate that our method is both effective and efficient, simultaneously improving task performance and significantly reducing the average reasoning length. Further analysis on reasoning structure paradigm offers valuable insights for future work.
[653] GRAM-TDI: adaptive multimodal representation learning for drug target interaction prediction
Feng Jiang, Amina Mollaysa, Hehuan Ma, Tommaso Mansi, Junzhou Huang, Mangal Prakash, Rui Liao
Main category: cs.LG
TL;DR: GRAMDTI is a multimodal pretraining framework for drug-target interaction prediction that integrates molecular and protein data from four modalities using volume-based contrastive learning, adaptive modality dropout, and IC50 weak supervision.
Details
Motivation: Existing DTI prediction approaches rely primarily on SMILES-protein pairs and fail to leverage the rich multimodal information available for small molecules and proteins, limiting their predictive performance and generalization capabilities.Method: GRAMDTI extends volume-based contrastive learning to four modalities, uses adaptive modality dropout to dynamically regulate each modality’s contribution, and incorporates IC50 activity measurements as weak supervision to ground representations in biologically meaningful interaction strengths.
Result: Experiments on four publicly available datasets demonstrate that GRAMDTI consistently outperforms state-of-the-art baselines in DTI prediction.
Conclusion: The results highlight the benefits of higher-order multimodal alignment, adaptive modality utilization, and auxiliary supervision for robust and generalizable DTI prediction.
Abstract: Drug target interaction (DTI) prediction is a cornerstone of computational drug discovery, enabling rational design, repurposing, and mechanistic insights. While deep learning has advanced DTI modeling, existing approaches primarily rely on SMILES protein pairs and fail to exploit the rich multimodal information available for small molecules and proteins. We introduce GRAMDTI, a pretraining framework that integrates multimodal molecular and protein inputs into unified representations. GRAMDTI extends volume based contrastive learning to four modalities, capturing higher-order semantic alignment beyond conventional pairwise approaches. To handle modality informativeness, we propose adaptive modality dropout, dynamically regulating each modality’s contribution during pre-training. Additionally, IC50 activity measurements, when available, are incorporated as weak supervision to ground representations in biologically meaningful interaction strengths. Experiments on four publicly available datasets demonstrate that GRAMDTI consistently outperforms state of the art baselines. Our results highlight the benefits of higher order multimodal alignment, adaptive modality utilization, and auxiliary supervision for robust and generalizable DTI prediction.
[654] Stage-wise Dynamics of Classifier-Free Guidance in Diffusion Models
Cheng Jin, Qitan Shi, Yuantao Gu
Main category: cs.LG
TL;DR: Classifier-Free Guidance (CFG) in diffusion models improves conditional fidelity but reduces diversity through three sampling stages: Direction Shift, Mode Separation, and Concentration. Stronger guidance enhances semantic alignment but inevitably diminishes diversity.
Details
Motivation: To understand the impact of CFG on sampling dynamics in diffusion models, particularly under multimodal conditional distributions, as prior studies provided only partial insights.Method: Analyzed CFG under multimodal conditionals, identifying three successive sampling stages and proposing a time-varying guidance schedule based on theoretical insights.
Result: Experiments confirmed that early strong guidance erodes global diversity while late strong guidance suppresses fine-grained variation. The proposed time-varying guidance schedule consistently improved both quality and diversity.
Conclusion: CFG’s trade-off between semantic alignment and diversity is explained through three sampling stages, and a time-varying guidance schedule can mitigate this trade-off by improving both quality and diversity.
Abstract: Classifier-Free Guidance (CFG) is widely used to improve conditional fidelity in diffusion models, but its impact on sampling dynamics remains poorly understood. Prior studies, often restricted to unimodal conditional distributions or simplified cases, provide only a partial picture. We analyze CFG under multimodal conditionals and show that the sampling process unfolds in three successive stages. In the Direction Shift stage, guidance accelerates movement toward the weighted mean, introducing initialization bias and norm growth. In the Mode Separation stage, local dynamics remain largely neutral, but the inherited bias suppresses weaker modes, reducing global diversity. In the Concentration stage, guidance amplifies within-mode contraction, diminishing fine-grained variability. This unified view explains a widely observed phenomenon: stronger guidance improves semantic alignment but inevitably reduces diversity. Experiments support these predictions, showing that early strong guidance erodes global diversity, while late strong guidance suppresses fine-grained variation. Moreover, our theory naturally suggests a time-varying guidance schedule, and empirical results confirm that it consistently improves both quality and diversity.
[655] Goal-Guided Efficient Exploration via Large Language Model in Reinforcement Learning
Yajie Qi, Wei Wei, Lin Li, Lijun Zhang, Zhidong Gao, Da Wang, Huizhong Song
Main category: cs.LG
TL;DR: SGRL is a structured goal-guided RL method that uses LLMs to generate prioritized goals and prune misaligned actions, achieving superior performance on Crafter and Craftax-Classic benchmarks.
Details
Motivation: Real-world RL tasks face challenges in complex environments with poor exploration efficiency and long-horizon planning. Existing LLM-enhanced RL methods suffer from frequent costly LLM calls and semantic mismatch issues.Method: Uses structured goal planner with LLMs to generate reusable prioritized goal functions, and goal-conditioned action pruner with action masking to filter goal-misaligned actions.
Result: Experimental results on Crafter and Craftax-Classic show SGRL achieves superior performance compared to state-of-the-art methods.
Conclusion: SGRL effectively integrates LLM planning with RL through structured goal guidance and action pruning, enabling efficient exploration and improved decision-making in complex environments.
Abstract: Real-world decision-making tasks typically occur in complex and open environments, posing significant challenges to reinforcement learning (RL) agents’ exploration efficiency and long-horizon planning capabilities. A promising approach is LLM-enhanced RL, which leverages the rich prior knowledge and strong planning capabilities of LLMs to guide RL agents in efficient exploration. However, existing methods mostly rely on frequent and costly LLM invocations and suffer from limited performance due to the semantic mismatch. In this paper, we introduce a Structured Goal-guided Reinforcement Learning (SGRL) method that integrates a structured goal planner and a goal-conditioned action pruner to guide RL agents toward efficient exploration. Specifically, the structured goal planner utilizes LLMs to generate a reusable, structured function for goal generation, in which goals are prioritized. Furthermore, by utilizing LLMs to determine goals’ priority weights, it dynamically generates forward-looking goals to guide the agent’s policy toward more promising decision-making trajectories. The goal-conditioned action pruner employs an action masking mechanism that filters out actions misaligned with the current goal, thereby constraining the RL agent to select goal-consistent policies. We evaluate the proposed method on Crafter and Craftax-Classic, and experimental results demonstrate that SGRL achieves superior performance compared to existing state-of-the-art methods.
[656] Latent Diffusion : Multi-Dimension Stable Diffusion Latent Space Explorer
Zhihua Zhong, Xuanyang Huang
Main category: cs.LG
TL;DR: Introduces a framework for integrating customizable latent space operations into diffusion models like Stable Diffusion, enabling direct manipulation of conceptual and spatial representations for enhanced creative expression in generative art.
Details
Motivation: Diffusion models lack the intuitive latent vector control found in GANs, limiting their flexibility for artistic expression and creative exploration through vector manipulation.Method: A framework that integrates customizable latent space operations into the diffusion process, allowing direct manipulation of conceptual and spatial representations.
Result: Demonstrated through two artworks (Infinitepedia and Latent Motion) showing successful conceptual blending and dynamic motion generation. Revealed latent space structures with both semantic and meaningless regions.
Conclusion: The approach expands creative possibilities in generative art and provides insights into the geometry of diffusion models, paving the way for further explorations of latent space.
Abstract: Latent space is one of the key concepts in generative AI, offering powerful means for creative exploration through vector manipulation. However, diffusion models like Stable Diffusion lack the intuitive latent vector control found in GANs, limiting their flexibility for artistic expression. This paper introduces \workname, a framework for integrating customizable latent space operations into the diffusion process. By enabling direct manipulation of conceptual and spatial representations, this approach expands creative possibilities in generative art. We demonstrate the potential of this framework through two artworks, \textit{Infinitepedia} and \textit{Latent Motion}, highlighting its use in conceptual blending and dynamic motion generation. Our findings reveal latent space structures with semantic and meaningless regions, offering insights into the geometry of diffusion models and paving the way for further explorations of latent space.
[657] Concept-SAE: Active Causal Probing of Visual Model Behavior
Jianrong Ding, Muxi Chen, Chenchen Zhao, Qiang Xu
Main category: cs.LG
TL;DR: Concept-SAE introduces semantically grounded concept tokens using hybrid disentanglement to enable causal probing of model behavior, outperforming standard SAEs in feature fidelity and enabling direct intervention studies.
Details
Motivation: Standard Sparse Autoencoders produce ambiguous, ungrounded features that are unreliable for causal probing of model behavior, limiting interpretability beyond correlational analysis.Method: Hybrid disentanglement strategy that forges semantically grounded concept tokens through dual-supervision approach to achieve faithful and spatially localized feature representations.
Result: Produces tokens with remarkable fidelity and spatial localization, outperforming alternative methods in disentanglement. Enables causal probing through direct intervention and systematic localization of adversarial vulnerabilities.
Conclusion: Concept-SAE provides a validated blueprint for moving beyond correlational interpretation to mechanistic, causal probing of model behavior, establishing reliable instruments for active model analysis.
Abstract: Standard Sparse Autoencoders (SAEs) excel at discovering a dictionary of a model’s learned features, offering a powerful observational lens. However, the ambiguous and ungrounded nature of these features makes them unreliable instruments for the active, causal probing of model behavior. To solve this, we introduce Concept-SAE, a framework that forges semantically grounded concept tokens through a novel hybrid disentanglement strategy. We first quantitatively demonstrate that our dual-supervision approach produces tokens that are remarkably faithful and spatially localized, outperforming alternative methods in disentanglement. This validated fidelity enables two critical applications: (1) we probe the causal link between internal concepts and predictions via direct intervention, and (2) we probe the model’s failure modes by systematically localizing adversarial vulnerabilities to specific layers. Concept-SAE provides a validated blueprint for moving beyond correlational interpretation to the mechanistic, causal probing of model behavior.
[658] AEGIS: Authentic Edge Growth In Sparsity for Link Prediction in Edge-Sparse Bipartite Knowledge Graphs
Hugh Xuechen Liu, Kıvanç Tatar
Main category: cs.LG
TL;DR: AEGIS is an edge-only augmentation framework for sparse bipartite knowledge graphs that resamples existing training edges to improve link prediction without introducing fabricated endpoints.
Details
Motivation: Bipartite knowledge graphs in niche domains are typically data-poor and edge-sparse, which hinders link prediction performance.Method: Edge-only augmentation framework that resamples existing training edges (uniformly simple or with inverse-degree bias), preserving original nodes and avoiding fabricated endpoints. Evaluated on naturally sparse graphs and induced-sparsity benchmarks.
Result: On Amazon and MovieLens, copy-based AEGIS variants match baseline, while semantic KNN augmentation restores AUC and calibration. On GDP graph, semantic KNN achieves largest AUC improvement and Brier score reduction.
Conclusion: Authenticity-constrained resampling is a data-efficient strategy for sparse bipartite link prediction, with semantic augmentation providing additional boost when informative node descriptions are available.
Abstract: Bipartite knowledge graphs in niche domains are typically data-poor and edge-sparse, which hinders link prediction. We introduce AEGIS (Authentic Edge Growth In Sparsity), an edge-only augmentation framework that resamples existing training edges -either uniformly simple or with inverse-degree bias degree-aware -thereby preserving the original node set and sidestepping fabricated endpoints. To probe authenticity across regimes, we consider naturally sparse graphs (game design pattern’s game-pattern network) and induce sparsity in denser benchmarks (Amazon, MovieLens) via high-rate bond percolation. We evaluate augmentations on two complementary metrics: AUC-ROC (higher is better) and the Brier score (lower is better), using two-tailed paired t-tests against sparse baselines. On Amazon and MovieLens, copy-based AEGIS variants match the baseline while the semantic KNN augmentation is the only method that restores AUC and calibration; random and synthetic edges remain detrimental. On the text-rich GDP graph, semantic KNN achieves the largest AUC improvement and Brier score reduction, and simple also lowers the Brier score relative to the sparse control. These findings position authenticity-constrained resampling as a data-efficient strategy for sparse bipartite link prediction, with semantic augmentation providing an additional boost when informative node descriptions are available.
[659] The Rogue Scalpel: Activation Steering Compromises LLM Safety
Anton Korznikov, Andrey Galichin, Alexey Dontsov, Oleg Y. Rogov, Ivan Oseledets, Elena Tutubalina
Main category: cs.LG
TL;DR: Activation steering, often considered a safe alternative to fine-tuning, actually systematically breaks LLM alignment safeguards and increases harmful compliance rates, even with random or benign steering vectors.
Details
Motivation: To challenge the assumption that activation steering is a precise, interpretable, and safe method for controlling LLM behavior, and to demonstrate its potential to compromise model safety.Method: Conducted extensive experiments on different model families using activation steering with random directions and benign features from sparse autoencoders (SAEs), and tested universal attacks by combining multiple steering vectors.
Result: Random steering increased harmful compliance from 0% to 2-27%, SAE-based steering further increased rates by 2-4%, and combining 20 random vectors created universal attacks that significantly increased harmful compliance on unseen requests.
Conclusion: Precise control over model internals through activation steering does not guarantee precise control over model behavior, challenging the paradigm of safety through interpretability.
Abstract: Activation steering is a promising technique for controlling LLM behavior by adding semantically meaningful vectors directly into a model’s hidden states during inference. It is often framed as a precise, interpretable, and potentially safer alternative to fine-tuning. We demonstrate the opposite: steering systematically breaks model alignment safeguards, making it comply with harmful requests. Through extensive experiments on different model families, we show that even steering in a random direction can increase the probability of harmful compliance from 0% to 2-27%. Alarmingly, steering benign features from a sparse autoencoder (SAE), a common source of interpretable directions, increases these rates by a further 2-4%. Finally, we show that combining 20 randomly sampled vectors that jailbreak a single prompt creates a universal attack, significantly increasing harmful compliance on unseen requests. These results challenge the paradigm of safety through interpretability, showing that precise control over model internals does not guarantee precise control over model behavior.
[660] Task-Adaptive Parameter-Efficient Fine-Tuning for Weather Foundation Models
Shilei Cao, Hehai Lin, Jiashun Cheng, Yang Liu, Guowen Li, Xuehe Wang, Juepeng Zheng, Haoyuan Liang, Meng Jin, Chengwei Qin, Hong Cheng, Haohuan Fu
Main category: cs.LG
TL;DR: WeatherPEFT is a parameter-efficient fine-tuning framework for Weather Foundation Models that addresses unique challenges in weather tasks through dynamic prompting and adaptive parameter selection, achieving full-tuning performance with fewer parameters.
Details
Motivation: Current PEFT methods designed for vision/language tasks fail to handle weather-specific challenges like variable heterogeneity, resolution diversity, and spatiotemporal variations, leading to suboptimal performance when applied to Weather Foundation Models.Method: WeatherPEFT introduces two innovations: 1) Task-Adaptive Dynamic Prompting (TADP) that dynamically injects embedding weights to input tokens via pattern extraction for context-aware feature recalibration, and 2) Stochastic Fisher-Guided Adaptive Selection (SFAS) that identifies and updates task-critical parameters using Fisher information with randomness for stability.
Result: WeatherPEFT achieves performance parity with Full-Tuning on three downstream weather tasks while using fewer trainable parameters, outperforming existing PEFT methods that show significant gaps versus Full-Tuning.
Conclusion: WeatherPEFT provides an effective and efficient fine-tuning solution for Weather Foundation Models, addressing domain-specific challenges and enabling practical deployment by reducing computational requirements while maintaining performance.
Abstract: While recent advances in machine learning have equipped Weather Foundation Models (WFMs) with substantial generalization capabilities across diverse downstream tasks, the escalating computational requirements associated with their expanding scale increasingly hinder practical deployment. Current Parameter-Efficient Fine-Tuning (PEFT) methods, designed for vision or language tasks, fail to address the unique challenges of weather downstream tasks, such as variable heterogeneity, resolution diversity, and spatiotemporal coverage variations, leading to suboptimal performance when applied to WFMs. To bridge this gap, we introduce WeatherPEFT, a novel PEFT framework for WFMs incorporating two synergistic innovations. First, during the forward pass, Task-Adaptive Dynamic Prompting (TADP) dynamically injects the embedding weights within the encoder to the input tokens of the pre-trained backbone via internal and external pattern extraction, enabling context-aware feature recalibration for specific downstream tasks. Furthermore, during backpropagation, Stochastic Fisher-Guided Adaptive Selection (SFAS) not only leverages Fisher information to identify and update the most task-critical parameters, thereby preserving invariant pre-trained knowledge, but also introduces randomness to stabilize the selection. We demonstrate the effectiveness and efficiency of WeatherPEFT on three downstream tasks, where existing PEFT methods show significant gaps versus Full-Tuning, and WeatherPEFT achieves performance parity with Full-Tuning using fewer trainable parameters. The code of this work will be released.
[661] Teaching Transformers to Solve Combinatorial Problems through Efficient Trial & Error
Panagiotis Giannoulis, Yorgos Pantis, Christos Tzamos
Main category: cs.LG
TL;DR: A novel approach using GPT-2 with imitation learning and DFS achieves 99% accuracy on Sudoku, addressing LLMs’ limitations in combinatorial problems.
Details
Motivation: Large Language Models struggle with combinatorial problems like Satisfiability and basic arithmetic, creating a gap in their capabilities that needs addressing.Method: Uses vanilla GPT-2 with imitation learning of Sudoku rules combined with explicit Depth-First Search exploration involving informed guessing and backtracking.
Result: Achieves state-of-the-art 99% accuracy on Sudoku, significantly outperforming prior neuro-symbolic approaches.
Conclusion: The method successfully bridges LLMs’ gap in combinatorial problem-solving and provides formal analysis connecting it to Min-Sum Set Cover.
Abstract: Despite their proficiency in various language tasks, Large Language Models (LLMs) struggle with combinatorial problems like Satisfiability, Traveling Salesman Problem, or even basic arithmetic. We address this gap through a novel approach for solving problems in the class NP. We focus on the paradigmatic task of Sudoku and achieve state-of-the-art accuracy (99%) compared to prior neuro-symbolic approaches. Unlike prior work that used custom architectures, our method employs a vanilla decoder-only Transformer (GPT-2) without external tools or function calling. Our method integrates imitation learning of simple Sudoku rules with an explicit Depth-First Search (DFS) exploration strategy involving informed guessing and backtracking. Moving beyond imitation learning, we seek to minimize the number of guesses until reaching a solution. We provide a rigorous analysis of this setup formalizing its connection to a contextual variant of Min-Sum Set Cover, a well-studied problem in algorithms and stochastic optimization.
[662] MCGM: Multi-stage Clustered Global Modeling for Long-range Interactions in Molecules
Haodong Pan, Yusong Wang, Nanning Zheng, Caijui Jiang
Main category: cs.LG
TL;DR: MCGM is a plug-and-play module that adds hierarchical global context to geometric GNNs through efficient clustering, overcoming locality limitations without computational overhead.
Details
Motivation: Geometric GNNs struggle with long-range interactions due to locality-biased message passing, and current solutions have limitations like high computational costs, lack of generality, or parameter tuning complexity.Method: Multi-stage Clustered Global Modeling builds multi-resolution atomic clusters, distills global information via dynamic hierarchical clustering, and propagates context through learned transformations with residual connections.
Result: MCGM reduces OE62 energy prediction error by 26.2% on average, achieves SOTA accuracy on AQM (17.0 meV for energy, 4.9 meV/Å for forces) with 20% fewer parameters than Neural P3M.
Conclusion: MCGM provides an efficient, generalizable solution for incorporating global context into geometric GNNs, significantly improving performance on molecular property prediction tasks.
Abstract: Geometric graph neural networks (GNNs) excel at capturing molecular geometry, yet their locality-biased message passing hampers the modeling of long-range interactions. Current solutions have fundamental limitations: extending cutoff radii causes computational costs to scale cubically with distance; physics-inspired kernels (e.g., Coulomb, dispersion) are often system-specific and lack generality; Fourier-space methods require careful tuning of multiple parameters (e.g., mesh size, k-space cutoff) with added computational overhead. We introduce Multi-stage Clustered Global Modeling (MCGM), a lightweight, plug-and-play module that endows geometric GNNs with hierarchical global context through efficient clustering operations. MCGM builds a multi-resolution hierarchy of atomic clusters, distills global information via dynamic hierarchical clustering, and propagates this context back through learned transformations, ultimately reinforcing atomic features via residual connections. Seamlessly integrated into four diverse backbone architectures, MCGM reduces OE62 energy prediction error by an average of 26.2%. On AQM, MCGM achieves state-of-the-art accuracy (17.0 meV for energy, 4.9 meV/{\AA} for forces) while using 20% fewer parameters than Neural P3M. Code will be made available upon acceptance.
[663] Reinforcement Learning for Durable Algorithmic Recourse
Marina Ceccon, Alessandro Fabris, Goran Radanović, Asia J. Biega, Gian Antonio Susto
Main category: cs.LG
TL;DR: A novel time-aware framework for algorithmic recourse that models population adaptation dynamics and uses reinforcement learning to generate durable recommendations that remain valid over time.
Details
Motivation: Existing recourse methods lack consideration of temporal dynamics in competitive, resource-constrained settings where recommendations shape future applicant pools, leading to recommendations that may become invalid over time.Method: Proposed a reinforcement learning-based recourse algorithm that captures evolving environmental dynamics and generates recommendations designed to be durable over a predefined time horizon T.
Result: Extensive experiments show the approach substantially outperforms existing baselines, offering superior balance between feasibility and long-term validity of recommendations.
Conclusion: Temporal and behavioral dynamics are crucial for designing practical recourse systems, and the proposed framework successfully addresses these challenges with durable, time-aware recommendations.
Abstract: Algorithmic recourse seeks to provide individuals with actionable recommendations that increase their chances of receiving favorable outcomes from automated decision systems (e.g., loan approvals). While prior research has emphasized robustness to model updates, considerably less attention has been given to the temporal dynamics of recourse–particularly in competitive, resource-constrained settings where recommendations shape future applicant pools. In this work, we present a novel time-aware framework for algorithmic recourse, explicitly modeling how candidate populations adapt in response to recommendations. Additionally, we introduce a novel reinforcement learning (RL)-based recourse algorithm that captures the evolving dynamics of the environment to generate recommendations that are both feasible and valid. We design our recommendations to be durable, supporting validity over a predefined time horizon T. This durability allows individuals to confidently reapply after taking time to implement the suggested changes. Through extensive experiments in complex simulation environments, we show that our approach substantially outperforms existing baselines, offering a superior balance between feasibility and long-term validity. Together, these results underscore the importance of incorporating temporal and behavioral dynamics into the design of practical recourse systems.
[664] OrtSAE: Orthogonal Sparse Autoencoders Uncover Atomic Features
Anton Korznikov, Andrey Galichin, Alexey Dontsov, Oleg Rogov, Elena Tutubalina, Ivan Oseledets
Main category: cs.LG
TL;DR: Orthogonal SAE (OrtSAE) introduces orthogonality constraints to mitigate feature absorption and composition issues in sparse autoencoders, improving feature discovery and downstream task performance.
Details
Motivation: Current sparse autoencoders suffer from feature absorption (specialized features capturing general ones) and feature composition (independent features merging), creating representation holes and reducing interpretability.Method: OrtSAE enforces orthogonality between learned features by penalizing high pairwise cosine similarity during training, promoting disentangled features with linear computational scaling relative to SAE size.
Result: OrtSAE discovers 9% more distinct features, reduces feature absorption by 65% and composition by 15%, improves spurious correlation removal by 6%, and maintains comparable performance on other downstream tasks.
Conclusion: Orthogonal constraints effectively mitigate feature absorption and composition in sparse autoencoders, leading to more interpretable and disentangled feature representations without significant computational overhead.
Abstract: Sparse autoencoders (SAEs) are a technique for sparse decomposition of neural network activations into human-interpretable features. However, current SAEs suffer from feature absorption, where specialized features capture instances of general features creating representation holes, and feature composition, where independent features merge into composite representations. In this work, we introduce Orthogonal SAE (OrtSAE), a novel approach aimed to mitigate these issues by enforcing orthogonality between the learned features. By implementing a new training procedure that penalizes high pairwise cosine similarity between SAE features, OrtSAE promotes the development of disentangled features while scaling linearly with the SAE size, avoiding significant computational overhead. We train OrtSAE across different models and layers and compare it with other methods. We find that OrtSAE discovers 9% more distinct features, reduces feature absorption (by 65%) and composition (by 15%), improves performance on spurious correlation removal (+6%), and achieves on-par performance for other downstream tasks compared to traditional SAEs.
[665] Learning More with Less: A Dynamic Dual-Level Down-Sampling Framework for Efficient Policy Optimization
Chao Wang, Tao Yang, Hongtao Tian, Yunsheng Shi, Qiyao Ma, Xiaotao Liu, Ting Yao, Wenbo Ding
Main category: cs.LG
TL;DR: D$^3$S is a dynamic dual-level down-sampling framework that improves policy optimization efficiency by prioritizing informative samples and tokens, achieving state-of-the-art performance with fewer samples and tokens.
Details
Motivation: Critic-free methods like GRPO reduce memory demands but converge slowly due to diluted learning signals from uninformative samples and tokens.Method: D$^3$S operates at two levels: sample-level selects rollouts to maximize advantage variance, and token-level prioritizes tokens with high advantage magnitude and policy entropy. It uses a dynamic down-sampling schedule inspired by curriculum learning.
Result: Extensive experiments on Qwen2.5 and Llama3.1 show state-of-the-art performance and generalization across diverse reasoning benchmarks while requiring fewer samples and tokens.
Conclusion: D$^3$S effectively improves policy optimization efficiency by focusing on informative data, achieving better performance with reduced computational resources.
Abstract: Critic-free methods like GRPO reduce memory demands by estimating advantages from multiple rollouts but tend to converge slowly, as critical learning signals are diluted by an abundance of uninformative samples and tokens. To tackle this challenge, we propose the \textbf{Dynamic Dual-Level Down-Sampling (D$^3$S)} framework that prioritizes the most informative samples and tokens across groups to improve the efficient of policy optimization. D$^3$S operates along two levels: (1) the sample-level, which selects a subset of rollouts to maximize advantage variance ($\text{Var}(A)$). We theoretically proven that this selection is positively correlated with the upper bound of the policy gradient norms, yielding higher policy gradients. (2) the token-level, which prioritizes tokens with a high product of advantage magnitude and policy entropy ($|A_{i,t}|\times H_{i,t}$), focusing updates on tokens where the policy is both uncertain and impactful. Moreover, to prevent overfitting to high-signal data, D$^3$S employs a dynamic down-sampling schedule inspired by curriculum learning. This schedule starts with aggressive down-sampling to accelerate early learning and gradually relaxes to promote robust generalization. Extensive experiments on Qwen2.5 and Llama3.1 demonstrate that integrating D$^3$S into advanced RL algorithms achieves state-of-the-art performance and generalization while requiring \textit{fewer} samples and tokens across diverse reasoning benchmarks. Our code is added in the supplementary materials and will be made publicly available.
[666] Convexity-Driven Projection for Point Cloud Dimensionality Reduction
Suman Sanyal
Main category: cs.LG
TL;DR: CDP is a boundary-free linear dimensionality reduction method that preserves detour-induced local non-convexity using k-NN graphs and spectral projection.
Details
Motivation: To develop a dimensionality reduction method that preserves local non-convexity structures in point clouds, particularly focusing on detour-induced geometric properties.Method: Builds k-NN graph, identifies admissible pairs with low Euclidean-to-shortest-path ratios, aggregates normalized directions to form positive semidefinite non-convexity structure matrix, and projects using top-k eigenvectors.
Result: Provides two verifiable guarantees: pairwise a-posteriori certificate for bounding post-projection distortion per admissible pair, and average-case spectral bound linking captured direction energy to structure matrix spectrum.
Conclusion: CDP offers a principled approach with practical evaluation protocol enabling practitioners to verify guarantees on their data through detour errors and certificate quantiles.
Abstract: We propose Convexity-Driven Projection (CDP), a boundary-free linear method for dimensionality reduction of point clouds that targets preserving detour-induced local non-convexity. CDP builds a $k$-NN graph, identifies admissible pairs whose Euclidean-to-shortest-path ratios are below a threshold, and aggregates their normalized directions to form a positive semidefinite non-convexity structure matrix. The projection uses the top-$k$ eigenvectors of the structure matrix. We give two verifiable guarantees. A pairwise a-posteriori certificate that bounds the post-projection distortion for each admissible pair, and an average-case spectral bound that links expected captured direction energy to the spectrum of the structure matrix, yielding quantile statements for typical distortion. Our evaluation protocol reports fixed- and reselected-pairs detour errors and certificate quantiles, enabling practitioners to check guarantees on their data.
[667] MO-GRPO: Mitigating Reward Hacking of Group Relative Policy Optimization on Multi-Objective Problems
Yuki Ichihara, Yuu Jinnai, Tetsuro Morimura, Mitsuki Sakamoto, Ryota Mitsuhashi, Eiji Uchibe
Main category: cs.LG
TL;DR: MO-GRPO extends GRPO with automatic reward normalization to address reward hacking in multi-objective RL, ensuring balanced optimization across all objectives without manual tuning.
Details
Motivation: GRPO is vulnerable to reward hacking in multi-objective settings where it may optimize only one objective at the cost of others, especially when reliable reward models are unavailable in real-world tasks.Method: MO-GRPO extends GRPO with a simple normalization method that automatically reweights reward functions according to their value variances, ensuring all rewards contribute evenly to the loss while preserving preference order.
Result: MO-GRPO achieves stable learning by evenly distributing correlations among reward components, outperforming GRPO in multi-armed bandits, simulated control, machine translation (WMT En-Ja, En-Zh), and instruction following tasks.
Conclusion: MO-GRPO is a promising algorithm for multi-objective reinforcement learning that eliminates the need for manual reward scaling and prevents reward hacking through automatic normalization.
Abstract: Group Relative Policy Optimization (GRPO) has been shown to be an effective algorithm when an accurate reward model is available. However, such a highly reliable reward model is not available in many real-world tasks. In this paper, we particularly focus on multi-objective settings, in which we identify that GRPO is vulnerable to reward hacking, optimizing only one of the objectives at the cost of the others. To address this issue, we propose MO-GRPO, an extension of GRPO with a simple normalization method to reweight the reward functions automatically according to the variances of their values. We first show analytically that MO-GRPO ensures that all reward functions contribute evenly to the loss function while preserving the order of preferences, eliminating the need for manual tuning of the reward functions’ scales. Then, we evaluate MO-GRPO experimentally in four domains: (i) the multi-armed bandits problem, (ii) simulated control task (Mo-Gymnasium), (iii) machine translation tasks on the WMT benchmark (En-Ja, En-Zh), and (iv) instruction following task. MO-GRPO achieves stable learning by evenly distributing correlations among the components of rewards, outperforming GRPO, showing MO-GRPO to be a promising algorithm for multi-objective reinforcement learning problems.
[668] Pushing Toward the Simplex Vertices: A Simple Remedy for Code Collapse in Smoothed Vector Quantization
Takashi Morita
Main category: cs.LG
TL;DR: The paper introduces a new regularization method for smoothed vector quantization that simultaneously ensures tight approximation to hard quantization and prevents code collapse by minimizing distances between simplex vertices and their K-nearest smoothed quantizers.
Details
Motivation: Vector quantization is widely used but faces the challenge of non-differentiable quantization steps blocking gradient backpropagation. Existing smoothed quantization methods address approximation tightness and codebook utilization separately, which is suboptimal.Method: Proposes a simple regularization that minimizes the distance between each simplex vertex and its K-nearest smoothed quantizers, promoting both tight approximation (close to onehot vectors) and full codebook utilization simultaneously.
Result: Experiments on discrete image autoencoding and contrastive speech representation learning show improved codebook utilization and better performance compared to prior methods.
Conclusion: The proposed regularization effectively addresses both key requirements of smoothed vector quantization simultaneously, leading to more reliable performance and better codebook usage than existing approaches.
Abstract: Vector quantization, which discretizes a continuous vector space into a finite set of representative vectors (a codebook), has been widely adopted in modern machine learning. Despite its effectiveness, vector quantization poses a fundamental challenge: the non-differentiable quantization step blocks gradient backpropagation. Smoothed vector quantization addresses this issue by relaxing the hard assignment of a codebook vector into a weighted combination of codebook entries, represented as the matrix product of a simplex vector and the codebook. Effective smoothing requires two properties: (1) smoothed quantizers should remain close to a onehot vector, ensuring tight approximation, and (2) all codebook entries should be utilized, preventing code collapse. Existing methods typically address these desiderata separately. By contrast, the present study introduces a simple and intuitive regularization that promotes both simultaneously by minimizing the distance between each simplex vertex and its $K$-nearest smoothed quantizers. Experiments on representative benchmarks, including discrete image autoencoding and contrastive speech representation learning, demonstrate that the proposed method achieves more reliable codebook utilization and improves performance compared to prior approaches.
[669] BrainPro: Towards Large-scale Brain State-aware EEG Representation Learning
Yi Ding, Muyun Jiang, Weibang Jiang, Shuailei Zhang, Xinliang Zhou, Chenyu Liu, Shanglin Li, Yong Li, Cuntai Guan
Main category: cs.LG
TL;DR: BrainPro is a large EEG foundation model that addresses limitations in capturing spatial interactions and brain state awareness through retrieval-based spatial learning and state-decoupling blocks, achieving SOTA performance across multiple BCI datasets.
Details
Motivation: Existing EEG foundation models fail to explicitly capture channel-to-channel and region-to-region interactions, and lack state-aware representation learning during pre-training, limiting their flexibility and effectiveness across diverse datasets and brain states.Method: Proposes BrainPro with two key components: 1) retrieval-based spatial learning block to flexibly capture channel- and region-level interactions across varying electrode layouts, and 2) brain state-decoupling block with parallel encoders using decoupling and region-aware reconstruction losses for state-aware representation learning.
Result: Pre-trained on an extensive EEG corpus, BrainPro achieves state-of-the-art performance and robust generalization across nine public BCI datasets.
Conclusion: BrainPro demonstrates superior adaptability to diverse tasks and hardware settings through its novel spatial learning and state-decoupling architecture, advancing EEG foundation models for BCI and healthcare applications.
Abstract: Electroencephalography (EEG) is a non-invasive technique for recording brain electrical activity, widely used in brain-computer interface (BCI) and healthcare. Recent EEG foundation models trained on large-scale datasets have shown improved performance and generalizability over traditional decoding methods, yet significant challenges remain. Existing models often fail to explicitly capture channel-to-channel and region-to-region interactions, which are critical sources of information inherently encoded in EEG signals. Due to varying channel configurations across datasets, they either approximate spatial structure with self-attention or restrict training to a limited set of common channels, sacrificing flexibility and effectiveness. Moreover, although EEG datasets reflect diverse brain states such as emotion, motor, and others, current models rarely learn state-aware representations during self-supervised pre-training. To address these gaps, we propose BrainPro, a large EEG model that introduces a retrieval-based spatial learning block to flexibly capture channel- and region-level interactions across varying electrode layouts, and a brain state-decoupling block that enables state-aware representation learning through parallel encoders with decoupling and region-aware reconstruction losses. This design allows BrainPro to adapt seamlessly to diverse tasks and hardware settings. Pre-trained on an extensive EEG corpus, BrainPro achieves state-of-the-art performance and robust generalization across nine public BCI datasets. Our codes and the pre-trained weights will be released.
[670] Lightweight error mitigation strategies for post-training N:M activation sparsity in LLMs
Shirin Alanova, Kristina Kazistova, Ekaterina Galaeva, Alina Kostromina, Vladimir Smirnov, Redko Dmitry, Alexey Dontsov, Maxim Zhelnin, Evgeny Burnaev, Egor Shvetsov
Main category: cs.LG
TL;DR: This paper presents a comprehensive analysis of post-training N:M activation pruning for LLMs, showing it preserves generative capabilities better than weight pruning at equivalent sparsity levels, with 8:16 pattern identified as optimal.
Details
Motivation: The demand for efficient LLM inference has intensified focus on sparsification techniques, but activation pruning remains underexplored despite its potential for dynamic, input-adaptive compression and reductions in I/O overhead.Method: Comprehensive analysis of post-training N:M activation pruning methods across multiple LLMs, evaluating lightweight plug-and-play error mitigation techniques and pruning criteria with minimal calibration requirements.
Result: Activation pruning enables superior preservation of generative capabilities compared to weight pruning at equivalent sparsity levels. The 16:32 pattern achieves performance nearly on par with unstructured sparsity, but 8:16 pattern is identified as superior considering flexibility-hardware complexity trade-off.
Conclusion: The findings provide effective practical methods for activation pruning and motivate future hardware to support more flexible sparsity patterns beyond standard 2:4 patterns.
Abstract: The demand for efficient large language model (LLM) inference has intensified the focus on sparsification techniques. While semi-structured (N:M) pruning is well-established for weights, its application to activation pruning remains underexplored despite its potential for dynamic, input-adaptive compression and reductions in I/O overhead. This work presents a comprehensive analysis of methods for post-training N:M activation pruning in LLMs. Across multiple LLMs, we demonstrate that pruning activations enables superior preservation of generative capabilities compared to weight pruning at equivalent sparsity levels. We evaluate lightweight, plug-and-play error mitigation techniques and pruning criteria, establishing strong hardware-friendly baselines that require minimal calibration. Furthermore, we explore sparsity patterns beyond NVIDIA’s standard 2:4, showing that the 16:32 pattern achieves performance nearly on par with unstructured sparsity. However, considering the trade-off between flexibility and hardware implementation complexity, we focus on the 8:16 pattern as a superior candidate. Our findings provide both effective practical methods for activation pruning and a motivation for future hardware to support more flexible sparsity patterns. Our code is available https://anonymous.4open.science/r/Structured-Sparse-Activations-Inference-EC3C/README.md .
[671] Enriching Knowledge Distillation with Intra-Class Contrastive Learning
Hua Yuan, Ning Xu, Xin Geng, Yong Rui
Main category: cs.LG
TL;DR: Proposes intra-class contrastive loss with margin for teacher training in knowledge distillation to enrich intra-class diversity in soft labels, improving student generalization.
Details
Motivation: Existing distillation methods use teacher models trained with ground-truth labels, ignoring diverse representations within the same class, which limits the effectiveness of soft labels for student learning.Method: Incorporates intra-class contrastive loss with margin during teacher training to enrich intra-class information in soft labels while maintaining training stability and convergence speed.
Result: Theoretical analysis shows intra-class contrastive loss enriches intra-class diversity. Experimental results demonstrate the method’s effectiveness in improving knowledge distillation.
Conclusion: Integrating intra-class contrastive learning with margin loss during teacher training enhances the quality of soft labels by capturing intra-class diversity, leading to better student model generalization.
Abstract: Since the advent of knowledge distillation, much research has focused on how the soft labels generated by the teacher model can be utilized effectively. Existing studies points out that the implicit knowledge within soft labels originates from the multi-view structure present in the data. Feature variations within samples of the same class allow the student model to generalize better by learning diverse representations. However, in existing distillation methods, teacher models predominantly adhere to ground-truth labels as targets, without considering the diverse representations within the same class. Therefore, we propose incorporating an intra-class contrastive loss during teacher training to enrich the intra-class information contained in soft labels. In practice, we find that intra-class loss causes instability in training and slows convergence. To mitigate these issues, margin loss is integrated into intra-class contrastive learning to improve the training stability and convergence speed. Simultaneously, we theoretically analyze the impact of this loss on the intra-class distances and inter-class distances. It has been proved that the intra-class contrastive loss can enrich the intra-class diversity. Experimental results demonstrate the effectiveness of the proposed method.
[672] Towards Understanding Feature Learning in Parameter Transfer
Hua Yuan, Xuran Meng, Qiufeng Wang, Shiyu Xia, Ning Xu, Xu Yang, Jing Wang, Xin Geng, Yong Rui
Main category: cs.LG
TL;DR: Theoretical analysis of partial parameter transfer in ReLU CNNs, identifying conditions where parameter reuse benefits downstream tasks and explaining cases where it can hurt performance compared to training from scratch.
Details
Motivation: There's a lack of theoretical understanding about when partial parameter transfer is beneficial and what factors govern its effectiveness in transfer learning.Method: Theoretical analysis of ReLU convolutional neural networks with numerical experiments and real-world data validation.
Result: Characterized how inherited parameters carry universal knowledge and identified key factors that amplify their beneficial impact on target tasks.
Conclusion: Provides theoretical insights into parameter transfer effectiveness and explains why transferring parameters can sometimes lead to worse performance than training from scratch.
Abstract: Parameter transfer is a central paradigm in transfer learning, enabling knowledge reuse across tasks and domains by sharing model parameters between upstream and downstream models. However, when only a subset of parameters from the upstream model is transferred to the downstream model, there remains a lack of theoretical understanding of the conditions under which such partial parameter reuse is beneficial and of the factors that govern its effectiveness. To address this gap, we analyze a setting in which both the upstream and downstream models are ReLU convolutional neural networks (CNNs). Within this theoretical framework, we characterize how the inherited parameters act as carriers of universal knowledge and identify key factors that amplify their beneficial impact on the target task. Furthermore, our analysis provides insight into why, in certain cases, transferring parameters can lead to lower test accuracy on the target task than training a new model from scratch. Numerical experiments and real-world data experiments are conducted to empirically validate our theoretical findings.
[673] Efficiency Boost in Decentralized Optimization: Reimagining Neighborhood Aggregation with Minimal Overhead
Durgesh Kalwar, Mayank Baranwal, Harshad Khadilkar
Main category: cs.LG
TL;DR: DYNAWEIGHT is a novel framework for dynamic weight allocation in decentralized learning that accelerates training by favoring servers with diverse information, especially in data-heterogeneous scenarios.
Details
Motivation: Distributed learning is crucial for privacy and computational efficiency in decentralized infrastructures where local processing is necessary due to lack of centralized aggregation.Method: DYNAWEIGHT dynamically allocates weights to neighboring servers based on their relative losses on local datasets, unlike traditional static weight assignments like Metropolis weights.
Result: Experiments on MNIST, CIFAR10, and CIFAR100 datasets with various server counts and graph topologies demonstrate notable enhancements in training speeds with minimal communication and memory overhead.
Conclusion: DYNAWEIGHT functions as a versatile aggregation scheme compatible with any underlying server-level optimization algorithm, showing potential for widespread integration in decentralized learning systems.
Abstract: In today’s data-sensitive landscape, distributed learning emerges as a vital tool, not only fortifying privacy measures but also streamlining computational operations. This becomes especially crucial within fully decentralized infrastructures where local processing is imperative due to the absence of centralized aggregation. Here, we introduce DYNAWEIGHT, a novel framework to information aggregation in multi-agent networks. DYNAWEIGHT offers substantial acceleration in decentralized learning with minimal additional communication and memory overhead. Unlike traditional static weight assignments, such as Metropolis weights, DYNAWEIGHT dynamically allocates weights to neighboring servers based on their relative losses on local datasets. Consequently, it favors servers possessing diverse information, particularly in scenarios of substantial data heterogeneity. Our experiments on various datasets MNIST, CIFAR10, and CIFAR100 incorporating various server counts and graph topologies, demonstrate notable enhancements in training speeds. Notably, DYNAWEIGHT functions as an aggregation scheme compatible with any underlying server-level optimization algorithm, underscoring its versatility and potential for widespread integration.
[674] Non-Linear Trajectory Modeling for Multi-Step Gradient Inversion Attacks in Federated Learning
Li Xia, Zheng Liu, Sili Huang, Wei Tang, Xuan Liu
Main category: cs.LG
TL;DR: NL-SME introduces nonlinear parametric trajectory modeling for gradient inversion attacks in federated learning, significantly outperforming linear methods by capturing SGD’s curved characteristics through quadratic Bézier curves.
Details
Motivation: Existing surrogate model methods for gradient inversion attacks assume linear parameter trajectories, which severely underestimates SGD's nonlinear complexity and fundamentally limits attack effectiveness in multi-step federated learning scenarios.Method: Proposes Non-Linear Surrogate Model Extension (NL-SME) that replaces linear interpolation with learnable quadratic Bézier curves to capture SGD’s nonlinear characteristics, combined with regularization and dvec scaling mechanisms for enhanced expressiveness.
Result: Extensive experiments on CIFAR-100 and FEMNIST datasets show NL-SME significantly outperforms baselines across all metrics, achieving order-of-magnitude improvements in cosine similarity loss while maintaining computational efficiency.
Conclusion: This work exposes heightened privacy vulnerabilities in FL’s multi-step update paradigm and offers novel perspectives for developing robust defense strategies against gradient inversion attacks.
Abstract: Federated Learning (FL) preserves privacy by keeping raw data local, yet Gradient Inversion Attacks (GIAs) pose significant threats. In FedAVG multi-step scenarios, attackers observe only aggregated gradients, making data reconstruction challenging. Existing surrogate model methods like SME assume linear parameter trajectories, but we demonstrate this severely underestimates SGD’s nonlinear complexity, fundamentally limiting attack effectiveness. We propose Non-Linear Surrogate Model Extension (NL-SME), the first method to introduce nonlinear parametric trajectory modeling for GIAs. Our approach replaces linear interpolation with learnable quadratic B'ezier curves that capture SGD’s curved characteristics through control points, combined with regularization and dvec scaling mechanisms for enhanced expressiveness. Extensive experiments on CIFAR-100 and FEMNIST datasets show NL-SME significantly outperforms baselines across all metrics, achieving order-of-magnitude improvements in cosine similarity loss while maintaining computational efficiency.This work exposes heightened privacy vulnerabilities in FL’s multi-step update paradigm and offers novel perspectives for developing robust defense strategies.
[675] Learning Equivariant Functions via Quadratic Forms
Pavan Karjol, Vivek V Kashyap, Rohan Kashyap, Prathosh A P
Main category: cs.LG
TL;DR: A method for learning group equivariant functions by discovering the underlying quadratic form from data, leveraging orthogonal group properties to build simplified neural network architectures with appropriate inductive biases.
Details
Motivation: To develop efficient neural network models that can automatically discover and incorporate underlying symmetry groups from data, enabling better generalization and performance in tasks involving equivariant functions.Method: Learn the quadratic form x^T A x corresponding to the symmetry group from data, use the diagonal form of the symmetric matrix to incorporate inductive biases, and decompose equivariant functions into norm-invariant and scale-invariant components. Extended to handle tuples of vectors with diagonal group actions.
Result: The framework consistently outperforms baseline methods in discovering underlying symmetries and learning equivariant functions across tasks including polynomial regression, top quark tagging, and moment of inertia matrix prediction.
Conclusion: The proposed approach effectively discovers symmetry groups from data and builds efficient equivariant models through quadratic form learning and proper architectural inductive biases, demonstrating superior performance across multiple applications.
Abstract: In this study, we introduce a method for learning group (known or unknown) equivariant functions by learning the associated quadratic form $x^T A x$ corresponding to the group from the data. Certain groups, known as orthogonal groups, preserve a specific quadratic form, and we leverage this property to uncover the underlying symmetry group under the assumption that it is orthogonal. By utilizing the corresponding unique symmetric matrix and its inherent diagonal form, we incorporate suitable inductive biases into the neural network architecture, leading to models that are both simplified and efficient. Our approach results in an invariant model that preserves norms, while the equivariant model is represented as a product of a norm-invariant model and a scale-invariant model, where the ``product’’ refers to the group action. Moreover, we extend our framework to a more general setting where the function acts on tuples of input vectors via a diagonal (or product) group action. In this extension, the equivariant function is decomposed into an angular component extracted solely from the normalized first vector and a scale-invariant component that depends on the full Gram matrix of the tuple. This decomposition captures the inter-dependencies between multiple inputs while preserving the underlying group symmetry. We assess the effectiveness of our framework across multiple tasks, including polynomial regression, top quark tagging, and moment of inertia matrix prediction. Comparative analysis with baseline methods demonstrates that our model consistently excels in both discovering the underlying symmetry and efficiently learning the corresponding equivariant function.
[676] SHAKE-GNN: Scalable Hierarchical Kirchhoff-Forest Graph Neural Network
Zhipu Cui, Johannes Lutzeyer
Main category: cs.LG
TL;DR: SHAKE-GNN is a scalable graph-level GNN framework using Kirchhoff Forests for multi-resolution graph decompositions, enabling flexible efficiency-performance trade-offs.
Details
Motivation: Scaling GNNs to large graphs remains challenging, especially for graph-level tasks, requiring more efficient and scalable approaches.Method: Uses a hierarchy of Kirchhoff Forests (random spanning forests) to create stochastic multi-resolution graph decompositions, with data-driven parameter selection.
Result: Achieves competitive performance on large-scale graph classification benchmarks while offering improved scalability compared to existing methods.
Conclusion: SHAKE-GNN provides an effective solution for scalable graph-level learning with flexible trade-offs between efficiency and performance.
Abstract: Graph Neural Networks (GNNs) have achieved remarkable success across a range of learning tasks. However, scaling GNNs to large graphs remains a significant challenge, especially for graph-level tasks. In this work, we introduce SHAKE-GNN, a novel scalable graph-level GNN framework based on a hierarchy of Kirchhoff Forests, a class of random spanning forests used to construct stochastic multi-resolution decompositions of graphs. SHAKE-GNN produces multi-scale representations, enabling flexible trade-offs between efficiency and performance. We introduce an improved, data-driven strategy for selecting the trade-off parameter and analyse the time-complexity of SHAKE-GNN. Experimental results on multiple large-scale graph classification benchmarks demonstrate that SHAKE-GNN achieves competitive performance while offering improved scalability.
[677] Modeling Psychological Profiles in Volleyball via Mixed-Type Bayesian Networks
Maria Iannario, Dae-Jin Lee, Manuele Leonelli
Main category: cs.LG
TL;DR: The paper introduces latent MMHC, a hybrid structure learning method for discovering directed relationships among mixed-type psychological variables in volleyball players, showing improved performance over existing methods and providing interpretable networks for athlete development.
Details
Motivation: Psychological attributes operate in networks rather than isolation, and there's a need to analyze relationships among mixed-type variables (ordinal, categorical, continuous) in sports psychology to understand how mental skills and traits interact.Method: Latent MMHC - a hybrid structure learner combining latent Gaussian copula with constraint-based skeleton and constrained score-based refinement to learn directed acyclic graphs (DAGs) from mixed-type data, with bootstrap-aggregated variant for stability.
Result: In simulations, latent MMHC achieved lower structural Hamming distance and higher edge recall than recent copula-based learners while maintaining high specificity. Applied to volleyball data, it revealed networks organizing mental skills around goal setting and self-confidence, with emotional arousal linking motivation and anxiety.
Conclusion: The approach provides an interpretable, data-driven framework for profiling psychological traits in sports and supports decision-making in athlete development through scenario analyses that quantify how improvements in specific skills propagate through the network.
Abstract: Psychological attributes rarely operate in isolation: coaches reason about networks of related traits. We analyze a new dataset of 164 female volleyball players from Italy’s C and D leagues that combines standardized psychological profiling with background information. To learn directed relationships among mixed-type variables (ordinal questionnaire scores, categorical demographics, continuous indicators), we introduce latent MMHC, a hybrid structure learner that couples a latent Gaussian copula and a constraint-based skeleton with a constrained score-based refinement to return a single DAG. We also study a bootstrap-aggregated variant for stability. In simulations spanning sample size, sparsity, and dimension, latent Max-Min Hill-Climbing (MMHC) attains lower structural Hamming distance and higher edge recall than recent copula-based learners while maintaining high specificity. Applied to volleyball, the learned network organizes mental skills around goal setting and self-confidence, with emotional arousal linking motivation and anxiety, and locates Big-Five traits (notably neuroticism and extraversion) upstream of skill clusters. Scenario analyses quantify how improvements in specific skills propagate through the network to shift preparation, confidence, and self-esteem. The approach provides an interpretable, data-driven framework for profiling psychological traits in sport and for decision support in athlete development.
[678] Reversible GNS for Dissipative Fluids with Consistent Bidirectional Dynamics
Mu Huang, Linning Xu, Mingyue Dai, Yidi Shao, Bo Dai
Main category: cs.LG
TL;DR: R-GNS is a reversible graph network simulator that unifies forward and inverse fluid dynamics simulation through mathematically invertible residual reversible message passing, achieving high accuracy and 100x faster inverse inference than optimization-based methods.
Details
Motivation: Inverse inference in dissipative fluid systems is challenging due to irreversible dynamics, slow optimization-based solvers, and convergence failures. Existing neural simulators struggle with accurate backward dynamics approximation.Method: Proposes Reversible Graph Network Simulator (R-GNS) with bidirectional consistency using residual reversible message passing with shared parameters. Unlike prior methods, it doesn’t reverse physics but uses mathematically invertible design coupling forward and inverse inference.
Result: Achieves higher accuracy with 1/4 parameters, 100x faster inverse inference than baselines. Matches GNS speed in forward simulation, eliminates iterative optimization in goal-conditioned tasks, and handles complex target shapes with physically consistent trajectories.
Conclusion: R-GNS is the first reversible framework unifying forward and inverse simulation for dissipative fluid systems, demonstrating efficient and accurate bidirectional fluid dynamics modeling.
Abstract: Simulating physically plausible trajectories toward user-defined goals is a fundamental yet challenging task in fluid dynamics. While particle-based simulators can efficiently reproduce forward dynamics, inverse inference remains difficult, especially in dissipative systems where dynamics are irreversible and optimization-based solvers are slow, unstable, and often fail to converge. In this work, we introduce the Reversible Graph Network Simulator (R-GNS), a unified framework that enforces bidirectional consistency within a single graph architecture. Unlike prior neural simulators that approximate inverse dynamics by fitting backward data, R-GNS does not attempt to reverse the underlying physics. Instead, we propose a mathematically invertible design based on residual reversible message passing with shared parameters, coupling forward dynamics with inverse inference to deliver accurate predictions and efficient recovery of plausible initial states. Experiments on three dissipative benchmarks (Water-3D, WaterRamps, and WaterDrop) show that R-GNS achieves higher accuracy and consistency with only one quarter of the parameters, and performs inverse inference more than 100 times faster than optimization-based baselines. For forward simulation, R-GNS matches the speed of strong GNS baselines, while in goal-conditioned tasks it eliminates iterative optimization and achieves orders-of-magnitude speedups. On goal-conditioned tasks, R-GNS further demonstrates its ability to complex target shapes (e.g., characters “L” and “N”) through vivid, physically consistent trajectories. To our knowledge, this is the first reversible framework that unifies forward and inverse simulation for dissipative fluid systems.
[679] Countering adversarial evasion in regression analysis
David Benfield, Phan Tu Vuong, Alain Zemkoho
Main category: cs.LG
TL;DR: This paper proposes a pessimistic bilevel optimization framework for adversarial regression scenarios, extending previous game-theoretic approaches from classification to regression problems without convexity or uniqueness assumptions.
Details
Motivation: Adversarial evasion poses significant challenges in applications like spam filtering and malware detection, where adversaries adapt data to manipulate prediction models. While game-theoretic approaches have been effective for classification, they haven't been adapted to regression scenarios.Method: The authors develop a pessimistic bilevel optimization program specifically designed for regression problems that removes assumptions about the convexity and uniqueness of the adversary’s optimal strategy.
Result: The proposed framework captures the antagonistic nature of adversaries in regression settings, providing a more robust approach to handling adversarial threats.
Conclusion: This work successfully extends pessimistic bilevel optimization from classification to regression scenarios, offering a more comprehensive defense against adversarial attacks across different types of prediction tasks.
Abstract: Adversarial machine learning challenges the assumption that the underlying distribution remains consistent throughout the training and implementation of a prediction model. In particular, adversarial evasion considers scenarios where adversaries adapt their data to influence particular outcomes from established prediction models, such scenarios arise in applications such as spam email filtering, malware detection and fake-image generation, where security methods must be actively updated to keep up with the ever-improving generation of malicious data. Game theoretic models have been shown to be effective at modelling these scenarios and hence training resilient predictors against such adversaries. Recent advancements in the use of pessimistic bilevel optimsiation which remove assumptions about the convexity and uniqueness of the adversary’s optimal strategy have proved to be particularly effective at mitigating threats to classifiers due to its ability to capture the antagonistic nature of the adversary. However, this formulation has not yet been adapted to regression scenarios. This article serves to propose a pessimistic bilevel optimisation program for regression scenarios which makes no assumptions on the convexity or uniqueness of the adversary’s solutions.
[680] Automatic Discovery of One Parameter Subgroups of $SO(n)$
Pavan Karjol, Vivek V Kashyap, Rohan Kashyap, Prathosh A P
Main category: cs.LG
TL;DR: A framework for automatically discovering one-parameter subgroups of SO(n) using Jordan forms of skew-symmetric matrices to establish canonical forms and learn invariant functions.
Details
Motivation: One-parameter subgroups of SO(n) are crucial in robotics, quantum mechanics, and molecular structure analysis, but their automatic discovery remains challenging.Method: Uses standard Jordan form of skew-symmetric matrices (Lie algebra of SO(n)) to establish canonical forms for orbits and derive standardized representations for invariant functions, then learns parameters to uncover subgroups.
Result: Successfully applied to double pendulum modeling, moment of inertia prediction, top quark tagging, and invariant polynomial regression, recovering meaningful subgroup structure.
Conclusion: The framework effectively discovers one-parameter subgroups and produces interpretable, symmetry-aware representations for various applications.
Abstract: We introduce a novel framework for the automatic discovery of one-parameter subgroups ($H_{\gamma}$) of $SO(3)$ and, more generally, $SO(n)$. One-parameter subgroups of $SO(n)$ are crucial in a wide range of applications, including robotics, quantum mechanics, and molecular structure analysis. Our method utilizes the standard Jordan form of skew-symmetric matrices, which define the Lie algebra of $SO(n)$, to establish a canonical form for orbits under the action of $H_{\gamma}$. This canonical form is then employed to derive a standardized representation for $H_{\gamma}$-invariant functions. By learning the appropriate parameters, the framework uncovers the underlying one-parameter subgroup $H_{\gamma}$. The effectiveness of the proposed approach is demonstrated through tasks such as double pendulum modeling, moment of inertia prediction, top quark tagging and invariant polynomial regression, where it successfully recovers meaningful subgroup structure and produces interpretable, symmetry-aware representations.
[681] Mind the Missing: Variable-Aware Representation Learning for Irregular EHR Time Series using Large Language Models
Jeong Eul Kwon, Joo Heung Yoon, Hyo Kyung Lee
Main category: cs.LG
TL;DR: VITAL is a variable-aware LLM framework that handles irregular EHR time series by differentiating between frequently recorded vital signs and sporadic lab tests, using language reprogramming for temporal context and robust missing value handling.
Details
Motivation: To address challenges of irregular sampling and high missingness in EHR time series data, where clinical variables are measured at uneven intervals depending on clinical workflow and intervention timing.Method: Differentiates vital signs (frequent, temporal patterns) from lab tests (sporadic, no temporal structure). Reprograms vital signs into language space for temporal context and missing value reasoning. Embeds lab variables using summary values or learnable [Not measured] tokens.
Result: Outperforms state-of-the-art methods on PhysioNet benchmark datasets and maintains robust performance under high missingness levels common in real clinical scenarios.
Conclusion: VITAL effectively handles irregular EHR time series by leveraging LLM capabilities with variable-aware processing, demonstrating superior performance in realistic clinical settings with high missing data.
Abstract: Irregular sampling and high missingness are intrinsic challenges in modeling time series derived from electronic health records (EHRs),where clinical variables are measured at uneven intervals depending on workflow and intervention timing. To address this, we propose VITAL, a variable-aware, large language model (LLM) based framework tailored for learning from irregularly sampled physiological time series. VITAL differentiates between two distinct types of clinical variables: vital signs, which are frequently recorded and exhibit temporal patterns, and laboratory tests, which are measured sporadically and lack temporal structure. It reprograms vital signs into the language space, enabling the LLM to capture temporal context and reason over missing values through explicit encoding. In contrast, laboratory variables are embedded either using representative summary values or a learnable [Not measured] token, depending on their availability. Extensive evaluations on the benchmark datasets from the PhysioNet demonstrate that VITAL outperforms state of the art methods designed for irregular time series. Furthermore, it maintains robust performance under high levels of missingness, which is prevalent in real world clinical scenarios where key variables are often unavailable.
[682] Slicing Wasserstein Over Wasserstein Via Functional Optimal Transport
Moritz Piening, Robert Beinert
Main category: cs.LG
TL;DR: The paper introduces Double-Sliced Wasserstein (DSW) as a computationally efficient and numerically stable alternative to Wasserstein over Wasserstein (WoW) distances for comparing datasets or distributions over images and shapes.
Details
Motivation: Existing sliced WoW accelerations suffer from numerical instability due to reliance on parametric meta-measures or high-order moments, making them computationally costly and unstable.Method: Leverages the isometry between 1d Wasserstein space and quantile functions in L_2([0,1]), introduces a general sliced Wasserstein framework for arbitrary Banach spaces, and defines DSW via infinite-dimensional L_2-projections parametrized by Gaussian processes combined with integration over the Euclidean unit sphere.
Result: DSW minimization is equivalent to WoW minimization for discretized meta-measures while avoiding unstable higher-order moments and achieving computational savings. Numerical experiments validate DSW as a scalable substitute for WoW distance.
Conclusion: DSW provides a numerically stable and computationally efficient alternative to WoW distances for comparing meta-measures, with applications in datasets, shapes, and images.
Abstract: Wasserstein distances define a metric between probability measures on arbitrary metric spaces, including meta-measures (measures over measures). The resulting Wasserstein over Wasserstein (WoW) distance is a powerful, but computationally costly tool for comparing datasets or distributions over images and shapes. Existing sliced WoW accelerations rely on parametric meta-measures or the existence of high-order moments, leading to numerical instability. As an alternative, we propose to leverage the isometry between the 1d Wasserstein space and the quantile functions in the function space $L_2([0,1])$. For this purpose, we introduce a general sliced Wasserstein framework for arbitrary Banach spaces. Due to the 1d Wasserstein isometry, this framework defines a sliced distance between 1d meta-measures via infinite-dimensional $L_2$-projections, parametrized by Gaussian processes. Combining this 1d construction with classical integration over the Euclidean unit sphere yields the double-sliced Wasserstein (DSW) metric for general meta-measures. We show that DSW minimization is equivalent to WoW minimization for discretized meta-measures, while avoiding unstable higher-order moments and computational savings. Numerical experiments on datasets, shapes, and images validate DSW as a scalable substitute for the WoW distance.
[683] Fairness-Aware Reinforcement Learning (FAReL): A Framework for Transparent and Balanced Sequential Decision-Making
Alexandra Cimpean, Nicole Orzan, Catholijn Jonker, Pieter Libin, Ann Nowé
Main category: cs.LG
TL;DR: A framework for exploring performance-fairness trade-offs in sequential decision problems using extended Markov decision processes (fMDP) that explicitly model individuals and groups, with applications in job hiring and fraud detection.
Details
Motivation: Need for algorithms that can make transparent trade-offs between performance and fairness in real-world sequential decision problems, where the optimal trade-off is hard to specify in advance.Method: Proposed fMDP (extended Markov decision process) that explicitly encodes individuals and groups, formalizes fairness notions in sequential contexts, and computes fairness measures over time to explore multiple performance-fairness trade-offs.
Result: Framework learns policies that are more fair across multiple scenarios with only minor performance loss, and shows that group and individual fairness notions don’t necessarily imply each other.
Conclusion: The framework enables stakeholders to explore multiple trade-offs and select appropriate policies, with guidelines provided for applying it across different problem settings.
Abstract: Equity in real-world sequential decision problems can be enforced using fairness-aware methods. Therefore, we require algorithms that can make suitable and transparent trade-offs between performance and the desired fairness notions. As the desired performance-fairness trade-off is hard to specify a priori, we propose a framework where multiple trade-offs can be explored. Insights provided by the reinforcement learning algorithm regarding the obtainable performance-fairness trade-offs can then guide stakeholders in selecting the most appropriate policy. To capture fairness, we propose an extended Markov decision process, $f$MDP, that explicitly encodes individuals and groups. Given this $f$MDP, we formalise fairness notions in the context of sequential decision problems and formulate a fairness framework that computes fairness measures over time. We evaluate our framework in two scenarios with distinct fairness requirements: job hiring, where strong teams must be composed while treating applicants equally, and fraud detection, where fraudulent transactions must be detected while ensuring the burden on customers is fairly distributed. We show that our framework learns policies that are more fair across multiple scenarios, with only minor loss in performance reward. Moreover, we observe that group and individual fairness notions do not necessarily imply one another, highlighting the benefit of our framework in settings where both fairness types are desired. Finally, we provide guidelines on how to apply this framework across different problem settings.
[684] Mechanistic Independence: A Principle for Identifiable Disentangled Representations
Stefan Matthes, Zhiwei Han, Hao Shen
Main category: cs.LG
TL;DR: A unified framework for disentangled representations using mechanistic independence criteria that achieves identifiability without relying on latent distribution assumptions.
Details
Motivation: Current disentangled representation methods lack full understanding of identifiability conditions, particularly when latent factors may be statistically dependent.Method: Proposes mechanistic independence framework with various criteria (support-based, sparsity-based, higher-order) that characterize latent factors by how they act on observed variables rather than their distribution.
Result: Shows that each independence criterion yields identifiability of latent subspaces even under nonlinear, non-invertible mixing, and establishes a hierarchy among criteria with graph-theoretic characterization.
Conclusion: The framework clarifies conditions for disentangled representation identifiability without statistical assumptions, providing a unified perspective invariant to changes in latent density.
Abstract: Disentangled representations seek to recover latent factors of variation underlying observed data, yet their identifiability is still not fully understood. We introduce a unified framework in which disentanglement is achieved through mechanistic independence, which characterizes latent factors by how they act on observed variables rather than by their latent distribution. This perspective is invariant to changes of the latent density, even when such changes induce statistical dependencies among factors. Within this framework, we propose several related independence criteria – ranging from support-based and sparsity-based to higher-order conditions – and show that each yields identifiability of latent subspaces, even under nonlinear, non-invertible mixing. We further establish a hierarchy among these criteria and provide a graph-theoretic characterization of latent subspaces as connected components. Together, these results clarify the conditions under which disentangled representations can be identified without relying on statistical assumptions.
[685] ASSESS: A Semantic and Structural Evaluation Framework for Statement Similarity
Xiaoyang Liu, Tao Zhu, Zineng Dong, Yuntian Liu, Qingfeng Guo, Zhaoxuan Liu, Yu Chen, Tao Luo
Main category: cs.LG
TL;DR: ASSESS is a new framework that combines semantic and structural information to evaluate formal statement similarity, addressing limitations of existing metrics that either focus only on syntax or semantics but not both.
Details
Motivation: Existing metrics for formal statement similarity fail to balance semantic and structural information - string-based methods ignore semantics while proof-based methods lack graded similarity scores and structural awareness.Method: The framework transforms formal statements into Operator Trees to capture syntactic structure, then computes similarity using TransTED (Transformation Tree Edit Distance) Similarity metric that enhances traditional Tree Edit Distance with semantic awareness through transformations.
Result: Experiments on the new EPLA benchmark (524 expert-annotated formal statement pairs) show TransTED Similarity outperforms existing methods, achieving state-of-the-art accuracy and highest Kappa coefficient.
Conclusion: ASSESS provides a comprehensive evaluation framework that successfully integrates both semantic and structural information for formal statement similarity assessment, demonstrating superior performance over existing approaches.
Abstract: Statement autoformalization, the automated translation of statements from natural language into formal languages, has seen significant advancements, yet the development of automated evaluation metrics remains limited. Existing metrics for formal statement similarity often fail to balance semantic and structural information. String-based approaches capture syntactic structure but ignore semantic meaning, whereas proof-based methods validate semantic equivalence but disregard structural nuances and, critically, provide no graded similarity score in the event of proof failure. To address these issues, we introduce ASSESS (A Semantic and Structural Evaluation Framework for Statement Similarity), which comprehensively integrates semantic and structural information to provide a continuous similarity score. Our framework first transforms formal statements into Operator Trees to capture their syntactic structure and then computes a similarity score using our novel TransTED (Transformation Tree Edit Distance) Similarity metric, which enhances traditional Tree Edit Distance by incorporating semantic awareness through transformations. For rigorous validation, we present EPLA (Evaluating Provability and Likeness for Autoformalization), a new benchmark of 524 expert-annotated formal statement pairs derived from miniF2F and ProofNet, with labels for both semantic provability and structural likeness. Experiments on EPLA demonstrate that TransTED Similarity outperforms existing methods, achieving state-of-the-art accuracy and the highest Kappa coefficient. The benchmark, and implementation code will be made public soon.
[686] Kernel Regression of Multi-Way Data via Tensor Trains with Hadamard Overparametrization: The Dynamic Graph Flow Case
Duc Thien Nguyen, Konstantinos Slavakis, Eleftherios Kofidis, Dimitris Pados
Main category: cs.LG
TL;DR: KReTTaH is a regression-based framework for interpretable multi-way data imputation that uses kernel regression with tensor-train decomposition and Hadamard overparametrization for parameter efficiency.
Details
Motivation: To develop an interpretable and parameter-efficient method for multi-way data imputation that can handle missing data in complex structures like dynamic graph flows.Method: Uses nonparametric regression via reproducing kernel Hilbert spaces with tensor-train rank tensors on Riemannian manifolds, enhanced by Hadamard overparametrization for sparsity. Learning is done by solving smooth inverse problems on the manifold.
Result: KReTTaH consistently outperforms state-of-the-art alternatives (including nonparametric tensor and neural network methods) for imputing missing time-varying edge flows on real-world graph datasets.
Conclusion: The framework provides an effective and interpretable solution for multi-way data imputation, particularly for dynamic graph flow estimation, with demonstrated superiority over existing methods.
Abstract: A regression-based framework for interpretable multi-way data imputation, termed Kernel Regression via Tensor Trains with Hadamard overparametrization (KReTTaH), is introduced. KReTTaH adopts a nonparametric formulation by casting imputation as regression via reproducing kernel Hilbert spaces. Parameter efficiency is achieved through tensors of fixed tensor-train (TT) rank, which reside on low-dimensional Riemannian manifolds, and is further enhanced via Hadamard overparametrization, which promotes sparsity within the TT parameter space. Learning is accomplished by solving a smooth inverse problem posed on the Riemannian manifold of fixed TT-rank tensors. As a representative application, the estimation of dynamic graph flows is considered. In this setting, KReTTaH exhibits flexibility by seamlessly incorporating graph-based (topological) priors via its inverse problem formulation. Numerical tests on real-world graph datasets demonstrate that KReTTaH consistently outperforms state-of-the-art alternatives-including a nonparametric tensor- and a neural-network-based methods-for imputing missing, time-varying edge flows.
[687] A Law of Data Reconstruction for Random Features (and Beyond)
Leonardo Iurada, Simone Bombari, Tatiana Tommasi, Marco Mondelli
Main category: cs.LG
TL;DR: The paper shows that when the number of parameters p exceeds dn (data dimensionality × number of samples), deep learning models can fully reconstruct the entire training dataset from model parameters.
Details
Motivation: To understand memorization in deep learning from a data reconstruction perspective, going beyond classical interpolation theory that focuses on label fitting.Method: Theoretical analysis in random features model plus optimization-based reconstruction method applied to various architectures (random features, two-layer networks, ResNets).
Result: Demonstrates successful reconstruction of entire training datasets when p > dn, revealing a threshold for complete data memorization.
Conclusion: Establishes a law of data reconstruction where full training data recovery becomes possible once model parameters exceed the dn threshold.
Abstract: Large-scale deep learning models are known to memorize parts of the training set. In machine learning theory, memorization is often framed as interpolation or label fitting, and classical results show that this can be achieved when the number of parameters $p$ in the model is larger than the number of training samples $n$. In this work, we consider memorization from the perspective of data reconstruction, demonstrating that this can be achieved when $p$ is larger than $dn$, where $d$ is the dimensionality of the data. More specifically, we show that, in the random features model, when $p \gg dn$, the subspace spanned by the training samples in feature space gives sufficient information to identify the individual samples in input space. Our analysis suggests an optimization method to reconstruct the dataset from the model parameters, and we demonstrate that this method performs well on various architectures (random features, two-layer fully-connected and deep residual networks). Our results reveal a law of data reconstruction, according to which the entire training dataset can be recovered as $p$ exceeds the threshold $dn$.
[688] Wavelet-Induced Rotary Encodings: RoPE Meets Graphs
Isaac Reid, Arijit Sehanobish, Cedrik Höfs, Bruno Mlodozeniec, Leonhard Vulpius, Federico Barbero, Adrian Weller, Krzysztof Choromanski, Richard E. Turner, Petar Veličković
Main category: cs.LG
TL;DR: WIRE extends Rotary Position Encodings (RoPE) to graph data, offering theoretical advantages like permutation equivariance and compatibility with linear attention, while being effective in graph-structured tasks.
Details
Motivation: To generalize position encoding methods from sequence and grid data to arbitrary graph-structured data, addressing the need for effective positional representations in graph neural networks.Method: WIRE uses wavelet-induced rotary encodings that extend RoPE to graphs, maintaining desirable properties like permutation equivariance and working with linear attention mechanisms.
Result: WIRE demonstrates effectiveness in synthetic tasks (identifying monochromatic subgraphs), real-world applications (point cloud segmentation), and standard graph benchmarks, particularly when graph structure is important.
Conclusion: WIRE provides a principled extension of position encodings to graphs with strong theoretical foundations and practical effectiveness across various graph-based tasks.
Abstract: We introduce WIRE: Wavelet-Induced Rotary Encodings. WIRE extends Rotary Position Encodings (RoPE), a popular algorithm in LLMs and ViTs, to graph-structured data. We demonstrate that WIRE is more general than RoPE, recovering the latter in the special case of grid graphs. WIRE also enjoys a host of desirable theoretical properties, including equivariance under node ordering permutation, compatibility with linear attention, and (under select assumptions) asymptotic dependence on graph resistive distance. We test WIRE on a range of synthetic and real-world tasks, including identifying monochromatic subgraphs, semantic segmentation of point clouds, and more standard graph benchmarks. We find it to be effective in settings where the underlying graph structure is important.
[689] Erase or Hide? Suppressing Spurious Unlearning Neurons for Robust Unlearning
Nakyeong Yang, Dong-Kyum Kim, Jea Kwon, Minsung Kim, Kyomin Jung, Meeyoung Cha
Main category: cs.LG
TL;DR: The paper introduces Ssiuu, a new unlearning method that uses attribution-guided regularization to prevent spurious negative influence and faithfully remove target knowledge from language models, outperforming existing methods.
Details
Motivation: Current unlearning methods for language models are vulnerable to "relearning" during subsequent training, allowing forgotten knowledge to resurface, which poses privacy risks.Method: Ssiuu employs attribution-guided regularization to prevent spurious unlearning neurons and faithfully remove target knowledge, addressing the shallow alignment problem in existing methods.
Result: Experimental results show Ssiuu reliably erases target knowledge and outperforms strong baselines in both adversarial injection of private data and benign attack scenarios using instruction-following benchmarks.
Conclusion: The findings highlight the necessity of robust and faithful unlearning methods for safe deployment of language models to address privacy concerns.
Abstract: Large language models trained on web-scale data can memorize private or sensitive knowledge, raising significant privacy risks. Although some unlearning methods mitigate these risks, they remain vulnerable to “relearning” during subsequent training, allowing a substantial portion of forgotten knowledge to resurface. In this paper, we show that widely used unlearning methods cause shallow alignment: instead of faithfully erasing target knowledge, they generate spurious unlearning neurons that amplify negative influence to hide it. To overcome this limitation, we introduce Ssiuu, a new class of unlearning methods that employs attribution-guided regularization to prevent spurious negative influence and faithfully remove target knowledge. Experimental results confirm that our method reliably erases target knowledge and outperforms strong baselines across two practical retraining scenarios: (1) adversarial injection of private data, and (2) benign attack using an instruction-following benchmark. Our findings highlight the necessity of robust and faithful unlearning methods for safe deployment of language models.
[690] Towards a more realistic evaluation of machine learning models for bearing fault diagnosis
João Paulo Vieira, Victor Afonso Bauler, Rodrigo Kobashikawa Rosa, Danilo Silva
Main category: cs.LG
TL;DR: This paper addresses data leakage issues in bearing fault diagnosis using machine learning, proposing a leakage-free evaluation methodology with bearing-wise data partitioning and multi-label classification to improve real-world generalization.
Details
Motivation: Current ML methods for bearing fault diagnosis often fail in real-world applications due to data leakage from improper dataset partitioning strategies that create spurious correlations and inflate performance metrics.Method: Proposes a rigorous leakage-free evaluation methodology using bearing-wise data partitioning (no overlap between training and testing physical components), reformulates classification as multi-label problem, and evaluates on three datasets (CWRU, Paderborn University, University of Ottawa).
Result: Demonstrates that common dataset partitioning strategies introduce data leakage and spurious correlations, while bearing-wise partitioning prevents leakage. Shows dataset diversity (number of unique training bearings) is crucial for robust performance.
Conclusion: Highlights the importance of leakage-aware evaluation protocols and provides practical guidelines for dataset partitioning, model selection, and validation to develop more trustworthy ML systems for industrial fault diagnosis.
Abstract: Reliable detection of bearing faults is essential for maintaining the safety and operational efficiency of rotating machinery. While recent advances in machine learning (ML), particularly deep learning, have shown strong performance in controlled settings, many studies fail to generalize to real-world applications due to methodological flaws, most notably data leakage. This paper investigates the issue of data leakage in vibration-based bearing fault diagnosis and its impact on model evaluation. We demonstrate that common dataset partitioning strategies, such as segment-wise and condition-wise splits, introduce spurious correlations that inflate performance metrics. To address this, we propose a rigorous, leakage-free evaluation methodology centered on bearing-wise data partitioning, ensuring no overlap between the physical components used for training and testing. Additionally, we reformulate the classification task as a multi-label problem, enabling the detection of co-occurring fault types and the use of prevalence-independent metrics such as Macro AUROC. Beyond preventing leakage, we also examine the effect of dataset diversity on generalization, showing that the number of unique training bearings is a decisive factor for achieving robust performance. We evaluate our methodology on three widely adopted datasets: CWRU, Paderborn University (PU), and University of Ottawa (UORED-VAFCLS). This study highlights the importance of leakage-aware evaluation protocols and provides practical guidelines for dataset partitioning, model selection, and validation, fostering the development of more trustworthy ML systems for industrial fault diagnosis applications.
[691] Fine-Grained Uncertainty Decomposition in Large Language Models: A Spectral Approach
Nassim Walha, Sebastian G. Gruber, Thomas Decker, Yinchong Yang, Alireza Javanmardi, Eyke Hüllermeier, Florian Buettner
Main category: cs.LG
TL;DR: Spectral Uncertainty is a novel method using Von Neumann entropy to quantify and decompose uncertainty in LLMs into aleatoric and epistemic components, outperforming existing methods.
Details
Motivation: As LLMs are increasingly used in applications, reliable uncertainty measures are crucial. Distinguishing between aleatoric uncertainty (from data ambiguities) and epistemic uncertainty (from model limitations) is essential to address each source effectively.Method: Leverages Von Neumann entropy from quantum information theory to separate total uncertainty into aleatoric and epistemic components. Incorporates fine-grained semantic similarity representation for nuanced differentiation among semantic interpretations.
Result: Empirical evaluations show Spectral Uncertainty outperforms state-of-the-art methods in estimating both aleatoric and total uncertainty across diverse models and benchmark datasets.
Conclusion: Spectral Uncertainty provides a rigorous theoretical foundation for uncertainty quantification in LLMs and demonstrates superior performance compared to existing baseline methods.
Abstract: As Large Language Models (LLMs) are increasingly integrated in diverse applications, obtaining reliable measures of their predictive uncertainty has become critically important. A precise distinction between aleatoric uncertainty, arising from inherent ambiguities within input data, and epistemic uncertainty, originating exclusively from model limitations, is essential to effectively address each uncertainty source. In this paper, we introduce Spectral Uncertainty, a novel approach to quantifying and decomposing uncertainties in LLMs. Leveraging the Von Neumann entropy from quantum information theory, Spectral Uncertainty provides a rigorous theoretical foundation for separating total uncertainty into distinct aleatoric and epistemic components. Unlike existing baseline methods, our approach incorporates a fine-grained representation of semantic similarity, enabling nuanced differentiation among various semantic interpretations in model responses. Empirical evaluations demonstrate that Spectral Uncertainty outperforms state-of-the-art methods in estimating both aleatoric and total uncertainty across diverse models and benchmark datasets.
[692] HEAPr: Hessian-based Efficient Atomic Expert Pruning in Output Space
Ke Li, Zheng Yang, Zhongbin Zhou, Feng Xue, Zhonglin Jiang, Wenxiao Wang
Main category: cs.LG
TL;DR: HEAPr is a novel pruning algorithm for Mixture-of-Experts (MoE) LLMs that decomposes experts into atomic experts for fine-grained pruning using second-order information, achieving nearly lossless compression at 20-25% ratios while reducing FLOPs by 20%.
Details
Motivation: MoE architectures have large parameter counts leading to prohibitive memory requirements, and existing expert-level pruning methods cause substantial accuracy degradation due to coarse granularity.Method: HEAPr decomposes experts into atomic experts and uses second-order information similar to Optimal Brain Surgeon theory, transforming the computation to reduce space complexity from O(d^4) to O(d^2) with only two forward passes and one backward pass on a small calibration set.
Result: HEAPr outperforms existing expert-level pruning methods across various compression ratios and benchmarks, achieving nearly lossless compression at 20-25% ratios while reducing FLOPs by nearly 20% on models like DeepSeek MoE and Qwen MoE.
Conclusion: HEAPr enables precise and flexible atomic expert pruning for MoE models, addressing memory limitations while maintaining performance, making practical deployment more feasible.
Abstract: Mixture-of-Experts (MoE) architectures in large language models (LLMs) deliver exceptional performance and reduced inference costs compared to dense LLMs. However, their large parameter counts result in prohibitive memory requirements, limiting practical deployment. While existing pruning methods primarily focus on expert-level pruning, this coarse granularity often leads to substantial accuracy degradation. In this work, we introduce HEAPr, a novel pruning algorithm that decomposes experts into smaller, indivisible atomic experts, enabling more precise and flexible atomic expert pruning. To measure the importance of each atomic expert, we leverage second-order information based on principles similar to Optimal Brain Surgeon (OBS) theory. To address the computational and storage challenges posed by second-order information, HEAPr exploits the inherent properties of atomic experts to transform the second-order information from expert parameters into that of atomic expert parameters, and further simplifies it to the second-order information of atomic expert outputs. This approach reduces the space complexity from $O(d^4)$, where d is the model’s dimensionality, to $O(d^2)$. HEAPr requires only two forward passes and one backward pass on a small calibration set to compute the importance of atomic experts. Extensive experiments on MoE models, including DeepSeek MoE and Qwen MoE family, demonstrate that HEAPr outperforms existing expert-level pruning methods across a wide range of compression ratios and benchmarks. Specifically, HEAPr achieves nearly lossless compression at compression ratios of 20% ~ 25% in most models, while also reducing FLOPs nearly by 20%. The code can be found at \href{https://github.com/LLIKKE/HEAPr}{https://github.com/LLIKKE/HEAPr}.
[693] Unlocking the Power of Mixture-of-Experts for Task-Aware Time Series Analytics
Xingjian Wu, Zhengyu Li, Hanyin Cheng, Xiangfei Qiu, Jilin Hu, Chenjuan Guo, Bin Yang
Main category: cs.LG
TL;DR: PatchMoE is a novel Mixture-of-Experts framework for time series analysis that introduces task-aware routing and channel correlation modeling to overcome limitations of traditional MoE approaches in time series tasks.
Details
Motivation: Traditional Mixture-of-Experts architectures are task-agnostic and lack capability to model channel correlations, making them suboptimal for versatile time series analytics tasks like forecasting, classification, and imputation.Method: Proposes Recurrent Noisy Gating to utilize hierarchical information for task-specific routing, operates routing on time series tokens in temporal and channel dimensions, and uses Temporal & Channel Load Balancing Loss to model correlations.
Result: Comprehensive experiments on five downstream tasks demonstrate state-of-the-art performance, showing effectiveness across various time series applications.
Conclusion: PatchMoE successfully addresses the limitations of traditional MoE in time series analysis by providing task-aware routing and effective modeling of temporal and channel correlations, achieving superior performance across multiple tasks.
Abstract: Time Series Analysis is widely used in various real-world applications such as weather forecasting, financial fraud detection, imputation for missing data in IoT systems, and classification for action recognization. Mixture-of-Experts (MoE), as a powerful architecture, though demonstrating effectiveness in NLP, still falls short in adapting to versatile tasks in time series analytics due to its task-agnostic router and the lack of capability in modeling channel correlations. In this study, we propose a novel, general MoE-based time series framework called PatchMoE to support the intricate ``knowledge’’ utilization for distinct tasks, thus task-aware. Based on the observation that hierarchical representations often vary across tasks, e.g., forecasting vs. classification, we propose a Recurrent Noisy Gating to utilize the hierarchical information in routing, thus obtaining task-sepcific capability. And the routing strategy is operated on time series tokens in both temporal and channel dimensions, and encouraged by a meticulously designed Temporal & Channel Load Balancing Loss to model the intricate temporal and channel correlations. Comprehensive experiments on five downstream tasks demonstrate the state-of-the-art performance of PatchMoE.
[694] Adaptive Policy Backbone via Shared Network
Bumgeun Park, Donghwan Lee
Main category: cs.LG
TL;DR: APB is a meta-transfer RL method that uses lightweight linear layers around a shared backbone for parameter-efficient fine-tuning, enabling better adaptation to out-of-distribution tasks.
Details
Motivation: RL requires extensive interaction data and existing methods struggle with task mismatch, especially in out-of-distribution settings where prior knowledge degrades.Method: Insert lightweight linear layers before and after a shared backbone network, enabling parameter-efficient fine-tuning while preserving prior knowledge during adaptation.
Result: APB improves sample efficiency over standard RL and successfully adapts to out-of-distribution tasks where existing meta-RL baselines typically fail.
Conclusion: APB provides an effective approach for meta-transfer RL that maintains prior knowledge while enabling efficient adaptation to new tasks, including challenging out-of-distribution scenarios.
Abstract: Reinforcement learning (RL) has achieved impressive results across domains, yet learning an optimal policy typically requires extensive interaction data, limiting practical deployment. A common remedy is to leverage priors, such as pre-collected datasets or reference policies, but their utility degrades under task mismatch between training and deployment. While prior work has sought to address this mismatch, it has largely been restricted to in-distribution settings. To address this challenge, we propose Adaptive Policy Backbone (APB), a meta-transfer RL method that inserts lightweight linear layers before and after a shared backbone, thereby enabling parameter-efficient fine-tuning (PEFT) while preserving prior knowledge during adaptation. Our results show that APB improves sample efficiency over standard RL and adapts to out-of-distribution (OOD) tasks where existing meta-RL baselines typically fail.
[695] Conditional Denoising Diffusion Autoencoders for Wireless Semantic Communications
Mehdi Letafati, Samad Ali, Matti Latva-aho
Main category: cs.LG
TL;DR: The paper proposes diffusion autoencoder models for wireless semantic communication, addressing limitations of traditional autoencoder-based approaches by learning semantic-to-clean mapping using conditional diffusion models.
Details
Motivation: Existing semantic communication systems focus on channel-adaptive neural encoding-decoding but lack full exploration of signal distribution and suffer from scalability issues due to tightly coupled encoder-decoder architectures.Method: A neural encoder extracts high-level semantics at the transmitter, and a conditional diffusion model at the receiver performs signal-space denoising using received semantic latents as conditioning input to guide the decoding process.
Result: The proposed decoder is analytically proven to be a consistent estimator of ground-truth data. Extensive simulations on CIFAR-10 and MNIST datasets show superior performance compared to legacy autoencoders and VAEs, with extensions to multi-user scenarios.
Conclusion: Diffusion autoencoder models effectively address scalability and distribution modeling issues in semantic communication, providing a robust framework for semantic-to-clean mapping with theoretical guarantees and practical performance improvements.
Abstract: Semantic communication (SemCom) systems aim to learn the mapping from low-dimensional semantics to high-dimensional ground-truth. While this is more akin to a “domain translation” problem, existing frameworks typically emphasize on channel-adaptive neural encoding-decoding schemes, lacking full exploration of signal distribution. Moreover, such methods so far have employed autoencoder-based architectures, where the encoding is tightly coupled to a matched decoder, causing scalability issues in practice. To address these gaps, diffusion autoencoder models are proposed for wireless SemCom. The goal is to learn a “semantic-to-clean” mapping, from the semantic space to the ground-truth probability distribution. A neural encoder at semantic transmitter extracts the high-level semantics, and a conditional diffusion model (CDiff) at the semantic receiver exploits the source distribution for signal-space denoising, while the received semantic latents are incorporated as the conditioning input to “steer” the decoding process towards the semantics intended by the transmitter. It is analytically proved that the proposed decoder model is a consistent estimator of the ground-truth data. Furthermore, extensive simulations over CIFAR-10 and MNIST datasets are provided along with design insights, highlighting the performance compared to legacy autoencoders and variational autoencoders (VAE). Simulations are further extended to the multi-user SemCom, identifying the dominating factors in a more realistic setup.
[696] Progressive Weight Loading: Accelerating Initial Inference and Gradually Boosting Performance on Resource-Constrained Environments
Hyunwoo Kim, Junha Lee, Mincheol Choi, Jeonghwan Lee, Jaeshin Cho
Main category: cs.LG
TL;DR: Progressive Weight Loading (PWL) enables fast initial inference using lightweight student models, then incrementally replaces layers with teacher model weights to improve accuracy without compromising initial speed.
Details
Motivation: Address the trade-off between model compression via Knowledge Distillation and performance loss, particularly in mobile/latency-sensitive environments where frequent model loading and initial inference speed are critical.Method: PWL first deploys a lightweight student model, then progressively replaces its layers with pre-trained teacher model layers. A training method aligns intermediate feature representations between student and teacher layers while improving student output performance.
Result: Experiments on VGG, ResNet, and ViT show PWL maintains competitive distillation performance, gradually improves accuracy as teacher layers are loaded, and matches final teacher model accuracy without compromising initial inference speed.
Conclusion: PWL is well-suited for dynamic, resource-constrained deployments where both responsiveness and performance are critical, offering a practical solution to the speed-accuracy trade-off in model compression.
Abstract: Deep learning models have become increasingly large and complex, resulting in higher memory consumption and computational demands. Consequently, model loading times and initial inference latency have increased, posing significant challenges in mobile and latency-sensitive environments where frequent model loading and unloading are required, which directly impacts user experience. While Knowledge Distillation (KD) offers a solution by compressing large teacher models into smaller student ones, it often comes at the cost of reduced performance. To address this trade-off, we propose Progressive Weight Loading (PWL), a novel technique that enables fast initial inference by first deploying a lightweight student model, then incrementally replacing its layers with those of a pre-trained teacher model. To support seamless layer substitution, we introduce a training method that not only aligns intermediate feature representations between student and teacher layers, but also improves the overall output performance of the student model. Our experiments on VGG, ResNet, and ViT architectures demonstrate that models trained with PWL maintain competitive distillation performance and gradually improve accuracy as teacher layers are loaded-matching the final accuracy of the full teacher model without compromising initial inference speed. This makes PWL particularly suited for dynamic, resource-constrained deployments where both responsiveness and performance are critical.
[697] A Multi-Level Framework for Multi-Objective Hypergraph Partitioning: Combining Minimum Spanning Tree and Proximal Gradient
Yingying Li, Mingxuan Xie, Hailong You, Yongqiang Yao, Hongwei Liu
Main category: cs.LG
TL;DR: Efficient hypergraph partitioning using multi-objective non-convex constrained relaxation with MST-based strategies and refinement techniques, achieving 2-5% cut size reduction vs KaHyPar and up to 35% improvement on specific instances.
Details
Motivation: To develop a more efficient hypergraph partitioning framework that avoids local optima and enhances partition quality through diverse vertex features and scalable MST-based approaches.Method: Multi-objective non-convex constrained relaxation model with accelerated proximal gradient algorithm for diverse k-dimensional vertex features. Uses Prim algorithm for small-scale data and representative node subset MST for large-scale data, plus refinement strategies including greedy migration, swapping, and recursive MST-based clustering.
Result: Achieves 2-5% average cut size reduction compared to KaHyPar in 2,3,4-way partitioning, with up to 35% improvement on specific instances. Outperforms KaHyPar, hMetis, Mt-KaHyPar, and K-SpecPart on weighted vertex sets. Refinement strategy improves hMetis partitions by up to 16%.
Conclusion: The proposed framework demonstrates superior partitioning quality and competitiveness, validated through comprehensive evaluation and parameter sensitivity analysis.
Abstract: This paper proposes an efficient hypergraph partitioning framework based on a novel multi-objective non-convex constrained relaxation model. A modified accelerated proximal gradient algorithm is employed to generate diverse $k$-dimensional vertex features to avoid local optima and enhance partition quality. Two MST-based strategies are designed for different data scales: for small-scale data, the Prim algorithm constructs a minimum spanning tree followed by pruning and clustering; for large-scale data, a subset of representative nodes is selected to build a smaller MST, while the remaining nodes are assigned accordingly to reduce complexity. To further improve partitioning results, refinement strategies including greedy migration, swapping, and recursive MST-based clustering are introduced for partitions. Experimental results on public benchmark sets demonstrate that the proposed algorithm achieves reductions in cut size of approximately 2%–5% on average compared to KaHyPar in 2, 3, and 4-way partitioning, with improvements of up to 35% on specific instances. Particularly on weighted vertex sets, our algorithm outperforms state-of-the-art partitioners including KaHyPar, hMetis, Mt-KaHyPar, and K-SpecPart, highlighting its superior partitioning quality and competitiveness. Furthermore, the proposed refinement strategy improves hMetis partitions by up to 16%. A comprehensive evaluation based on virtual instance methodology and parameter sensitivity analysis validates the algorithm’s competitiveness and characterizes its performance trade-offs.
[698] Spectral Collapse Drives Loss of Plasticity in Deep Continual Learning
Naicheng He, Kaicheng Guo, Arjun Prakash, Saket Tiwari, Ruo Yu Tao, Tyrone Serapio, Amy Greenwald, George Konidaris
Main category: cs.LG
TL;DR: Deep neural networks suffer from loss of plasticity in continual learning due to Hessian spectral collapse, where meaningful curvature directions vanish. The paper introduces τ-trainability framework and proposes regularization methods to maintain plasticity.
Details
Motivation: To understand why deep neural networks lose plasticity in continual learning scenarios and fail to learn new tasks without parameter reinitialization.Method: Introduces τ-trainability framework, analyzes Hessian spectral collapse, and proposes two regularization enhancements: maintaining high effective feature rank and applying L2 penalties based on Kronecker factored Hessian approximation.
Result: Experiments on continual supervised and reinforcement learning tasks show that combining the two proposed regularizers effectively preserves plasticity in deep neural networks.
Conclusion: Hessian spectral collapse precedes loss of plasticity, and the proposed regularization methods successfully maintain network trainability for new tasks in continual learning.
Abstract: We investigate why deep neural networks suffer from \emph{loss of plasticity} in deep continual learning, failing to learn new tasks without reinitializing parameters. We show that this failure is preceded by Hessian spectral collapse at new-task initialization, where meaningful curvature directions vanish and gradient descent becomes ineffective. To characterize the necessary condition for successful training, we introduce the notion of $\tau$-trainability and show that current plasticity preserving algorithms can be unified under this framework. Targeting spectral collapse directly, we then discuss the Kronecker factored approximation of the Hessian, which motivates two regularization enhancements: maintaining high effective feature rank and applying $L2$ penalties. Experiments on continual supervised and reinforcement learning tasks confirm that combining these two regularizers effectively preserves plasticity.
[699] Aurora: Towards Universal Generative Multimodal Time Series Forecasting
Xingjian Wu, Jianxin Jin, Wanghui Qiu, Peng Chen, Yang Shu, Bin Yang, Chenjuan Guo
Main category: cs.LG
TL;DR: Aurora is a multimodal time series foundation model that supports zero-shot inference and cross-domain generalization by extracting domain knowledge from text and image modalities to guide temporal modeling.
Details
Motivation: Existing unimodal time series models lack explicit utilization of domain-specific knowledge from other modalities like text, while end-to-end multimodal models don't support zero-shot inference for cross-domain scenarios.Method: Uses tokenization, encoding, and distillation to extract multimodal domain knowledge, employs Modality-Guided Multi-head Self-Attention to inject knowledge into temporal modeling, and uses Prototype-Guided Flow Matching for generative probabilistic forecasting.
Result: Achieves state-of-the-art performance on TimeMMD, TSFM-Bench and ProbTS benchmarks in both unimodal and multimodal scenarios.
Conclusion: Aurora demonstrates strong cross-domain generalization capability through adaptive extraction of multimodal domain knowledge and novel attention mechanisms for time series forecasting.
Abstract: Cross-domain generalization is very important in Time Series Forecasting because similar historical information may lead to distinct future trends due to the domain-specific characteristics. Recent works focus on building unimodal time series foundation models and end-to-end multimodal supervised models. Since domain-specific knowledge is often contained in modalities like texts, the former lacks the explicit utilization of them, thus hindering the performance. The latter is tailored for end-to-end scenarios and does not support zero-shot inference for cross-domain scenarios. In this work, we introduce Aurora, a Multimodal Time Series Foundation Model, which supports multimodal inputs and zero-shot inference. Pretrained on Corss-domain Multimodal Time Series Corpus, Aurora can adaptively extract and focus on key domain knowledge contained in corrsponding text or image modalities, thus possessing strong Cross-domain generalization capability. Through tokenization, encoding, and distillation, Aurora can extract multimodal domain knowledge as guidance and then utilizes a Modality-Guided Multi-head Self-Attention to inject them into the modeling of temporal representations. In the decoding phase, the multimodal representations are used to generate the conditions and prototypes of future tokens, contributing to a novel Prototype-Guided Flow Matching for generative probabilistic forecasting. Comprehensive experiments on well-recognized benchmarks, including TimeMMD, TSFM-Bench and ProbTS, demonstrate the consistent state-of-the-art performance of Aurora on both unimodal and multimodal scenarios.
[700] SurvDiff: A Diffusion Model for Generating Synthetic Data in Survival Analysis
Marie Brockschmidt, Maresa Schröder, Stefan Feuerriegel
Main category: cs.LG
TL;DR: SurvDiff is a novel diffusion model for generating synthetic survival data that jointly models covariates, event times, and censoring mechanisms using a survival-tailored loss function.
Details
Motivation: Survival analysis faces unique challenges with incomplete event information due to censoring, requiring synthetic data generation to faithfully reproduce both event-time distributions and censoring mechanisms for clinical research.Method: Proposed SurvDiff, an end-to-end diffusion model that jointly generates mixed-type covariates, event times, and right-censoring using a survival-tailored loss function that encodes time-to-event structure and optimizes for downstream survival tasks.
Result: SurvDiff consistently outperforms state-of-the-art generative baselines in both distributional fidelity and downstream evaluation metrics across multiple medical datasets.
Conclusion: SurvDiff is the first diffusion model explicitly designed for generating synthetic survival data and successfully addresses the unique challenges of survival analysis data generation.
Abstract: Survival analysis is a cornerstone of clinical research by modeling time-to-event outcomes such as metastasis, disease relapse, or patient death. Unlike standard tabular data, survival data often come with incomplete event information due to dropout, or loss to follow-up. This poses unique challenges for synthetic data generation, where it is crucial for clinical research to faithfully reproduce both the event-time distribution and the censoring mechanism. In this paper, we propose SurvDiff, an end-to-end diffusion model specifically designed for generating synthetic data in survival analysis. SurvDiff is tailored to capture the data-generating mechanism by jointly generating mixed-type covariates, event times, and right-censoring, guided by a survival-tailored loss function. The loss encodes the time-to-event structure and directly optimizes for downstream survival tasks, which ensures that SurvDiff (i) reproduces realistic event-time distributions and (ii) preserves the censoring mechanism. Across multiple datasets, we show that \survdiff consistently outperforms state-of-the-art generative baselines in both distributional fidelity and downstream evaluation metrics across multiple medical datasets. To the best of our knowledge, SurvDiff is the first diffusion model explicitly designed for generating synthetic survival data.
[701] SoDaDE: Solvent Data-Driven Embeddings with Small Transformer Models
Gabriel Kitso Gibberd, Jose Pablo Folch, Antonio Del Rio Chanona
Main category: cs.LG
TL;DR: SoDaDE is a new solvent representation method using a small transformer model and solvent property data to create specialized fingerprints that outperform generic representations in predicting solvent-related properties like reaction yields.
Details
Motivation: Generic chemical representations lack physical context specific to solvents, which is problematic since harmful solvents are a major climate issue and there's growing interest in green solvent replacement. Current data-driven representations are too generic as they're trained on broad datasets with shallow information.Method: Developed Solvent Data Driven Embeddings (SoDaDE) using a small transformer model trained on solvent property datasets to create specialized solvent fingerprints.
Result: SoDaDE outperformed previous representations when used to predict yields on a recently published dataset, demonstrating better performance for solvent-specific applications.
Conclusion: Data-driven fingerprints can be effectively created with small datasets, and the SoDaDE workflow can be adapted for other specialized chemical applications beyond solvents.
Abstract: Computational representations have become crucial in unlocking the recent growth of machine learning algorithms for chemistry. Initially hand-designed, machine learning has shown that meaningful representations can be learnt from data. Chemical datasets are limited and so the representations learnt from data are generic, being trained on broad datasets which contain shallow information on many different molecule types. For example, generic fingerprints lack physical context specific to solvents. However, the use of harmful solvents is a leading climate-related issue in the chemical industry, and there is a surge of interest in green solvent replacement. To empower this research, we propose a new solvent representation scheme by developing Solvent Data Driven Embeddings (SoDaDE). SoDaDE uses a small transformer model and solvent property dataset to create a fingerprint for solvents. To showcase their effectiveness, we use SoDaDE to predict yields on a recently published dataset, outperforming previous representations. We demonstrate through this paper that data-driven fingerprints can be made with small datasets and set-up a workflow that can be explored for other applications.
[702] Context and Diversity Matter: The Emergence of In-Context Learning in World Models
Fan Wang, Zhiyuan Chen, Yuxuan Zhong, Sunjian Zheng, Pengtao Shao, Bo Yu, Shaoshan Liu, Jianan Wang, Ning Ding, Yang Cao, Yu Kang
Main category: cs.LG
TL;DR: The paper introduces in-context environment learning (ICEL) for world models, focusing on their ability to adapt to novel configurations rather than zero-shot performance. It formalizes ICEL mechanisms, derives error bounds, and empirically validates how data distribution and architecture affect ICEL emergence.
Details
Motivation: Current static world models fail with novel or rare configurations. The paper aims to shift focus to how world models can learn and adapt to environments in-context, moving beyond zero-shot limitations.Method: The authors formalize in-context learning of world models, identifying environment recognition and environment learning mechanisms. They derive theoretical error upper-bounds for these mechanisms and conduct empirical validation using different data distributions and model architectures.
Result: The research confirms distinct in-context learning mechanisms exist in world models. It shows how data distribution and model architecture affect ICEL in ways consistent with theoretical predictions, particularly highlighting the importance of long context and diverse environments.
Conclusion: The findings demonstrate the potential of self-adapting world models and identify key factors for ICEL emergence, most notably the necessity of long context and diverse environments for effective in-context environment learning.
Abstract: The capability of predicting environmental dynamics underpins both biological neural systems and general embodied AI in adapting to their surroundings. Yet prevailing approaches rest on static world models that falter when confronted with novel or rare configurations. We investigate in-context environment learning (ICEL), shifting attention from zero-shot performance to the growth and asymptotic limits of the world model. Our contributions are three-fold: (1) we formalize in-context learning of a world model and identify two core mechanisms: environment recognition and environment learning; (2) we derive error upper-bounds for both mechanisms that expose how the mechanisms emerge; and (3) we empirically confirm that distinct ICL mechanisms exist in the world model, and we further investigate how data distribution and model architecture affect ICL in a manner consistent with theory. These findings demonstrate the potential of self-adapting world models and highlight the key factors behind the emergence of ICEL, most notably the necessity of long context and diverse environments.
[703] Distributed Associative Memory via Online Convex Optimization
Bowen Wang, Matteo Zecchin, Osvaldo Simeone
Main category: cs.LG
TL;DR: A distributed associative memory framework where agents maintain local AMs and share selective information through communication over routing trees, with theoretical sublinear regret guarantees and superior performance over baselines.
Details
Motivation: Associative memory enables cue-response recall and underlies modern neural architectures like Transformers. This work addresses distributed settings where agents need to maintain local associative memories while selectively sharing information with others.Method: Proposed a distributed online gradient descent method that optimizes local associative memories at different agents through communication over routing trees.
Result: Theoretical analysis established sublinear regret guarantees, and experiments demonstrated that the proposed protocol consistently outperforms existing online optimization baselines.
Conclusion: The distributed associative memory framework with communication over routing trees provides an effective approach for agents to maintain local memories while selectively sharing information, with proven theoretical guarantees and superior empirical performance.
Abstract: An associative memory (AM) enables cue-response recall, and associative memorization has recently been noted to underlie the operation of modern neural architectures such as Transformers. This work addresses a distributed setting where agents maintain a local AM to recall their own associations as well as selective information from others. Specifically, we introduce a distributed online gradient descent method that optimizes local AMs at different agents through communication over routing trees. Our theoretical analysis establishes sublinear regret guarantees, and experiments demonstrate that the proposed protocol consistently outperforms existing online optimization baselines.
[704] Bridging Kolmogorov Complexity and Deep Learning: Asymptotically Optimal Description Length Objectives for Transformers
Peter Shaw, James Cohan, Jacob Eisenstein, Kristina Toutanova
Main category: cs.LG
TL;DR: The paper introduces asymptotically optimal description length objectives for neural networks like Transformers, grounded in Kolmogorov complexity theory, and shows these objectives can achieve optimal compression with strong generalization guarantees.
Details
Motivation: To address the challenge of applying MDL principles to neural networks due to the lack of principled model complexity measures, and to provide theoretical foundations for training networks with better compression and generalization.Method: Developed asymptotically optimal description length objectives based on Kolmogorov complexity theory, proved their existence for Transformers using computational universality, and constructed a tractable variational objective with adaptive Gaussian mixture prior.
Result: The variational objective successfully selects low-complexity solutions with strong generalization on algorithmic tasks, though standard optimizers struggle to find such solutions from random initialization.
Conclusion: The framework provides a theoretical path toward training neural networks that achieve greater compression and generalization through principled description length objectives with strong asymptotic guarantees.
Abstract: The Minimum Description Length (MDL) principle offers a formal framework for applying Occam’s razor in machine learning. However, its application to neural networks such as Transformers is challenging due to the lack of a principled, universal measure for model complexity. This paper introduces the theoretical notion of asymptotically optimal description length objectives, grounded in the theory of Kolmogorov complexity. We establish that a minimizer of such an objective achieves optimal compression, for any dataset, up to an additive constant, in the limit as model resource bounds increase. We prove that asymptotically optimal objectives exist for Transformers, building on a new demonstration of their computational universality. We further show that such objectives can be tractable and differentiable by constructing and analyzing a variational objective based on an adaptive Gaussian mixture prior. Our empirical analysis shows that this variational objective selects for a low-complexity solution with strong generalization on an algorithmic task, but standard optimizers fail to find such solutions from a random initialization, highlighting key optimization challenges. More broadly, by providing a theoretical framework for identifying description length objectives with strong asymptotic guarantees, we outline a potential path towards training neural networks that achieve greater compression and generalization.
[705] Stochastic activations
Maria Lomeli, Matthijs Douze, Gergely Szilvasy, Loic Cabannes, Jade Copet, Sainbayar Sukhbaatar, Jason Weston, Gabriel Synnaeve, Pierre-Emmanuel Mazaré, Hervé Jégou
Main category: cs.LG
TL;DR: Stochastic activations randomly select between SILU and RELU functions during training, addressing RELU’s gradient flow issues while enabling sparse inference and improved text generation diversity.
Details
Motivation: To overcome RELU's optimization problems (constant shape for negative inputs that prevents gradient flow) while maintaining inference efficiency and enabling controlled diversity in text generation.Method: Randomly select between SILU or RELU activations using Bernoulli draws during training, then fine-tune with RELU for sparse inference or use stochastic activations directly for generation tasks.
Result: During pre-training with stochastic activations followed by RELU fine-tuning achieves better performance than training from scratch with RELU, reduces inference FLOPs, and provides significant CPU speedup. For generation, it performs reasonably well, slightly inferior to SILU with temperature scaling but offers controlled text diversity.
Conclusion: Stochastic activations provide an effective strategy to address RELU’s limitations while enabling sparse inference and controlled diversity in text generation, offering a viable alternative to existing activation strategies.
Abstract: We introduce stochastic activations. This novel strategy randomly selects between several non-linear functions in the feed-forward layer of a large language model. In particular, we choose between SILU or RELU depending on a Bernoulli draw. This strategy circumvents the optimization problem associated with RELU, namely, the constant shape for negative inputs that prevents the gradient flow. We leverage this strategy in two ways: (1) We use stochastic activations during pre-training and fine-tune the model with RELU, which is used at inference time to provide sparse latent vectors. This reduces the inference FLOPs and translates into a significant speedup in the CPU. Interestingly, this leads to much better results than training from scratch with the RELU activation function. (2) We evaluate stochastic activations for generation. This strategy performs reasonably well: it is only slightly inferior to the best deterministic non-linearity, namely SILU combined with temperature scaling. This offers an alternative to existing strategies by providing a controlled way to increase the diversity of the generated text.
[706] Neural Feature Geometry Evolves as Discrete Ricci Flow
Moritz Hehl, Max von Renesse, Melanie Weber
Main category: cs.LG
TL;DR: The paper investigates neural feature geometry using discrete geometry, showing that neural networks evolve feature representations similar to discrete Ricci flow, with nonlinear activations playing a key role in shaping geometry.
Details
Motivation: To better understand how neural networks learn feature representations through geometric transformations of the input data manifold, as current understanding is incomplete despite empirical success.Method: Approximates the input data manifold using geometric graphs encoding local similarity structure, provides theoretical analysis of graph evolution during training, and conducts experiments on over 20,000 feedforward networks across synthetic and real datasets.
Result: Shows that neural feature geometry evolves analogous to discrete Ricci flow, with class separability corresponding to community structure emergence in graph representations. Nonlinear activations are crucial for shaping feature geometry.
Conclusion: Introduces a framework for evaluating geometric transformations via discrete Ricci flow comparison, suggesting practical design principles including geometry-informed early-stopping and network depth selection criteria.
Abstract: Deep neural networks learn feature representations via complex geometric transformations of the input data manifold. Despite the models’ empirical success across domains, our understanding of neural feature representations is still incomplete. In this work we investigate neural feature geometry through the lens of discrete geometry. Since the input data manifold is typically unobserved, we approximate it using geometric graphs that encode local similarity structure. We provide theoretical results on the evolution of these graphs during training, showing that nonlinear activations play a crucial role in shaping feature geometry in feedforward neural networks. Moreover, we discover that the geometric transformations resemble a discrete Ricci flow on these graphs, suggesting that neural feature geometry evolves analogous to Ricci flow. This connection is supported by experiments on over 20,000 feedforward neural networks trained on binary classification tasks across both synthetic and real-world datasets. We observe that the emergence of class separability corresponds to the emergence of community structure in the associated graph representations, which is known to relate to discrete Ricci flow dynamics. Building on these insights, we introduce a novel framework for locally evaluating geometric transformations through comparison with discrete Ricci flow dynamics. Our results suggest practical design principles, including a geometry-informed early-stopping heuristic and a criterion for selecting network depth.
[707] IIET: Efficient Numerical Transformer via Implicit Iterative Euler Method
Xinyu Liu, Bei Li, Jiahao Liu, Junhao Ruan, Kechen Jiao, Hongyin Tang, Jingang Wang, Xiao Tong, Jingbo Zhu
Main category: cs.LG
TL;DR: IIET Transformer uses iterative implicit Euler method to simplify high-order ODE-based Transformers, achieving better performance and efficiency than PCformer and vanilla Transformers.
Details
Motivation: High-order numerical methods improve Transformer performance but create performance-efficiency trade-offs, and conventional efficiency techniques like distillation can harm performance of models like PCformer.Method: Proposed IIET (Iterative Implicit Euler Transformer) simplifies high-order methods using iterative implicit Euler approach, and introduced IIAD (Iteration Influence-Aware Distillation) with flexible threshold to balance performance-efficiency trade-off.
Result: IIET boosts average accuracy by 2.65% over vanilla Transformers and 0.8% over PCformer. E-IIET variant cuts inference overhead by 55% while retaining 99.4% of original task accuracy. Most efficient variant achieves >1.6% performance gain over vanilla Transformer with comparable speed.
Conclusion: IIET provides superior performance and facilitates model compression compared to PCformer, with IIAD enabling effective balance between performance and efficiency through flexible thresholding.
Abstract: High-order numerical methods enhance Transformer performance in tasks like NLP and CV, but introduce a performance-efficiency trade-off due to increased computational overhead. Our analysis reveals that conventional efficiency techniques, such as distillation, can be detrimental to the performance of these models, exemplified by PCformer. To explore more optimizable ODE-based Transformer architectures, we propose the \textbf{I}terative \textbf{I}mplicit \textbf{E}uler \textbf{T}ransformer \textbf{(IIET)}, which simplifies high-order methods using an iterative implicit Euler approach. This simplification not only leads to superior performance but also facilitates model compression compared to PCformer. To enhance inference efficiency, we introduce \textbf{I}teration \textbf{I}nfluence-\textbf{A}ware \textbf{D}istillation \textbf{(IIAD)}. Through a flexible threshold, IIAD allows users to effectively balance the performance-efficiency trade-off. On lm-evaluation-harness, IIET boosts average accuracy by 2.65% over vanilla Transformers and 0.8% over PCformer. Its efficient variant, E-IIET, significantly cuts inference overhead by 55% while retaining 99.4% of the original task accuracy. Moreover, the most efficient IIET variant achieves an average performance gain exceeding 1.6% over vanilla Transformer with comparable speed.
[708] Role-Aware Multi-modal federated learning system for detecting phishing webpages
Bo Wang, Imran Khan, Martin White, Natalia Beloff
Main category: cs.LG
TL;DR: Federated multi-modal phishing detector with URL, HTML, and IMAGE inputs using role-aware bucket aggregation on FedProx, achieving high accuracy (up to 97.5%) with low false positive rates across different modalities.
Details
Motivation: To create a flexible phishing detection system that supports multiple modalities (URL, HTML, IMAGE) without requiring clients to commit to a fixed modality, while maintaining strict privacy through federated learning.Method: Proposed role-aware bucket aggregation on FedProx with hard-gated modality experts (IMAGE/HTML/URL) instead of learnable routing, enabling separate aggregation of modality-specific parameters to prevent cross-embedding conflicts and stabilize convergence.
Result: Achieved 97.5% accuracy with 2.4% FPR on TR-OP dataset across two data types; 95.5% accuracy with 5.9% FPR on image subset; 96.5% accuracy with 1.8% FPR on WebPhish (HTML); 95.1% accuracy with 4.6% FPR on TR-OP (raw HTML).
Conclusion: Bucket aggregation with hard-gated experts enables stable federated training under strict privacy constraints while improving usability and flexibility of multi-modal phishing detection systems.
Abstract: We present a federated, multi-modal phishing website detector that supports URL, HTML, and IMAGE inputs without binding clients to a fixed modality at inference: any client can invoke any modality head trained elsewhere. Methodologically, we propose role-aware bucket aggregation on top of FedProx, inspired by Mixture-of-Experts and FedMM. We drop learnable routing and use hard gating (selecting the IMAGE/HTML/URL expert by sample modality), enabling separate aggregation of modality-specific parameters to isolate cross-embedding conflicts and stabilize convergence. On TR-OP, the Fusion head reaches Acc 97.5% with FPR 2.4% across two data types; on the image subset (ablation) it attains Acc 95.5% with FPR 5.9%. For text, we use GraphCodeBERT for URLs and an early three-way embedding for raw, noisy HTML. On WebPhish (HTML) we obtain Acc 96.5% / FPR 1.8%; on TR-OP (raw HTML) we obtain Acc 95.1% / FPR 4.6%. Results indicate that bucket aggregation with hard-gated experts enables stable federated training under strict privacy, while improving the usability and flexibility of multi-modal phishing detection.
[709] SpinGPT: A Large-Language-Model Approach to Playing Poker Correctly
Narada Maugin, Tristan Cazenave
Main category: cs.LG
TL;DR: SpinGPT is the first LLM tailored for 3-player Spin & Go poker, trained via supervised fine-tuning on expert decisions and RL on solver-generated hands, achieving competitive performance against existing bots.
Details
Motivation: CFR algorithms struggle with computational complexity in multi-player games and Nash equilibrium doesn't guarantee non-losing outcomes in 3+ player games, limiting applicability to popular tournament formats like Spin & Go.Method: Two-stage training: (1) Supervised Fine-Tuning on 320k high-stakes expert decisions; (2) Reinforcement Learning on 270k solver-generated hands.
Result: Matches solver’s actions in 78% of decisions (tolerant accuracy). With deep-stack heuristic, achieves 13.4 ± 12.9 BB/100 versus Slumbot in heads-up over 30k hands (95% CI).
Conclusion: LLMs could be a new approach for handling multi-player imperfect-information games like poker, overcoming limitations of traditional CFR methods.
Abstract: The Counterfactual Regret Minimization (CFR) algorithm and its variants have enabled the development of pokerbots capable of beating the best human players in heads-up (1v1) cash games and competing with them in six-player formats. However, CFR’s computational complexity rises exponentially with the number of players. Furthermore, in games with three or more players, following Nash equilibrium no longer guarantees a non-losing outcome. These limitations, along with others, significantly restrict the applicability of CFR to the most popular formats: tournaments. Motivated by the recent success of Large Language Models (LLM) in chess and Diplomacy, we present SpinGPT, the first LLM tailored to Spin & Go, a popular three-player online poker format. SpinGPT is trained in two stages: (1) Supervised Fine-Tuning on 320k high-stakes expert decisions; (2) Reinforcement Learning on 270k solver-generated hands. Our results show that SpinGPT matches the solver’s actions in 78% of decisions (tolerant accuracy). With a simple deep-stack heuristic, it achieves 13.4 +/- 12.9 BB/100 versus Slumbot in heads-up over 30,000 hands (95% CI). These results suggest that LLMs could be a new way to deal with multi-player imperfect-information games like poker.
[710] Enhancing Credit Risk Prediction: A Meta-Learning Framework Integrating Baseline Models, LASSO, and ECOC for Superior Accuracy
Haibo Wang, Lutfu S. Sua, Jun Huang, Figen Balo, Burak Dolar
Main category: cs.LG
TL;DR: A meta-learning framework combining multiple ML models for credit risk assessment that improves classification accuracy and default probability prediction while addressing high-dimensional data and class imbalance issues.
Details
Motivation: Traditional ML models struggle with high-dimensional data, limited interpretability, rare event detection, and multi-class imbalance in credit risk assessment, necessitating a more robust approach.Method: Meta-learning framework combining supervised learning (XGBoost, Random Forest, SVM, Decision Tree), unsupervised methods (K-NN), deep learning (MLP), LASSO for feature selection, and Error-Correcting Output Codes as meta-classifier, with Permutation Feature Importance for transparency.
Result: Significantly enhanced accuracy in financial entity classification for credit rating migrations (upgrades/downgrades) and default probability estimation on Corporate Credit Ratings dataset with 2,029 US companies.
Conclusion: The proposed framework provides a more holistic and accurate approach to credit risk modeling, addressing key challenges and improving reliability for financial decision support.
Abstract: Effective credit risk management is fundamental to financial decision-making, necessitating robust models for default probability prediction and financial entity classification. Traditional machine learning approaches face significant challenges when confronted with high-dimensional data, limited interpretability, rare event detection, and multi-class imbalance problems in risk assessment. This research proposes a comprehensive meta-learning framework that synthesizes multiple complementary models: supervised learning algorithms, including XGBoost, Random Forest, Support Vector Machine, and Decision Tree; unsupervised methods such as K-Nearest Neighbors; deep learning architectures like Multilayer Perceptron; alongside LASSO regularization for feature selection and dimensionality reduction; and Error-Correcting Output Codes as a meta-classifier for handling imbalanced multi-class problems. We implement Permutation Feature Importance analysis for each prediction class across all constituent models to enhance model transparency. Our framework aims to optimize predictive performance while providing a more holistic approach to credit risk assessment. This research contributes to the development of more accurate and reliable computational models for strategic financial decision support by addressing three fundamental challenges in credit risk modeling. The empirical validation of our approach involves an analysis of the Corporate Credit Ratings dataset with credit ratings for 2,029 publicly listed US companies. Results demonstrate that our meta-learning framework significantly enhances the accuracy of financial entity classification regarding credit rating migrations (upgrades and downgrades) and default probability estimation.
[711] (Sometimes) Less is More: Mitigating the Complexity of Rule-based Representation for Interpretable Classification
Luca Bergamin, Roberto Confalonieri, Fabio Aiolli
Main category: cs.LG
TL;DR: This paper adapts L0 regularization to a logic-based neural network (MLLP) to reduce complexity of interpretable models while maintaining performance, comparing it against random binarization methods.
Details
Motivation: Deep neural networks lack interpretability, which is crucial for many AI applications where high performance alone is insufficient. Model transparency is a key requirement in multiple scenarios.Method: A differentiable approximation of L0 regularization is adapted into Multi-layer Logical Perceptron (MLLP) to sparsify the network based on loss function rather than random distribution.
Result: The study compares L0 regularization against alternative heuristics like Random Binarization to evaluate complexity reduction in Concept Rule Set (CRS) while retaining performance.
Conclusion: The paper discusses the trade-off between CRS complexity and performance, suggesting that less-noisy sparsification techniques based on loss function may achieve better results than random methods.
Abstract: Deep neural networks are widely used in practical applications of AI, however, their inner structure and complexity made them generally not easily interpretable. Model transparency and interpretability are key requirements for multiple scenarios where high performance is not enough to adopt the proposed solution. In this work, a differentiable approximation of $L_0$ regularization is adapted into a logic-based neural network, the Multi-layer Logical Perceptron (MLLP), to study its efficacy in reducing the complexity of its discrete interpretable version, the Concept Rule Set (CRS), while retaining its performance. The results are compared to alternative heuristics like Random Binarization of the network weights, to determine if better results can be achieved when using a less-noisy technique that sparsifies the network based on the loss function instead of a random distribution. The trade-off between the CRS complexity and its performance is discussed.
[712] EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning
Xu Wujiang, Wentian Zhao, Zhenting Wang, Li Yu-Jhe, Jin Can, Jin Mingyu, Mei Kai, Wan Kun, Metaxas Dimitris
Main category: cs.LG
TL;DR: EPO framework addresses exploration-exploitation cascade failure in multi-turn sparse-reward environments through entropy regularization, smoothing, and adaptive weighting, achieving significant performance improvements.
Details
Motivation: Training LLM agents in multi-turn environments with sparse rewards (30+ turns per task) faces unique challenges where conventional RL methods fail due to premature convergence and policy collapse cycles.Method: Proposed Entropy-regularized Policy Optimization (EPO) with three mechanisms: entropy regularization for exploration, entropy smoothing to bound policy entropy within historical averages, and adaptive phase-based weighting to balance exploration-exploitation.
Result: EPO achieves up to 152% performance improvement on ScienceWorld and up to 19.8% on ALFWorld, with guaranteed monotonically decreasing entropy variance while maintaining convergence.
Conclusion: Multi-turn sparse-reward settings require fundamentally different entropy control than traditional RL, with EPO providing an effective solution that breaks the exploration-exploitation cascade failure cycle.
Abstract: Training LLM agents in multi-turn environments with sparse rewards, where completing a single task requires 30+ turns of interaction within an episode, presents a fundamental challenge for reinforcement learning. We identify a critical failure mode unique to this setting: the exploration-exploitation cascade failure. This cascade begins with early-stage policy premature convergence, where sparse feedback causes agents to commit to flawed, low-entropy strategies. Subsequently, agents enter late-stage policy collapse, where conventional entropy regularization becomes counterproductive, promoting chaotic exploration that destabilizes training. We propose Entropy-regularized Policy Optimization (EPO), a general framework that breaks this failure cycle through three synergistic mechanisms: (1) adopting entropy regularization in multi-turn settings to enhance exploration, (2) an entropy smoothing regularizer that bounds policy entropy within historical averages to prevent abrupt fluctuations, and (3) adaptive phase-based weighting that balances exploration and exploitation across training. Our analysis justifies that EPO guarantees monotonically decreasing entropy variance while maintaining convergence. EPO achieves up to 152% performance improvement on ScienceWorld and up to 19.8% on ALFWorld. Our work demonstrates that multi-turn sparse-reward settings require fundamentally different entropy control than traditional RL, with broad implications for LLM agent training.
[713] Improving accuracy in short mortality rate series: Exploring Multi-step Forecasting Approaches in Hybrid Systems
Filipe C. L. Duarte, Paulo S. G. de Mattos Neto, Paulo R. A. Firmino
Main category: cs.LG
TL;DR: Hybrid ARIMA-LSTM model with recursive approach outperforms other models for multi-step mortality forecasting when data is limited.
Details
Motivation: Accurate mortality forecasting is crucial for insurance and pension markets, especially with declining interest rates and economic stabilization. Multi-step predictions are needed for public health and risk assessment but face challenges with limited data.Method: Evaluated hybrid systems combining statistical and ML models using different multi-step forecasting approaches (Recursive, Direct, Multi-Input Multi-Output) and various ML models across 12 datasets.
Result: ARIMA-LSTM hybrid using recursive approach outperformed other models in most cases. Selection of both multi-step approach and ML model significantly impacts performance.
Conclusion: The choice of multi-step forecasting approach and ML model is essential for improving hybrid system performance in mortality forecasting with limited data.
Abstract: The decline in interest rates and economic stabilization has heightened the importance of accurate mortality rate forecasting, particularly in insurance and pension markets. Multi-step-ahead predictions are crucial for public health, demographic planning, and insurance risk assessments; however, they face challenges when data are limited. Hybrid systems that combine statistical and Machine Learning (ML) models offer a promising solution for handling both linear and nonlinear patterns. This study evaluated the impact of different multi-step forecasting approaches (Recursive, Direct, and Multi-Input Multi-Output) and ML models on the accuracy of hybrid systems. Results from 12 datasets and 21 models show that the selection of both the multi-step approach and the ML model is essential for improving performance, with the ARIMA-LSTM hybrid using a recursive approach outperforming other models in most cases.
[714] Partial Parameter Updates for Efficient Distributed Training
Anastasiia Filippova, Angelos Katharopoulos, David Grangier, Ronan Collobert
Main category: cs.LG
TL;DR: A memory- and compute-efficient distributed training method that restricts backpropagation to fixed parameter subsets during local updates, reducing communication while maintaining model quality.
Details
Motivation: To improve efficiency of distributed training by reducing communication overhead and computational costs while maintaining model performance.Method: Restricts backpropagation to only update fixed parameter subsets during local steps, keeping other parameters frozen, while maintaining full forward passes to avoid cross-node activation exchange.
Result: Matches perplexity of prior low-communication methods on a 1.3B-parameter language model across 32 nodes while reducing training FLOPs and peak memory usage under identical token and bandwidth budgets.
Conclusion: The proposed constrained backpropagation approach provides significant efficiency improvements for distributed training without sacrificing model quality.
Abstract: We introduce a memory- and compute-efficient method for low-communication distributed training. Existing methods reduce communication by performing multiple local updates between infrequent global synchronizations. We demonstrate that their efficiency can be significantly improved by restricting backpropagation: instead of updating all the parameters, each node updates only a fixed subset while keeping the remainder frozen during local steps. This constraint substantially reduces peak memory usage and training FLOPs, while a full forward pass over all parameters eliminates the need for cross-node activation exchange. Experiments on a $1.3$B-parameter language model trained across $32$ nodes show that our method matches the perplexity of prior low-communication approaches under identical token and bandwidth budgets while reducing training FLOPs and peak memory.
[715] ReLAM: Learning Anticipation Model for Rewarding Visual Robotic Manipulation
Nan Tang, Jing-Cheng Pang, Guanlin Li, Chao Qian, Yang Yu
Main category: cs.LG
TL;DR: ReLAM is a novel framework that automatically generates dense, structured rewards from action-free video demonstrations for visual RL in robotic manipulation, using keypoint-based subgoals and an anticipation model to accelerate learning.
Details
Motivation: Reward design is a critical bottleneck in visual RL for robotic manipulation, as precise positional information is often unavailable in real-world settings due to sensory limitations.Method: ReLAM learns an anticipation model that proposes intermediate keypoint-based subgoals, creating a structured learning curriculum. It then provides continuous reward signals to train a goal-conditioned policy under hierarchical RL framework.
Result: Extensive experiments on complex, long-horizon manipulation tasks show that ReLAM significantly accelerates learning and achieves superior performance compared to state-of-the-art methods.
Conclusion: ReLAM effectively addresses the reward design problem in visual RL by automatically generating structured rewards from demonstrations, enabling efficient learning in real-world robotic manipulation tasks.
Abstract: Reward design remains a critical bottleneck in visual reinforcement learning (RL) for robotic manipulation. In simulated environments, rewards are conventionally designed based on the distance to a target position. However, such precise positional information is often unavailable in real-world visual settings due to sensory and perceptual limitations. In this study, we propose a method that implicitly infers spatial distances through keypoints extracted from images. Building on this, we introduce Reward Learning with Anticipation Model (ReLAM), a novel framework that automatically generates dense, structured rewards from action-free video demonstrations. ReLAM first learns an anticipation model that serves as a planner and proposes intermediate keypoint-based subgoals on the optimal path to the final goal, creating a structured learning curriculum directly aligned with the task’s geometric objectives. Based on the anticipated subgoals, a continuous reward signal is provided to train a low-level, goal-conditioned policy under the hierarchical reinforcement learning (HRL) framework with provable sub-optimality bound. Extensive experiments on complex, long-horizon manipulation tasks show that ReLAM significantly accelerates learning and achieves superior performance compared to state-of-the-art methods.
[716] IA2: Alignment with ICL Activations Improves Supervised Fine-Tuning
Aayush Mishra, Daniel Khashabi, Anqi Liu
Main category: cs.LG
TL;DR: ICL Activation Alignment (IA2) is a self-distillation technique that uses ICL’s activation patterns to improve SFT models, enhancing accuracy and calibration.
Details
Motivation: ICL offers better generalizability and calibrated responses than SFT in data-scarce settings, but SFT is more efficient. The paper explores whether ICL's internal computations can improve SFT quality.Method: IA2 aligns SFT models with ICL’s activation patterns through self-distillation, serving as a priming step before SFT to incentivize ICL-like internal reasoning.
Result: IA2 significantly improves accuracy and calibration across 12 benchmarks and 2 model families, showing distinct activation patterns between ICL and SFT.
Conclusion: IA2 effectively bridges ICL and SFT benefits, offering practical improvements and insights into model adaptation mechanics.
Abstract: Supervised Fine-Tuning (SFT) is used to specialize model behavior by training weights to produce intended target responses for queries. In contrast, In-Context Learning (ICL) adapts models during inference with instructions or demonstrations in the prompt. ICL can offer better generalizability and more calibrated responses compared to SFT in data scarce settings, at the cost of more inference compute. In this work, we ask the question: Can ICL’s internal computations be used to improve the qualities of SFT? We first show that ICL and SFT produce distinct activation patterns, indicating that the two methods achieve adaptation through different functional mechanisms. Motivated by this observation and to use ICL’s rich functionality, we introduce ICL Activation Alignment (IA2), a self-distillation technique which aims to replicate ICL’s activation patterns in SFT models and incentivizes ICL-like internal reasoning. Performing IA2 as a priming step before SFT significantly improves the accuracy and calibration of model outputs, as shown by our extensive empirical results on 12 popular benchmarks and 2 model families. This finding is not only practically useful, but also offers a conceptual window into the inner mechanics of model adaptation.
[717] MoveFM-R: Advancing Mobility Foundation Models via Language-driven Semantic Reasoning
Fanjin Meng, Yuan Yuan, Jingtao Ding, Jie Feng, Chonghua Han, Yong Li
Main category: cs.LG
TL;DR: MoveFM-R bridges mobility foundation models and large language models to enable semantic reasoning for human mobility modeling, overcoming vocabulary mismatches and representation gaps through novel encoding, curriculum learning, and self-reflection mechanisms.
Details
Motivation: Current Mobility Foundation Models (MFMs) are limited by data scale and semantic understanding, while LLMs lack spatio-temporal statistical understanding. There's a need to combine MFMs' statistical power with LLMs' semantic reasoning for more comprehensive mobility modeling.Method: Three core innovations: semantically enhanced location encoding to bridge geography-language gap, progressive curriculum to align LLM reasoning with mobility patterns, and interactive self-reflection mechanism for conditional trajectory generation.
Result: Significantly outperforms existing MFM-based and LLM-based baselines, shows robust generalization in zero-shot settings, and excels at generating realistic trajectories from natural language instructions.
Conclusion: MoveFM-R pioneers a new paradigm that synthesizes statistical power of MFMs with semantic understanding of LLMs, enabling more comprehensive, interpretable, and powerful human mobility modeling.
Abstract: Mobility Foundation Models (MFMs) have advanced the modeling of human movement patterns, yet they face a ceiling due to limitations in data scale and semantic understanding. While Large Language Models (LLMs) offer powerful semantic reasoning, they lack the innate understanding of spatio-temporal statistics required for generating physically plausible mobility trajectories. To address these gaps, we propose MoveFM-R, a novel framework that unlocks the full potential of mobility foundation models by leveraging language-driven semantic reasoning capabilities. It tackles two key challenges: the vocabulary mismatch between continuous geographic coordinates and discrete language tokens, and the representation gap between the latent vectors of MFMs and the semantic world of LLMs. MoveFM-R is built on three core innovations: a semantically enhanced location encoding to bridge the geography-language gap, a progressive curriculum to align the LLM’s reasoning with mobility patterns, and an interactive self-reflection mechanism for conditional trajectory generation. Extensive experiments demonstrate that MoveFM-R significantly outperforms existing MFM-based and LLM-based baselines. It also shows robust generalization in zero-shot settings and excels at generating realistic trajectories from natural language instructions. By synthesizing the statistical power of MFMs with the deep semantic understanding of LLMs, MoveFM-R pioneers a new paradigm that enables a more comprehensive, interpretable, and powerful modeling of human mobility. The implementation of MoveFM-R is available online at https://anonymous.4open.science/r/MoveFM-R-CDE7/.
[718] Global Convergence in Neural ODEs: Impact of Activation Functions
Tianxiang Gao, Siyuan Sun, Hailiang Liu, Hongyang Gao
Main category: cs.LG
TL;DR: The paper analyzes how activation function properties (smoothness and nonlinearity) affect Neural ODE training, establishing global convergence guarantees in overparameterized regimes.
Details
Motivation: Neural ODEs face training challenges due to their continuous nature, particularly regarding gradient computation accuracy and convergence analysis. The unique characteristics of Neural ODEs introduce difficulties that need to be addressed.Method: Theoretical investigation of activation function properties (smoothness and nonlinearity) and their impact on training dynamics. Smoothness ensures globally unique solutions for forward/backward ODEs, while nonlinearity maintains NTK spectral properties during training.
Result: Established global convergence of Neural ODEs under gradient descent in overparameterized regimes. Numerical experiments validate theoretical findings and provide practical scaling guidelines.
Conclusion: Proper activation function selection (smooth and sufficiently nonlinear) enables stable Neural ODE training with global convergence guarantees, potentially leading to faster training and improved real-world performance.
Abstract: Neural Ordinary Differential Equations (ODEs) have been successful in various applications due to their continuous nature and parameter-sharing efficiency. However, these unique characteristics also introduce challenges in training, particularly with respect to gradient computation accuracy and convergence analysis. In this paper, we address these challenges by investigating the impact of activation functions. We demonstrate that the properties of activation functions, specifically smoothness and nonlinearity, are critical to the training dynamics. Smooth activation functions guarantee globally unique solutions for both forward and backward ODEs, while sufficient nonlinearity is essential for maintaining the spectral properties of the Neural Tangent Kernel (NTK) during training. Together, these properties enable us to establish the global convergence of Neural ODEs under gradient descent in overparameterized regimes. Our theoretical findings are validated by numerical experiments, which not only support our analysis but also provide practical guidelines for scaling Neural ODEs, potentially leading to faster training and improved performance in real-world applications.
[719] Fast-Forward Lattice Boltzmann: Learning Kinetic Behaviour with Physics-Informed Neural Operators
Xiao Xue, Marco F. P. ten Eikelder, Mingyang Gao, Xiaoyuan Cheng, Yiming Yang, Yi He, Shuo Wang, Sibo Cheng, Yukun Hu, Peter V. Coveney
Main category: cs.LG
TL;DR: A physics-informed neural operator framework for the lattice Boltzmann equation that enables long-term prediction without step-by-step integration, bypassing computational bottlenecks of traditional collision kernels.
Details
Motivation: Traditional lattice Boltzmann equation (LBE) solvers are computationally intensive due to strict time-step restrictions imposed by collision kernels, limiting their efficiency for large-scale simulations.Method: Physics-informed neural operator framework incorporating intrinsic moment-matching constraints and global equivariance of distribution fields, enabling discretization-invariant models that work across different collision models.
Result: Demonstrated robustness across complex flow scenarios including von Karman vortex shedding, ligament breakup, and bubble adhesion, with ability to generalize from coarse to fine lattices (kinetic super-resolution).
Conclusion: Establishes a new data-driven pathway for modeling kinetic systems that overcomes computational limitations of traditional LBE solvers while maintaining physical accuracy.
Abstract: The lattice Boltzmann equation (LBE), rooted in kinetic theory, provides a powerful framework for capturing complex flow behaviour by describing the evolution of single-particle distribution functions (PDFs). Despite its success, solving the LBE numerically remains computationally intensive due to strict time-step restrictions imposed by collision kernels. Here, we introduce a physics-informed neural operator framework for the LBE that enables prediction over large time horizons without step-by-step integration, effectively bypassing the need to explicitly solve the collision kernel. We incorporate intrinsic moment-matching constraints of the LBE, along with global equivariance of the full distribution field, enabling the model to capture the complex dynamics of the underlying kinetic system. Our framework is discretization-invariant, enabling models trained on coarse lattices to generalise to finer ones (kinetic super-resolution). In addition, it is agnostic to the specific form of the underlying collision model, which makes it naturally applicable across different kinetic datasets regardless of the governing dynamics. Our results demonstrate robustness across complex flow scenarios, including von Karman vortex shedding, ligament breakup, and bubble adhesion. This establishes a new data-driven pathway for modelling kinetic systems.
[720] One Prompt Fits All: Universal Graph Adaptation for Pretrained Models
Yongqi Huang, Jitao Zhao, Dongxiao He, Xiaobao Wang, Yawen Li, Yuxiao Huang, Di Jin, Zhiyong Feng
Main category: cs.LG
TL;DR: UniPrompt is a novel Graph Prompt Learning method that adapts pretrained models to downstream tasks by unleashing their capabilities while preserving graph structure, achieving strong performance across diverse scenarios.
Details
Motivation: Address limitations in current Graph Prompt Learning: lack of consensus on prompt mechanisms and poor adaptability to diverse downstream scenarios, especially under data distribution shifts.Method: Theoretically analyze existing GPL approaches, reveal representation-level prompts function as fine-tuning classifiers, and propose UniPrompt that adapts pretrained models while preserving input graph structure.
Result: Extensive experiments show UniPrompt effectively integrates with various pretrained models and achieves strong performance across in-domain and cross-domain scenarios.
Conclusion: Graph prompt learning should focus on unleashing pretrained model capabilities while letting classifiers adapt to downstream scenarios, with UniPrompt providing an effective solution.
Abstract: Graph Prompt Learning (GPL) has emerged as a promising paradigm that bridges graph pretraining models and downstream scenarios, mitigating label dependency and the misalignment between upstream pretraining and downstream tasks. Although existing GPL studies explore various prompt strategies, their effectiveness and underlying principles remain unclear. We identify two critical limitations: (1) Lack of consensus on underlying mechanisms: Despite current GPLs have advanced the field, there is no consensus on how prompts interact with pretrained models, as different strategies intervene at varying spaces within the model, i.e., input-level, layer-wise, and representation-level prompts. (2) Limited scenario adaptability: Most methods fail to generalize across diverse downstream scenarios, especially under data distribution shifts (e.g., homophilic-to-heterophilic graphs). To address these issues, we theoretically analyze existing GPL approaches and reveal that representation-level prompts essentially function as fine-tuning a simple downstream classifier, proposing that graph prompt learning should focus on unleashing the capability of pretrained models, and the classifier adapts to downstream scenarios. Based on our findings, we propose UniPrompt, a novel GPL method that adapts any pretrained models, unleashing the capability of pretrained models while preserving the structure of the input graph. Extensive experiments demonstrate that our method can effectively integrate with various pretrained models and achieve strong performance across in-domain and cross-domain scenarios.
[721] Physics-informed GNN for medium-high voltage AC power flow with edge-aware attention and line search correction operator
Changhun Kim, Timon Conrad, Redwanul Karim, Julian Oelhaf, David Riebesel, Tomás Arias-Vergara, Andreas Maier, Johann Jäger, Siming Bayer
Main category: cs.LG
TL;DR: PIGNN-Attn-LS improves physics-informed graph neural networks for AC power-flow solving by combining edge-aware attention with backtracking line-search, achieving higher accuracy and faster inference than Newton-Raphson solvers.
Details
Motivation: Current physics-informed graph neural networks (PIGNNs) need accuracy improvements and lack operative physics loss at inference, which limits their operational adoption as AC power-flow solvers.Method: Combines edge-aware attention mechanism that encodes line physics via per-edge biases with backtracking line-search-based globalized correction operator to restore operative decrease criterion at inference.
Result: Achieves test RMSE of 0.00033 p.u. in voltage and 0.08° in angle on 4-32-bus grids, outperforming baseline by 99.5% and 87.1% respectively, with 2-5× faster batched inference than Newton-Raphson on 4-1024-bus grids.
Conclusion: PIGNN-Attn-LS provides an accurate and fast AC power-flow solver that can replace classic Newton-Raphson methods, especially for scenarios requiring thousands of evaluations.
Abstract: Physics-informed graph neural networks (PIGNNs) have emerged as fast AC power-flow solvers that can replace classic Newton–Raphson (NR) solvers, especially when thousands of scenarios must be evaluated. However, current PIGNNs still need accuracy improvements at parity speed; in particular, the physics loss is inoperative at inference, which can deter operational adoption. We address this with PIGNN-Attn-LS, combining an edge-aware attention mechanism that explicitly encodes line physics via per-edge biases, capturing the grid’s anisotropy, with a backtracking line-search-based globalized correction operator that restores an operative decrease criterion at inference. Training and testing use a realistic High-/Medium-Voltage scenario generator, with NR used only to construct reference states. On held-out HV cases consisting of 4–32-bus grids, PIGNN-Attn-LS achieves a test RMSE of 0.00033 p.u. in voltage and 0.08$^\circ$ in angle, outperforming the PIGNN-MLP baseline by 99.5% and 87.1%, respectively. With streaming micro-batches, it delivers 2–5$\times$ faster batched inference than NR on 4–1024-bus grids.
[722] The Flood Complex: Large-Scale Persistent Homology on Millions of Points
Florian Graf, Paolo Pellizzoni, Martin Uray, Stefan Huber, Roland Kwitt
Main category: cs.LG
TL;DR: The Flood complex is introduced as a scalable alternative to Vietoris-Rips for persistent homology computation on large point clouds, enabling PH computation on millions of points while maintaining competitive classification performance.
Details
Motivation: Existing methods like Vietoris-Rips complex face computational limitations due to exponential growth in simplices, making them impractical for large-scale point cloud data in machine learning applications.Method: The Flood complex uses Delaunay triangulation on a subset of points and includes simplices covered by balls of radius r emanating from the full point cloud (flooding process), allowing efficient PH computation and GPU parallelization.
Result: Successfully computed PH up to dimension 2 on several million 3D points, with superior object classification performance on real-world and synthetic data compared to other PH-based methods and neural networks.
Conclusion: The Flood complex provides the scaling capability needed for geometrically or topologically complex objects, offering a practical solution for large-scale persistent homology computation in machine learning tasks.
Abstract: We consider the problem of computing persistent homology (PH) for large-scale Euclidean point cloud data, aimed at downstream machine learning tasks, where the exponential growth of the most widely-used Vietoris-Rips complex imposes serious computational limitations. Although more scalable alternatives such as the Alpha complex or sparse Rips approximations exist, they often still result in a prohibitively large number of simplices. This poses challenges in the complex construction and in the subsequent PH computation, prohibiting their use on large-scale point clouds. To mitigate these issues, we introduce the Flood complex, inspired by the advantages of the Alpha and Witness complex constructions. Informally, at a given filtration value $r\geq 0$, the Flood complex contains all simplices from a Delaunay triangulation of a small subset of the point cloud $X$ that are fully covered by balls of radius $r$ emanating from $X$, a process we call flooding. Our construction allows for efficient PH computation, possesses several desirable theoretical properties, and is amenable to GPU parallelization. Scaling experiments on 3D point cloud data show that we can compute PH of up to dimension 2 on several millions of points. Importantly, when evaluating object classification performance on real-world and synthetic data, we provide evidence that this scaling capability is needed, especially if objects are geometrically or topologically complex, yielding performance superior to other PH-based methods and neural networks for point cloud data.
[723] Learning the Neighborhood: Contrast-Free Multimodal Self-Supervised Molecular Graph Pretraining
Boshra Ariguib, Mathias Niepert, Andrei Manolache
Main category: cs.LG
TL;DR: C-FREE is a self-supervised learning framework that integrates 2D molecular graphs with 3D conformers to learn molecular representations without contrastive learning, achieving state-of-the-art performance on MoleculeNet.
Details
Motivation: Existing molecular representation learning methods often rely on hand-crafted augmentations, complex generative objectives, and primarily use 2D topology while underutilizing valuable 3D structural information.Method: C-FREE predicts subgraph embeddings from complementary neighborhoods using fixed-radius ego-nets across different conformers, integrating geometric and topological information through a hybrid GNN-Transformer backbone without negatives or positional encodings.
Result: Pretrained on GEOM dataset, C-FREE achieves state-of-the-art results on MoleculeNet, outperforming contrastive, generative, and other multimodal self-supervised methods.
Conclusion: The framework demonstrates effective transfer learning across diverse chemical domains, highlighting the importance of 3D-informed molecular representations for property prediction and molecular design.
Abstract: High-quality molecular representations are essential for property prediction and molecular design, yet large labeled datasets remain scarce. While self-supervised pretraining on molecular graphs has shown promise, many existing approaches either depend on hand-crafted augmentations or complex generative objectives, and often rely solely on 2D topology, leaving valuable 3D structural information underutilized. To address this gap, we introduce C-FREE (Contrast-Free Representation learning on Ego-nets), a simple framework that integrates 2D graphs with ensembles of 3D conformers. C-FREE learns molecular representations by predicting subgraph embeddings from their complementary neighborhoods in the latent space, using fixed-radius ego-nets as modeling units across different conformers. This design allows us to integrate both geometric and topological information within a hybrid Graph Neural Network (GNN)-Transformer backbone, without negatives, positional encodings, or expensive pre-processing. Pretraining on the GEOM dataset, which provides rich 3D conformational diversity, C-FREE achieves state-of-the-art results on MoleculeNet, surpassing contrastive, generative, and other multimodal self-supervised methods. Fine-tuning across datasets with diverse sizes and molecule types further demonstrates that pretraining transfers effectively to new chemical domains, highlighting the importance of 3D-informed molecular representations.
[724] Overclocking Electrostatic Generative Models
Daniil Shlenskii, Alexander Korotin
Main category: cs.LG
TL;DR: IPFM is a novel distillation framework that accelerates electrostatic generative models like PFGM++ by learning a generator whose induced electrostatic field matches the teacher’s, enabling fast sampling with few function evaluations while maintaining sample quality.
Details
Motivation: Electrostatic generative models like PFGM++ achieve state-of-the-art performance but rely on expensive ODE simulations for sampling, making them computationally costly. There's a need to accelerate these models while preserving their high sample quality.Method: Proposed Inverse Poisson Flow Matching (IPFM) - a distillation framework that formulates distillation as an inverse problem: learning a generator whose induced electrostatic field matches that of the teacher model. Derived a tractable training objective for this problem.
Result: IPFM produces distilled generators that achieve near-teacher or even superior sample quality using only a few function evaluations. Distillation converges faster for finite D than in the diffusion limit (D→∞), consistent with prior findings about PFGM++’s favorable optimization properties.
Conclusion: IPFM effectively accelerates electrostatic generative models across all D values, with faster convergence for finite D models. The method bridges to Score Identity Distillation (SiD) in the diffusion limit, providing a unified framework for accelerating both electrostatic and diffusion models.
Abstract: Electrostatic generative models such as PFGM++ have recently emerged as a powerful framework, achieving state-of-the-art performance in image synthesis. PFGM++ operates in an extended data space with auxiliary dimensionality $D$, recovering the diffusion model framework as $D\to\infty$, while yielding superior empirical results for finite $D$. Like diffusion models, PFGM++ relies on expensive ODE simulations to generate samples, making it computationally costly. To address this, we propose Inverse Poisson Flow Matching (IPFM), a novel distillation framework that accelerates electrostatic generative models across all values of $D$. Our IPFM reformulates distillation as an inverse problem: learning a generator whose induced electrostatic field matches that of the teacher. We derive a tractable training objective for this problem and show that, as $D \to \infty$, our IPFM closely recovers Score Identity Distillation (SiD), a recent method for distilling diffusion models. Empirically, our IPFM produces distilled generators that achieve near-teacher or even superior sample quality using only a few function evaluations. Moreover, we observe that distillation converges faster for finite $D$ than in the $D \to \infty$ (diffusion) limit, which is consistent with prior findings that finite-$D$ PFGM++ models exhibit more favorable optimization and sampling properties.
[725] OFMU: Optimization-Driven Framework for Machine Unlearning
Sadia Asif, Mohammad Mohammadi Amiri
Main category: cs.LG
TL;DR: OFMU is a penalty-based bi-level optimization framework for machine unlearning that prioritizes forgetting while preserving model utility through hierarchical optimization and gradient decorrelation.
Details
Motivation: Large language models need to unlearn specific knowledge for regulatory compliance, privacy, and safety without retraining from scratch. Current methods using weighted sum scalarization suffer from unstable training and degraded utility due to conflicting gradient directions.Method: Proposes OFMU with penalty-based bi-level optimization: inner maximization enforces forgetting with similarity-aware penalty to decorrelate gradients, outer minimization restores utility. Uses two-loop algorithm with convergence guarantees for convex and non-convex cases.
Result: Extensive experiments on vision and language benchmarks show OFMU consistently outperforms existing unlearning methods in both forgetting efficacy and retained utility. Theoretical analysis confirms better trade-offs between forgetting and model utility.
Conclusion: OFMU provides an effective and scalable solution for machine unlearning with provable convergence guarantees, addressing the limitations of previous scalarization approaches through hierarchical optimization and gradient decorrelation.
Abstract: Large language models deployed in sensitive applications increasingly require the ability to unlearn specific knowledge, such as user requests, copyrighted materials, or outdated information, without retraining from scratch to ensure regulatory compliance, user privacy, and safety. This task, known as machine unlearning, aims to remove the influence of targeted data (forgetting) while maintaining performance on the remaining data (retention). A common approach is to formulate this as a multi-objective problem and reduce it to a single-objective problem via scalarization, where forgetting and retention losses are combined using a weighted sum. However, this often results in unstable training dynamics and degraded model utility due to conflicting gradient directions. To address these challenges, we propose OFMU, a penalty-based bi-level optimization framework that explicitly prioritizes forgetting while preserving retention through a hierarchical structure. Our method enforces forgetting via an inner maximization step that incorporates a similarity-aware penalty to decorrelate the gradients of the forget and retention objectives, and restores utility through an outer minimization step. To ensure scalability, we develop a two-loop algorithm with provable convergence guarantees under both convex and non-convex regimes. We further provide a rigorous theoretical analysis of convergence rates and show that our approach achieves better trade-offs between forgetting efficacy and model utility compared to prior methods. Extensive experiments across vision and language benchmarks demonstrate that OFMU consistently outperforms existing unlearning methods in both forgetting efficacy and retained utility.
[726] Nonlinear Optimization with GPU-Accelerated Neural Network Constraints
Robert Parker, Oscar Dowson, Nicole LoGiudice, Manuel Garcia, Russell Bent
Main category: cs.LG
TL;DR: A reduced-space formulation for optimizing over trained neural networks that treats networks as gray boxes, leading to faster solves and fewer iterations in interior point methods.
Details
Motivation: To improve optimization efficiency over neural networks by avoiding exposure of intermediate variables and constraints to the solver, which occurs in full-space formulations.Method: Treat neural networks as gray boxes where only outputs and derivatives are evaluated on GPU, without exposing intermediate variables and constraints to the optimization solver.
Result: The reduced-space formulation achieves faster solves and fewer iterations compared to full-space formulation in interior point methods.
Conclusion: The reduced-space gray box approach provides computational benefits for optimization problems involving neural networks, as demonstrated on adversarial generation and power flow applications.
Abstract: We propose a reduced-space formulation for optimizing over trained neural networks where the network’s outputs and derivatives are evaluated on a GPU. To do this, we treat the neural network as a “gray box” where intermediate variables and constraints are not exposed to the optimization solver. Compared to the full-space formulation, in which intermediate variables and constraints are exposed to the optimization solver, the reduced-space formulation leads to faster solves and fewer iterations in an interior point method. We demonstrate the benefits of this method on two optimization problems: Adversarial generation for a classifier trained on MNIST images and security-constrained optimal power flow with transient feasibility enforced using a neural network surrogate.
[727] A Machine Learning Pipeline for Multiple Sclerosis Biomarker Discovery: Comparing explainable AI and Traditional Statistical Approaches
Samuele Punzo, Silvia Giulia Galfrè, Francesco Massafra, Alessandro Maglione, Corrado Priami, Alina Sîrbu
Main category: cs.LG
TL;DR: Machine learning pipeline using XGBoost and SHAP identifies MS biomarkers from PBMC microarray data, revealing both overlapping and unique biomarkers compared to traditional differential expression analysis.
Details
Motivation: To discover biomarkers for Multiple Sclerosis (MS) by integrating multiple publicly available microarray datasets and combining explainable AI with traditional statistical methods for deeper disease insights.Method: Integrated 8 PBMC microarray datasets, preprocessed data, trained XGBoost classifier optimized via Bayesian search, used SHAP for feature importance analysis, and compared with classical Differential Expression Analysis (DEA).
Result: Identified both overlapping and unique biomarkers between SHAP and DEA methods. SHAP-selected genes were biologically relevant, linked to MS-associated pathways including sphingolipid signaling, Th1/Th2/Th17 cell differentiation, and Epstein-Barr virus infection.
Conclusion: Combining explainable AI (xAI) with traditional statistical methods provides complementary strengths and deeper insights into MS disease mechanisms.
Abstract: We present a machine learning pipeline for biomarker discovery in Multiple Sclerosis (MS), integrating eight publicly available microarray datasets from Peripheral Blood Mononuclear Cells (PBMC). After robust preprocessing we trained an XGBoost classifier optimized via Bayesian search. SHapley Additive exPlanations (SHAP) were used to identify key features for model prediction, indicating thus possible biomarkers. These were compared with genes identified through classical Differential Expression Analysis (DEA). Our comparison revealed both overlapping and unique biomarkers between SHAP and DEA, suggesting complementary strengths. Enrichment analysis confirmed the biological relevance of SHAP-selected genes, linking them to pathways such as sphingolipid signaling, Th1/Th2/Th17 cell differentiation, and Epstein-Barr virus infection all known to be associated with MS. This study highlights the value of combining explainable AI (xAI) with traditional statistical methods to gain deeper insights into disease mechanism.
[728] Bayesian Transfer Operators in Reproducing Kernel Hilbert Spaces
Septimus Boshoff, Sebastian Peitz, Stefan Klus
Main category: cs.LG
TL;DR: This paper unifies Gaussian process regression with dynamic mode decomposition to improve kernel-based Koopman operator methods, addressing computational sparsity and hyperparameter optimization challenges.
Details
Motivation: To address two key problems in kernel-based Koopman operator methods: computational sparsity/scalability issues and hyperparameter optimization challenges for adapting models to dynamical systems.Method: Combines Gaussian process regression with dynamic mode decomposition to create a unified approach that leverages reproducing kernel Hilbert spaces and Koopman operator theory.
Result: The method reduces computational demands while improving resilience against sensor noise, and provides better adaptation to dynamical systems through improved hyperparameter optimization.
Conclusion: The main contribution is the successful unification of Gaussian process regression and dynamic mode decomposition, offering practical improvements for kernel-based Koopman operator applications.
Abstract: The Koopman operator, as a linear representation of a nonlinear dynamical system, has been attracting attention in many fields of science. Recently, Koopman operator theory has been combined with another concept that is popular in data science: reproducing kernel Hilbert spaces. We follow this thread into Gaussian process methods, and illustrate how these methods can alleviate two pervasive problems with kernel-based Koopman algorithms. The first being sparsity: most kernel methods do not scale well and require an approximation to become practical. We show that not only can the computational demands be reduced, but also demonstrate improved resilience against sensor noise. The second problem involves hyperparameter optimization and dictionary learning to adapt the model to the dynamical system. In summary, the main contribution of this work is the unification of Gaussian process regression and dynamic mode decomposition.
[729] Dual Optimistic Ascent (PI Control) is the Augmented Lagrangian Method in Disguise
Juan Ramirez, Simon Lacoste-Julien
Main category: cs.LG
TL;DR: Dual optimistic ascent on Lagrangian is equivalent to gradient descent-ascent on Augmented Lagrangian, enabling transfer of theoretical guarantees and principled hyperparameter tuning.
Details
Motivation: Constrained optimization for neural networks often uses first-order methods that suffer from oscillations and failure to find all local solutions. Practitioners prefer dual optimistic ascent but lack formal guarantees.Method: Established equivalence between dual optimistic ascent on Lagrangian and gradient descent-ascent on Augmented Lagrangian, allowing transfer of ALM guarantees.
Result: Proved dual optimistic ascent converges linearly to all local solutions and provided principled guidance for tuning optimism hyper-parameter.
Conclusion: Closed critical gap between empirical success of dual optimistic methods and their theoretical foundation by establishing equivalence with ALM.
Abstract: Constrained optimization is a powerful framework for enforcing requirements on neural networks. These constrained deep learning problems are typically solved using first-order methods on their min-max Lagrangian formulation, but such approaches often suffer from oscillations and can fail to find all local solutions. While the Augmented Lagrangian method (ALM) addresses these issues, practitioners often favor dual optimistic ascent schemes (PI control) on the standard Lagrangian, which perform well empirically but lack formal guarantees. In this paper, we establish a previously unknown equivalence between these approaches: dual optimistic ascent on the Lagrangian is equivalent to gradient descent-ascent on the Augmented Lagrangian. This finding allows us to transfer the robust theoretical guarantees of the ALM to the dual optimistic setting, proving it converges linearly to all local solutions. Furthermore, the equivalence provides principled guidance for tuning the optimism hyper-parameter. Our work closes a critical gap between the empirical success of dual optimistic methods and their theoretical foundation.
[730] Adaptive Dual-Mode Distillation with Incentive Schemes for Scalable, Heterogeneous Federated Learning on Non-IID Data
Zahid Iqbal
Main category: cs.LG
TL;DR: The paper proposes three federated learning methods (DL-SH, DL-MH, I-DL-MH) to address statistical heterogeneity, model heterogeneity, and client incentive challenges, achieving significant accuracy improvements and reduced communication costs.
Details
Motivation: To overcome key challenges in federated learning including: 1) model heterogeneity where clients have different computational capabilities and business needs, 2) statistical heterogeneity (non-IID data) that degrades global model performance, and 3) the need for cost-effective incentives to encourage client participation.Method: Three proposed approaches: DL-SH for statistical heterogeneity with privacy preservation and communication efficiency; DL-MH for managing fully heterogeneous models while addressing statistical disparities; I-DL-MH as an incentive-based extension of DL-MH to promote client engagement.
Result: Comprehensive experiments across various model architectures, data distributions (IID and non-IID), and datasets show significant improvements: DL-SH improves global model accuracy by 153%, and I-DL-MH achieves 225% improvement under non-IID conditions while reducing communication costs compared to state-of-the-art approaches.
Conclusion: The proposed approaches effectively address statistical and model heterogeneity in federated learning, significantly enhancing accuracy and reducing communication costs, while providing incentives to encourage client participation in complex federated learning environments.
Abstract: Federated Learning (FL) has emerged as a promising decentralized learning (DL) approach that enables the use of distributed data without compromising user privacy. However, FL poses several key challenges. First, it is frequently assumed that every client can train the same machine learning models, however, not all clients are able to meet this assumption because of differences in their business needs and computational resources. Second, statistical heterogeneity (a.k.a. non-IID data) poses a major challenge in FL, which can lead to lower global model performance. Third, while addressing these challenges, there is a need for a cost-effective incentive mechanism to encourage clients to participate in FL training. In response to these challenges, we propose several methodologies: DL-SH, which facilitates efficient, privacy-preserving, and communication-efficient learning in the context of statistical heterogeneity; DL-MH, designed to manage fully heterogeneous models while tackling statistical disparities; and I-DL-MH, an incentive-based extension of DL-MH that promotes client engagement in federated learning training by providing incentives within this complex federated learning framework. Comprehensive experiments were carried out to assess the performance and scalability of the proposed approaches across a range of complex experimental settings. This involved utilizing various model architectures, in diverse data distributions, including IID and several non-IID scenarios, as well as multiple datasets. Experimental results demonstrate that the proposed approaches significantly enhance accuracy and decrease communication costs while effectively addressing statistical heterogeneity and model heterogeneity in comparison to existing state-of-the-art approaches and baselines, with DL-SH improving global model accuracy by 153%, and I-DL-MH achieving a 225% improvement under non-IID conditions.
[731] Activation Function Design Sustains Plasticity in Continual Learning
Lute Lillo, Nick Cheney
Main category: cs.LG
TL;DR: Activation functions play a crucial role in mitigating plasticity loss in continual learning, with specific nonlinearities (Smooth-Leaky and Randomized Smooth-Leaky) showing effectiveness across supervised and reinforcement learning settings.
Details
Motivation: In continual learning, models can progressively lose plasticity beyond catastrophic forgetting, and the role of activation functions in this failure mode remains underexplored compared to i.i.d. training regimes.Method: Introduced two drop-in nonlinearities (Smooth-Leaky and Randomized Smooth-Leaky) based on analysis of negative-branch shape and saturation behavior, evaluated in supervised class-incremental benchmarks and RL with non-stationary MuJoCo environments, plus a stress protocol for diagnostics.
Result: Activation choice is a primary, architecture-agnostic lever for mitigating plasticity loss, offering lightweight solutions without extra capacity or task-specific tuning.
Conclusion: Thoughtful activation design provides a domain-general way to sustain plasticity in continual learning, with specific activation shapes directly linked to adaptation under change.
Abstract: In independent, identically distributed (i.i.d.) training regimes, activation functions have been benchmarked extensively, and their differences often shrink once model size and optimization are tuned. In continual learning, however, the picture is different: beyond catastrophic forgetting, models can progressively lose the ability to adapt (referred to as loss of plasticity) and the role of the non-linearity in this failure mode remains underexplored. We show that activation choice is a primary, architecture-agnostic lever for mitigating plasticity loss. Building on a property-level analysis of negative-branch shape and saturation behavior, we introduce two drop-in nonlinearities (Smooth-Leaky and Randomized Smooth-Leaky) and evaluate them in two complementary settings: (i) supervised class-incremental benchmarks and (ii) reinforcement learning with non-stationary MuJoCo environments designed to induce controlled distribution and dynamics shifts. We also provide a simple stress protocol and diagnostics that link the shape of the activation to the adaptation under change. The takeaway is straightforward: thoughtful activation design offers a lightweight, domain-general way to sustain plasticity in continual learning without extra capacity or task-specific tuning.
[732] JointDiff: Bridging Continuous and Discrete in Multi-Agent Trajectory Generation
Guillem Capellera, Luis Ferraz, Antonio Rubio, Alexandre Alahi, Antonio Agudo
Main category: cs.LG
TL;DR: JointDiff is a diffusion framework that unifies continuous spatio-temporal data and discrete events generation, validated in sports with trajectory and possession event modeling, achieving SOTA performance.
Details
Motivation: To bridge the gap between modeling continuous data and discrete events as separate processes in complex interactive systems where they interact synchronously.Method: JointDiff diffusion framework with CrossGuid conditioning operation for multi-agent domains, supporting non-controllable and controllable generation (weak-possessor-guidance and text-guidance).
Result: Achieves state-of-the-art performance, demonstrates joint modeling is crucial for realistic and controllable generative models in interactive systems.
Conclusion: Joint modeling of continuous and discrete processes is essential for building realistic and controllable generative models for complex interactive systems.
Abstract: Generative models often treat continuous data and discrete events as separate processes, creating a gap in modeling complex systems where they interact synchronously. To bridge this gap, we introduce JointDiff, a novel diffusion framework designed to unify these two processes by simultaneously generating continuous spatio-temporal data and synchronous discrete events. We demonstrate its efficacy in the sports domain by simultaneously modeling multi-agent trajectories and key possession events. This joint modeling is validated with non-controllable generation and two novel controllable generation scenarios: weak-possessor-guidance, which offers flexible semantic control over game dynamics through a simple list of intended ball possessors, and text-guidance, which enables fine-grained, language-driven generation. To enable the conditioning with these guidance signals, we introduce CrossGuid, an effective conditioning operation for multi-agent domains. We also share a new unified sports benchmark enhanced with textual descriptions for soccer and football datasets. JointDiff achieves state-of-the-art performance, demonstrating that joint modeling is crucial for building realistic and controllable generative models for interactive systems.
[733] From Parameters to Behavior: Unsupervised Compression of the Policy Space
Davide Tenedini, Riccardo Zamboni, Mirco Mutti, Marcello Restelli
Main category: cs.LG
TL;DR: The paper proposes an unsupervised method to compress DRL policy parameters into a low-dimensional latent space using behavioral reconstruction, enabling more efficient learning and task adaptation.
Details
Motivation: DRL is sample-inefficient due to optimizing policies directly in high-dimensional parameter spaces, especially problematic in multi-task settings.Method: Train a generative model to map from low-dimensional latent space to policy parameters using behavioral reconstruction loss, organizing latent space by functional similarity rather than parameter proximity.
Result: Policy parameterization can be compressed up to five orders of magnitude while retaining expressivity, and the learned manifold enables task-specific adaptation via Policy Gradient in latent space.
Conclusion: Compressing policy parameters into a functionally-organized latent space significantly improves DRL efficiency and enables effective multi-task learning.
Abstract: Despite its recent successes, Deep Reinforcement Learning (DRL) is notoriously sample-inefficient. We argue that this inefficiency stems from the standard practice of optimizing policies directly in the high-dimensional and highly redundant parameter space $\Theta$. This challenge is greatly compounded in multi-task settings. In this work, we develop a novel, unsupervised approach that compresses the policy parameter space $\Theta$ into a low-dimensional latent space $\mathcal{Z}$. We train a generative model $g:\mathcal{Z}\to\Theta$ by optimizing a behavioral reconstruction loss, which ensures that the latent space is organized by functional similarity rather than proximity in parameterization. We conjecture that the inherent dimensionality of this manifold is a function of the environment’s complexity, rather than the size of the policy network. We validate our approach in continuous control domains, showing that the parameterization of standard policy networks can be compressed up to five orders of magnitude while retaining most of its expressivity. As a byproduct, we show that the learned manifold enables task-specific adaptation via Policy Gradient operating in the latent space $\mathcal{Z}$.
[734] ECHO: Toward Contextual Seq2Seq Paradigms in Large EEG Models
Chenyu Liu, Yuqiu Deng, Tianyu Liu, Jinan Zhou, Xinliang Zhou, Ziyu Jia, Yi Ding
Main category: cs.LG
TL;DR: ECHO is a decoder-centric Large EEG Model that reformulates EEG modeling as sequence-to-sequence learning, enabling in-context learning and superior multi-task performance without parameter updates.
Details
Motivation: Current Large EEG Models (LEMs) focus on encoder-centric architectures but lack powerful decoders, limiting full utilization of learned features and generalization across diverse EEG tasks and datasets.Method: ECHO captures layered relationships among signals, labels, and tasks in sequence space, incorporating discrete support samples to construct contextual cues for sequence-to-sequence learning with in-context learning capabilities.
Result: Extensive experiments show ECHO consistently outperforms state-of-the-art single-task LEMs in multi-task settings, demonstrating superior generalization and adaptability across multiple datasets.
Conclusion: The decoder-centric paradigm of ECHO provides a more effective approach for EEG modeling, enabling dynamic adaptation to heterogeneous tasks through in-context learning without requiring parameter updates.
Abstract: Electroencephalography (EEG), with its broad range of applications, necessitates models that can generalize effectively across various tasks and datasets. Large EEG Models (LEMs) address this by pretraining encoder-centric architectures on large-scale unlabeled data to extract universal representations. While effective, these models lack decoders of comparable capacity, limiting the full utilization of the learned features. To address this issue, we introduce ECHO, a novel decoder-centric LEM paradigm that reformulates EEG modeling as sequence-to-sequence learning. ECHO captures layered relationships among signals, labels, and tasks within sequence space, while incorporating discrete support samples to construct contextual cues. This design equips ECHO with in-context learning, enabling dynamic adaptation to heterogeneous tasks without parameter updates. Extensive experiments across multiple datasets demonstrate that, even with basic model components, ECHO consistently outperforms state-of-the-art single-task LEMs in multi-task settings, showing superior generalization and adaptability.
[735] Quantile Advantage Estimation for Entropy-Safe Reasoning
Junkang Wu, Kexin Huang, Jiancan Wu, An Zhang, Xiang Wang, Xiangnan He
Main category: cs.LG
TL;DR: RLVR improves LLM reasoning but suffers from entropy collapse/explosion due to mean baseline issues. QAE replaces mean with quantile baseline, providing two-regime gating and proven entropy safety, stabilizing training and improving performance.
Details
Motivation: Address entropy collapse and explosion problems in RLVR training caused by improper penalty of negative-advantage samples under reward outliers when using mean baselines in value-free RL methods.Method: Propose Quantile Advantage Estimation (QAE) that replaces mean baseline with group-wise K-quantile baseline, creating response-level two-regime gating: reinforces rare successes on hard queries and targets remaining failures on easy queries.
Result: QAE stabilizes entropy, sparsifies credit assignment (80% responses get zero advantage), and yields sustained pass@1 gains on Qwen3-8B/14B-Base across AIME 2024/2025 and AMC 2023 benchmarks.
Conclusion: Baseline design, rather than token-level heuristics, is the primary mechanism for scaling RLVR, with QAE providing effective entropy control and performance improvements.
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) strengthens LLM reasoning, but training often oscillates between {entropy collapse} and {entropy explosion}. We trace both hazards to the mean baseline used in value-free RL (e.g., GRPO and DAPO), which improperly penalizes negative-advantage samples under reward outliers. We propose {Quantile Advantage Estimation} (QAE), replacing the mean with a group-wise K-quantile baseline. QAE induces a response-level, two-regime gate: on hard queries (p <= 1 - K) it reinforces rare successes, while on easy queries (p > 1 - K) it targets remaining failures. Under first-order softmax updates, we prove {two-sided entropy safety}, giving lower and upper bounds on one-step entropy change that curb explosion and prevent collapse. Empirically, this minimal modification stabilizes entropy, sparsifies credit assignment (with tuned K, roughly 80% of responses receive zero advantage), and yields sustained pass@1 gains on Qwen3-8B/14B-Base across AIME 2024/2025 and AMC 2023. These results identify {baseline design} – rather than token-level heuristics – as the primary mechanism for scaling RLVR.
[736] Learning to Price Bundles: A GCN Approach for Mixed Bundling
Liangyu Ding, Chenghan Wu, Guokai Li, Zizhuo Wang
Main category: cs.LG
TL;DR: The paper proposes a GCN-based framework for solving the computationally challenging bundle pricing problem, achieving near-optimal solutions efficiently.
Details
Motivation: Bundle pricing is a classic revenue management problem that is typically intractable due to exponential candidate bundles, requiring efficient solution methods.Method: Develop graph representation of mixed bundling model, train GCN to learn optimal bundle patterns, propose inference strategies and local-search technique for solution improvement.
Result: Achieves near-optimal solutions (better than 97%) with fraction of computational time for small-medium problems, superior performance for larger problems compared to BSP, handles up to 30+ products with non-additive utilities.
Conclusion: GCN-based framework is effective and efficient for bundle pricing, providing high-quality solutions across various problem sizes and challenging utility scenarios.
Abstract: Bundle pricing refers to designing several product combinations (i.e., bundles) and determining their prices in order to maximize the expected profit. It is a classic problem in revenue management and arises in many industries, such as e-commerce, tourism, and video games. However, the problem is typically intractable due to the exponential number of candidate bundles. In this paper, we explore the usage of graph convolutional networks (GCNs) in solving the bundle pricing problem. Specifically, we first develop a graph representation of the mixed bundling model (where every possible bundle is assigned with a specific price) and then train a GCN to learn the latent patterns of optimal bundles. Based on the trained GCN, we propose two inference strategies to derive high-quality feasible solutions. A local-search technique is further proposed to improve the solution quality. Numerical experiments validate the effectiveness and efficiency of our proposed GCN-based framework. Using a GCN trained on instances with 5 products, our methods consistently achieve near-optimal solutions (better than 97%) with only a fraction of computational time for problems of small to medium size. It also achieves superior solutions for larger size of problems compared with other heuristic methods such as bundle size pricing (BSP). The method can also provide high quality solutions for instances with more than 30 products even for the challenging cases where product utilities are non-additive.
[737] A Theoretical Analysis of Discrete Flow Matching Generative Models
Maojiang Su, Mingcheng Lu, Jerry Yao-Chieh Hu, Shang Wu, Zhao Song, Alex Reneau, Han Liu
Main category: cs.LG
TL;DR: Theoretical analysis of Discrete Flow Matching (DFM) generative models, proving that the generated distribution converges to the true data distribution as training set size increases.
Details
Motivation: To provide formal theoretical guarantees for DFM generative models by analyzing the distribution estimation error and establishing convergence properties.Method: Decomposed the final distribution estimation error into velocity field risk, then bounded this risk through analysis of approximation error (Transformer architecture capacity) and estimation error (statistical convergence rates from finite dataset training).
Result: Proved that total variation distance between generated and target distributions is controlled by learned velocity field risk, and provided formal proof that DFM-generated distribution converges to true data distribution with increasing training data.
Conclusion: First formal proof establishing theoretical convergence guarantees for DFM models, providing rigorous foundation for discrete generative modeling using flow matching framework.
Abstract: We provide a theoretical analysis for end-to-end training Discrete Flow Matching (DFM) generative models. DFM is a promising discrete generative modeling framework that learns the underlying generative dynamics by training a neural network to approximate the transformative velocity field. Our analysis establishes a clear chain of guarantees by decomposing the final distribution estimation error. We first prove that the total variation distance between the generated and target distributions is controlled by the risk of the learned velocity field. We then bound this risk by analyzing its two primary sources: (i) Approximation Error, where we quantify the capacity of the Transformer architecture to represent the true velocity, and (ii) Estimation Error, where we derive statistical convergence rates that bound the error from training on a finite dataset. By composing these results, we provide the first formal proof that the distribution generated by a trained DFM model provably converges to the true data distribution as the training set size increases.
[738] Machine learning approaches to seismic event classification in the Ostrava region
Marek Pecha, Michael Skotnica, Jana Rušajová, Bohdan Rieznikov, Vít Wandrol, Markéta Rösnerová, Jaromír Knejzlík
Main category: cs.LG
TL;DR: Machine learning methods (LSTM and XGBoost) were applied to classify seismic events in the Czech Republic, achieving high F1-scores (0.94-0.95) for distinguishing between tectonic and mining-induced events.
Details
Motivation: The northeastern Czech Republic is seismically active with both mining-induced and natural tectonic events, requiring rapid differentiation between them despite mining cessation.Method: Applied Long Short-Term Memory (LSTM) recurrent neural network and XGBoost to labeled seismic data from the Seismic Polygon Frenštát dataset containing tectonic and mining-induced events.
Result: Both machine learning methods achieved high performance with F1-scores of 0.94-0.95 for binary classification of seismic event types.
Conclusion: Modern machine learning techniques show strong potential for rapid characterization and classification of seismic events in this region.
Abstract: The northeastern region of the Czech Republic is among the most seismically active areas in the country. The most frequent seismic events are mining-induced since there used to be strong mining activity in the past. However, natural tectonic events may also occur. In addition, seismic stations often record explosions in quarries in the region. Despite the cessation of mining activities, mine-induced seismic events still occur. Therefore, a rapid differentiation between tectonic and anthropogenic events is still important. The region is currently monitored by the OKC seismic station in Ostrava-Kr'{a}sn'{e} Pole built in 1983 which is a part of the Czech Regional Seismic Network. The station has been providing digital continuous waveform data at 100 Hz since 2007. In the years 1992–2002, the region was co-monitored by the Seismic Polygon Fren\v{s}t'{a}t (SPF) which consisted of five seismic stations using a triggered STA/LTA system. In this study, we apply and compare machine learning methods to the SPF dataset, which contains labeled records of tectonic and mining-induced events. For binary classification, a Long Short-Term Memory recurrent neural network and XGBoost achieved an F1-score of 0.94 – 0.95, demonstrating the potential of modern machine learning techniques for rapid event characterization.
[739] Learning Admissible Heuristics for A*: Theory and Practice
Ehsan Futuhi, Nathan R. Sturtevant
Main category: cs.LG
TL;DR: This paper introduces Cross-Entropy Admissibility (CEA) to learn admissible heuristics for A-star search, provides theoretical generalization bounds for heuristic learning, and demonstrates improved performance on Rubik’s Cube.
Details
Motivation: Existing deep learning approaches for heuristic functions often disregard admissibility (which guarantees solution optimality) and have limited generalization guarantees beyond training data.Method: Poses heuristic learning as constrained optimization with Cross-Entropy Admissibility (CEA) loss function; leverages PDB abstractions and graph structural properties; uses ReLU neural networks for theoretical analysis.
Result: Achieves near-admissible heuristics on Rubik’s Cube with stronger guidance than compressed pattern database heuristics; provides tighter bounds on training samples needed for generalization.
Conclusion: The approach successfully enforces admissibility during training while providing theoretical generalization guarantees for both standard and goal-dependent heuristics.
Abstract: Heuristic functions are central to the performance of search algorithms such as A-star, where admissibility - the property of never overestimating the true shortest-path cost - guarantees solution optimality. Recent deep learning approaches often disregard admissibility and provide limited guarantees on generalization beyond the training data. This paper addresses both of these limitations. First, we pose heuristic learning as a constrained optimization problem and introduce Cross-Entropy Admissibility (CEA), a loss function that enforces admissibility during training. On the Rubik’s Cube domain, this method yields near-admissible heuristics with significantly stronger guidance than compressed pattern database (PDB) heuristics. Theoretically, we study the sample complexity of learning heuristics. By leveraging PDB abstractions and the structural properties of graphs such as the Rubik’s Cube, we tighten the bound on the number of training samples needed for A-star to generalize. Replacing a general hypothesis class with a ReLU neural network gives bounds that depend primarily on the network’s width and depth, rather than on graph size. Using the same network, we also provide the first generalization guarantees for goal-dependent heuristics.
[740] The Lie of the Average: How Class Incremental Learning Evaluation Deceives You?
Guannan Lai, Da-Wei Zhou, Xin Yang, Han-Jia Ye
Main category: cs.LG
TL;DR: Current CIL evaluation protocols using random sequence sampling fail to capture true performance distribution. EDGE protocol uses inter-task similarity to identify extreme sequences for more accurate evaluation.
Details
Motivation: Mainstream CIL evaluation protocols calculate mean/variance from small random samples, which fails to capture full performance range and underestimates true variance.Method: Propose EDGE protocol that adaptively identifies extreme class sequences using inter-task similarity to better approximate ground-truth performance distribution.
Result: EDGE effectively captures performance extremes and yields more accurate estimates of distributional boundaries compared to random sampling approaches.
Conclusion: EDGE provides more reliable CIL evaluation by characterizing entire performance distribution through extreme sequence identification, offering actionable insights for model selection.
Abstract: Class Incremental Learning (CIL) requires models to continuously learn new classes without forgetting previously learned ones, while maintaining stable performance across all possible class sequences. In real-world settings, the order in which classes arrive is diverse and unpredictable, and model performance can vary substantially across different sequences. Yet mainstream evaluation protocols calculate mean and variance from only a small set of randomly sampled sequences. Our theoretical analysis and empirical results demonstrate that this sampling strategy fails to capture the full performance range, resulting in biased mean estimates and a severe underestimation of the true variance in the performance distribution. We therefore contend that a robust CIL evaluation protocol should accurately characterize and estimate the entire performance distribution. To this end, we introduce the concept of extreme sequences and provide theoretical justification for their crucial role in the reliable evaluation of CIL. Moreover, we observe a consistent positive correlation between inter-task similarity and model performance, a relation that can be leveraged to guide the search for extreme sequences. Building on these insights, we propose EDGE (Extreme case-based Distribution and Generalization Evaluation), an evaluation protocol that adaptively identifies and samples extreme class sequences using inter-task similarity, offering a closer approximation of the ground-truth performance distribution. Extensive experiments demonstrate that EDGE effectively captures performance extremes and yields more accurate estimates of distributional boundaries, providing actionable insights for model selection and robustness checking. Our code is available at https://github.com/AIGNLAI/EDGE.
[741] Transport Based Mean Flows for Generative Modeling
Elaheh Akbari, Ping He, Ahmadreza Moradipari, Yikun Bai, Soheil Kolouri
Main category: cs.LG
TL;DR: The paper proposes an improved one-step generation method for flow-matching models by incorporating optimal transport-based sampling strategies into Mean Flows, achieving better fidelity and diversity while maintaining fast inference.
Details
Motivation: Flow-matching models suffer from slow inference due to multiple sequential sampling steps. While Mean Flows offer one-step generation with speedups, they often fail to faithfully approximate the original multi-step process in continuous domains.Method: The authors incorporate optimal transport-based sampling strategies into the Mean Flow framework to better preserve the fidelity and diversity of the original multi-step flow process.
Result: Experiments on low-dimensional settings and high-dimensional tasks (image generation, image-to-image translation, point cloud generation) show superior inference accuracy in one-step generative modeling compared to previous approaches.
Conclusion: The proposed method enables one-step generators that better maintain the quality of original multi-step flow-matching processes while providing substantial inference speedups.
Abstract: Flow-matching generative models have emerged as a powerful paradigm for continuous data generation, achieving state-of-the-art results across domains such as images, 3D shapes, and point clouds. Despite their success, these models suffer from slow inference due to the requirement of numerous sequential sampling steps. Recent work has sought to accelerate inference by reducing the number of sampling steps. In particular, Mean Flows offer a one-step generation approach that delivers substantial speedups while retaining strong generative performance. Yet, in many continuous domains, Mean Flows fail to faithfully approximate the behavior of the original multi-step flow-matching process. In this work, we address this limitation by incorporating optimal transport-based sampling strategies into the Mean Flow framework, enabling one-step generators that better preserve the fidelity and diversity of the original multi-step flow process. Experiments on controlled low-dimensional settings and on high-dimensional tasks such as image generation, image-to-image translation, and point cloud generation demonstrate that our approach achieves superior inference accuracy in one-step generative modeling.
[742] Efficient Epistemic Uncertainty Estimation in Regression Ensemble Models Using Pairwise-Distance Estimators
Lucas Berry, David Meger
Main category: cs.LG
TL;DR: Introduces PaiDEs for fast epistemic uncertainty estimation in ensemble models, achieving 100x speedup over Monte Carlo methods while improving performance in high-dimensional regression tasks.
Details
Motivation: Need for efficient epistemic uncertainty estimation in ensemble models, as traditional Monte Carlo methods are slow and struggle with high-dimensional inputs.Method: Uses pairwise-distance estimators (PaiDEs) between model components to establish entropy bounds and enhance Bayesian Active Learning by Disagreement (BALD).
Result: PaiDEs achieve up to 100x faster uncertainty estimation, cover more inputs simultaneously, and outperform existing methods on high-dimensional regression benchmarks like Hopper, Ant, and Humanoid.
Conclusion: PaiDEs provide an efficient and effective approach for epistemic uncertainty estimation, particularly beneficial for high-dimensional regression tasks in active learning frameworks.
Abstract: This work introduces an efficient novel approach for epistemic uncertainty estimation for ensemble models for regression tasks using pairwise-distance estimators (PaiDEs). Utilizing the pairwise-distance between model components, these estimators establish bounds on entropy. We leverage this capability to enhance the performance of Bayesian Active Learning by Disagreement (BALD). Notably, unlike sample-based Monte Carlo estimators, PaiDEs exhibit a remarkable capability to estimate epistemic uncertainty at speeds up to 100 times faster while covering a significantly larger number of inputs at once and demonstrating superior performance in higher dimensions. To validate our approach, we conducted a varied series of regression experiments on commonly used benchmarks: 1D sinusoidal data, $\textit{Pendulum}$, $\textit{Hopper}$, $\textit{Ant}$ and $\textit{Humanoid}$. For each experimental setting, an active learning framework was applied to demonstrate the advantages of PaiDEs for epistemic uncertainty estimation. We compare our approach to existing active learning methods and find that our approach outperforms on high-dimensional regression tasks.
[743] Machine Learning-Assisted Sustainable Remanufacturing, Reusing and Recycling for Lithium-ion Batteries
Shengyu Tao
Main category: cs.LG
TL;DR: A machine learning framework for sustainable lithium-ion battery management addressing data scarcity through physics-informed quality control, generative learning for residual value assessment, federated learning for privacy-preserving cathode sorting, and unified diagnostics/prognostics.
Details
Motivation: Data scarcity and heterogeneity are major barriers to sustainable battery utilization across remanufacturing, reusing, and recycling, which is crucial for global energy transition and carbon neutrality.Method: Developed a machine learning assisted framework with: physics-informed quality control for degradation prediction, generative learning for residual value assessment, federated learning for privacy-preserving cathode material sorting, and unified diagnostics/prognostics using correlation alignment.
Result: The framework enables long-term degradation prediction from limited early-cycle data, rapid and accurate evaluation of retired batteries under random conditions, privacy-preserving high-precision cathode sorting, and enhanced adaptability across various battery management tasks.
Conclusion: The contributions advance sustainable battery management by integrating physics, data generation, privacy-preserving collaboration, and adaptive learning, offering methodological innovations to promote circular economy and global carbon neutrality.
Abstract: The sustainable utilization of lithium-ion batteries (LIBs) is crucial to the global energy transition and carbon neutrality, yet data scarcity and heterogeneity remain major barriers across remanufacturing, reusing, and recycling. This dissertation develops a machine learning assisted framework to address these challenges throughout the battery lifecycle. A physics informed quality control model predicts long-term degradation from limited early-cycle data, while a generative learning based residual value assessment method enables rapid and accurate evaluation of retired batteries under random conditions. A federated learning strategy achieves privacy preserving and high precision cathode material sorting, supporting efficient recycling. Furthermore, a unified diagnostics and prognostics framework based on correlation alignment enhances adaptability across tasks such as state of health estimation, state of charge estimation, and remaining useful life prediction under varied testing protocols. Collectively, these contributions advance sustainable battery management by integrating physics, data generation, privacy preserving collaboration, and adaptive learning, offering methodological innovations to promote circular economy and global carbon neutrality.
[744] Diverse Subset Selection via Norm-Based Sampling and Orthogonality
Noga Bar, Raja Giryes
Main category: cs.LG
TL;DR: Proposes a simple method combining feature norms, randomization, and orthogonality to select informative and diverse samples from large unlabeled datasets for efficient annotation.
Details
Motivation: Labeling data is expensive, especially in domains like medical imaging, creating need for effective subset selection methods to identify most informative examples for annotation.Method: Uses feature norms as proxy for informativeness, combined with randomization and orthogonality (via Gram-Schmidt process) to select diverse samples that cover feature space while reducing redundancy.
Result: Extensive experiments on image and text benchmarks (CIFAR-10/100, Tiny ImageNet, ImageNet, OrganAMNIST, Yelp) show consistent improvement in subset selection performance, both standalone and integrated with existing techniques.
Conclusion: The proposed method effectively combines feature norms, randomization, and orthogonality to select diverse and informative samples, improving subset selection performance across various domains.
Abstract: Large annotated datasets are crucial for the success of deep neural networks, but labeling data can be prohibitively expensive in domains such as medical imaging. This work tackles the subset selection problem: selecting a small set of the most informative examples from a large unlabeled pool for annotation. We propose a simple and effective method that combines feature norms, randomization, and orthogonality (via the Gram-Schmidt process) to select diverse and informative samples. Feature norms serve as a proxy for informativeness, while randomization and orthogonalization reduce redundancy and encourage coverage of the feature space. Extensive experiments on image and text benchmarks, including CIFAR-10/100, Tiny ImageNet, ImageNet, OrganAMNIST, and Yelp, show that our method consistently improves subset selection performance, both as a standalone approach and when integrated with existing techniques.
[745] VeriFlow: Modeling Distributions for Neural Network Verification
Faried Abu Zaid, Daniel Neider, Mustafa Yalçıner
Main category: cs.LG
TL;DR: VeriFlow architecture enables neural network verification to focus on relevant data distributions using flow-based density models with piecewise affine transformations and linear constraint handling.
Details
Motivation: Current verification methods check neural networks on unrealistic inputs. VeriFlow restricts verification to meaningful data distributions to improve practical relevance.Method: Proposes VeriFlow as a flow-based density model with piecewise affine transformations, enabling linear constraint solving and computable upper density level sets in latent space.
Result: The architecture allows probabilistically interpretable control over input typicality during verification and supports existing verifiers with linear arithmetic constraints.
Conclusion: VeriFlow provides an effective framework for distribution-aware neural network verification with fine-grained control over input relevance.
Abstract: Formal verification has emerged as a promising method to ensure the safety and reliability of neural networks. However, many relevant properties, such as fairness or global robustness, pertain to the entire input space. If one applies verification techniques naively, the neural network is checked even on inputs that do not occur in the real world and have no meaning. To tackle this shortcoming, we propose the VeriFlow architecture as a flow-based density model tailored to allow any verification approach to restrict its search to some data distribution of interest. We argue that our architecture is particularly well suited for this purpose because of two major properties. First, we show that the transformation that is defined by our model is piecewise affine. Therefore, the model allows the usage of verifiers based on constraint solving with linear arithmetic. Second, upper density level sets (UDL) of the data distribution are definable via linear constraints in the latent space. As a consequence, representations of UDLs specified by a given probability are effectively computable in the latent space. This property allows for effective verification with a fine-grained, probabilistically interpretable control of how a-typical the inputs subject to verification are.
[746] Large Language Models versus Classical Machine Learning: Performance in COVID-19 Mortality Prediction Using High-Dimensional Tabular Data
Mohammadreza Ghaffarzadeh-Esfahani, Mahdi Ghaffarzadeh-Esfahani, Arian Salahi-Niri, Hossein Toreyhi, Zahra Atf, Amirali Mohsenzadeh-Kermani, Mahshad Sarikhani, Zohreh Tajabadi, Fatemeh Shojaeian, Mohammad Hassan Bagheri, Aydin Feyzi, Mohammadamin Tarighatpayma, Narges Gazmeh, Fateme Heydari, Hossein Afshar, Amirreza Allahgholipour, Farid Alimardani, Ameneh Salehi, Naghmeh Asadimanesh, Mohammad Amin Khalafi, Hadis Shabanipour, Ali Moradi, Sajjad Hossein Zadeh, Omid Yazdani, Romina Esbati, Moozhan Maleki, Danial Samiei Nasr, Amirali Soheili, Hossein Majlesi, Saba Shahsavan, Alireza Soheilipour, Nooshin Goudarzi, Erfan Taherifard, Hamidreza Hatamabadi, Jamil S Samaan, Thomas Savage, Ankit Sakhuja, Ali Soroush, Girish Nadkarni, Ilad Alavi Darazam, Mohamad Amin Pourhoseingholi, Seyed Amir Ahmad Safavi-Naini
Main category: cs.LG
TL;DR: This study compares classical machine learning models (CMLs) and large language models (LLMs) for COVID-19 mortality prediction using tabular data from 9,134 patients. XGBoost and random forest outperformed LLMs, with fine-tuning significantly improving LLM performance but not matching CML effectiveness.
Details
Motivation: To evaluate and compare the performance of classical feature-based machine learning models versus large language models in predicting COVID-19 mortality from high-dimensional tabular medical data.Method: Compared 7 CML models (including XGBoost and random forest) with 8 LLMs (including GPT-4 and Mistral-7b) using data from 9,134 patients across four hospitals. LLMs performed zero-shot classification on text-converted structured data, and Mistral-7b was fine-tuned using QLoRA approach.
Result: XGBoost and RF achieved F1 scores of 0.87 and 0.83 for internal/external validation. GPT-4 led LLMs with F1 score of 0.43. Fine-tuning Mistral-7b improved recall from 1% to 79% and achieved F1 score of 0.74 in external validation. CMLs consistently outperformed LLMs.
Conclusion: While fine-tuning significantly enhances LLM performance for medical prediction tasks, classical machine learning models remain superior for handling high-dimensional tabular data. Both approaches show potential in medical predictive modeling, with CMLs currently better suited for structured data analysis.
Abstract: This study compared the performance of classical feature-based machine learning models (CMLs) and large language models (LLMs) in predicting COVID-19 mortality using high-dimensional tabular data from 9,134 patients across four hospitals. Seven CML models, including XGBoost and random forest (RF), were evaluated alongside eight LLMs, such as GPT-4 and Mistral-7b, which performed zero-shot classification on text-converted structured data. Additionally, Mistral- 7b was fine-tuned using the QLoRA approach. XGBoost and RF demonstrated superior performance among CMLs, achieving F1 scores of 0.87 and 0.83 for internal and external validation, respectively. GPT-4 led the LLM category with an F1 score of 0.43, while fine-tuning Mistral-7b significantly improved its recall from 1% to 79%, yielding a stable F1 score of 0.74 during external validation. Although LLMs showed moderate performance in zero-shot classification, fine-tuning substantially enhanced their effectiveness, potentially bridging the gap with CML models. However, CMLs still outperformed LLMs in handling high-dimensional tabular data tasks. This study highlights the potential of both CMLs and fine-tuned LLMs in medical predictive modeling, while emphasizing the current superiority of CMLs for structured data analysis.
[747] Closed-Form Interpretation of Neural Network Latent Spaces with Symbolic Gradients
Sebastian J. Wetzel, Zakaria Patel
Main category: cs.LG
TL;DR: A framework for extracting human-readable closed-form equations from neural network latent spaces by matching symbolic expressions’ gradients with neuron gradients.
Details
Motivation: To systematically extract meaningful mathematical concepts from neural network latent spaces without prior knowledge, since concepts in quantitative disciplines are typically formulated as equations.Method: Embed neural networks into equivalence classes of functions encoding the same concept, then find intersections with human-readable symbolic expressions by matching normalized gradients of symbolic expressions with neuron gradients.
Result: Successfully retrieved matrix invariants and conserved quantities of dynamical systems from Siamese neural network latent spaces.
Conclusion: The framework effectively extracts interpretable mathematical concepts from neural network representations through symbolic gradient matching.
Abstract: It has been demonstrated that artificial neural networks like autoencoders or Siamese networks encode meaningful concepts in their latent spaces. However, there does not exist a comprehensive framework for retrieving this information in a human-readable form without prior knowledge. In quantitative disciplines concepts are typically formulated as equations. Hence, in order to extract these concepts, we introduce a framework for finding closed-form interpretations of neurons in latent spaces of artificial neural networks. The interpretation framework is based on embedding trained neural networks into an equivalence class of functions that encode the same concept. We interpret these neural networks by finding an intersection between the equivalence class and human-readable equations defined by a symbolic search space. Computationally, this framework is based on finding a symbolic expression whose normalized gradients match the normalized gradients of a specific neuron with respect to the input variables. The effectiveness of our approach is demonstrated by retrieving invariants of matrices and conserved quantities of dynamical systems from latent spaces of Siamese neural networks.
[748] DOTA: Distributional Test-Time Adaptation of Vision-Language Models
Zongbo Han, Jialong Yang, Guangyu Wang, Junfan Li, Qianli Xu, Mike Zheng Shou, Changqing Zhang
Main category: cs.LG
TL;DR: DOTA is a distribution-based test-time adaptation method that continuously estimates test data distributions to mitigate catastrophic forgetting in vision-language models, achieving state-of-the-art performance.
Details
Motivation: Vision-language models suffer from unreliable deployment due to distribution gaps between training and test data, while fine-tuning is costly. Existing cache-based adapters have limited capacity and suffer from catastrophic forgetting when samples are dropped.Method: DOTA continuously estimates the underlying distribution of test data streams and computes test-time posterior probabilities using these dynamically estimated distributions via Bayes’ theorem, enabling continual adaptation to deployment environments.
Result: Extensive experiments show DOTA significantly mitigates forgetting and achieves state-of-the-art performance compared to existing methods.
Conclusion: The distribution-centric approach of DOTA effectively addresses catastrophic forgetting in test-time adaptation for vision-language models, providing a simple yet powerful solution for reliable deployment.
Abstract: Vision-language foundation models (VLMs), such as CLIP, exhibit remarkable performance across a wide range of tasks. However, deploying these models can be unreliable when significant distribution gaps exist between training and test data, while fine-tuning for diverse scenarios is often costly. Cache-based test-time adapters offer an efficient alternative by storing representative test samples to guide subsequent classifications. Yet, these methods typically employ naive cache management with limited capacity, leading to severe catastrophic forgetting when samples are inevitably dropped during updates. In this paper, we propose DOTA (DistributiOnal Test-time Adaptation), a simple yet effective method addressing this limitation. Crucially, instead of merely memorizing individual test samples, DOTA continuously estimates the underlying distribution of the test data stream. Test-time posterior probabilities are then computed using these dynamically estimated distributions via Bayes’ theorem for adaptation. This distribution-centric approach enables the model to continually learn and adapt to the deployment environment. Extensive experiments validate that DOTA significantly mitigates forgetting and achieves state-of-the-art performance compared to existing methods.
[749] Degree-Conscious Spiking Graph for Cross-Domain Adaptation
Yingxu Wang, Mengzhu Wang, Houcheng Su, Nan Yin, Quanming Yao, James Kwok
Main category: cs.LG
TL;DR: DeSGraDA is a novel framework for cross-domain adaptation in Spiking Graph Networks that addresses distribution shifts through degree-conscious spiking representation, temporal distribution alignment, and consistent pseudo-label generation.
Details
Motivation: Existing Spiking Graph Networks are constrained to in-distribution scenarios and struggle with distribution shifts, limiting their practical application in real-world settings where data distributions may vary.Method: Three key components: 1) Degree-conscious spiking representation with adaptive spike thresholds based on node degrees, 2) Temporal distribution alignment via adversarial matching of membrane potentials between domains, 3) Consistent prediction extraction for reliable pseudo-label generation using unlabeled data.
Result: Extensive experiments show DeSGraDA consistently outperforms state-of-the-art methods in both classification accuracy and energy efficiency on benchmark datasets.
Conclusion: The framework successfully addresses domain adaptation in Spiking Graph Networks, provides theoretical generalization bounds, and demonstrates superior performance while maintaining energy efficiency.
Abstract: Spiking Graph Networks (SGNs) have demonstrated significant potential in graph classification by emulating brain-inspired neural dynamics to achieve energy-efficient computation. However, existing SGNs are generally constrained to in-distribution scenarios and struggle with distribution shifts. In this paper, we first propose the domain adaptation problem in SGNs, and introduce a novel framework named Degree-Consicious Spiking Graph for Cross-Domain Adaptation (DeSGraDA). DeSGraDA enhances generalization across domains with three key components. First, we introduce the degree-conscious spiking representation module by adapting spike thresholds based on node degrees, enabling more expressive and structure-aware signal encoding. Then, we perform temporal distribution alignment by adversarially matching membrane potentials between domains, ensuring effective performance under domain shift while preserving energy efficiency. Additionally, we extract consistent predictions across two spaces to create reliable pseudo-labels, effectively leveraging unlabeled data to enhance graph classification performance. Furthermore, we establish the first generalization bound for SGDA, providing theoretical insights into its adaptation performance. Extensive experiments on benchmark datasets validate that DeSGraDA consistently outperforms state-of-the-art methods in both classification accuracy and energy efficiency.
[750] Capacity-Aware Planning and Scheduling in Budget-Constrained Multi-Agent MDPs: A Meta-RL Approach
Manav Vora, Ilan Shomorony, Melkior Ornik
Main category: cs.LG
TL;DR: The paper proposes a two-stage approach for capacity- and budget-constrained multi-agent MDPs that partitions agents into diverse groups and uses meta-trained PPO policies to solve sub-MDPs efficiently.
Details
Motivation: To address combinatorial complexity in maintenance and scheduling tasks where agents can irreversibly fail, and planners must decide when to apply restorative actions and which subsets to treat simultaneously under global budget and capacity constraints.Method: A two-stage solution: (1) LSAP-based grouping partitions agents into r disjoint sets maximizing diversity in expected time-to-failure with proportional budget allocation, (2) meta-trained PPO policy solves each sub-MDP with transfer learning across groups.
Result: The method outperforms baselines in maximizing average uptime for industrial robot teams, especially for large team sizes, and demonstrates scalability through complexity analysis.
Conclusion: The proposed approach provides a tractable solution for large-scale CB-MA-MDPs by combining strategic grouping with meta-reinforcement learning, effectively handling capacity and budget constraints while maintaining computational efficiency.
Abstract: We study capacity- and budget-constrained multi-agent MDPs (CB-MA-MDPs), a class that captures many maintenance and scheduling tasks in which each agent can irreversibly fail and a planner must decide (i) when to apply a restorative action and (ii) which subset of agents to treat in parallel. The global budget limits the total number of restorations, while the capacity constraint bounds the number of simultaneous actions, turning na"ive dynamic programming into a combinatorial search that scales exponentially with the number of agents. We propose a two-stage solution that remains tractable for large systems. First, a Linear Sum Assignment Problem (LSAP)-based grouping partitions the agents into r disjoint sets (r = capacity) that maximise diversity in expected time-to-failure, allocating budget to each set proportionally. Second, a meta-trained PPO policy solves each sub-MDP, leveraging transfer across groups to converge rapidly. To validate our approach, we apply it to the problem of scheduling repairs for a large team of industrial robots, constrained by a limited number of repair technicians and a total repair budget. Our results demonstrate that the proposed method outperforms baseline approaches in terms of maximizing the average uptime of the robot team, particularly for large team sizes. Lastly, we confirm the scalability of our approach through a computational complexity analysis across varying numbers of robots and repair technicians.
[751] Rare-to-Frequent: Unlocking Compositional Generation Power of Diffusion Models on Rare Concepts with LLM Guidance
Dongmin Park, Sebin Kim, Taehong Moon, Minkyu Kim, Kangwook Lee, Jaewoong Cho
Main category: cs.LG
TL;DR: R2F is a training-free approach that enhances text-to-image diffusion models’ ability to generate rare concept compositions by using LLM guidance to expose relevant frequent concepts during diffusion sampling.
Details
Motivation: State-of-the-art text-to-image diffusion models struggle with generating rare compositions of concepts (e.g., objects with unusual attributes), limiting their compositional generation capabilities.Method: Proposes R2F framework that leverages LLM guidance to plan and execute rare-to-frequent concept guidance throughout diffusion inference, exposing relevant frequent concepts during sampling without requiring training.
Result: R2F significantly surpasses existing models including SD3.0 and FLUX by up to 28.1% in T2I alignment on various benchmarks including the new RareBench dataset.
Conclusion: The approach effectively enhances compositional generation power for rare concepts, is flexible across pre-trained models, and can integrate with region-guided diffusion methods.
Abstract: State-of-the-art text-to-image (T2I) diffusion models often struggle to generate rare compositions of concepts, e.g., objects with unusual attributes. In this paper, we show that the compositional generation power of diffusion models on such rare concepts can be significantly enhanced by the Large Language Model (LLM) guidance. We start with empirical and theoretical analysis, demonstrating that exposing frequent concepts relevant to the target rare concepts during the diffusion sampling process yields more accurate concept composition. Based on this, we propose a training-free approach, R2F, that plans and executes the overall rare-to-frequent concept guidance throughout the diffusion inference by leveraging the abundant semantic knowledge in LLMs. Our framework is flexible across any pre-trained diffusion models and LLMs, and can be seamlessly integrated with the region-guided diffusion approaches. Extensive experiments on three datasets, including our newly proposed benchmark, RareBench, containing various prompts with rare compositions of concepts, R2F significantly surpasses existing models including SD3.0 and FLUX by up to 28.1%p in T2I alignment. Code is available at https://github.com/krafton-ai/Rare-to-Frequent.
[752] Adaptive Policy Learning to Additional Tasks
Wenjian Hao, Zehui Lu, Zihao Liang, Tianyu Zhou, Shaoshuai Mou
Main category: cs.LG
TL;DR: APG method adapts pre-trained policies to new tasks without affecting original performance, combining Bellman’s principle with policy gradients for faster convergence.
Details
Motivation: To enable efficient adaptation of pre-trained policies to additional tasks while maintaining original task performance, addressing the need for sample-efficient policy tuning.Method: Proposed Adaptive Policy Gradient (APG) that integrates Bellman’s principle of optimality with policy gradient approach to enhance convergence speed.
Result: Theoretical guarantees show O(1/T) convergence rate and O(1/ε) sample complexity. Experiments on cartpole, lunar lander, and robot arm demonstrate comparable performance to existing methods with less data and faster convergence.
Conclusion: APG provides an effective method for policy adaptation with strong theoretical guarantees and practical efficiency in sample usage and convergence speed.
Abstract: This paper develops a policy learning method for tuning a pre-trained policy to adapt to additional tasks without altering the original task. A method named Adaptive Policy Gradient (APG) is proposed in this paper, which combines Bellman’s principle of optimality with the policy gradient approach to improve the convergence rate. This paper provides theoretical analysis which guarantees the convergence rate and sample complexity of $\mathcal{O}(1/T)$ and $\mathcal{O}(1/\epsilon)$, respectively, where $T$ denotes the number of iterations and $\epsilon$ denotes the accuracy of the resulting stationary policy. Furthermore, several challenging numerical simulations, including cartpole, lunar lander, and robot arm, are provided to show that APG obtains similar performance compared to existing deterministic policy gradient methods while utilizing much less data and converging at a faster rate.
[753] How Strategic Agents Respond: Comparing Analytical Models with LLM-Generated Responses in Strategic Classification
Tian Xie, Pavan Rauch, Xueru Zhang
Main category: cs.LG
TL;DR: LLMs can generate effective and socially responsible strategies in Strategic Classification settings, performing similarly or better than theoretical models at population level while producing more diverse individual strategies.
Details
Motivation: To investigate if LLMs can generate effective strategies in Strategic Classification settings and whether existing theoretical models accurately capture agent behavior when agents follow LLM-generated advice.Method: Simulated agents with diverse profiles interacting with three commercial LLMs (GPT-4o, GPT-4.1, GPT-5) in five SC scenarios: hiring, loan applications, school admissions, personal income, and public assistance programs.
Result: LLMs generated effective strategies that improved both agents’ scores and qualifications without policy access. At population level, LLM-guided strategies yielded similar or higher score improvements, qualification rates, and fairness metrics than theoretical models.
Conclusion: Theoretical SC models may serve as reasonable proxies for LLM-influenced behavior, and LLMs produce more diverse individual strategies than theoretical models.
Abstract: When ML algorithms are deployed to automate human-related decisions, human agents may learn the underlying decision policies and adapt their behavior. Strategic Classification (SC) has emerged as a framework for studying this interaction between agents and decision-makers to design more trustworthy ML systems. Prior theoretical models in SC assume that agents are perfectly or approximately rational and respond to decision policies by optimizing their utility. However, the growing prevalence of LLMs raises the possibility that real-world agents may instead rely on these tools for strategic advice. This shift prompts two questions: (i) Can LLMs generate effective and socially responsible strategies in SC settings? (ii) Can existing SC theoretical models accurately capture agent behavior when agents follow LLM-generated advice? To investigate these questions, we examine five critical SC scenarios: hiring, loan applications, school admissions, personal income, and public assistance programs. We simulate agents with diverse profiles who interact with three commercial LLMs (GPT-4o, GPT-4.1, and GPT-5), following their suggestions on effort allocations on features. We compare the resulting agent behaviors with the best responses in existing SC models. Our findings show that: (i) Even without access to the decision policy, LLMs can generate effective strategies that improve both agents’ scores and qualification; (ii) At the population level, LLM-guided effort allocation strategies yield similar or even higher score improvements, qualification rates, and fairness metrics as those predicted by the SC theoretical model, suggesting that the theoretical model may still serve as a reasonable proxy for LLM-influenced behavior; and (iii) At the individual level, LLMs tend to produce more diverse and balanced effort allocations than theoretical models.
[754] Fast Partition-Based Cross-Validation With Centering and Scaling for $\mathbf{X}^\mathbf{T}\mathbf{X}$ and $\mathbf{X}^\mathbf{T}\mathbf{Y}$
Ole-Christian Galbo Engstrøm, Martin Holm Jensen
Main category: cs.LG
TL;DR: Algorithms that accelerate partition-based cross-validation for machine learning models requiring X^TX and X^TY computations, supporting all combinations of column-wise centering and scaling while preventing data leakage.
Details
Motivation: To speed up cross-validation for model selection in PCA, PCR, ridge regression, OLS, and PLS by eliminating redundant computations between training partitions and avoiding data leakage from preprocessing.Method: Manipulate X^TX and X^TY using only validation samples to obtain preprocessed training partition-wise matrices, eliminating redundant computations in partition overlaps while maintaining preprocessing integrity.
Result: Algorithms achieve same time complexity as computing X^TX and X^TY, independent of number of folds, with manageable constant overhead for preprocessing. Space complexity equivalent to storing X, Y, X^TX, and X^TY.
Conclusion: First correct and efficient cross-validation algorithms for all 16 combinations of column-wise centering/scaling (12 distinct matrix products), preventing data leakage while maintaining computational efficiency.
Abstract: We present algorithms that substantially accelerate partition-based cross-validation for machine learning models that require matrix products $\mathbf{X}^\mathbf{T}\mathbf{X}$ and $\mathbf{X}^\mathbf{T}\mathbf{Y}$. Our algorithms have applications in model selection for, for example, principal component analysis (PCA), principal component regression (PCR), ridge regression (RR), ordinary least squares (OLS), and partial least squares (PLS). Our algorithms support all combinations of column-wise centering and scaling of $\mathbf{X}$ and $\mathbf{Y}$, and we demonstrate in our accompanying implementation that this adds only a manageable, practical constant over efficient variants without preprocessing. We prove the correctness of our algorithms under a fold-based partitioning scheme and show that the running time is independent of the number of folds; that is, they have the same time complexity as that of computing $\mathbf{X}^\mathbf{T}\mathbf{X}$ and $\mathbf{X}^\mathbf{T}\mathbf{Y}$ and space complexity equivalent to storing $\mathbf{X}$, $\mathbf{Y}$, $\mathbf{X}^\mathbf{T}\mathbf{X}$, and $\mathbf{X}^\mathbf{T}\mathbf{Y}$. Importantly, unlike alternatives found in the literature, we avoid data leakage due to preprocessing. We achieve these results by eliminating redundant computations in the overlap between training partitions. Concretely, we show how to manipulate $\mathbf{X}^\mathbf{T}\mathbf{X}$ and $\mathbf{X}^\mathbf{T}\mathbf{Y}$ using only samples from the validation partition to obtain the preprocessed training partition-wise $\mathbf{X}^\mathbf{T}\mathbf{X}$ and $\mathbf{X}^\mathbf{T}\mathbf{Y}$. To our knowledge, we are the first to derive correct and efficient cross-validation algorithms for any of the $16$ combinations of column-wise centering and scaling, for which we also prove only $12$ give distinct matrix products.
[755] Avoiding $\mathbf{exp(R_{max})}$ scaling in RLHF through Preference-based Exploration
Mingyu Chen, Yiding Chen, Wen Sun, Xuezhou Zhang
Main category: cs.LG
TL;DR: SE-POPO is a new online RLHF algorithm that achieves polynomial sample complexity scaling with reward scale, solving the exponential scaling problem in existing methods.
Details
Motivation: Existing online RLHF algorithms suffer from exponential sample complexity scaling with reward function scale, which limits effectiveness in scenarios with heavily skewed preferences like questions with unique correct solutions.Method: Self-Exploring Preference-Incentive Online Preference Optimization (SE-POPO) - an online RLHF algorithm that uses self-exploration and preference incentives to improve sample efficiency.
Result: SE-POPO achieves polynomial sample complexity scaling with reward scale, theoretically dominating existing exploration algorithms and empirically outperforming both exploratory and non-exploratory baselines in RLHF application scenarios and public benchmarks.
Conclusion: SE-POPO represents a significant advancement in RLHF algorithm design by solving the exponential sample complexity problem and providing more efficient alignment for large language models.
Abstract: Reinforcement Learning from Human Feedback (RLHF) has emerged as a pivotal technique for large language model (LLM) alignment. This paper studies the setting of online RLHF and focus on improving sample efficiency. All existing algorithms in online RLHF, whether doing passive exploration or active exploration, suffer from a sample complexity that scales exponentially with the scale of the reward function. This fundamental limitation hinders their effectiveness in scenarios with heavily skewed preferences, e.g. questions with a unique correct solution. To address this, we introduce Self-Exploring Preference-Incentive Online Preference Optimization (SE-POPO), an online RLHF algorithm that for the first time achieves a sample complexity that scales polynomially with the reward scale, answering an open problem raised by Xie et al. (2024).. Theoretically, we demonstrate that the sample complexity of SE-POPO dominates that of existing exploration algorithms. Empirically, our systematic evaluation confirms that SE-POPO is more sample-efficient than both exploratory and non-exploratory baselines, in two primary application scenarios of RLHF as well as on public benchmarks, marking a significant step forward in RLHF algorithm design. The code is available at https://github.com/MYC000801/SE-POPO.
[756] A Notion of Uniqueness for the Adversarial Bayes Classifier
Natalie S. Frank
Main category: cs.LG
TL;DR: The paper introduces a new concept of uniqueness for adversarial Bayes classifiers in binary classification, develops a method to compute all such classifiers for 1D data distributions, and shows that increasing perturbation radius improves classifier regularity.
Details
Motivation: To better understand adversarial Bayes classifiers and their relationship with standard Bayes classifiers, particularly in one-dimensional settings where analytical characterization is possible.Method: Proposed a new notion of uniqueness for adversarial Bayes classifiers and developed a computational procedure to find all such classifiers for well-motivated 1D data distributions.
Result: Characterized all adversarial Bayes classifiers for 1D distributions and demonstrated that increasing perturbation radius leads to improved regularity of these classifiers.
Conclusion: The analysis provides tools for understanding relationships between Bayes and adversarial Bayes classifiers in one dimension, showing that adversarial training can lead to more regular decision boundaries.
Abstract: We propose a new notion of uniqueness for the adversarial Bayes classifier in the setting of binary classification. Analyzing this concept produces a simple procedure for computing all adversarial Bayes classifiers for a well-motivated family of one dimensional data distributions. This characterization is then leveraged to show that as the perturbation radius increases, certain notions of regularity for the adversarial Bayes classifiers improve. Furthermore, these results provide tools for understanding relationships between the Bayes and adversarial Bayes classifiers in one dimension.
[757] Process Reinforcement through Implicit Rewards
Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Yuchen Zhang, Jiacheng Chen, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, Jiarui Yuan, Huayu Chen, Kaiyan Zhang, Xingtai Lv, Shuo Wang, Yuan Yao, Xu Han, Hao Peng, Yu Cheng, Zhiyuan Liu, Maosong Sun, Bowen Zhou, Ning Ding
Main category: cs.LG
TL;DR: PRIME enables online process reward model updates using only policy rollouts and outcome labels through implicit process rewards, eliminating the need for dedicated reward model training.
Details
Motivation: Dense process rewards are more effective than sparse outcome rewards for LLM reasoning tasks, but training process reward models online is challenging due to expensive label collection and vulnerability to reward hacking.Method: PRIME combines implicit process rewards with various advantage functions, using only policy rollouts and outcome labels without requiring dedicated reward model training.
Result: PRIME achieves 15.1% average improvement on reasoning benchmarks over SFT model, and Eurus-2-7B-PRIME surpasses Qwen2.5-Math-7B-Instruct on seven benchmarks with only 10% of training data.
Conclusion: PRIME effectively enables online process reward model updates, substantially reducing development overhead while achieving significant performance improvements in complex reasoning tasks.
Abstract: Dense process rewards have proven a more effective alternative to the sparse outcome-level rewards in the inference-time scaling of large language models (LLMs), particularly in tasks requiring complex multi-step reasoning. While dense rewards also offer an appealing choice for the reinforcement learning (RL) of LLMs since their fine-grained rewards have the potential to address some inherent issues of outcome rewards, such as training efficiency and credit assignment, this potential remains largely unrealized. This can be primarily attributed to the challenges of training process reward models (PRMs) online, where collecting high-quality process labels is prohibitively expensive, making them particularly vulnerable to reward hacking. To address these challenges, we propose PRIME (Process Reinforcement through IMplicit rEwards), which enables online PRM updates using only policy rollouts and outcome labels through implict process rewards. PRIME combines well with various advantage functions and forgoes the dedicated reward model training phrase that existing approaches require, substantially reducing the development overhead. We demonstrate PRIME’s effectiveness on competitional math and coding. Starting from Qwen2.5-Math-7B-Base, PRIME achieves a 15.1% average improvement across several key reasoning benchmarks over the SFT model. Notably, our resulting model, Eurus-2-7B-PRIME, surpasses Qwen2.5-Math-7B-Instruct on seven reasoning benchmarks with 10% of its training data.
[758] A Critical Look At Tokenwise Reward-Guided Text Generation
Ahmad Rashid, Ruotian Wu, Julia Grosse, Agustinus Kristiadi, Pascal Poupart
Main category: cs.LG
TL;DR: Proposes training Bradley-Terry reward models on partial sequences for tokenwise reward-guided text generation, outperforming previous methods without requiring expensive LLM fine-tuning.
Details
Motivation: RLHF fine-tuning is too costly for many users, and existing prediction-time tokenwise reward-guided generation methods are heuristically motivated and poorly analyzed, with reward models trained on full sequences being incompatible with partial sequence scoring.Method: Train a Bradley-Terry reward model on partial sequences explicitly, then autoregressively sample from the implied tokenwise policy during decoding. This policy is shown to be proportional to the ratio of two distinct RLHF policies.
Result: The proposed approach outperforms previous RGTG methods and performs similarly to strong offline baselines without requiring large-scale LLM fine-tuning.
Conclusion: Training reward models specifically on partial sequences enables effective tokenwise reward-guided generation that matches strong baselines while avoiding the prohibitive costs of LLM fine-tuning.
Abstract: Large language models (LLMs) can be improved by aligning with human preferences through fine-tuning – the so-called reinforcement learning from human feedback (RLHF). However, the cost of fine-tuning an LLM is prohibitive for many users. Due to their ability to bypass LLM fine-tuning, prediction-time tokenwise reward-guided text generation (RGTG) methods have recently been proposed. They use a reward model trained on full sequences to score partial sequences during decoding in a bid to steer the generation towards sequences with high rewards. However, these methods have so far been only heuristically motivated and poorly analyzed. In this work, we show that reward models trained on full sequences are not compatible with scoring partial sequences. To alleviate this, we propose to train a Bradley-Terry reward model on partial sequences explicitly, and autoregressively sample from the implied tokenwise policy during decoding. We study the properties of this reward model and the resulting policy: we show that this policy is proportional to the ratio of two distinct RLHF policies. Our simple approach outperforms previous RGTG methods and performs similarly to strong offline baselines without large-scale LLM fine-tuning. Code for our work is available at https://github.com/ahmadrash/PARGS
[759] Tokenizing Single-Channel EEG with Time-Frequency Motif Learning
Jathurshan Pradeepkumar, Xihao Piao, Zheng Chen, Jimeng Sun
Main category: cs.LG
TL;DR: TFM-Tokenizer is a novel EEG tokenization framework that learns time-frequency motifs from single-channel EEG signals and encodes them into discrete tokens, achieving improved accuracy, generalization, and scalability across diverse EEG benchmarks.
Details
Motivation: Foundation models are reshaping EEG analysis, but EEG tokenization remains a challenge. The paper aims to address this by creating a robust tokenization method that can work with various foundation models and handle diverse EEG data formats.Method: Proposes a dual-path architecture with time-frequency masking to capture robust motif representations. The framework is model-agnostic and operates at the single-channel level, supporting both lightweight transformers and existing foundation models for downstream tasks.
Result: Experiments on four EEG benchmarks show up to 17% improvement in Cohen’s Kappa over baselines. The tokenizer consistently boosts performance of foundation models like BIOT and LaBraM, and achieves 14% improvement on ear-EEG sleep staging despite differences in signal format, channel configuration, and recording device.
Conclusion: TFM-Tokenizer provides strong class-discriminative, frequency-aware, and consistent token structures that enable improved representation quality and interpretability in EEG analysis, with potential for device-agnostic applications.
Abstract: Foundation models are reshaping EEG analysis, yet an important problem of EEG tokenization remains a challenge. This paper presents TFM-Tokenizer, a novel tokenization framework that learns a vocabulary of time-frequency motifs from single-channel EEG signals and encodes them into discrete tokens. We propose a dual-path architecture with time-frequency masking to capture robust motif representations, and it is model-agnostic, supporting both lightweight transformers and existing foundation models for downstream tasks. Our study demonstrates three key benefits: Accuracy: Experiments on four diverse EEG benchmarks demonstrate consistent performance gains across both single- and multi-dataset pretraining settings, achieving up to 17% improvement in Cohen’s Kappa over strong baselines. Generalization: Moreover, as a plug-and-play component, it consistently boosts the performance of diverse foundation models, including BIOT and LaBraM. Scalability: By operating at the single-channel level rather than relying on the strict 10-20 EEG system, our method has the potential to be device-agnostic. Experiments on ear-EEG sleep staging, which differs from the pretraining data in signal format, channel configuration, recording device, and task, show that our tokenizer outperforms baselines by 14%. A comprehensive token analysis reveals strong class-discriminative, frequency-aware, and consistent structure, enabling improved representation quality and interpretability. Code is available at https://github.com/Jathurshan0330/TFM-Tokenizer.
[760] Unstable Unlearning: The Hidden Risk of Concept Resurgence in Diffusion Models
Vinith M. Suriyakumar, Rohan Alur, Ayush Sekhari, Manish Raghavan, Ashia C. Wilson
Main category: cs.LG
TL;DR: Fine-tuning text-to-image diffusion models after unlearning can cause previously removed concepts to “resurge,” revealing a critical vulnerability in incremental model updates.
Details
Motivation: To investigate the vulnerability where fine-tuning diffusion models on unrelated images causes previously unlearned concepts to reappear, highlighting safety concerns in incremental model updates.Method: Performed experiments composing concept unlearning with subsequent fine-tuning on Stable Diffusion v1.4 and v2.1 under benign, non-adversarial conditions.
Result: Demonstrated that fine-tuning can cause “concept resurgence” - previously unlearned concepts reappear even when fine-tuning on unrelated content.
Conclusion: This reveals serious fragility in current approaches to model safety and alignment, raising concerns about the reliability of incremental updates to text-to-image diffusion models.
Abstract: Text-to-image diffusion models rely on massive, web-scale datasets. Training them from scratch is computationally expensive, and as a result, developers often prefer to make incremental updates to existing models. These updates often compose fine-tuning steps (to learn new concepts or improve model performance) with “unlearning” steps (to “forget” existing concepts, such as copyrighted works or explicit content). In this work, we demonstrate a critical and previously unknown vulnerability that arises in this paradigm: even under benign, non-adversarial conditions, fine-tuning a text-to-image diffusion model on seemingly unrelated images can cause it to “relearn” concepts that were previously “unlearned.” We comprehensively investigate the causes and scope of this phenomenon, which we term concept resurgence, by performing a series of experiments which compose “concept unlearning” with subsequent fine-tuning of Stable Diffusion v1.4 and Stable Diffusion v2.1. Our findings underscore the fragility of composing incremental model updates, and raise serious new concerns about current approaches to ensuring the safety and alignment of text-to-image diffusion models.
[761] Measurability in the Fundamental Theorem of Statistical Learning
Lothar Sebastian Krapp, Laura Wirth
Main category: cs.LG
TL;DR: This paper provides a rigorous measure-theoretic analysis of the Fundamental Theorem of Statistical Learning, explicitly identifying minimal measurability requirements for agnostic PAC learning and applying these results to hypothesis spaces in o-minimal structures.
Details
Motivation: Previous proofs of the Fundamental Theorem of Statistical Learning in agnostic PAC learning often rely on implicit measurability assumptions. The authors aim to make these assumptions explicit and provide a rigorous foundation for the theorem.Method: The authors conduct a detailed measure-theoretic analysis of existing proofs, extracting explicit measurability requirements and providing a self-contained proof of the Fundamental Theorem with minimal assumptions.
Result: The paper presents a sound statement and rigorous proof of the Fundamental Theorem of Statistical Learning in the agnostic setting, identifying the minimal measurability conditions needed. It also establishes sufficient conditions for PAC learnability of hypothesis spaces over o-minimal expansions of the reals.
Conclusion: Careful measure-theoretic analysis is essential for the Fundamental Theorem of Statistical Learning, especially in settings where measure-theoretic subtleties matter. The results have foundational importance and apply to neural networks with common activation functions.
Abstract: The Fundamental Theorem of Statistical Learning states that a hypothesis space is PAC learnable if and only if its VC dimension is finite. For the agnostic model of PAC learning, the literature so far presents proofs of this theorem that often tacitly impose several measurability assumptions on the involved sets and functions. We scrutinize these proofs from a measure-theoretic perspective in order to explicitly extract the assumptions needed for a rigorous argument. This leads to a sound statement as well as a detailed and self-contained proof of the Fundamental Theorem of Statistical Learning in the agnostic setting, showcasing the minimal measurability requirements needed. As the Fundamental Theorem of Statistical Learning underpins a wide range of further theoretical developments, our results are of foundational importance: A careful analysis of measurability aspects is essential, especially when the theorem is used in settings where measure-theoretic subtleties play a role. We particularly discuss applications in Model Theory, considering NIP and o-minimal structures. Our main theorem presents sufficient conditions for the PAC learnability of hypothesis spaces defined over o-minimal expansions of the reals. This class of hypothesis spaces covers all artificial neural networks for binary classification that use commonly employed activation functions like ReLU and the sigmoid function.
[762] Machine Unlearning for Speaker-Agnostic Detection of Gender-Based Violence Condition in Speech
Emma Reyner-Fuentes, Esther Rituerto-Gonzalez, Carmen Pelaez-Moreno
Main category: cs.LG
TL;DR: This paper introduces a speaker-agnostic AI approach using domain-adversarial training to detect gender-based violence victims from speech, reducing speaker bias while improving classification accuracy.
Details
Motivation: Gender-based violence severely impacts women's mental health, but current speech-based AI tools often fail with unseen speakers due to speaker trait confounding. There's a need for robust models that can generalize across speakers and focus on relevant paralinguistic biomarkers.Method: Used domain-adversarial training to reduce the influence of speaker identity on model predictions, making the AI models speaker-agnostic while maintaining their ability to detect gender-based violence victim conditions from speech.
Result: Achieved 26.95% relative reduction in speaker identification accuracy while improving gender-based violence victim condition classification accuracy by 6.37% (relative). Model predictions showed moderate correlation with pre-clinical PTSD symptoms.
Conclusion: The speaker-agnostic approach effectively captures paralinguistic biomarkers linked to gender-based violence victim condition rather than speaker-specific traits, laying foundation for ethical, privacy-preserving AI systems for clinical screening of gender-based violence survivors.
Abstract: Gender-based violence is a pervasive public health issue that severely impacts women’s mental health, often leading to conditions such as in anxiety, depression, post-traumatic stress disorder, and substance abuse. Identifying the combination of these various mental health conditions could then point to someone who is a victim of gender-based violence. And while speech-based artificial intelligence tools show as a promising solution for mental health screening, their performance often deteriorates when encountering speech from previously unseen speakers, a sign that speaker traits may be confounding factors. This study introduces a speaker-agnostic approach to detecting the gender-based violence victim condition from speech, aiming to develop robust artificial intelligence models capable of generalizing across speakers. By employing domain-adversarial training, we reduce the influence of speaker identity on model predictions, we achieve a 26.95% relative reduction in speaker identification accuracy while improving gender-based violence victim condition classification accuracy by 6.37% (relative). These results suggest that our models effectively capture paralinguistic biomarkers linked to the gender-based violence victim condition, rather than speaker-specific traits. Additionally, the model’s predictions show moderate correlation with pre-clinical post-traumatic stress disorder symptoms, supporting the relevance of speech as a non-invasive tool for mental health monitoring. This work lays the foundation for ethical, privacy-preserving artificial intelligence systems to support clinical screening of gender-based violence survivors.
[763] Can Diffusion Models Disentangle? A Theoretical Perspective
Liming Wang, Muhammad Jehanzeb Mirza, Yishu Gong, Yuan Gong, Jiaqi Zhang, Brian H. Tracey, Katerina Placek, Marco Vilela, James R. Glass
Main category: cs.LG
TL;DR: This paper develops a theoretical framework for understanding how diffusion models learn disentangled representations, establishes identifiability conditions, analyzes training dynamics, and provides sample complexity bounds.
Details
Motivation: To provide a theoretical foundation for understanding disentangled representation learning in diffusion models, addressing the lack of formal theoretical understanding in this area.Method: Developed a novel theoretical framework for disentangled latent variable models, established identifiability conditions, analyzed training dynamics, derived sample complexity bounds, and validated through experiments on various tasks including subspace recovery, image colorization, denoising, and voice conversion.
Result: The framework successfully explains how diffusion models learn disentangled representations, and experiments validate the theory across diverse tasks and modalities. Training strategies like style guidance regularization consistently improve disentanglement performance.
Conclusion: The paper provides a comprehensive theoretical foundation for disentangled representation learning in diffusion models, with practical validation showing that theory-inspired training strategies enhance disentanglement performance across multiple applications.
Abstract: This paper presents a novel theoretical framework for understanding how diffusion models can learn disentangled representations. Within this framework, we establish identifiability conditions for general disentangled latent variable models, analyze training dynamics, and derive sample complexity bounds for disentangled latent subspace models. To validate our theory, we conduct disentanglement experiments across diverse tasks and modalities, including subspace recovery in latent subspace Gaussian mixture models, image colorization, image denoising, and voice conversion for speech classification. Additionally, our experiments show that training strategies inspired by our theory, such as style guidance regularization, consistently enhance disentanglement performance.
[764] Efficient Prior Selection in Gaussian Process Bandits with Thompson Sampling
Jack Sandberg, Morteza Haghir Chehreghani
Main category: cs.LG
TL;DR: Proposes two algorithms (PE-GP-TS and HP-GP-TS) for joint prior selection and regret minimization in Gaussian process bandits, addressing the practical issue of unknown hyperparameters.
Details
Motivation: Most GP bandit work assumes known priors, but in practice hyperparameters are unknown and typically selected via maximum likelihood estimation without theoretical guarantees.Method: Developed two algorithms based on GP Thompson sampling: Prior-Elimination GP-TS (PE-GP-TS) and HyperPrior GP-TS (HP-GP-TS) for joint prior selection and optimization.
Result: Established theoretical upper bounds for regret and demonstrated effectiveness through experiments with synthetic and real-world data.
Conclusion: The proposed algorithms provide theoretically grounded solutions for GP bandits with unknown priors, outperforming existing alternatives.
Abstract: Gaussian process (GP) bandits provide a powerful framework for performing blackbox optimization of unknown functions. The characteristics of the unknown function depends heavily on the assumed GP prior. Most work in the literature assume that this prior is known but in practice this seldom holds. Instead, practitioners often rely on maximum likelihood estimation to select the hyperparameters of the prior - which lacks theoretical guarantees. In this work, we propose two algorithms for joint prior selection and regret minimization in GP bandits based on GP Thompson sampling (GP-TS): Prior-Elimination GP-TS (PE-GP-TS) and HyperPrior GP-TS (HP-GP-TS). We theoretically analyze the algorithms and establish upper bounds for their respective regret. In addition, we demonstrate the effectiveness of our algorithms compared to the alternatives through experiments with synthetic and real-world data.
[765] Recursive Training Loops in LLMs: How training data properties modulate distribution shift in generated data?
Grgur Kovač, Jérémy Perez, Rémy Portelas, Peter Ford Dominey, Pierre-Yves Oudeyer
Main category: cs.LG
TL;DR: LLMs trained on synthetic data create feedback loops leading to distribution shifts (model collapse). This study examines how human data properties affect these shifts, finding lexical diversity amplifies shifts while semantic diversity and quality mitigate them.
Details
Motivation: To understand how different human data properties influence distribution shifts in LLMs when trained recursively on synthetic data, as current understanding of this feedback loop effect is limited.Method: Empirical examination using different human datasets, exhaustive manipulation of dataset properties combined with regression analyses to identify properties predicting distribution shift magnitudes.
Result: Lexical diversity amplifies distribution shifts, while semantic diversity and data quality mitigate them. Effects are modular - data from one internet domain has little influence on content from another domain. Political bias experiments show human data properties determine whether initial bias is amplified or reduced.
Conclusion: Different parts of the internet may undergo different types of distribution shift, providing a novel view of how synthetic data feedback loops affect LLM training across various domains.
Abstract: Large language models (LLMs) are increasingly used in the creation of online content, creating feedback loops as subsequent generations of models will be trained on this synthetic data. Such loops were shown to lead to distribution shifts - models misrepresenting the true underlying distributions of human data (also called model collapse). However, how human data properties affect such shifts remains poorly understood. In this paper, we provide the first empirical examination of the effect of such properties on the outcome of recursive training. We first confirm that using different human datasets leads to distribution shifts of different magnitudes. Through exhaustive manipulation of dataset properties combined with regression analyses, we then identify a set of properties predicting distribution shift magnitudes. Lexical diversity is found to amplify these shifts, while semantic diversity and data quality mitigate them. Furthermore, we find that these influences are highly modular: data scrapped from a given internet domain has little influence on the content generated for another domain. Finally, experiments on political bias reveal that human data properties affect whether the initial bias will be amplified or reduced. Overall, our results portray a novel view, where different parts of internet may undergo different types of distribution shift.
[766] GNN-DT: Graph Neural Network Enhanced Decision Transformer for Efficient Optimization in Dynamic Environments
Stavros Orfanoudakis, Nanda Kishor Panda, Peter Palensky, Pedro P. Vergara
Main category: cs.LG
TL;DR: GNN-DT is a novel Decision Transformer architecture that combines Graph Neural Networks with residual connections to handle dynamic environments in reinforcement learning, achieving superior performance on EV charging optimization with better sample efficiency and generalization.
Details
Motivation: Address challenges in RL for real-world optimization problems, including dynamic state-action spaces, large scale, sparse rewards, and poor convergence/scalability.Method: Integrates GNN embedders with Decision Transformer architecture using residual connections between input and output tokens, learns from previously collected trajectories to handle sparse rewards.
Result: Superior performance on EV charging optimization, requires significantly fewer training trajectories, improves sample efficiency compared to existing DT and offline RL baselines, exhibits robust generalization to unseen environments and larger action spaces.
Conclusion: GNN-DT effectively addresses critical gaps in prior RL approaches by handling dynamic environments, sparse rewards, and achieving better generalization with improved sample efficiency.
Abstract: Reinforcement Learning (RL) methods used for solving real-world optimization problems often involve dynamic state-action spaces, larger scale, and sparse rewards, leading to significant challenges in convergence, scalability, and efficient exploration of the solution space. This study introduces GNN-DT, a novel Decision Transformer (DT) architecture that integrates Graph Neural Network (GNN) embedders with a novel residual connection between input and output tokens crucial for handling dynamic environments. By learning from previously collected trajectories, GNN-DT tackles the sparse rewards limitations of online RL algorithms and delivers high-quality solutions in real-time. We evaluate GNN-DT on the complex electric vehicle (EV) charging optimization problem and prove that its performance is superior and requires significantly fewer training trajectories, thus improving sample efficiency compared to existing DT and offline RL baselines. Furthermore, GNN-DT exhibits robust generalization to unseen environments and larger action spaces, addressing a critical gap in prior offline and online RL approaches.
[767] ReciNet: Reciprocal Space-Aware Long-Range Modeling for Crystalline Property Prediction
Jianan Nie, Peiyao Xiao, Kaiyi Ji, Peng Gao
Main category: cs.LG
TL;DR: ReciNet is a novel architecture that uses reciprocal space and Fourier series to capture both short-range and long-range interactions in crystal structures, achieving state-of-the-art performance on crystal property prediction tasks.
Details
Motivation: Current methods for crystal property prediction fail to capture long-range interactions in periodic crystal structures, which is crucial for accurate property prediction.Method: Leverages reciprocal space with Fourier series representation from fractional coordinates and reciprocal lattice vectors, combining geometric GNNs for short-range interactions and reciprocal blocks for long-range interactions.
Result: Achieves state-of-the-art predictive accuracy on JARVIS, Materials Project, and MatBench benchmarks, with efficient multi-property prediction using mixture-of-experts showing positive transfer between correlated properties.
Conclusion: ReciNet provides a scalable and accurate solution for crystal property prediction by effectively modeling both local and global periodic interactions.
Abstract: Predicting properties of crystals from their structures is a fundamental yet challenging task in materials science. Unlike molecules, crystal structures exhibit infinite periodic arrangements of atoms, requiring methods capable of capturing both local and global information effectively. However, current works fall short of capturing long-range interactions within periodic structures. To address this limitation, we leverage \emph{reciprocal space}, the natural domain for periodic crystals, and construct a Fourier series representation from fractional coordinates and reciprocal lattice vectors with learnable filters. Building on this principle, we introduce the reciprocal space-based geometry network (\textbf{ReciNet}), a novel architecture that integrates geometric GNNs and reciprocal blocks to model short-range and long-range interactions, respectively. Experimental results on standard benchmarks JARVIS, Materials Project, and MatBench demonstrate that ReciNet achieves state-of-the-art predictive accuracy across a range of crystal property prediction tasks. Additionally, we explore a model extension to multi-property prediction with the mixture-of-experts, which demonstrates high computational efficiency and reveals positive transfer between correlated properties. These findings highlight the potential of our model as a scalable and accurate solution for crystal property prediction.
[768] Geometry aware inference of steady state PDEs using Equivariant Neural Fields representations
Giovanni Catalani, Michael Bauerheim, Frédéric Tost, Xavier Bertrand, Joseph Morlier
Main category: cs.LG
TL;DR: enf2enf is a neural field approach that encodes local geometric features to predict steady-state PDEs with geometric variability, achieving competitive performance with real-time inference.
Details
Motivation: Existing neural operators struggle to efficiently encode local geometric structure and handle variable domains for PDEs on general geometries.Method: Encodes geometries into latent features anchored at spatial locations to preserve locality, combines with global parameters, and decodes to continuous physical fields.
Result: Demonstrates competitive or superior performance on aerodynamic and structural benchmarks compared to graph-based, neural operator, and recent neural field methods.
Conclusion: The method enables effective modeling of complex shape variations with real-time inference and efficient scaling to high-resolution meshes.
Abstract: Advances in neural operators have introduced discretization invariant surrogate models for PDEs on general geometries, yet many approaches struggle to encode local geometric structure and variable domains efficiently. We introduce enf2enf, a neural field approach for predicting steady-state PDEs with geometric variability. Our method encodes geometries into latent features anchored at specific spatial locations, preserving locality throughout the network. These local representations are combined with global parameters and decoded to continuous physical fields, enabling effective modeling of complex shape variations. Experiments on aerodynamic and structural benchmarks demonstrate competitive or superior performance compared to graph-based, neural operator, and recent neural field methods, with real-time inference and efficient scaling to high-resolution meshes.
[769] Mechanisms of Projective Composition of Diffusion Models
Arwen Bradley, Preetum Nakkiran, David Berthelot, James Thornton, Joshua M. Susskind
Main category: cs.LG
TL;DR: Theoretical analysis of composition in diffusion models, focusing on out-of-distribution extrapolation and length-generalization. Defines projective composition and examines when linear score combinations work, reverse-diffusion sampling generates desired compositions, and conditions for failure.
Details
Motivation: To address fundamental gaps in understanding how and why composition works in diffusion models, particularly for out-of-distribution extrapolation and length-generalization, building on prior empirical observations.Method: Define projective composition as desired outcome, theoretically analyze when linear score combinations achieve projective composition, investigate reverse-diffusion sampling capabilities, and identify failure conditions.
Result: Provides theoretical foundations connecting to prior empirical observations, explains reasons for composition success/failure, and proposes heuristic for predicting composition outcomes.
Conclusion: Establishes theoretical understanding of composition mechanisms in diffusion models, offering predictive insights for successful composition applications.
Abstract: We study the theoretical foundations of composition in diffusion models, with a particular focus on out-of-distribution extrapolation and length-generalization. Prior work has shown that composing distributions via linear score combination can achieve promising results, including length-generalization in some cases (Du et al., 2023; Liu et al., 2022). However, our theoretical understanding of how and why such compositions work remains incomplete. In fact, it is not even entirely clear what it means for composition to “work”. This paper starts to address these fundamental gaps. We begin by precisely defining one possible desired result of composition, which we call projective composition. Then, we investigate: (1) when linear score combinations provably achieve projective composition, (2) whether reverse-diffusion sampling can generate the desired composition, and (3) the conditions under which composition fails. We connect our theoretical analysis to prior empirical observations where composition has either worked or failed, for reasons that were unclear at the time. Finally, we propose a simple heuristic to help predict the success or failure of new compositions.
[770] LDC-MTL: Balancing Multi-Task Learning through Scalable Loss Discrepancy Control
Peiyao Xiao, Chaosheng Dong, Shaofeng Zou, Kaiyi Ji
Main category: cs.LG
TL;DR: LDC-MTL is a scalable multi-task learning method that uses bilevel optimization for loss discrepancy control with O(1) time/memory complexity, outperforming existing gradient methods.
Details
Motivation: Existing gradient manipulation methods for multi-task learning incur significant O(K) computational overhead in time and memory, where K is the number of tasks, making them inefficient for large-scale applications.Method: Proposes LDC-MTL with two components: (1) bilevel formulation for fine-grained loss discrepancy control, and (2) scalable first-order bilevel algorithm requiring only O(1) time and memory complexity.
Result: Theoretically proven to converge to stationary points and Pareto stationary points. Extensive experiments show superior performance in both accuracy and efficiency across diverse multi-task datasets.
Conclusion: LDC-MTL provides an efficient and effective solution for multi-task learning with guaranteed convergence and significantly reduced computational overhead compared to existing methods.
Abstract: Multi-task learning (MTL) has been widely adopted for its ability to simultaneously learn multiple tasks. While existing gradient manipulation methods often yield more balanced solutions than simple scalarization-based approaches, they typically incur a significant computational overhead of $\mathcal{O}(K)$ in both time and memory, where $K$ is the number of tasks. In this paper, we propose LDC-MTL, a simple and scalable loss discrepancy control approach for MTL, formulated from a bilevel optimization perspective. Our method incorporates two key components: (i) a bilevel formulation for fine-grained loss discrepancy control, and (ii) a scalable first-order bilevel algorithm that requires only $\mathcal{O}(1)$ time and memory. Theoretically, we prove that LDC-MTL guarantees convergence not only to a stationary point of the bilevel problem with loss discrepancy control but also to an $\epsilon$-accurate Pareto stationary point for all $K$ loss functions under mild conditions. Extensive experiments on diverse multi-task datasets demonstrate the superior performance of LDC-MTL in both accuracy and efficiency.
[771] Quantization Meets Reasoning: Exploring and Mitigating Degradation of Low-Bit LLMs in Mathematical Reasoning
Zhen Li, Yupeng Su, Songmiao Wang, Runming Yang, Congkai Xie, Aofan Liu, Ming Li, Jiannong Cao, Yuan Xie, Ngai Wong, Hongxia Yang
Main category: cs.LG
TL;DR: Low-bit PTQ severely impairs LLM math reasoning. The paper identifies that failures start early in solution steps and proposes a lightweight intervention to restore performance using minimal data and compute.
Details
Motivation: Low-bit post-training quantization (PTQ) is essential for deploying LLMs under memory constraints but severely degrades mathematical reasoning capabilities, with drops up to 69.81%. The research aims to understand where degradation occurs in step-structured solutions and how to mitigate it while maintaining low-bit efficiency.Method: The paper uses format-aligned chain-of-thought with step-aligned attribution across PTQ methods (AWQ, GPTQ, SmoothQuant) and models (Qwen, LLaMA; 0.5-7B). It identifies two regularities: PTQ elevates method/execution errors disproportionately and failures emerge early. The proposed solution is a measure→locate→restore loop that detects first faulty steps, constructs “Silver Bullet” datasets, and applies small-scale supervised/preference tuning.
Result: The intervention successfully recovers 4-bit weight math reasoning toward full-precision baseline using only 332 curated examples and 3-5 minutes of compute on a single GPU, while preserving PTQ efficiency.
Conclusion: The framework transforms low-bit degradation from a global accuracy problem into a local, reproducible process intervention that is quantizer- and architecture-agnostic within evaluated regimes.
Abstract: Low-bit post-training quantization (PTQ) is a practical route to deploy reasoning-capable LLMs under tight memory and latency budgets, yet it can markedly impair mathematical reasoning (drops up to 69.81% in our harder settings). We address two deployment-critical questions with process-level precision: Where along a step-structured solution does degradation first arise? How to mitigate it while staying in the low-bit regime? Across widely used PTQ methods (AWQ, GPTQ, SmoothQuant), open-source model families (Qwen, LLaMA; 0.5–7B), and math reasoning benchmarks (GSM8K, MATH, AIME), we perform format-aligned chain-of-thought with step-aligned attribution and uncover two robust regularities: (i) PTQ disproportionately elevates method and execution errors relative to high-level conceptual mistakes; and (ii) failures emerge early, with the first vulnerable step flipping and cascading to the final answer. These regularities suggest a general intervention principle: restore local token-level margins exactly at the earliest failure frontier. We instantiate this principle as a lightweight measure$\rightarrow$locate$\rightarrow$restore loop that operates directly on the quantized model: detect the first faulty step, construct our “Silver Bullet” datasets, and apply small-scale supervised/preference tuning. In our settings, as few as 332 curated examples and 3–5 minutes of compute on a single GPU recover 4-bit weight math reasoning toward the full-precision baseline while preserving PTQ efficiency. Our framework is quantizer- and architecture-agnostic within the evaluated regimes, and turns low-bit degradation from a global accuracy problem into a local, reproducible process intervention.
[772] Fused Partial Gromov-Wasserstein for Structured Objects
Yikun Bai, Shuang Wang, Huy Tran, Hengrong Du, Juexin Wang, Soheil Kolouri
Main category: cs.LG
TL;DR: The paper proposes Fused Partial Gromov-Wasserstein (FPGW), an extension of FGW that relaxes the equal mass constraint to handle unbalanced structured data like graphs.
Details
Motivation: Classical Fused Gromov-Wasserstein (FGW) distance assumes equal mass constraint on compared data, which limits its applicability to unbalanced structured data. There is a need to extend FGW to accommodate data with unequal masses.Method: The authors relax the mass constraint of FGW and propose the FPGW framework. They establish theoretical relationships between FPGW and FGW, prove metric properties, and introduce both Frank-Wolfe and Sinkhorn solvers for computational efficiency.
Result: FPGW demonstrates robust performance in graph matching, graph classification, and graph clustering experiments, showing its effectiveness in handling unbalanced structured data.
Conclusion: FPGW successfully extends FGW to handle unbalanced structured data while maintaining theoretical properties and achieving practical performance across various graph analysis tasks.
Abstract: Structured data, such as graphs, is vital in machine learning due to its capacity to capture complex relationships and interactions. In recent years, the Fused Gromov-Wasserstein (FGW) distance has attracted growing interest because it enables the comparison of structured data by jointly accounting for feature similarity and geometric structure. However, as a variant of optimal transport (OT), classical FGW assumes an equal mass constraint on the compared data. In this work, we relax this mass constraint and propose the Fused Partial Gromov-Wasserstein (FPGW) framework, which extends FGW to accommodate unbalanced data. Theoretically, we establish the relationship between FPGW and FGW and prove the metric properties of FPGW. Numerically, we introduce Frank-Wolfe solvers and Sinkhorn solvers for the proposed FPGW framework. Finally, we evaluate the FPGW distance through graph matching, graph classification and graph clustering experiments, demonstrating its robust performance.
[773] TokUR: Token-Level Uncertainty Estimation for Large Language Model Reasoning
Tunyu Zhang, Haizhou Shi, Yibin Wang, Hengyi Wang, Xiaoxiao He, Zhuowei Li, Haoxian Chen, Ligong Han, Kai Xu, Huan Zhang, Dimitris Metaxas, Hao Wang
Main category: cs.LG
TL;DR: TokUR is a token-level uncertainty estimation framework that uses low-rank random weight perturbation during LLM decoding to generate predictive distributions for uncertainty estimation in mathematical reasoning tasks.
Details
Motivation: LLMs have inconsistent output quality across different scenarios, making it difficult to identify trustworthy responses, especially in complex multi-step reasoning tasks.Method: Introduces low-rank random weight perturbation during LLM decoding to generate predictive distributions for token-level uncertainty estimation, then aggregates these uncertainties to capture semantic uncertainty of responses.
Result: Experiments show TokUR exhibits strong correlation with answer correctness and model robustness, and the uncertainty signals can enhance reasoning performance at test time.
Conclusion: TokUR is an effective, principled, and scalable approach for improving reliability and interpretability of LLMs in challenging reasoning tasks.
Abstract: While Large Language Models (LLMs) have demonstrated impressive capabilities, their output quality remains inconsistent across various application scenarios, making it difficult to identify trustworthy responses, especially in complex tasks requiring multi-step reasoning. In this paper, we propose a Token-level Uncertainty estimation framework for Reasoning (TokUR) that enables LLMs to self-assess and self-improve their responses in mathematical reasoning. Specifically, we introduce low-rank random weight perturbation during LLM decoding to generate predictive distributions for token-level uncertainty estimation, and we aggregate these uncertainty quantities to capture the semantic uncertainty of generated responses. Experiments on mathematical reasoning datasets of varying difficulty demonstrate that TokUR exhibits a strong correlation with answer correctness and model robustness, and the uncertainty signals produced by TokUR can be leveraged to enhance the model’s reasoning performance at test time. These results highlight the effectiveness of TokUR as a principled and scalable approach for improving the reliability and interpretability of LLMs in challenging reasoning tasks.
[774] Taxonomy, Opportunities, and Challenges of Representation Engineering for Large Language Models
Jan Wehner, Sahar Abdelnabi, Daniel Tan, David Krueger, Mario Fritz
Main category: cs.LG
TL;DR: RepE is a new approach for controlling LLMs by directly manipulating internal representations, offering more effective, interpretable, and flexible control compared to traditional methods.
Details
Motivation: To provide a comprehensive survey of Representation Engineering methods for LLMs, addressing key questions about existing methods, applications, and comparative strengths/weaknesses.Method: Proposed a unified framework describing RepE as a pipeline with three components: representation identification, operationalization, and control. Conducted literature review and analysis.
Result: Identified that RepE methods offer significant potential but face challenges including managing multiple concepts, ensuring reliability, and preserving model performance.
Conclusion: RepE represents a promising paradigm for LLM control, with identified opportunities for improvement and best practices guidance for future research.
Abstract: Representation Engineering (RepE) is a novel paradigm for controlling the behavior of LLMs. Unlike traditional approaches that modify inputs or fine-tune the model, RepE directly manipulates the model’s internal representations. As a result, it may offer more effective, interpretable, data-efficient, and flexible control over models’ behavior. We present the first comprehensive survey of RepE for LLMs, reviewing the rapidly growing literature to address key questions: What RepE methods exist and how do they differ? For what concepts and problems has RepE been applied? What are the strengths and weaknesses of RepE compared to other methods? To answer these, we propose a unified framework describing RepE as a pipeline comprising representation identification, operationalization, and control. We posit that while RepE methods offer significant potential, challenges remain, including managing multiple concepts, ensuring reliability, and preserving models’ performance. Towards improving RepE, we identify opportunities for experimental and methodological improvements and construct a guide for best practices.
[775] Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders
David Chanin, Tomáš Dulka, Adrià Garriga-Alonso
Main category: cs.LG
TL;DR: Sparse autoencoders (SAEs) fail to produce interpretable features when they are narrower than the number of underlying features and features are correlated, a phenomenon called ‘feature hedging’ that worsens with narrower SAEs.
Details
Motivation: To understand why SAEs underperform supervised baselines and identify the conditions under which they fail to produce interpretable linear directions from polysemantic activations.Method: Theoretical analysis in toy models and empirical evaluation of SAEs trained on LLMs, plus proposing an improved variant of matryoshka SAEs based on understanding feature hedging.
Result: Feature hedging occurs when SAEs are narrower than true features and features are correlated, causing SAEs to merge correlated feature components and destroy monosemanticity. This phenomenon is more severe in narrower SAEs.
Conclusion: SAE width is not a neutral hyperparameter - narrower SAEs suffer more from feature hedging, which may explain their consistent underperformance compared to supervised baselines.
Abstract: It is assumed that sparse autoencoders (SAEs) decompose polysemantic activations into interpretable linear directions, as long as the activations are composed of sparse linear combinations of underlying features. However, we find that if an SAE is more narrow than the number of underlying “true features” on which it is trained, and there is correlation between features, the SAE will merge components of correlated features together, thus destroying monosemanticity. In LLM SAEs, these two conditions are almost certainly true. This phenomenon, which we call feature hedging, is caused by SAE reconstruction loss, and is more severe the narrower the SAE. In this work, we introduce the problem of feature hedging and study it both theoretically in toy models and empirically in SAEs trained on LLMs. We suspect that feature hedging may be one of the core reasons that SAEs consistently underperform supervised baselines. Finally, we use our understanding of feature hedging to propose an improved variant of matryoshka SAEs. Importantly, our work shows that SAE width is not a neutral hyperparameter: narrower SAEs suffer more from hedging than wider SAEs.
[776] Multi-View Causal Discovery without Non-Gaussianity: Identifiability and Algorithms
Ambroise Heurtebise, Omar Chehab, Pierre Ablin, Alexandre Gramfort, Aapo Hyvärinen
Main category: cs.LG
TL;DR: A multi-view linear Structural Equation Model for causal discovery that leverages correlation across multiple views of the same system, enabling causal graph estimation with weaker assumptions than traditional non-Gaussian methods.
Details
Motivation: Traditional causal discovery methods rely on strong assumptions like non-Gaussianity, but many real-world applications provide multiple related views of the same system, which hasn't been sufficiently exploited for causal discovery.Method: Proposed a multi-view linear SEM framework that extends non-Gaussian disturbance models by leveraging correlation across views. Developed several multi-view causal discovery algorithms inspired by single-view methods (DirectLiNGAM, PairwiseLiNGAM, ICA-LiNGAM).
Result: Proved identifiability of the model for acyclic SEMs. Validated methods through simulations and neuroimaging applications, successfully estimating causal graphs between brain regions.
Conclusion: Multi-view structure enables causal discovery with weaker assumptions, providing a practical framework for applications where multiple related views of the same system are available.
Abstract: Causal discovery is a difficult problem that typically relies on strong assumptions on the data-generating model, such as non-Gaussianity. In practice, many modern applications provide multiple related views of the same system, which has rarely been considered for causal discovery. Here, we leverage this multi-view structure to achieve causal discovery with weak assumptions. We propose a multi-view linear Structural Equation Model (SEM) that extends the well-known framework of non-Gaussian disturbances by alternatively leveraging correlation over views. We prove the identifiability of the model for acyclic SEMs. Subsequently, we propose several multi-view causal discovery algorithms, inspired by single-view algorithms (DirectLiNGAM, PairwiseLiNGAM, and ICA-LiNGAM). The new methods are validated through simulations and applications on neuroimaging data, where they enable the estimation of causal graphs between brain regions.
[777] Latent Veracity Inference for Identifying Errors in Stepwise Reasoning
Minsu Kim, Jean-Pierre Falet, Oliver E. Richardson, Xiaoyin Chen, Moksh Jain, Sungjin Ahn, Sungsoo Ahn, Yoshua Bengio
Main category: cs.LG
TL;DR: The paper proposes Veracity Search (VS), a discrete search algorithm that augments Chain-of-Thought reasoning with latent veracity variables to identify and correct inaccurate statements in reasoning chains, and introduces Amortized Veracity Inference (AVI) for zero-shot veracity inference.
Details
Motivation: Chain-of-Thought reasoning chains often contain inaccurate statements that reduce performance and trustworthiness of language models, creating a need for methods to identify and correct these errors.Method: Proposes Veracity Search (VS) - a discrete search algorithm over latent veracity assignments that uses LM’s joint likelihood as proxy reward, and Amortized Veracity Inference (AVI) for supervised fine-tuning using pseudo-labels from VS.
Result: VS reliably identifies errors in logical (ProntoQA), mathematical (GSM8K), and commonsense (CommonsenseQA) reasoning benchmarks, with AVI achieving comparable zero-shot accuracy.
Conclusion: Latent veracity inference is useful for providing feedback during self-correction and self-improvement, enabling more reliable and trustworthy reasoning in language models.
Abstract: Chain-of-Thought (CoT) reasoning has advanced the capabilities and transparency of language models (LMs); however, reasoning chains can contain inaccurate statements that reduce performance and trustworthiness. To address this, we propose to augment each reasoning step in a CoT with a latent veracity (or correctness) variable. To efficiently explore this expanded space, we introduce Veracity Search (VS), a discrete search algorithm over veracity assignments. It performs otherwise intractable inference in the posterior distribution over latent veracity values by leveraging the LM’s joint likelihood over veracity and the final answer as a proxy reward. This efficient inference-time verification method facilitates supervised fine-tuning of an Amortized Veracity Inference (AVI) machine by providing pseudo-labels for veracity. AVI generalizes VS, enabling accurate zero-shot veracity inference in novel contexts. Empirical results demonstrate that VS reliably identifies errors in logical (ProntoQA), mathematical (GSM8K), and commonsense (CommonsenseQA) reasoning benchmarks, with AVI achieving comparable zero-shot accuracy. Finally, we demonstrate the utility of latent veracity inference for providing feedback during self-correction and self-improvement.
[778] BPINN-EM-Post: Bayesian Physics-Informed Neural Network based Stochastic Electromigration Damage Analysis in the Post-void Phase
Subed Lamichhane, Haotian Lu, Sheldon X. -D. Tan
Main category: cs.LG
TL;DR: BPINN-EM-Post is a machine learning framework that combines closed-form analytical solutions with Bayesian Physics-Informed Neural Networks for efficient stochastic analysis of electromigration-induced post-voiding aging processes, achieving significant speedup over traditional methods.
Details
Motivation: Traditional EM analysis tools assume deterministic stress evolution, but real-world EM stress is non-deterministic due to input current fluctuations and manufacturing variations. Existing Monte Carlo simulations are computationally expensive and inefficient.Method: Integrates closed-form analytical solutions with Bayesian Physics-Informed Neural Network (BPINN) framework. Analytical solutions enforce physical laws at wire segments, while BPINN ensures junction physics constraints and models stochastic behaviors. Reduces variables in loss functions using analytical solutions.
Result: Achieves over 240x speedup compared to FEM-based COMSOL solver and more than 67x speedup compared to FDM-based EMSpice, with marginal accuracy loss. Significantly improves training efficiency without sacrificing accuracy.
Conclusion: The proposed BPINN-EM-Post framework provides an efficient and accurate approach for stochastic EM analysis, overcoming computational limitations of traditional methods while naturally incorporating variational effects and initial stress distributions.
Abstract: In contrast to the assumptions of most existing Electromigration (EM) analysis tools, the evolution of EM-induced stress is inherently non-deterministic, influenced by factors such as input current fluctuations and manufacturing non-idealities. Traditional approaches for estimating stress variations typically involve computationally expensive and inefficient Monte Carlo simulations with industrial solvers, which quantify variations using mean and variance metrics. In this work, we introduce a novel machine learning-based framework, termed BPINN-EM- Post, for efficient stochastic analysis of EM-induced post-voiding aging processes. For the first time, our new approach integrates closed-form analytical solutions with a Bayesian Physics- Informed Neural Network (BPINN) framework to accelerate the analysis. The closed-form solutions enforce physical laws at the individual wire segment level, while the BPINN ensures that physics constraints at inter-segment junctions are satisfied and stochastic behaviors are accurately modeled. By reducing the number of variables in the loss functions through utilizing analytical solutions, our method significantly improves training efficiency without accuracy loss and naturally incorporates variational effects. Additionally, the analytical solutions effectively address the challenge of incorporating initial stress distributions in interconnect structures during post-void stress calculations. Numerical results demonstrate that BPINN-EM-Post achieves over 240x and more than 67x speedup compared to Monte Carlo simulations using the FEM-based COMSOL solver and FDM-based EMSpice, respectively, with marginal accuracy loss.
[779] Structured Relational Representations
Arun Kumar, Paul Schrater
Main category: cs.LG
TL;DR: The paper proposes that invariant representations should be defined as partitions in abstract knowledge spaces, where knowledge is stored as relational path closures, and inter-partition connectors enable task-relevant transitions.
Details
Motivation: To address the challenge of finding invariant representations that are stable and transferable without suppressing task-relevant signals, and to determine the appropriate abstraction level for such invariants.Method: Formalizes invariant structures as partitions defined by relational path closures in abstract knowledge spaces, using closed semiring as the computational foundation for structured relational representations.
Result: Proposes that invariant partitions serve as core representations where knowledge resides and learning occurs, while inter-partition connectors enable deployment of task-relevant transitions.
Conclusion: Invariant partitions provide the foundational primitives for structured representation, with closed semiring serving as the relational algebraic foundation for computational implementation.
Abstract: Invariant representations are core to representation learning, yet a central challenge remains: uncovering invariants that are stable and transferable without suppressing task-relevant signals. This raises fundamental questions, requiring further inquiry, about the appropriate level of abstraction at which such invariants should be defined and which aspects of a system they should characterize. Interpretation of the environment relies on abstract knowledge structures to make sense of the current state, which leads to interactions, essential drivers of learning and knowledge acquisition. Interpretation operates at the level of higher-order relational knowledge; hence, we propose that invariant structures must be where knowledge resides, specifically as partitions defined by the closure of relational paths within an abstract knowledge space. These partitions serve as the core invariant representations, forming the structural substrate where knowledge is stored and learning occurs. On the other hand, inter-partition connectors enable the deployment of these knowledge partitions encoding task-relevant transitions. Thus, invariant partitions provide the foundational primitives of structured representation. We formalize the computational foundations for structured relational representations of the invariant partitions based on closed semiring, a relational algebraic structure.
[780] MNT-TNN: Spatiotemporal Traffic Data Imputation via Compact Multimode Nonlinear Transform-based Tensor Nuclear Norm
Yihang Lu, Mahwish Yousaf, Xianwei Meng, Enhong Chen
Main category: cs.LG
TL;DR: Proposes MNT-TNN and ATTNNs frameworks for spatiotemporal traffic data imputation, addressing random missing values in modern ITS with theoretical convergence guarantees and superior performance.
Details
Motivation: Modern communication technologies like GNSS have created new challenges for random missing value imputation in traffic data, requiring better spatiotemporal dependency modeling.Method: Uses Multimode Nonlinear Transformed Tensor Nuclear Norm (MNT-TNN) to capture spatiotemporal correlations and low-rankness, solved via proximal alternating minimization (PAM) algorithm with convergence guarantees.
Result: Extensive experiments on real datasets show MNT-TNN and ATTNNs outperform state-of-the-art methods, especially at high missing rates, completing the random missing traffic value imputation benchmark.
Conclusion: The proposed methods effectively address spatiotemporal traffic data imputation challenges and provide superior performance compared to existing approaches.
Abstract: Imputation of random or non-random missing data is a long-standing research topic and a crucial application for Intelligent Transportation Systems (ITS). However, with the advent of modern communication technologies such as Global Satellite Navigation Systems (GNSS), traffic data collection has introduced new challenges in random missing value imputation and increasing demands for spatiotemporal dependency modelings. To address these issues, we propose a novel spatiotemporal traffic imputation method based on a Multimode Nonlinear Transformed Tensor Nuclear Norm (MNT-TNN), which can effectively capture the intrinsic multimode spatiotemporal correlations and low-rankness of the traffic tensor, represented as location $\times$ location $\times$ time. To solve the nonconvex optimization problem, we design a proximal alternating minimization (PAM) algorithm with theoretical convergence guarantees. We also suggest an Augmented Transform-based Tensor Nuclear Norm Families (ATTNNs) framework to enhance the imputation results of TTNN techniques, especially at very high miss rates. Extensive experiments on real datasets demonstrate that our proposed MNT-TNN and ATTNNs can outperform the compared state-of-the-art imputation methods, completing the benchmark of random missing traffic value imputation.
[781] CSF: Fixed-outline Floorplanning Based on the Conjugate Subgradient Algorithm Assisted by Q-Learning
Xinyan Meng, Huabin Cheng, Rujie Chen, Ning Xu, Yu Chen, Wei Zhang
Main category: cs.LG
TL;DR: Proposes CSAQ (conjugate subgradient algorithm with Q-learning) for floorplanning, achieving better wirelength optimization and legal floorplans than existing methods.
Details
Motivation: Gradient-based optimization algorithms for smooth models suffer from local convergence, making it challenging to generate compact floorplans with good wirelength optimization.Method: Constructs a nonsmooth analytic floorplanning model solved by conjugate subgradient algorithm (CSA) accelerated by Q-learning for adaptive stepsize regulation and balance between exploration and exploitation.
Result: Experimental results on MCNC and GSRC benchmarks show CSF algorithm effectively addresses global floorplanning, generates legal floorplans more efficiently than constraint graph-based methods, and is competitive with state-of-the-art algorithms for hard modules.
Conclusion: The proposed CSF algorithm based on CSAQ successfully overcomes local convergence issues and achieves efficient, high-quality floorplanning for complex scenarios.
Abstract: The state-of-the-art researches indicate that analytic algorithms are promising in handling complex floorplanning scenarios. However, it is challenging to generate compact floorplans with excellent wirelength optimization effect due to the local convergence of gradient-based optimization algorithms designed for constructed smooth optimization models. Accordingly, we propose to construct a nonsmooth analytic floorplanning model addressed by the conjugate subgradient algorithm (CSA), which is accelerated by a population-based scheme adaptively regulating the stepsize with the assistance of Q-learning. In this way, the proposed CSA assisted by Q-learning (CSAQ) can strike a good balance on exploration and exploitation. Experimental results on the MCNC and GSRC benchmarks demonstrate that the proposed fixed-outline floorplanning algorithm based on CSAQ (CSF) not only address global floorplanning effectively, but also get legal floorplans more efficiently than the constraint graph-based legalization algorithm as well as its improved variants. It is also demonstrated that the CSF is competitive to the state-of-the-art algorithms on floorplanning scenarios only containing hard modules.
[782] Learning Flexible Forward Trajectories for Masked Molecular Diffusion
Hyunjin Seo, Taewon Kim, Sihyun Yu, SungSoo Ahn
Main category: cs.LG
TL;DR: Masked diffusion models (MDMs) underperform in molecular generation due to state-clashing where distinct molecules collapse into common states during diffusion. The proposed MELD method uses element-wise learnable noise scheduling to avoid collisions, significantly improving performance.
Details
Motivation: To address the poor performance of standard masked diffusion models in molecular generation, which suffer from state-clashing problems where different molecules collapse into identical states during forward diffusion.Method: Proposed Masked Element-wise Learnable Diffusion (MELD) that uses a parameterized noise scheduling network to assign distinct corruption rates to individual graph elements (atoms and bonds), orchestrating per-element corruption trajectories to avoid collisions.
Result: MELD dramatically improved chemical validity from 15% to 93% on ZINC250K benchmark and achieved state-of-the-art property alignment in conditional generation tasks.
Conclusion: Element-wise learnable noise scheduling effectively mitigates state-clashing in masked diffusion models for molecular generation, enabling significant performance improvements over element-agnostic approaches.
Abstract: Masked diffusion models (MDMs) have achieved notable progress in modeling discrete data, while their potential in molecular generation remains underexplored. In this work, we explore their potential and introduce the surprising result that naively applying standards MDMs severely degrades the performance. We identify the critical cause of this issue as a state-clashing problem-where the forward diffusion of distinct molecules collapse into a common state, resulting in a mixture of reconstruction targets that cannot be learned using typical reverse diffusion process with unimodal predictions. To mitigate this, we propose Masked Element-wise Learnable Diffusion (MELD) that orchestrates per-element corruption trajectories to avoid collision between distinct molecular graphs. This is achieved through a parameterized noise scheduling network that assigns distinct corruption rates to individual graph elements, i.e., atoms and bonds. Extensive experiments on diverse molecular benchmarks reveal that MELD markedly enhances overall generation quality compared to element-agnostic noise scheduling, increasing the chemical validity of vanilla MDMs on ZINC250K from 15% to 93%, Furthermore, it achieves state-of-the-art property alignment in conditional generation tasks.
[783] Sparsity Forcing: Reinforcing Token Sparsity of MLLMs
Feng Chen, Yefei He, Lequan Lin, Chenhui Gou, Jing Liu, Bohan Zhuang, Qi Wu
Main category: cs.LG
TL;DR: Sparsity Forcing is an RL-based post-training framework that explicitly enforces token sparsity in MLLMs, achieving up to 75% token reduction with minimal accuracy loss and 3.3x decoding speedup.
Details
Motivation: Existing sparse attention methods either exploit inherent model sparsity (plateauing at ~50% reduction) or use rigid patterns/regularizers without direct budget control, limiting further efficiency gains.Method: Uses RL-based post-training with multiple rollouts at different token budgets, formulating efficiency (token reduction) and performance (answer correctness) as joint rewards to optimize sparsity end-to-end.
Result: Achieves 20-75% token reduction on Qwen2-VL/Qwen2.5-VL across 13 benchmarks with minimal accuracy decline, reducing memory by 3x and speeding up decoding by 3.3x.
Conclusion: Sparsity Forcing successfully turns token saving into an inference-consistent optimization objective, enabling significant efficiency improvements in MLLMs while maintaining accuracy.
Abstract: Sparse attention mechanisms aim to reduce computational overhead with minimal accuracy loss by selectively processing salient tokens. Despite their effectiveness, most methods merely exploit a model’s inherent sparsity and thus plateau at moderate budgets (about 50% token reduction), with little headroom to push budget lower without hurting accuracy. Other approaches attempt to enforce sparsity through trainable sparse attention or sharpness-inducing regularizers, but these either fix rigid patterns that ignore input and layer dynamics, or optimize proxy objectives without direct control over token budgets. In this paper, we explicitly reinforce token sparsity in well-posed multimodal large language models (MLLMs) through a simple RL-based post-training framework named \textit{Sparsity Forcing}. Our method explores the efficiency-accuracy trade-off by running multiple rollouts with different token budgets, where both efficiency (token reduction ratio) and performance (answer correctness) are formulated as joint rewards. By contrasting rollouts within each group, the more efficient and correct answer is rewarded while less efficient or incorrect ones are penalized, thereby turning token saving into an end-to-end, inference-consistent optimization objective. Across thirteen image and video benchmarks, Sparsity Forcing raises token reduction ratio on Qwen2-VL/Qwen2.5-VL from 20% to 75% with minimal accuracy decline, significantly reducing long-context inference memory by up to 3$\times$ while speeding up decoding by up to 3.3$\times$.
[784] The Polar Express: Optimal Matrix Sign Methods and Their Application to the Muon Algorithm
Noah Amsel, David Persson, Christopher Musco, Robert M. Gower
Main category: cs.LG
TL;DR: Polar Express is a new GPU-friendly method for computing polar decomposition that adapts update rules via minimax optimization, enabling fast convergence and practical use in bfloat16 precision for deep learning applications.
Details
Motivation: Deep learning requires GPU-friendly polar decomposition algorithms that prioritize high throughput over high precision, differing from classical numerical analysis requirements.Method: Uses matrix-matrix multiplications only, adapts update rules at each iteration by solving minimax optimization problems to minimize worst-case error, and addresses finite-precision issues for bfloat16.
Result: Converges rapidly in both early iterations and asymptotically, outperforms recent alternatives when integrated into Muon training framework for GPT-2 models on FineWeb dataset.
Conclusion: Polar Express provides an effective GPU-friendly solution for polar decomposition in deep learning, demonstrating consistent improvements in validation loss across various learning rates.
Abstract: Computing the polar decomposition and the related matrix sign function has been a well-studied problem in numerical analysis for decades. Recently, it has emerged as an important subroutine within the Muon algorithm for training deep neural networks. However, the requirements of this application differ sharply from classical settings: deep learning demands GPU-friendly algorithms that prioritize high throughput over high precision. We introduce Polar Express, a new method for computing the polar decomposition. Like Newton-Schulz and other classical polynomial methods, our approach uses only matrix-matrix multiplications, making it very efficient on GPUs. Inspired by earlier work of Chen & Chow and Nakatsukasa & Freund, Polar Express adapts the update rule at each iteration by solving a minimax optimization problem. We prove that this strategy minimizes error in a worst-case sense, allowing Polar Express to converge as rapidly as possible both in the early iterations and asymptotically. We also address finite-precision issues, making it practical to use in bfloat16. When integrated into the Muon training framework, our method leads to consistent improvements in validation loss when training a GPT-2 model on one billion tokens from the FineWeb dataset, outperforming recent alternatives across a range of learning rates.
[785] Trial and Trust: Addressing Byzantine Attacks with Comprehensive Defense Strategy
Gleb Molodtsov, Daniil Medyakov, Sergey Skorik, Nikolas Khachaturov, Shahane Tigranyan, Vladimir Aletov, Aram Avetisyan, Martin Takáč, Aleksandr Beznosikov
Main category: cs.LG
TL;DR: The paper proposes a Byzantine-resilient federated learning method using trust scores and trial functions to filter malicious updates, working even when Byzantine nodes are in majority and adapting to popular optimizers like Adam.
Details
Motivation: Federated learning systems are vulnerable to Byzantine attacks where compromised clients inject adversarial updates to disrupt global convergence, requiring robust defense mechanisms.Method: Combines trust scores concept with trial function methodology to dynamically filter outliers, adapting to scaled methods like Adam and RMSProp, and supporting local training and partial participation.
Result: Extensive experiments on synthetic and real ECG data validate robustness, with convergence guarantees comparable to classical algorithms without Byzantine interference.
Conclusion: The proposed methods effectively defend against Byzantine attacks even in majority Byzantine scenarios while maintaining performance comparable to non-attacked systems.
Abstract: Recent advancements in machine learning have improved performance while also increasing computational demands. While federated and distributed setups address these issues, their structure is vulnerable to malicious influences. In this paper, we address a specific threat, Byzantine attacks, where compromised clients inject adversarial updates to derail global convergence. We combine the trust scores concept with trial function methodology to dynamically filter outliers. Our methods address the critical limitations of previous approaches, allowing functionality even when Byzantine nodes are in the majority. Moreover, our algorithms adapt to widely used scaled methods like Adam and RMSProp, as well as practical scenarios, including local training and partial participation. We validate the robustness of our methods by conducting extensive experiments on both synthetic and real ECG data collected from medical institutions. Furthermore, we provide a broad theoretical analysis of our algorithms and their extensions to aforementioned practical setups. The convergence guarantees of our methods are comparable to those of classical algorithms developed without Byzantine interference.
[786] Bottlenecked Transformers: Periodic KV Cache Consolidation for Generalised Reasoning
Adnan Oomerjee, Zafeirios Fountas, Haitham Bou-Ammar, Jun Wang
Main category: cs.LG
TL;DR: The paper introduces Bottlenecked Transformer, which enhances LLMs with memory consolidation/reconsolidation via KV cache rewrites to improve reasoning performance, achieving up to +6.6pp gains on math benchmarks.
Details
Motivation: Existing ALSC methods are limited, while memory consolidation/reconsolidation from neuroscience offers an underexplored alternative for improving Transformer reasoning by making memory traces plastic and integrable with new context.Method: Augments backbone LLM with Cache Processor - an auxiliary Transformer that performs periodic, non-causal, in-place KV rewrites at reasoning step boundaries, consolidating recent KV entries and reconsolidating top-k attention-selected prior entries.
Result: Consistent performance gains over vanilla Transformers and pause-token baselines, with up to +6.6pp improvement on math reasoning benchmarks for selected tasks/backbones.
Conclusion: Memory (re)consolidation via KV cache rewrites is beneficial for improved reasoning, as demonstrated by the Bottlenecked Transformer architecture’s superior performance on math reasoning tasks.
Abstract: Transformer LLMs have been shown to exhibit strong reasoning ability that scales with inference-time compute, most prominently through token-space “thinking” chains of thought. A growing line of work pushes extra computation into the model’s latent space, which we term Auxiliary Latent-Space Computation (ALSC). Existing ALSC methods largely fall into three buckets: (i) token-mediated latent rollouts, (ii) residual/activation steering, and (iii) memory (KV) compression. An underexplored alternative is memory consolidation/reconsolidation, two processes in the brain that are responsible for stabilising newly formed memory traces, and, upon recall, transiently rendering established traces plastic such they can integrate new contextual information before restabilising. In Transformer LLMs, this can be seen as analogous to performing in-place rewrites of new KV segments, and rewrites of recalled past segments. In this work, we give a theoretical justification as to why memory (re)consolidation via KV cache rewrites is beneficial for improved reasoning. We do this through the lens of Information Bottleneck (IB) theory, which posits that model generalisation emerges from an optimal balance between input information compression and retention of predictive information in latent representations. We then introduce the Bottlenecked Transformer, which augments a backbone LLM with a Cache Processor, an auxiliary Transformer that performs periodic, non-causal, in-place KV rewrites at newline-delimited reasoning step boundaries. The Processor consolidates recently written KV entries and reconsolidates a small, top-k attention-selected set of prior entries. We evaluate our Bottlenecked Transformer architecture on math reasoning benchmarks. Our model sees consistent performance gains over vanilla Transformers and pause-token augmented baselines, with gains of up to +6.6pp for selected tasks/backbones.
[787] Informed Forecasting: Leveraging Auxiliary Knowledge to Boost LLM Performance on Time Series Forecasting
Mohammadmahdi Ghasemloo, Alireza Moradi
Main category: cs.LG
TL;DR: A cross-domain knowledge transfer framework enhances LLMs for time series forecasting by infusing structured temporal information, significantly improving accuracy over uninformed baselines.
Details
Motivation: To establish best practices for using LLMs beyond traditional NLP tasks and bridge the gap between LLMs and domain-specific forecasting applications in energy, finance, and healthcare.Method: Proposes a novel cross-domain knowledge transfer framework that systematically infuses LLMs with structured temporal information for time series forecasting.
Result: The knowledge-informed forecasting approach significantly outperforms the uninformed baseline in predictive accuracy and generalization on real-world time series datasets.
Conclusion: Knowledge transfer strategies have strong potential to bridge the gap between LLMs and domain-specific forecasting tasks, enabling better performance in time series applications.
Abstract: With the widespread adoption of Large Language Models (LLMs), there is a growing need to establish best practices for leveraging their capabilities beyond traditional natural language tasks. In this paper, a novel cross-domain knowledge transfer framework is proposed to enhance the performance of LLMs in time series forecasting – a task of increasing relevance in fields such as energy systems, finance, and healthcare. The approach systematically infuses LLMs with structured temporal information to improve their forecasting accuracy. This study evaluates the proposed method on a real-world time series dataset and compares it to a naive baseline where the LLM receives no auxiliary information. Results show that knowledge-informed forecasting significantly outperforms the uninformed baseline in terms of predictive accuracy and generalization. These findings highlight the potential of knowledge transfer strategies to bridge the gap between LLMs and domain-specific forecasting tasks.
[788] FFT-based Dynamic Subspace Selection for Low-Rank Adaptive Optimization of Large Language Models
Ionut-Vlad Modoranu, Mher Safaryan, Erik Schultheis, Max Ryabinin, Artem Chumachenko, Dan Alistarh
Main category: cs.LG
TL;DR: A computationally efficient low-rank optimization method for LLMs using Discrete Cosine Transform (DCT) to approximate SVD/QR-based gradient projections, achieving faster runtime and 25% memory reduction while matching performance.
Details
Motivation: To address the computational expense and memory costs of traditional SVD/QR-based low-rank optimization methods for large language models, which require expensive projections for each layer.Method: A two-step procedure using predefined DCT orthogonal matrices: 1) compute projection via matmul with DCT matrix in O(n³) time, 2) lightweight sorting to select most relevant basis vectors. For large layers, uses FFT-based DCT computation in O(n² log(n)) time.
Result: Achieves rank-independent running time, matches SVD/QR performance on pre-training and fine-tuning tasks, with 25% memory reduction across different model sizes.
Conclusion: The DCT-based approach provides an efficient alternative to costly SVD/QR methods for low-rank optimization in LLMs, offering significant computational and memory benefits while maintaining performance.
Abstract: Low-rank optimization has emerged as a promising direction in training large language models (LLMs) to improve running time and reduce the memory usage of adaptive optimizers by constraining learning to a lower-dimensional space. Prior work typically projects gradients of linear layers using approaches based on Singular Value Decomposition (SVD) or QR-decomposition. Applying these techniques individually to each layer in large models is computationally expensive and incurs additional memory costs due to storing the projection matrices. In this work, we propose a computationally efficient and conceptually simple, two-step procedure to approximate SVD/QR-based gradient projections into lower-dimensional spaces by using a predefined orthogonal matrix of the Discrete Cosine Transform (DCT). We dynamically select columns from the DCT matrix based on their alignment with the gradient of each layer. The effective projection matrices are obtained via a simple matmul with the DCT matrix in $O(n^3)$ time, followed by a lightweight sorting step to identify the most relevant basis vectors. For large layers, DCT can be computed via Makhoul’s $N$-point algorithm based on Fast Fourier Transform (FFT) in $O(n^2 \log(n))$ time. Due to the predefined nature of the orthogonal bases, they are computed once at the start of training. Our numerical experiments on both pre-training and fine-tuning tasks demonstrate the effectiveness of our dual strategy in approximating optimal low-rank projections, obtaining an approach with rank-independent running time that matches the performance of costly SVD/QR-based methods while achieving faster runtime and reduced memory usage by up to $25%$ across different model sizes. Our code is available at \href{https://github.com/IST-DASLab/ISTA-DASLab-Optimizers/tree/main/ista_daslab_optimizers/fft_low_rank}{ISTA-DASLab-Optimizers}.
[789] Exploiting the Asymmetric Uncertainty Structure of Pre-trained VLMs on the Unit Hypersphere
Li Ju, Max Andersson, Stina Fredriksson, Edward Glöckner, Andreas Hellander, Ekta Vats, Prashant Singh
Main category: cs.LG
TL;DR: AsymVLM addresses asymmetric uncertainty in vision-language models by building probabilistic embeddings on the unit hypersphere, enabling uncertainty quantification.
Details
Motivation: Deterministic VLMs fail to capture inherent ambiguity and uncertainty in natural language and visual data, and existing probabilistic methods don't account for asymmetric uncertainty structure and unit hypersphere constraints.Method: Proposed AsymVLM builds probabilistic embeddings from pre-trained VLMs on the unit hypersphere, addressing asymmetric uncertainty structure in textual and visual data.
Result: Validated effectiveness on established benchmarks and demonstrated inherent asymmetry in uncertainty structure through comprehensive ablation studies.
Conclusion: AsymVLM successfully addresses asymmetric uncertainty in VLMs by creating probabilistic embeddings on the unit hypersphere, enabling better uncertainty quantification.
Abstract: Vision-language models (VLMs) as foundation models have significantly enhanced performance across a wide range of visual and textual tasks, without requiring large-scale training from scratch for downstream tasks. However, these deterministic VLMs fail to capture the inherent ambiguity and uncertainty in natural language and visual data. Recent probabilistic post-hoc adaptation methods address this by mapping deterministic embeddings onto probability distributions; however, existing approaches do not account for the asymmetric uncertainty structure of the modalities, and the constraint that meaningful deterministic embeddings reside on a unit hypersphere, potentially leading to suboptimal performance. In this paper, we address the asymmetric uncertainty structure inherent in textual and visual data, and propose AsymVLM to build probabilistic embeddings from pre-trained VLMs on the unit hypersphere, enabling uncertainty quantification. We validate the effectiveness of the probabilistic embeddings on established benchmarks, and present comprehensive ablation studies demonstrating the inherent nature of asymmetry in the uncertainty structure of textual and visual data.
[790] Can LLMs Alleviate Catastrophic Forgetting in Graph Continual Learning? A Systematic Study
Ziyang Cheng, Zhixun Li, Yuhan Li, Yixin Song, Kangyi Zhao, Dawei Cheng, Jia Li, Hong Cheng, Jeffrey Xu Yu
Main category: cs.LG
TL;DR: This paper investigates whether large language models (LLMs) can mitigate catastrophic forgetting in Graph Continual Learning (GCL), identifies flaws in current GCL experimental setups, and proposes a simple-yet-effective method called SimGCL that significantly outperforms previous GNN-based baselines.
Details
Motivation: Real-world graph data often arrives in streaming manner, requiring learning systems to continuously acquire new knowledge without forgetting previous information. With the rise of pretrained models, the authors want to explore whether LLMs' strong generalization ability can help address catastrophic forgetting in GCL.Method: The authors first identify flaws in current GCL experimental setups (task ID leakage), then evaluate LLMs in more realistic scenarios, and finally propose SimGCL - a simple-yet-effective method for GCL that works under rehearsal-free constraints.
Result: The proposed SimGCL method surpasses previous state-of-the-art GNN-based baselines by around 20% under rehearsal-free constraint. Minor modifications to LLMs can lead to outstanding results in GCL.
Conclusion: LLMs can effectively mitigate catastrophic forgetting in Graph Continual Learning, and the proposed SimGCL method provides significant improvements over existing approaches. The authors also provide an easy-to-use benchmark LLM4GCL for evaluating GCL methods.
Abstract: Nowadays, real-world data, including graph-structure data, often arrives in a streaming manner, which means that learning systems need to continuously acquire new knowledge without forgetting previously learned information. Although substantial existing works attempt to address catastrophic forgetting in graph machine learning, they are all based on training from scratch with streaming data. With the rise of pretrained models, an increasing number of studies have leveraged their strong generalization ability for continual learning. Therefore, in this work, we attempt to answer whether large language models (LLMs) can mitigate catastrophic forgetting in Graph Continual Learning (GCL). We first point out that current experimental setups for GCL have significant flaws, as the evaluation stage may lead to task ID leakage. Then, we evaluate the performance of LLMs in more realistic scenarios and find that even minor modifications can lead to outstanding results. Finally, based on extensive experiments, we propose a simple-yet-effective method, Simple Graph Continual Learning (SimGCL), that surpasses the previous state-of-the-art GNN-based baseline by around 20% under the rehearsal-free constraint. To facilitate reproducibility, we have developed an easy-to-use benchmark LLM4GCL for training and evaluating existing GCL methods. The code is available at: https://github.com/ZhixunLEE/LLM4GCL.
[791] Efficient Orthogonal Fine-Tuning with Principal Subspace Adaptation
Fei Wu, Jia Hu, Geyong Min, Shiqiang Wang
Main category: cs.LG
TL;DR: PSOFT is a parameter-efficient fine-tuning method that confines orthogonal transformations to the principal subspace of pre-trained weights, achieving better semantic preservation, expressiveness, and efficiency than existing orthogonal fine-tuning approaches.
Details
Motivation: Existing orthogonal fine-tuning methods struggle to balance expressiveness and efficiency (parameter counts, memory, computation) while preserving semantic representations of pre-trained models.Method: PSOFT constructs principal subspace via matrix decomposition, establishes theoretical conditions to maintain subspace geometry for semantic preservation, and uses efficient tunable vectors that gradually relax orthogonality during training.
Result: Extensive experiments on 35 NLP and CV tasks across four models show PSOFT achieves semantic preservation, expressiveness, and multi-dimensional efficiency simultaneously.
Conclusion: PSOFT provides a practical and scalable solution for parameter-efficient fine-tuning that overcomes limitations of existing orthogonal fine-tuning methods.
Abstract: Driven by the rapid growth of model parameters, parameter-efficient fine-tuning (PEFT) has become essential for adapting large models to diverse downstream tasks under constrained computational resources. Within this paradigm, orthogonal fine-tuning and its variants preserve semantic representations of pre-trained models, but struggle to achieve both expressiveness and efficiency in terms of parameter counts, memory, and computation. To overcome this limitation, we propose efficient Orthogonal Fine-Tuning with Principal Subspace adaptation (PSOFT), which confines orthogonal transformations to the principal subspace of pre-trained weights. Specifically, PSOFT constructs this subspace via matrix decomposition to enable compatible transformations with higher effective rank, establishes a theoretical condition that strictly maintains the geometry of this subspace for essential semantic preservation, and introduces efficient tunable vectors that gradually relax orthogonality during training to enhance adaptability. Extensive experiments on 35 NLP and CV tasks across four representative models demonstrate that PSOFT offers a practical and scalable solution to simultaneously achieve semantic preservation, expressiveness, and multi-dimensional efficiency in PEFT. The code is publicly available at https://github.com/fei407/PSOFT.
[792] HD-PiSSA: High-Rank Distributed Orthogonal Adaptation
Yiding Wang, Fauxu Meng, Xuefeng Zhang, Fan Jiang, Pingzhi Tang, Muhan Zhang
Main category: cs.LG
TL;DR: HD-PiSSA is a distributed PEFT method that assigns different principal components of pre-trained weights to each GPU, achieving higher effective update ranks and better performance on complex tasks compared to LoRA and PiSSA.
Details
Motivation: Existing PEFT methods like LoRA and PiSSA constrain model updates to low-rank subspaces, limiting expressiveness and leading to suboptimal performance on complex tasks.Method: HD-PiSSA initializes orthogonal adapters across different devices and aggregates their delta updates collectively on W for fine-tuning, assigning different principal components to each GPU to expand update directions.
Result: HD-PiSSA achieves over 16x higher effective updated ranks than data-parallel LoRA/PiSSA on 8 GPUs. In multi-task learning, it gains 10.0 points (14.63%) over LoRA and 4.98 points (6.60%) over PiSSA across 12 benchmarks.
Conclusion: HD-PiSSA provides significant performance improvements on complex tasks by enabling higher-rank updates through distributed orthogonal adapters, demonstrating benefits from extra optimization flexibility.
Abstract: Existing parameter-efficient fine-tuning (PEFT) methods for large language models (LLMs), such as LoRA and PiSSA, constrain model updates to low-rank subspaces, limiting their expressiveness and leading to suboptimal performance on complex tasks. To address this, we introduce High-rank Distributed PiSSA (HD-PiSSA), a distributed PEFT approach that initializes orthogonal adapters across different devices and aggregates their delta updates collectively on W for fine-tuning. Unlike Data Parallel LoRA or PiSSA, which maintain identical adapters across all devices, HD-PiSSA assigns different principal components of the pre-trained weights to each GPU, significantly expanding the range of update directions. This results in over 16x higher effective updated ranks than data-parallel LoRA or PiSSA when fine-tuning on 8 GPUs with the same per-device adapter rank. Empirically, we evaluate HD-PiSSA across various challenging downstream tasks, including mathematics, code generation, and multi-task learning. In the multi-task setting, HD-PiSSA achieves average gains of 10.0 absolute points (14.63%) over LoRA and 4.98 points (6.60%) over PiSSA across 12 benchmarks, demonstrating its benefits from the extra optimization flexibility.
[793] Implicit bias produces neural scaling laws in learning curves, from perceptrons to deep networks
Francesco D’Amico, Dario Bocchi, Matteo Negri
Main category: cs.LG
TL;DR: The paper identifies two novel dynamical scaling laws that describe how model performance evolves during training, rather than just at convergence, and shows these laws apply across various architectures and datasets.
Details
Motivation: Most scaling law studies focus only on asymptotic behavior at training completion, but understanding the entire training dynamics could provide deeper insights into model behavior and interpretability.Method: Analyzed training dynamics across CNNs, ResNets, and Vision Transformers on MNIST, CIFAR-10, and CIFAR-100, and provided analytical support using a single-layer perceptron with logistic loss to derive the dynamical scaling laws.
Result: Identified two novel dynamical scaling laws that govern performance evolution as functions of norm-based complexity measures, which together recover the well-known scaling for test error at convergence.
Conclusion: The findings reveal a richer picture of scaling laws throughout training, consistent across architectures and datasets, with analytical support explaining the phenomena through gradient-based training’s implicit bias.
Abstract: Scaling laws in deep learning – empirical power-law relationships linking model performance to resource growth – have emerged as simple yet striking regularities across architectures, datasets, and tasks. These laws are particularly impactful in guiding the design of state-of-the-art models, since they quantify the benefits of increasing data or model size, and hint at the foundations of interpretability in machine learning. However, most studies focus on asymptotic behavior at the end of training. In this work, we describe a richer picture by analyzing the entire training dynamics: we identify two novel \textit{dynamical} scaling laws that govern how performance evolves as function of different norm-based complexity measures. Combined, our new laws recover the well-known scaling for test error at convergence. Our findings are consistent across CNNs, ResNets, and Vision Transformers trained on MNIST, CIFAR-10 and CIFAR-100. Furthermore, we provide analytical support using a single-layer perceptron trained with logistic loss, where we derive the new dynamical scaling laws, and we explain them through the implicit bias induced by gradient-based training.
[794] Beyond the Proxy: Trajectory-Distilled Guidance for Offline GFlowNet Training
Ruishuo Chen, Xun Wang, Rui Hu, Zhuoran Li, Longbo Huang
Main category: cs.LG
TL;DR: TD-GFN is a proxy-free offline training framework for GFlowNets that uses trajectory distillation to learn dense edge rewards from offline data, preventing error propagation while enabling efficient exploration.
Details
Motivation: Existing GFlowNet training methods face challenges in offline settings: proxy-based methods suffer from error propagation, while proxy-free approaches use coarse constraints that limit exploration.Method: TD-GFN learns transition-level edge rewards via inverse reinforcement learning from offline trajectories, then uses these rewards indirectly through DAG pruning and prioritized backward sampling to guide policy training without propagating errors.
Result: TD-GFN significantly outperforms existing baselines in both convergence speed and final sample quality across experiments.
Conclusion: TD-GFN establishes a more robust and efficient paradigm for offline GFlowNet training by combining dense structural guidance with error-free gradient updates.
Abstract: Generative Flow Networks (GFlowNets) are effective at sampling diverse, high-reward objects, but in many real-world settings where new reward queries are infeasible, they must be trained from offline datasets. The prevailing proxy-based training methods are susceptible to error propagation, while existing proxy-free approaches often use coarse constraints that limit exploration. To address these issues, we propose Trajectory-Distilled GFlowNet (TD-GFN), a novel proxy-free training framework. TD-GFN learns dense, transition-level edge rewards from offline trajectories via inverse reinforcement learning to provide rich structural guidance for efficient exploration. Crucially, to ensure robustness, these rewards are used indirectly to guide the policy through DAG pruning and prioritized backward sampling of training trajectories. This ensures that final gradient updates depend only on ground-truth terminal rewards from the dataset, thereby preventing the error propagation. Experiments show that TD-GFN significantly outperforms a broad range of existing baselines in both convergence speed and final sample quality, establishing a more robust and efficient paradigm for offline GFlowNet training.
[795] Few-Shot Adversarial Low-Rank Fine-Tuning of Vision-Language Models
Sajjad Ghiasvand, Haniyeh Ehsani Oskouie, Mahnoosh Alizadeh, Ramtin Pedarsani
Main category: cs.LG
TL;DR: AdvCLIP-LoRA is the first method to enhance adversarial robustness of CLIP models fine-tuned with LoRA in few-shot settings, achieving state-of-the-art performance across multiple datasets and backbones.
Details
Motivation: Vision-Language Models like CLIP are vulnerable to adversarial attacks, and existing Parameter-Efficient Fine-Tuning methods lack robustness. Adversarial training is needed to improve model robustness in PEFT scenarios.Method: Formulates training as a minimax optimization over low-rank adapters and adversarial perturbations, enabling robust adaptation with small trainable footprint in few-shot settings.
Result: Achieves state-of-the-art performance in few-shot classification, adversarial base-to-new generalization, and cross-dataset transfer across eight datasets and two backbones (ViT-B/16 and ViT-B/32), delivering higher adversarial robustness than prompt tuning baselines without sacrificing much clean accuracy.
Conclusion: AdvCLIP-LoRA is a practical approach for robust adaptation of Vision-Language Models in resource-constrained settings.
Abstract: Vision-Language Models (VLMs) such as CLIP have shown remarkable performance in cross-modal tasks through large-scale contrastive pre-training. To adapt these large transformer-based models efficiently for downstream tasks, Parameter-Efficient Fine-Tuning (PEFT) techniques like (Low-Rank Adaptation) LoRA have emerged as scalable alternatives to full fine-tuning, especially in few-shot scenarios. However, like traditional deep neural networks, VLMs are highly vulnerable to adversarial attacks, where imperceptible perturbations can significantly degrade model performance. Adversarial training remains the most effective strategy for improving model robustness in PEFT. In this work, we propose AdvCLIP-LoRA, to our knowledge the first method designed to enhance the adversarial robustness of CLIP models fine-tuned with LoRA in few-shot settings. Our method formulates training as a minimax optimization over low-rank adapters and adversarial perturbations, enabling robust adaptation with a small trainable footprint. Across eight datasets and two backbones (ViT-B/16 and ViT-B/32), AdvCLIP-LoRA achieves state-of-the-art performance in few-shot classification, adversarial base-to-new generalization, and cross-dataset transfer, delivering higher adversarial robustness than prompt tuning baselines without sacrificing much clean accuracy. These findings highlight AdvCLIP-LoRA as a practical approach for robust adaptation of VLMs in resource-constrained settings.
[796] Spectral-inspired Operator Learning with Limited Data and Unknown Physics
Han Wan, Rui Zhang, Hao Sun
Main category: cs.LG
TL;DR: SINO is a neural operator that learns PDE dynamics from just 2-5 trajectories without requiring known physics, achieving state-of-the-art performance with 1-2 orders of magnitude accuracy improvement.
Details
Motivation: Existing neural PDE solvers require large datasets or rely on known physics (PDE residuals or handcrafted stencils), limiting their applicability for learning from limited data with unknown physics.Method: SINO automatically captures local and global spatial derivatives from frequency indices, uses a Pi-block for multiplicative operations on spectral features to model nonlinear effects, and employs a low-pass filter to suppress aliasing.
Result: SINO achieves state-of-the-art performance on 2D and 3D PDE benchmarks with 1-2 orders of magnitude accuracy improvement. With only 5 training trajectories, it outperforms data-driven methods trained on 1000 trajectories and remains predictive on challenging out-of-distribution cases.
Conclusion: SINO enables effective learning of PDE dynamics from extremely limited data without requiring explicit PDE terms, demonstrating superior performance and generalization compared to existing methods.
Abstract: Learning PDE dynamics from limited data with unknown physics is challenging. Existing neural PDE solvers either require large datasets or rely on known physics (e.g., PDE residuals or handcrafted stencils), leading to limited applicability. To address these challenges, we propose Spectral-Inspired Neural Operator (SINO), which can model complex systems from just 2-5 trajectories, without requiring explicit PDE terms. Specifically, SINO automatically captures both local and global spatial derivatives from frequency indices, enabling a compact representation of the underlying differential operators in physics-agnostic regimes. To model nonlinear effects, it employs a Pi-block that performs multiplicative operations on spectral features, complemented by a low-pass filter to suppress aliasing. Extensive experiments on both 2D and 3D PDE benchmarks demonstrate that SINO achieves state-of-the-art performance, with improvements of 1-2 orders of magnitude in accuracy. Particularly, with only 5 training trajectories, SINO outperforms data-driven methods trained on 1000 trajectories and remains predictive on challenging out-of-distribution cases where other methods fail.
[797] Forward-only Diffusion Probabilistic Models
Ziwei Luo, Fredrik K. Gustafsson, Jens Sjölund, Thomas B. Schön
Main category: cs.LG
TL;DR: Forward-only diffusion (FoD) is a simple generative model that uses a single forward diffusion process with mean-reverting SDEs, achieving state-of-the-art performance on image restoration tasks without backward diffusion.
Details
Motivation: Traditional diffusion models require complex forward-backward diffusion schemes, which can be computationally expensive and complex to implement. FoD aims to simplify this by using only forward diffusion.Method: Uses a state-dependent stochastic differential equation with mean-reverting terms in both drift and diffusion functions. Trained with stochastic flow matching objective, enabling few-step non-Markov chain sampling.
Result: Achieves state-of-the-art performance on various image restoration tasks and demonstrates strong qualitative results on image-to-image translation tasks.
Conclusion: FoD provides a simpler yet effective alternative to traditional diffusion models, offering analytical tractability and efficient sampling while maintaining competitive performance.
Abstract: This work presents a forward-only diffusion (FoD) approach for generative modelling. In contrast to traditional diffusion models that rely on a coupled forward-backward diffusion scheme, FoD directly learns data generation through a single forward diffusion process, yielding a simple yet efficient generative framework. The core of FoD is a state-dependent stochastic differential equation that involves a mean-reverting term in both the drift and diffusion functions. This mean-reversion property guarantees the convergence to clean data, naturally simulating a stochastic interpolation between source and target distributions. More importantly, FoD is analytically tractable and is trained using a simple stochastic flow matching objective, enabling a few-step non-Markov chain sampling during inference. The proposed FoD model, despite its simplicity, achieves state-of-the-art performance on various image restoration tasks. Its general applicability on image-conditioned generation is also demonstrated via qualitative results on image-to-image translation. Our code is available at https://github.com/Algolzw/FoD.
[798] SDPO: Importance-Sampled Direct Preference Optimization for Stable Diffusion Training
Xiaomeng Yang, Zhiyu Tan, Junyan Wang, Zhijian Zhou, Hao Li
Main category: cs.LG
TL;DR: The paper addresses instability and off-policy bias in diffusion-based preference learning methods like Diffusion-DPO, proposing two solutions: DPO-C&M for practical stability improvement and SDPO for principled off-policy bias correction.
Details
Motivation: Existing diffusion preference learning methods suffer from timestep-dependent instability (due to mismatch between reverse/forward processes and high gradient variance in early timesteps) and off-policy bias from policy mismatch.Method: Two approaches: 1) DPO-C&M - practical strategy clipping and masking uninformative timesteps; 2) SDPO - principled framework incorporating importance sampling to correct off-policy bias and emphasize informative updates.
Result: Experiments on CogVideoX-2B, CogVideoX-5B, and Wan2.1-1.3B show both methods outperform standard Diffusion-DPO, with SDPO achieving superior VBench scores, human preference alignment, and training robustness.
Conclusion: Timestep-aware, distribution-corrected optimization is crucial for effective diffusion-based preference learning, with SDPO providing the most comprehensive solution.
Abstract: Preference learning has become a central technique for aligning generative models with human expectations. Recently, it has been extended to diffusion models through methods like Direct Preference Optimization (DPO). However, existing approaches such as Diffusion-DPO suffer from two key challenges: timestep-dependent instability, caused by a mismatch between the reverse and forward diffusion processes and by high gradient variance in early noisy timesteps, and off-policy bias arising from the mismatch between optimization and data collection policies. We begin by analyzing the reverse diffusion trajectory and observe that instability primarily occurs at early timesteps with low importance weights. To address these issues, we first propose DPO-C&M, a practical strategy that improves stability by clipping and masking uninformative timesteps while partially mitigating off-policy bias. Building on this, we introduce SDPO (Importance-Sampled Direct Preference Optimization), a principled framework that incorporates importance sampling into the objective to fully correct for off-policy bias and emphasize informative updates during the diffusion process. Experiments on CogVideoX-2B, CogVideoX-5B, and Wan2.1-1.3B demonstrate that both methods outperform standard Diffusion-DPO, with SDPO achieving superior VBench scores, human preference alignment, and training robustness. These results highlight the importance of timestep-aware, distribution-corrected optimization in diffusion-based preference learning.
[799] SPAR: Self-supervised Placement-Aware Representation Learning for Distributed Sensing
Yizhuo Chen, Tianchen Wang, You Lyu, Yanlan Hu, Jinyang Li, Tomoyoshi Kimura, Hongjue Zhao, Yigong Hu, Denizhan Kara, Tarek Abdelzaher
Main category: cs.LG
TL;DR: SPAR is a self-supervised framework for placement-aware representation learning in distributed sensing, treating sensor placements as intrinsic to learning rather than auxiliary metadata.
Details
Motivation: Existing pretraining methods for distributed sensing remain largely placement-agnostic, failing to account for how sensor placements (spatial locations and structural roles) inseparably shape observed signals across applications like vehicle monitoring and earthquake localization.Method: SPAR introduces spatial and structural positional embeddings with dual reconstruction objectives, explicitly modeling the duality between signals and positions - how observing positions and observed signals shape each other.
Result: Extensive experiments on three real-world datasets show SPAR achieves superior robustness and generalization across various modalities, placements, and downstream tasks.
Conclusion: SPAR successfully addresses the placement challenge in distributed sensing by treating placement as intrinsic to representation learning, with theoretical support from information theory and occlusion-invariant learning.
Abstract: We present SPAR, a framework for self-supervised placement-aware representation learning in distributed sensing. Distributed sensing spans applications where multiple spatially distributed and multimodal sensors jointly observe an environment, from vehicle monitoring to human activity recognition and earthquake localization. A central challenge shared by this wide spectrum of applications, is that observed signals are inseparably shaped by sensor placements, including their spatial locations and structural roles. However, existing pretraining methods remain largely placement-agnostic. SPAR addresses this gap through a unifying principle: the duality between signals and positions. Guided by this principle, SPAR introduces spatial and structural positional embeddings together with dual reconstruction objectives, explicitly modeling how observing positions and observed signals shape each other. Placement is thus treated not as auxiliary metadata but as intrinsic to representation learning. SPAR is theoretically supported by analyses from information theory and occlusion-invariant learning. Extensive experiments on three real-world datasets show that SPAR achieves superior robustness and generalization across various modalities, placements, and downstream tasks.
[800] Navigate the Unknown: Enhancing LLM Reasoning with Intrinsic Motivation Guided Exploration
Jingtong Gao, Ling Pan, Yejing Wang, Rui Zhong, Chi Lu, Qingpeng Cai, Peng Jiang, Xiangyu Zhao
Main category: cs.LG
TL;DR: i-MENTOR is a new RL method for LLM reasoning that addresses sparse reward limitations by providing dense rewards and enhanced exploration through trajectory-aware rewards, error-conditioned allocation, and advantage-preserving integration.
Details
Motivation: Current RL approaches like PPO and GRPO rely on sparse outcome-based rewards which provide insufficient feedback for challenging problems and create biases that prioritize exploitation over exploration, hindering performance in complex reasoning tasks.Method: i-MENTOR introduces three key innovations: trajectory-aware exploration rewards to mitigate token-level strategy bias, error-conditioned reward allocation for efficient exploration on challenging samples, and advantage-preserving integration to maintain advantage distribution integrity.
Result: Experiments across 4 public datasets show i-MENTOR achieves significant improvements, including a 22.23% improvement on AIME 2024.
Conclusion: i-MENTOR effectively addresses the limitations of sparse rewards in RL for LLM reasoning by providing dense rewards and enhanced exploration mechanisms, leading to substantial performance gains in complex reasoning tasks.
Abstract: Reinforcement Learning (RL) has emerged as a pivotal method for improving the reasoning capabilities of Large Language Models (LLMs). However, prevalent RL approaches such as Proximal Policy Optimization (PPO) and Group-Regularized Policy Optimization (GRPO) face critical limitations due to their reliance on sparse outcome-based rewards and inadequate mechanisms for incentivizing exploration. These limitations result in inefficient guidance for reasoning. Specifically, sparse rewards fail to deliver sufficient feedback, particularly for challenging problems. Furthermore, such rewards induce systematic biases that prioritize exploitation of familiar trajectories over novel solution discovery. These shortcomings critically hinder performance in complex reasoning tasks, which inherently demand iterative refinement across intermediate steps. To address these challenges, we propose an Intrinsic Motivation guidEd exploratioN meThOd foR LLM Reasoning (i-MENTOR), a method designed to deliver dense rewards and amplify exploration in the RL-based paradigm. i-MENTOR introduces three innovations: trajectory-aware exploration rewards that mitigate bias in token-level strategies while maintaining computational efficiency; error-conditioned reward allocation to ensure efficient exploration on challenging samples while intrinsically stabilizing training; and advantage-preserving integration that maintains advantage distribution integrity while incorporating exploratory guidance. Experiments across 4 public datasets demonstrate i-MENTOR’s effectiveness, achieving a 22.23% improvement on AIME 2024.
[801] Model-Preserving Adaptive Rounding
Albert Tseng, Zhaofeng Sun, Christopher De Sa
Main category: cs.LG
TL;DR: YAQA is a quantization algorithm that directly minimizes end-to-end output error rather than layer-wise activation error, achieving 30% better performance than existing methods with no inference overhead.
Details
Motivation: Existing quantization methods minimize immediate activation error per layer, which ignores the effect of future layers and is a poor proxy for end-to-end model performance.Method: Uses adaptive rounding with theoretical error bounds based on Hessian approximations, employing Kronecker-factored approximation with near-optimal Hessian sketches to directly optimize for end-to-end output distribution.
Result: YAQA reduces quantization error by ≈30% compared to GPTQ/LDLQ, achieves lower error than quantization aware training, and provides state-of-the-art performance on downstream tasks without adding inference overhead.
Conclusion: Directly optimizing for end-to-end error through adaptive rounding with proper Hessian approximations significantly outperforms layer-wise quantization approaches and even surpasses training-based quantization methods.
Abstract: The goal of quantization is to produce a compressed model whose output distribution is as close to the original model’s as possible. To do this tractably, most quantization algorithms minimize the immediate activation error of each layer as a proxy for the end-to-end error. However, this ignores the effect of future layers, making it a poor proxy. In this work, we introduce Yet Another Quantization Algorithm (YAQA), an adaptive rounding algorithm that directly considers the error at the network’s output. YAQA introduces a series of theoretical results that culminate in the first end-to-end error bounds for quantization algorithms. First, we characterize the convergence time of adaptive rounding algorithms via the structure of their Hessian approximations. We then show that the end-to-end error can be bounded by the approximation’s cosine similarity to the true Hessian. This admits a natural Kronecker-factored approximation with corresponding near-optimal Hessian sketches. YAQA is provably better than GPTQ/LDLQ and empirically reduces the error by $\approx 30%$ over these methods. YAQA even achieves a lower error than quantization aware training. This translates to state of the art performance on downstream tasks, all while adding no inference overhead.
[802] ExPLAIND: Unifying Model, Data, and Training Attribution to Study Model Behavior
Florian Eichin, Yupei Du, Philipp Mondorf, Maria Matveev, Barbara Plank, Michael A. Hedderich
Main category: cs.LG
TL;DR: ExPLAIND is a unified interpretability framework that integrates model components, data, and training trajectory perspectives using gradient path kernels, providing theoretically grounded influence scores and insights into training dynamics like Grokking.
Details
Motivation: Existing post-hoc interpretability methods analyze model components, data, or training trajectory in isolation, leading to fragmented explanations that miss key interactions and lack theoretical support.Method: Generalizes gradient path kernels to realistic settings (AdamW), validates CNN/Transformer replication, derives parameter- and step-wise influence scores from kernel feature maps, and jointly interprets model components and data over training.
Result: Successfully replicated CNN and Transformer models; influence scores perform comparably to existing methods for parameter pruning; Grokking analysis reveals refined final phase as alignment of embeddings and final layers around representation pipeline learned after memorization.
Conclusion: ExPLAIND provides a theoretically grounded, unified framework for interpreting model behavior and training dynamics, bridging previously isolated interpretability perspectives.
Abstract: Post-hoc interpretability methods typically attribute a model’s behavior to its components, data, or training trajectory in isolation. This leads to explanations that lack a unified view and may miss key interactions. While combining existing methods or applying them at different training stages offers broader insights, such approaches usually lack theoretical support. In this work, we present ExPLAIND, a unified framework that integrates all these perspectives. First, we generalize recent work on gradient path kernels, which reformulate models trained by gradient descent as a kernel machine, to realistic settings like AdamW. We empirically validate that a CNN and a Transformer are accurately replicated by this reformulation. Second, we derive novel parameter- and step-wise influence scores from the kernel feature maps. Their effectiveness for parameter pruning is comparable to existing methods, demonstrating their value for model component attribution. Finally, jointly interpreting model components and data over the training process, we leverage ExPLAIND to analyze a Transformer that exhibits Grokking. Our findings support previously proposed stages of Grokking, while refining the final phase as one of alignment of input embeddings and final layers around a representation pipeline learned after the memorization phase. Overall, ExPLAIND provides a theoretically grounded, unified framework to interpret model behavior and training dynamics.
[803] Mamba Integrated with Physics Principles Masters Long-term Chaotic System Forecasting
Chang Liu, Bohao Zhao, Jingtao Ding, Huandong Wang, Yong Li
Main category: cs.LG
TL;DR: PhyxMamba integrates Mamba-based state-space models with physics-informed principles to forecast chaotic systems from short-term observations, using attractor reconstruction, generative training, and physical constraints for improved long-term accuracy.
Details
Motivation: Long-term forecasting of chaotic systems is challenging due to sensitivity to initial conditions and complex attractor geometry. Existing methods require extensive training data and struggle with predictive stability over extended horizons.Method: Reconstructs attractor manifold with time-delay embeddings, uses Mamba-based state-space model with generative training scheme, and incorporates multi-patch prediction with attractor geometry regularization for physical constraints.
Result: Superior forecasting accuracy and faithful capture of essential statistics from short-term historical observations across simulated and real-world chaotic systems.
Conclusion: PhyxMamba effectively addresses long-term chaotic system forecasting by combining modern sequence modeling with physics-informed principles, enabling accurate predictions from limited observational data.
Abstract: Long-term forecasting of chaotic systems remains a fundamental challenge due to the intrinsic sensitivity to initial conditions and the complex geometry of strange attractors. Conventional approaches, such as reservoir computing, typically require training data that incorporates long-term continuous dynamical behavior to comprehensively capture system dynamics. While advanced deep sequence models can capture transient dynamics within the training data, they often struggle to maintain predictive stability and dynamical coherence over extended horizons. Here, we propose PhyxMamba, a framework that integrates a Mamba-based state-space model with physics-informed principles to forecast long-term behavior of chaotic systems given short-term historical observations on their state evolution. We first reconstruct the attractor manifold with time-delay embeddings to extract global dynamical features. After that, we introduce a generative training scheme that enables Mamba to replicate the physical process. It is further augmented by multi-patch prediction and attractor geometry regularization for physical constraints, enhancing predictive accuracy and preserving key statistical properties of systems. Extensive experiments on simulated and real-world chaotic systems demonstrate that PhyxMamba delivers superior forecasting accuracy and faithfully captures essential statistics from short-term historical observations.
[804] Practical estimation of the optimal classification error with soft labels and calibration
Ryota Ushio, Takashi Ishida, Masashi Sugiyama
Main category: cs.LG
TL;DR: This paper provides practical methods for estimating the Bayes error (optimal error rate) in binary classification, extending previous work on soft label-based estimation with theoretical analysis of bias properties and handling corrupted soft labels.
Details
Motivation: While ML performance has improved significantly, there's little attention to fundamental limits of model improvement. The paper addresses estimating the Bayes error to understand the extent models can be improved.Method: Extends previous soft label-based Bayes error estimation by: 1) Theoretical analysis of bias properties in hard-label estimators, 2) Handling corrupted soft labels using isotonic calibration for statistically consistent estimation without requiring input instances.
Result: Theoretical analysis reveals bias decay rate adapts to class separation and can be faster than previously suggested. Experiments with synthetic and real-world datasets validate the methods.
Conclusion: The proposed instance-free methods provide practical and theoretically supported approaches for estimating Bayes error, enabling assessment of fundamental model improvement limits in scenarios where input instances are unavailable due to privacy concerns.
Abstract: While the performance of machine learning systems has experienced significant improvement in recent years, relatively little attention has been paid to the fundamental question: to what extent can we improve our models? This paper provides a means of answering this question in the setting of binary classification, which is practical and theoretically supported. We extend a previous work that utilizes soft labels for estimating the Bayes error, the optimal error rate, in two important ways. First, we theoretically investigate the properties of the bias of the hard-label-based estimator discussed in the original work. We reveal that the decay rate of the bias is adaptive to how well the two class-conditional distributions are separated, and it can decay significantly faster than the previous result suggested as the number of hard labels per instance grows. Second, we tackle a more challenging problem setting: estimation with corrupted soft labels. One might be tempted to use calibrated soft labels instead of clean ones. However, we reveal that calibration guarantee is not enough, that is, even perfectly calibrated soft labels can result in a substantially inaccurate estimate. Then, we show that isotonic calibration can provide a statistically consistent estimator under an assumption weaker than that of the previous work. Our method is instance-free, i.e., we do not assume access to any input instances. This feature allows it to be adopted in practical scenarios where the instances are not available due to privacy issues. Experiments with synthetic and real-world datasets show the validity of our methods and theory.
[805] AMPED: Adaptive Multi-objective Projection for balancing Exploration and skill Diversification
Geonwoo Cho, Jaemoon Lee, Jaegyun Im, Subi Lee, Jihwan Lee, Sundong Kim
Main category: cs.LG
TL;DR: AMPED is a skill-based RL method that balances exploration and skill diversity through gradient surgery during pre-training and uses a skill selector for downstream task adaptation, achieving superior performance over baselines.
Details
Motivation: Existing skill-based RL methods struggle to simultaneously optimize for exploration and skill diversity, which are conflicting objectives but both crucial for effective skill learning.Method: Uses gradient-surgery projection to balance exploration and diversity gradients during pre-training, and a skill selector that exploits learned diversity during fine-tuning for downstream tasks.
Result: Achieves performance surpassing SBRL baselines across various benchmarks, with ablation studies confirming each component’s contribution and theoretical/empirical evidence showing greater skill diversity reduces fine-tuning sample complexity.
Conclusion: Explicitly harmonizing exploration and diversity is important for robust and generalizable skill learning, and AMPED effectively enables this through its balanced approach.
Abstract: Skill-based reinforcement learning (SBRL) enables rapid adaptation in environments with sparse rewards by pretraining a skill-conditioned policy. Effective skill learning requires jointly maximizing both exploration and skill diversity. However, existing methods often face challenges in simultaneously optimizing for these two conflicting objectives. In this work, we propose a new method, Adaptive Multi-objective Projection for balancing Exploration and skill Diversification (AMPED), which explicitly addresses both: during pre-training, a gradient-surgery projection balances the exploration and diversity gradients, and during fine-tuning, a skill selector exploits the learned diversity by choosing skills suited to downstream tasks. Our approach achieves performance that surpasses SBRL baselines across various benchmarks. Through an extensive ablation study, we identify the role of each component and demonstrate that each element in AMPED is contributing to performance. We further provide theoretical and empirical evidence that, with a greedy skill selector, greater skill diversity reduces fine-tuning sample complexity. These results highlight the importance of explicitly harmonizing exploration and diversity and demonstrate the effectiveness of AMPED in enabling robust and generalizable skill learning. Project Page: https://geonwoo.me/amped/
[806] Learnable Kernel Density Estimation for Graphs
Xudong Wang, Ziheng Sun, Chris Ding, Jicong Fan
Main category: cs.LG
TL;DR: LGKDE is a framework that learns kernel density estimation for graphs using graph neural networks and maximum mean discrepancy, achieving superior performance in graph anomaly detection.
Details
Motivation: Standard graph density estimation methods combining graph kernels and KDE have unsatisfactory performance due to handcrafted and fixed kernel features.Method: LGKDE represents graphs as discrete distributions using GNNs, uses MMD to learn graph metrics for multi-scale KDE, and learns parameters by maximizing density of graphs relative to perturbed counterparts through node feature and graph spectra perturbations.
Result: LGKDE demonstrates superior performance in recovering underlying density of synthetic graph distributions and graph anomaly detection across diverse benchmark datasets compared to state-of-the-art baselines.
Conclusion: The framework provides theoretical guarantees including consistency, convergence, MISE bounds, robustness, and generalization, while achieving effective graph density estimation.
Abstract: This work proposes a framework LGKDE that learns kernel density estimation for graphs. The key challenge in graph density estimation lies in effectively capturing both structural patterns and semantic variations while maintaining theoretical guarantees. Combining graph kernels and kernel density estimation (KDE) is a standard approach to graph density estimation, but has unsatisfactory performance due to the handcrafted and fixed features of kernels. Our method LGKDE leverages graph neural networks to represent each graph as a discrete distribution and utilizes maximum mean discrepancy to learn the graph metric for multi-scale KDE, where all parameters are learned by maximizing the density of graphs relative to the density of their well-designed perturbed counterparts. The perturbations are conducted on both node features and graph spectra, which helps better characterize the boundary of normal density regions. Theoretically, we establish consistency and convergence guarantees for LGKDE, including bounds on the mean integrated squared error, robustness, and generalization. We validate LGKDE by demonstrating its effectiveness in recovering the underlying density of synthetic graph distributions and applying it to graph anomaly detection across diverse benchmark datasets. Extensive empirical evaluation shows that LGKDE demonstrates superior performance compared to state-of-the-art baselines on most benchmark datasets.
[807] Domain-Aware Tensor Network Structure Search
Giorgos Iacovides, Wuyang Zhou, Chao Li, Qibin Zhao, Danilo Mandic
Main category: cs.LG
TL;DR: The paper proposes tnLLM, a novel tensor network structure search framework that uses large language models with domain-aware prompting to predict optimal tensor network structures, achieving comparable performance with fewer evaluations than state-of-the-art methods.
Details
Motivation: Current tensor network structure search methods treat the problem as purely numerical optimization, requiring extensive function evaluations and ignoring valuable domain information, while lacking transparency in identified structures.Method: The tnLLM framework incorporates domain information and uses LLMs with domain-aware prompting to infer suitable tensor network structures based on real-world relationships between tensor modes, enabling iterative optimization and domain-aware explanations.
Result: tnLLM achieves comparable objective function values with significantly fewer function evaluations compared to state-of-the-art algorithms, and LLM-enabled domain information can accelerate convergence of sampling-based methods while preserving theoretical guarantees.
Conclusion: The proposed tnLLM framework successfully integrates domain information and LLM reasoning capabilities to solve tensor network structure search more efficiently and transparently than purely numerical optimization approaches.
Abstract: Tensor networks (TNs) provide efficient representations of high-dimensional data, yet identification of the optimal TN structures, the so called tensor network structure search (TN-SS) problem, remains a challenge. Current state-of-the-art (SOTA) algorithms solve TN-SS as a purely numerical optimization problem and require extensive function evaluations, which is prohibitive for real-world applications. In addition, existing methods ignore the valuable domain information inherent in real-world tensor data and lack transparency in their identified TN structures. To this end, we propose a novel TN-SS framework, termed the tnLLM, which incorporates domain information about the data and harnesses the reasoning capabilities of large language models (LLMs) to directly predict suitable TN structures. The proposed framework involves a domain-aware prompting pipeline which instructs the LLM to infer suitable TN structures based on the real-world relationships between tensor modes. In this way, our approach is capable of not only iteratively optimizing the objective function, but also generating domain-aware explanations for the identified structures. Experimental results demonstrate that tnLLM achieves comparable TN-SS objective function values with much fewer function evaluations compared to SOTA algorithms. Furthermore, we demonstrate that the LLM-enabled domain information can be used to find good initializations in the search space for sampling-based SOTA methods to accelerate their convergence while preserving theoretical performance guarantees.
[808] Exploiting Block Coordinate Descent for Cost-Effective LLM Model Training
Zeyu Liu, Yan Li, Yunquan Zhang, Boyang Zhang, Guoyong Jiang, Xin Zhang, Limin Xiao, Weifeng Zhang, Daning Cheng
Main category: cs.LG
TL;DR: A block coordinate descent (BCD) framework reduces training costs for large language models by 67% on A100/A800 and 97.4% on RTX 4090 compared to standard full-parameter training, enabling efficient training on cost-effective hardware.
Details
Motivation: Training large language models requires extensive GPU memory and substantial financial investment, creating barriers for small- to medium-sized teams who cannot afford expensive hardware.Method: Proposes a full-parameter pre-training and fine-tuning framework based on block coordinate descent (BCD) enhanced with engineering optimizations for efficient training on RTX 4090, A100 and A800 GPU clusters.
Result: Reduces training cost of 7B model to 33% on A100/A800 and only 2.6% on RTX 4090 compared to standard full-parameter training. Enables training large models on RTX 4090 that were previously restricted to A100 clusters without performance degradation.
Conclusion: BCD achieves comparable or better accuracy than full-parameter and fine-tuning methods with lower GPU consumption and improved hardware utilization, making large model training more accessible.
Abstract: Training large language models typically demands extensive GPU memory and substantial financial investment, which poses a barrier for many small- to medium-sized teams. In this paper, we propose a full-parameter pre-training and fine-tuning framework based on block coordinate descent (BCD), enhanced with engineering optimizations, to enable efficient training of large-scale models on cost-effective RTX 4090, A100 and A800 GPU clusters. Under identical hardware configurations, we reduce the training cost of a 7B model to 33% on A100/A800 and only 2.6% on RTX 4090, compared to standard full-parameter training. It also enables large models previously restricted to A100 clusters to be trained on RTX 4090 without degrading performance. BCD achieves comparable or better accuracy than full-parameter and fine-tuning methods at most cases, with lower GPU consumption and improved hardware utilization.
[809] Intercept Cancer: Cancer Pre-Screening with Large Scale Healthcare Foundation Models
Liwen Sun, Hao-Ren Yao, Gary Gao, Ophir Frieder, Chenyan Xiong
Main category: cs.LG
TL;DR: CATCH-FM is a cancer pre-screening method that uses foundation models trained on medical code sequences from EHR data to identify high-risk patients for further screening, achieving 50% sensitivity at 99% specificity and outperforming existing models by up to 20% AUPRC.
Details
Motivation: Existing cancer screening techniques are expensive, intrusive, and not globally available, leading to many preventable cancer deaths. There's a need for accessible pre-screening methods using readily available medical records.Method: Pretrained compute-optimal foundation models (up to 2.4B parameters) on millions of EHR medical code sequences, then fine-tuned on clinician-curated cancer risk prediction cohorts. Operates in ICD code space and uses historical medical records for prediction.
Result: Achieved 50% sensitivity at 99% specificity cutoff for first cancer risk prediction. Outperformed feature-based tree models and general/medical LLMs by up to 20% AUPRC. Achieved state-of-the-art pancreatic cancer risk prediction on EHRSHOT few-shot leaderboard despite demographic and system differences.
Conclusion: CATCH-FM demonstrates robust cancer risk prediction across various patient distributions, benefits from operating in ICD code space, captures non-trivial risk factors, and shows promise as an accessible pre-screening tool using existing medical records.
Abstract: Cancer screening, leading to early detection, saves lives. Unfortunately, existing screening techniques require expensive and intrusive medical procedures, not globally available, resulting in too many lost would-be-saved lives. We present CATCH-FM, CATch Cancer early with Healthcare Foundation Models, a cancer pre-screening methodology that identifies high-risk patients for further screening solely based on their historical medical records. With millions of electronic healthcare records (EHR), we establish the scaling law of EHR foundation models pretrained on medical code sequences, pretrain compute-optimal foundation models of up to 2.4 billion parameters, and finetune them on clinician-curated cancer risk prediction cohorts. In our retrospective evaluation comprising of thirty thousand patients, CATCH-FM achieves strong efficacy, with 50% sensitivity in predicting first cancer risks at 99% specificity cutoff, and outperforming feature-based tree models and both general and medical LLMs by up to 20% AUPRC. Despite significant demographic, healthcare system, and EHR coding differences, CATCH-FM achieves state-of-the-art pancreatic cancer risk prediction on the EHRSHOT few-shot leaderboard, outperforming EHR foundation models pretrained using on-site patient data. Our analysis demonstrates the robustness of CATCH-FM in various patient distributions, the benefits of operating in the ICD code space, and its ability to capture non-trivial cancer risk factors. Our code will be open-sourced.
[810] Latent Concept Disentanglement in Transformer-based Language Models
Guan Zhe Hong, Bhavya Vasudeva, Vatsal Sharan, Cyrus Rashtchian, Prabhakar Raghavan, Rina Panigrahy
Main category: cs.LG
TL;DR: Transformers can identify and utilize latent concepts from in-context learning, performing step-by-step concept composition in reasoning tasks and representing numerical concepts in low-dimensional geometric subspaces.
Details
Motivation: To understand whether and how transformers represent latent structures when using in-context learning to solve new tasks, particularly how they infer latent concepts from demonstration examples.Method: Used mechanistic interpretability to analyze controlled tasks including transitive reasoning with discrete latent concepts and tasks parameterized by latent numerical concepts, examining model representations and computation.
Result: Models successfully identify latent concepts and perform step-by-step concept composition in reasoning tasks. For numerical concepts, low-dimensional subspaces in representation space cleanly reflect the underlying parameterization geometry.
Conclusion: Both small and large language models can disentangle and utilize latent concepts learned in-context from limited demonstrations, demonstrating their ability to represent and reason with underlying structures.
Abstract: When large language models (LLMs) use in-context learning (ICL) to solve a new task, they must infer latent concepts from demonstration examples. This raises the question of whether and how transformers represent latent structures as part of their computation. Our work experiments with several controlled tasks, studying this question using mechanistic interpretability. First, we show that in transitive reasoning tasks with a latent, discrete concept, the model successfully identifies the latent concept and does step-by-step concept composition. This builds upon prior work that analyzes single-step reasoning. Then, we consider tasks parameterized by a latent numerical concept. We discover low-dimensional subspaces in the model’s representation space, where the geometry cleanly reflects the underlying parameterization. Overall, we show that small and large models can indeed disentangle and utilize latent concepts that they learn in-context from a handful of abbreviated demonstrations.
[811] RsGCN: Subgraph-Based Rescaling Enhances Generalization of GCNs for Solving Traveling Salesman Problems
Junquan Huang, Zong-Gan Chen, Yuncheng Jiang, Zhi-Hui Zhan
Main category: cs.LG
TL;DR: Proposes RsGCN with subgraph-based rescaling for TSP solvers to achieve cross-scale generalization and low training costs, combined with RBS for effective search.
Details
Motivation: Address poor cross-scale generalization and high training costs in GCN-based TSP solvers by focusing on scale-dependent features.Method: Subgraph-based rescaling to normalize edge lengths, unified subgraph perspective for learning scale-generalizable representations, and Reconstruction-Based Search with adaptive weight to avoid local optima.
Result: Achieves remarkable generalization with only 3 training epochs on mixed-scale dataset (up to 100 nodes) and successfully generalizes to 10K-node instances without fine-tuning.
Conclusion: Demonstrates advanced performance across various scales and real-world instances while requiring fewest parameters and training epochs among neural competitors.
Abstract: GCN-based traveling salesman problem (TSP) solvers face two critical challenges: poor cross-scale generalization for TSPs and high training costs. To address these challenges, we propose a Subgraph-Based Rescaling Graph Convolutional Network (RsGCN). Focusing on the scale-dependent features (i.e., features varied with problem scales) related to nodes and edges, we design the subgraph-based rescaling to normalize edge lengths of subgraphs. Under a unified subgraph perspective, RsGCN can efficiently learn scale-generalizable representations from small-scale TSPs at low cost. To exploit and assess the heatmaps generated by RsGCN, we design a Reconstruction-Based Search (RBS), in which a reconstruction process based on adaptive weight is incorporated to help avoid local optima. Based on a combined architecture of RsGCN and RBS, our solver achieves remarkable generalization and low training cost: with only 3 epochs of training on a mixed-scale dataset containing instances with up to 100 nodes, it can be generalized successfully to 10K-node instances without any fine-tuning. Extensive experiments demonstrate our advanced performance across uniform-distribution instances of 9 different scales from 20 to 10K nodes and 78 real-world instances from TSPLIB, while requiring the fewest learnable parameters and training epochs among neural competitors.
[812] On the Necessity of Output Distribution Reweighting for Effective Class Unlearning
Ali Ebrahimpour-Boroojeny, Yian Wang, Hari Sundaram
Main category: cs.LG
TL;DR: The paper identifies a privacy vulnerability in class unlearning evaluations due to overlooked class geometry, proposes a membership-inference attack (MIA-NN) that exploits this weakness, and introduces Tilted ReWeighting (TRW) as a solution that matches or outperforms existing methods while reducing privacy leakage.
Details
Motivation: To address the significant privacy leakage in class unlearning evaluations caused by overlooking underlying class geometry, which makes existing methods vulnerable to membership-inference attacks.Method: Introduces MIA-NN attack using nearest neighbors’ probabilities, then proposes TRW - a fine-tuning objective that approximates the distribution of retrained models for forget-class inputs by estimating inter-class similarity and tilting the target model’s distribution accordingly.
Result: TRW reduces privacy leakage significantly - on CIFAR-10, it reduces the gap with retrained models by 19% for U-LiRA and 46% for MIA-NN scores compared to state-of-the-art methods, while matching or surpassing existing unlearning methods on prior metrics.
Conclusion: Class geometry is crucial for secure unlearning, and TRW effectively mitigates privacy leakage while maintaining unlearning performance across multiple benchmarks.
Abstract: In this paper, we reveal a significant shortcoming in class unlearning evaluations: overlooking the underlying class geometry can cause privacy leakage. We further propose a simple yet effective solution to mitigate this issue. We introduce a membership-inference attack via nearest neighbors (MIA-NN) that uses the probabilities the model assigns to neighboring classes to detect unlearned samples. Our experiments show that existing unlearning methods are vulnerable to MIA-NN across multiple datasets. We then propose a new fine-tuning objective that mitigates this privacy leakage by approximating, for forget-class inputs, the distribution over the remaining classes that a retrained-from-scratch model would produce. To construct this approximation, we estimate inter-class similarity and tilt the target model’s distribution accordingly. The resulting Tilted ReWeighting (TRW) distribution serves as the desired distribution during fine-tuning. We also show that across multiple benchmarks, TRW matches or surpasses existing unlearning methods on prior unlearning metrics. More specifically, on CIFAR-10, it reduces the gap with retrained models by 19% and 46% for U-LiRA and MIA-NN scores, accordingly, compared to the SOTA method for each category.
[813] WeightLoRA: Keep Only Necessary Adapters
Andrey Veprikov, Vladimir Solodkin, Alexander Zyl, Andrey Savchenko, Aleksandr Beznosikov
Main category: cs.LG
TL;DR: WeightLoRA is a novel parameter-efficient fine-tuning method that adaptively selects the most critical LoRA heads during optimization, significantly reducing trainable parameters while maintaining or improving performance.
Details
Motivation: Traditional LoRA requires significant memory for training large models and relies on intuition for adapter placement, creating efficiency and optimization challenges.Method: Proposes WeightLoRA which uses adaptive selection of the most critical LoRA heads throughout the optimization process, with an enhanced version called WeightLoRA+.
Result: Experimental results on DeBERTa, BART, and Llama models show WeightLoRA significantly reduces parameters while maintaining or improving metric values, with WeightLoRA+ performing superior in almost all cases.
Conclusion: WeightLoRA effectively overcomes LoRA’s limitations by adaptive head selection, achieving better parameter efficiency and performance across various models and benchmarks.
Abstract: The widespread utilization of language models in modern applications is inconceivable without Parameter-Efficient Fine-Tuning techniques, such as low-rank adaptation ($\texttt{LoRA}$), which adds trainable adapters to selected layers. Although $\texttt{LoRA}$ may obtain accurate solutions, it requires significant memory to train large models and intuition on which layers to add adapters. In this paper, we propose a novel method, $\texttt{WeightLoRA}$, which overcomes this issue by adaptive selection of the most critical $\texttt{LoRA}$ heads throughout the optimization process. As a result, we can significantly reduce the number of trainable parameters while maintaining the capability to obtain consistent or even superior metric values. We conduct experiments for a series of competitive benchmarks and DeBERTa, BART, and Llama models, comparing our method with different adaptive approaches. The experimental results demonstrate the efficacy of $\texttt{WeightLoRA}$ and the superior performance of $\texttt{WeightLoRA+}$ in almost all cases.
[814] Beyond Simple Graphs: Neural Multi-Objective Routing on Multigraphs
Filip Rydin, Attila Lischka, Jiaming Wu, Morteza Haghir Chehreghani, Balázs Kulcsár
Main category: cs.LG
TL;DR: Two GNN-based methods for multi-objective routing on multigraphs: one operates directly on multigraphs via autoregressive edge selection, while the other uses learned pruning to simplify multigraphs first.
Details
Motivation: Existing learning-based routing methods are unsuitable for multigraphs despite their strong relevance in real-world scenarios where multiple edges with distinct attributes exist between node pairs.Method: Two GNN-based approaches: 1) Direct multigraph routing via autoregressive edge selection, 2) More scalable method that first simplifies multigraph using learned pruning strategy then performs autoregressive routing on resulting simple graph.
Result: Both models demonstrate competitive performance compared to strong heuristics and neural baselines across wide range of problems and graph distributions.
Conclusion: The proposed methods effectively address the gap in learning-based routing for multigraphs, offering both direct and scalable approaches with competitive performance.
Abstract: Learning-based methods for routing have gained significant attention in recent years, both in single-objective and multi-objective contexts. Yet, existing methods are unsuitable for routing on multigraphs, which feature multiple edges with distinct attributes between node pairs, despite their strong relevance in real-world scenarios. In this paper, we propose two graph neural network-based methods to address multi-objective routing on multigraphs. Our first approach operates directly on the multigraph by autoregressively selecting edges until a tour is completed. The second model, which is more scalable, first simplifies the multigraph via a learned pruning strategy and then performs autoregressive routing on the resulting simple graph. We evaluate both models empirically, across a wide range of problems and graph distributions, and demonstrate their competitive performance compared to strong heuristics and neural baselines.
[815] Sign-SGD is the Golden Gate between Multi-Node to Single-Node Learning: Significant Boost via Parameter-Free Optimization
Daniil Medyakov, Sergey Stanko, Gleb Molodtsov, Philip Zmushko, Grigoriy Evseev, Egor Petrov, Aleksandr Beznosikov
Main category: cs.LG
TL;DR: The paper proposes deterministic Sign-SGD variants that automatically determine effective stepsizes, addressing a key limitation in training large language models efficiently.
Details
Motivation: Training large language models is extremely resource-intensive, and while Sign-SGD offers memory-efficient training and gradient compression, it lacks automatic stepsize determination which depends on inaccessible dataset parameters.Method: Designed several variants of single-node deterministic Sign-SGD, extended to stochastic single-node and multi-node learning scenarios, and incorporated momentum methods.
Result: Extensive experiments on real machine learning problems demonstrated the practical applicability of the proposed approaches.
Conclusion: The proposed Sign-SGD variants successfully address the stepsize determination problem and show practical utility in various learning scenarios including distributed training.
Abstract: Quite recently, large language models have made a significant breakthrough across various disciplines. However, training them is an extremely resource-intensive task, even for major players with vast computing resources. One of the methods gaining popularity in light of these challenges is Sign-SGD. This method can be applied both as a memory-efficient approach in single-node training and as a gradient compression technique in the distributed learning. Nevertheless, it is impossible to automatically determine the effective stepsize from the theoretical standpoint. Indeed, it depends on the parameters of the dataset to which we do not have access in the real-world learning paradigm. To address this issue, we design several variants of single-node deterministic Sign-SGD. We extend our approaches to practical scenarios: stochastic single-node and multi-node learning, methods with incorporated momentum. We conduct extensive experiments on real machine learning problems that emphasize the practical applicability of our ideas.
[816] Neural-Network solver of ideal MHD equilibria
Timo Thun, Andrea Merlo, Rory Conlin, Dario Panici, Daniel Böckenhoff
Main category: cs.LG
TL;DR: A novel neural network approach for computing 3D magnetohydrodynamic equilibria that achieves lower force residuals than conventional solvers.
Details
Motivation: To develop a more efficient and accurate method for solving complex 3D magnetohydrodynamic equilibrium problems using neural networks.Method: Parametrize Fourier modes with artificial neural networks and minimize the full nonlinear global force residual using first-order optimizers.
Result: The neural network approach achieves competitive computational cost and establishes new lower bounds for force residuals compared to conventional solvers.
Conclusion: Neural networks show promise for solving single equilibria and creating continuous models of equilibrium distributions, with potential for significant future improvements.
Abstract: We present a novel approach to compute three-dimensional Magnetohydrodynamic equilibria by parametrizing Fourier modes with artificial neural networks and compare it to equilibria computed by conventional solvers. The full nonlinear global force residual across the volume in real space is then minimized with first order optimizers. Already,we observe competitive computational cost to arrive at the same minimum residuals computed by existing codes. With increased computational cost,lower minima of the residual are achieved by the neural networks,establishing a new lower bound for the force residual. We use minimally complex neural networks,and we expect significant improvements for solving not only single equilibria with neural networks,but also for computing neural network models valid over continuous distributions of equilibria.
[817] OrthoGrad Improves Neural Calibration
C. Evans Hedges
Main category: cs.LG
TL;DR: ⊥Grad is a geometry-aware optimization method that constrains gradient updates to be orthogonal to weight vectors, addressing overconfidence in neural networks without architectural changes.
Details
Motivation: Standard optimizers often lead to overconfidence in uncertainty-critical applications, which can be problematic for reliable predictions.Method: Modifies gradient-based optimization by enforcing orthogonality between gradient updates and weight vectors, altering optimization trajectories while being optimizer-agnostic.
Result: On CIFAR-10 with 10% labeled data, matches SGD accuracy while significantly improving test loss (p=0.05), predictive entropy (p=0.001), and confidence measures across different corruption levels and architectures.
Conclusion: Geometric interventions in optimization like ⊥Grad can effectively improve predictive uncertainty estimates with minimal computational overhead and remain compatible with existing calibration techniques.
Abstract: We study $\perp$Grad, a geometry-aware modification to gradient-based optimization that constrains descent directions to address overconfidence, a key limitation of standard optimizers in uncertainty-critical applications. By enforcing orthogonality between gradient updates and weight vectors, $\perp$Grad alters optimization trajectories without architectural changes. On CIFAR-10 with 10% labeled data, $\perp$Grad matches SGD in accuracy while achieving statistically significant improvements in test loss ($p=0.05$), predictive entropy ($p=0.001$), and confidence measures. These effects show consistent trends across corruption levels and architectures. $\perp$Grad is optimizer-agnostic, incurs minimal overhead, and remains compatible with post-hoc calibration techniques. Theoretically, we characterize convergence and stationary points for a simplified $\perp$Grad variant, revealing that orthogonalization constrains loss reduction pathways to avoid confidence inflation and encourage decision-boundary improvements. Our findings suggest that geometric interventions in optimization can improve predictive uncertainty estimates at low computational cost.
[818] Lightweight MSA Design Advances Protein Folding From Evolutionary Embeddings
Hanqun Cao, Xinyi Zhou, Zijun Gao, Chenyu Wang, Xin Gao, Zhi Zhang, Cesar de la Fuente-Nunez, Chunbin Gu, Ge Liu, Pheng-Ann Heng
Main category: cs.LG
TL;DR: PLAME is a lightweight MSA design framework that uses protein language model embeddings to generate better MSAs for protein structure prediction, particularly improving performance on low-homology and orphan proteins.
Details
Motivation: Traditional multiple sequence alignments (MSAs) underperform on low-homology and orphan proteins, limiting structure prediction accuracy for these challenging cases.Method: PLAME leverages evolutionary embeddings from pretrained protein language models with a conservation-diversity loss function, plus MSA selection strategy and sequence-quality metrics.
Result: Achieves state-of-the-art improvements in structure accuracy (lDDT/TM-score) on AlphaFold2 low-homology/orphan benchmarks, with consistent gains when paired with AlphaFold3.
Conclusion: PLAME provides a practical solution for high-quality folding of proteins lacking strong evolutionary neighbors and can function as a lightweight adapter to improve ESMFold accuracy while maintaining speed.
Abstract: Protein structure prediction often hinges on multiple sequence alignments (MSAs), which underperform on low-homology and orphan proteins. We introduce PLAME, a lightweight MSA design framework that leverages evolutionary embeddings from pretrained protein language models to generate MSAs that better support downstream folding. PLAME couples these embeddings with a conservation–diversity loss that balances agreement on conserved positions with coverage of plausible sequence variation. Beyond generation, we develop (i) an MSA selection strategy to filter high-quality candidates and (ii) a sequence-quality metric that is complementary to depth-based measures and predictive of folding gains. On AlphaFold2 low-homology/orphan benchmarks, PLAME delivers state-of-the-art improvements in structure accuracy (e.g., lDDT/TM-score), with consistent gains when paired with AlphaFold3. Ablations isolate the benefits of the selection strategy, and case studies elucidate how MSA characteristics shape AlphaFold confidence and error modes. Finally, we show PLAME functions as a lightweight adapter, enabling ESMFold to approach AlphaFold2-level accuracy while retaining ESMFold-like inference speed. PLAME thus provides a practical path to high-quality folding for proteins lacking strong evolutionary neighbors.
[819] Spectral Graph Neural Networks are Incomplete on Graphs with a Simple Spectrum
Snir Hordan, Maya Bechler-Speicher, Gur Lifshitz, Nadav Dym
Main category: cs.LG
TL;DR: The paper analyzes the expressive power of spectrally-enhanced GNNs (SGNNs) and introduces a new expressivity hierarchy based on graph eigenvalue multiplicity, proving many SGNNs are incomplete even on graphs with distinct eigenvalues, and proposes a rotation equivariant method to improve SGNN expressivity.
Details
Motivation: Current frameworks for evaluating SGNN expressive power (k-WL hierarchy and homomorphism counting) poorly align with graph spectra, providing limited insight into SGNNs' actual capabilities.Method: Introduces an expressivity hierarchy based on classifying graphs by largest eigenvalue multiplicity, adapts rotation equivariant neural networks to graph spectra, and proposes a method to improve SGNN expressivity on simple spectrum graphs.
Result: Proves many SGNNs are incomplete even on graphs with distinct eigenvalues, and empirically verifies theoretical claims through MNIST Superpixel classification and eigenvector canonicalization on ZINC graphs.
Conclusion: The proposed rotation equivariant approach can provably improve SGNN expressivity, addressing limitations of current spectrally-enhanced GNNs.
Abstract: Spectral features are widely incorporated within Graph Neural Networks (GNNs) to improve their expressive power, or their ability to distinguish among non-isomorphic graphs. One popular example is the usage of graph Laplacian eigenvectors for positional encoding in MPNNs and Graph Transformers. The expressive power of such Spectrally-enhanced GNNs (SGNNs) is usually evaluated via the k-WL graph isomorphism test hierarchy and homomorphism counting. Yet, these frameworks align poorly with the graph spectra, yielding limited insight into SGNNs’ expressive power. We leverage a well-studied paradigm of classifying graphs by their largest eigenvalue multiplicity to introduce an expressivity hierarchy for SGNNs. We then prove that many SGNNs are incomplete even on graphs with distinct eigenvalues. To mitigate this deficiency, we adapt rotation equivariant neural networks to the graph spectra setting to propose a method to provably improve SGNNs’ expressivity on simple spectrum graphs. We empirically verify our theoretical claims via an image classification experiment on the MNIST Superpixel dataset and eigenvector canonicalization on graphs from ZINC.
[820] Caterpillar GNN: Replacing Message Passing with Efficient Aggregation
Marek Černý
Main category: cs.LG
TL;DR: Caterpillar GNNs introduce walk incidence-based aggregation that trades some expressivity for stronger inductive bias, enabling scaling between message-passing and walk-based methods while achieving competitive performance with reduced computational complexity.
Details
Motivation: To address limitations of MPGNNs by introducing a more efficient aggregation method that provides structured inductive bias while maintaining expressive power.Method: Proposed aggregation over walk incidence-based matrices, characterized expressive power using homomorphism counts over generalized caterpillar graphs, and developed Caterpillar GNNs with robust graph-level aggregation.
Result: Successfully tackled benchmark designed to challenge MPGNNs and achieved comparable predictive performance on real-world datasets while significantly reducing hidden layer nodes in computational graphs.
Conclusion: Walk incidence-based aggregation provides an effective alternative to traditional MPGNNs, offering better trade-offs between expressivity and computational efficiency.
Abstract: Message-passing graph neural networks (MPGNNs) dominate modern graph learning. Typical efforts enhance MPGNN’s expressive power by enriching the adjacency-based aggregation. In contrast, we introduce an efficient aggregation over walk incidence-based matrices that are constructed to deliberately trade off some expressivity for stronger and more structured inductive bias. Our approach allows for seamless scaling between classical message-passing and simpler methods based on walks. We rigorously characterize the expressive power at each intermediate step using homomorphism counts over a hierarchy of generalized caterpillar graphs. Based on this foundation, we propose Caterpillar GNNs, whose robust graph-level aggregation successfully tackles a benchmark specifically designed to challenge MPGNNs. Moreover, we demonstrate that, on real-world datasets, Caterpillar GNNs achieve comparable predictive performance while significantly reducing the number of nodes in the hidden layers of the computational graph.
[821] Aircraft Trajectory Dataset Augmentation in Latent Space
Seokbin Yoon, Keumjin Lee
Main category: cs.LG
TL;DR: ATRADA is a novel framework for aircraft trajectory dataset augmentation using Transformer encoder, PCA, GMM, and MLP to generate high-quality synthetic trajectory data.
Details
Motivation: Aircraft trajectory modeling is crucial for air traffic management, and dataset augmentation is needed to develop robust models and ensure sufficient, balanced datasets.Method: Transformer encoder learns trajectory patterns and converts data to latent space, PCA reduces dimensions, GMM fits probability distribution, and MLP decodes generated samples back to original dimensions.
Result: The framework effectively generates new, high-quality synthetic aircraft trajectory data, outperforming several baseline methods in experiments.
Conclusion: ATRADA provides an effective solution for aircraft trajectory dataset augmentation, enabling better model training for air traffic management applications.
Abstract: Aircraft trajectory modeling plays a crucial role in air traffic management (ATM) and is important for various downstream tasks, including conflict detection and landing time prediction. Dataset augmentation by adding synthetically generated trajectory data is necessary to develop a more robust aircraft trajectory model and ensure that the trajectory dataset is sufficient and balanced. We propose a novel framework called ATRADA for aircraft trajectory dataset augmentation. In the proposed framework, a Transformer encoder learns the underlying patterns in the original trajectory dataset and converts each data point into a context vector in the learned latent space. The converted dataset is projected to reduced dimensions using principal component analysis (PCA), and a Gaussian mixture model (GMM) is applied to fit the probability distribution of the data points in the reduced-dimensional space. Finally, new samples are drawn from the fitted GMM, the dimension of the samples is reverted to the original dimension, and the samples are decoded with a multi-layer perceptron (MLP). Several experiments demonstrate that the framework effectively generates new, high-quality synthetic aircraft trajectory data, which were compared to the results of several baselines.
[822] The Invisible Leash: Why RLVR May or May Not Escape Its Origin
Fang Wu, Weihao Xuan, Ximing Lu, Mingjie Liu, Yi Dong, Zaid Harchaoui, Yejin Choi
Main category: cs.LG
TL;DR: Current RLVR practice improves precision but may restrict reasoning boundaries by amplifying high-reward outputs the base model already knows, rather than discovering truly novel solutions.
Details
Motivation: To investigate whether RLVR truly expands model reasoning capabilities or just amplifies existing high-reward outputs, and understand potential limits of current RLVR practices.Method: Empirical investigation examining RLVR as support-constrained optimization mechanism, analyzing entropy-reward trade-offs, and conducting extensive experiments on how RLVR affects model exploration and solution discovery.
Result: RLVR consistently improves pass@1 but shrinks empirical support more than it expands, failing to recover correct answers previously accessible to base model. While token-level entropy increases, answer-level entropy declines, converging to smaller answer sets.
Conclusion: Current RLVR recipe has limits in extending reasoning horizons due to support constraints. Future innovations need explicit exploration mechanisms or hybrid strategies to seed probability mass into underrepresented solution regions.
Abstract: Recent advances in LLMs highlight RLVR as a promising method for enhancing AI’s capabilities, particularly in solving complex logical tasks. However, it remains unclear whether the current practice of RLVR truly expands a model’s reasoning boundary or mainly amplifies high-reward outputs that the base model already knows for improved precision. This study presents an empirical investigation that provides fresh insights into the potential limits of the common practice of RLVR. We examine how, under current training conditions, RLVR can operate as a support-constrained optimization mechanism that may restrict the discovery of entirely original solutions, remaining constrained by the base model’s initial distribution. We also identify an entropy-reward trade-off: while the current RLVR recipe reliably enhances precision, it may progressively narrow exploration and potentially overlook correct yet underrepresented solutions. Extensive empirical experiments validate that while the current RLVR recipe consistently improves pass@1, the shrinkage of empirical support generally outweighs the expansion of empirical support under larger sampling budgets, failing to recover correct answers that were previously accessible to the base model. Interestingly, we also observe that while RLVR sometimes increases token-level entropy - resulting in greater uncertainty at each generation step - answer-level entropy declines, indicating that these seemingly more uncertain paths ultimately converge onto a smaller set of distinct answers. Taken together, these findings reveal potential limits of the current RLVR recipe in extending reasoning horizons. Breaking this invisible leash may require future algorithmic innovations such as explicit exploration mechanisms or hybrid strategies that seed probability mass into underrepresented solution regions.
[823] RL-Obfuscation: Can Language Models Learn to Evade Latent-Space Monitors?
Rohan Gupta, Erik Jenner
Main category: cs.LG
TL;DR: LLMs can be trained to evade latent-space monitors while maintaining normal black-box behavior, with token-level monitors being vulnerable while holistic monitors remain robust.
Details
Motivation: To investigate whether models can learn to evade latent-space monitors that are used to detect undesirable behaviors, since these monitors may become training signals during retraining.Method: Introduce RL-Obfuscation, which fine-tunes LLMs via reinforcement learning to evade latent-space monitors while preserving their black-box outputs. Tested on models from 7B to 14B parameters against various monitor types.
Result: Token-level monitors are highly vulnerable to evasion attacks, while holistic monitors (max-pooling, attention-based) remain robust. Models can generalize evasion to unseen monitors and conditionally bypass monitors on specific inputs by repurposing token representations.
Conclusion: Latent-space monitors face evasion risks, particularly token-level ones, highlighting the need for more robust monitoring approaches that can withstand adversarial training.
Abstract: Latent-space monitors aim to detect undesirable behaviours in Large Language Models by leveraging their internal representations rather than relying solely on black-box outputs. These methods have shown promise in identifying behaviours such as deception and unsafe completions. However, these monitors may themselves become training signals, for example, by using problematic samples found in deployment to retrain models. This raises an important question: can models learn to evade such monitors? To evaluate this capability, we introduce RL-Obfuscation, in which LLMs are finetuned via reinforcement learning to evade latent-space monitors while maintaining their blackbox behaviour. We apply RL-Obfuscation to Language Models ranging from 7B to 14B parameters and evaluate their Evasion Success Rate against a suite of monitors. We find that token-level monitors are highly vulnerable to this attack while more holistic monitors, such as max-pooling or attention-based probes, remain robust. Moreover, for these vulnerable monitors, models trained to evade a single static monitor can generalise to evade other unseen monitors. We also find that the models can be trained to conditionally bypass latent-space monitors on only certain inputs. Finally, we study how the models bypass these monitors and find that the model can learn to repurpose tokens to have different internal representations.
[824] R-Stitch: Dynamic Trajectory Stitching for Efficient Reasoning
Zhuokun Chen, Zeren Chen, Jiahao He, Lu Sheng, Mingkui Tan, Jianfei Cai, Bohan Zhuang
Main category: cs.LG
TL;DR: R-Stitch is a training-free hybrid decoding framework that uses token-level entropy to delegate computation between small and large language models, achieving 3-4x speedups while maintaining accuracy comparable to full LLM decoding.
Details
Motivation: Chain-of-thought reasoning improves LLM problem-solving but incurs high inference costs due to long autoregressive trajectories. Existing acceleration methods have limitations - speculative decoding provides limited gains when model agreement is low and rigidly enforces token-level consistency.Method: Uses token-level entropy as an uncertainty proxy to route computation between SLM and LLM. High-entropy tokens are delegated to LLM while low-entropy ones are handled by SLM. R-Stitch+ extends this with adaptive routing policy to dynamically adjust token budget beyond fixed thresholds.
Result: Achieves peak speedups of 3.00x on DeepSeek-R1-Distill-Qwen-7B, 3.85x on 14B, and 4.10x on QWQ-32B while maintaining accuracy comparable to full LLM decoding. Enables adaptive efficiency-accuracy trade-offs without retraining.
Conclusion: R-Stitch substantially accelerates LLM inference by jointly reducing per-token decoding complexity and number of generated tokens, with negligible accuracy loss and natural support for adaptive efficiency-accuracy trade-offs.
Abstract: Chain-of-thought (CoT) enhances the problem-solving ability of large language models (LLMs) but incurs substantial inference cost due to long autoregressive trajectories. Existing acceleration strategies either shorten traces via early stopping or compression, or adopt speculative decoding with a smaller model. However, speculative decoding provides limited gains when model agreement is low and rigidly enforces token-level consistency, overlooking the observation that some smaller models, when correct, produce significantly more concise reasoning traces that could reduce inference length. We introduce R-Stitch, a training-free hybrid decoding framework that leverages token-level entropy as an uncertainty proxy to delegate computation between a small language model (SLM) and an LLM. Our analysis shows that high-entropy tokens are more likely to induce errors, motivating an entropy-guided routing strategy that lets the SLM efficiently handle low-entropy tokens while delegating uncertain ones to the LLM, thereby avoiding full rollbacks and preserving answer quality. We further extend this design with R-Stitch$^{+}$, which learns an adaptive routing policy to adjust the token budget dynamically beyond fixed thresholds. By jointly reducing per-token decoding complexity and the number of generated tokens, our method achieves substantial acceleration with negligible accuracy loss. Concretely, it attains peak speedups of 3.00$\times$ on DeepSeek-R1-Distill-Qwen-7B, 3.85$\times$ on 14B, and 4.10$\times$ on QWQ-32B while maintaining accuracy comparable to full LLM decoding. Moreover, it naturally enables adaptive efficiency–accuracy trade-offs that can be tailored to diverse computational budgets without retraining.
[825] SecP-Tuning: Efficient Privacy-Preserving Prompt Tuning for Large Language Models via MPC
Jinglong Luo, Zhuo Zhang, Yehong Zhang, Shiyu Liu, Ye Dong, Hui Wang, Yue Yu, Xun Zhou, Zenglin Xu
Main category: cs.LG
TL;DR: SecP-Tuning is a privacy-preserving prompt tuning framework for LLMs that uses Secure Multi-party Computation and Forward-only Tuning to achieve efficient fine-tuning while maintaining data privacy.
Details
Motivation: LLMs face adaptation challenges in privacy-sensitive domains like healthcare and finance due to data scarcity from privacy requirements. MPC-based approaches are limited to inference due to efficiency issues in fine-tuning operations.Method: Integrates Forward-only Tuning through data owner-server interaction paradigm, eliminating backward propagation and optimization computations. Uses efficient privacy-preserving Random Feature Attention to replace softmax-based self-attention.
Result: Achieves 12x and 16x end-to-end acceleration compared to full-parameter SFT and gradient-based prompt tuning. Reduces communication overhead by 18x and 20x. Achieves average performance score of 82.45 vs 79.90 for SFT and 73.73 for prompt tuning in few-shot tasks.
Conclusion: SecP-Tuning provides efficient privacy-preserving fine-tuning for LLMs with significant speedup and communication reduction while maintaining performance and avoiding memory leakage risks from gradient/parameter transmission.
Abstract: Large Language Models (LLMs) have revolutionized numerous fields, yet their
adaptation to specialized tasks in privacy-sensitive domains such as healthcare
and finance remains constrained due to the scarcity of accessible training data
caused by stringent privacy requirements. Secure Multi-party Computation
(MPC)-based privacy-preserving machine learning provides theoretical guarantees
for the privacy of model parameters and data. However, its application to LLMs
has been predominantly limited to inference, as fine-tuning introduces
significant efficiency challenges, particularly in backward propagation,
optimizer, and self-attention operations. To address these challenges, we
propose SecP-Tuning, the first MPC-based framework designed for efficient,
privacy-preserving prompt tuning of LLMs. SecP-Tuning innovatively integrates
Forward-only Tuning (FoT) through the data owner-server interaction" paradigm, effectively removing the need for privacy-preserving computations in backward propagation and optimization processes. Furthermore, it devises an efficient privacy-preserving Random Feature Attention (RFA), effectively mitigating the computational complexity of softmax-based self-attention and circumventing MPC-incompatible nonlinear operations. Experimental results demonstrate that, compared to full-Parameter Supervised Fine-Tuning (SFT) and gradient-based prompt tuning, SecP-Tuning achieves approximately 12 times and 16 times end-to-end acceleration, as well as 18 times and 20 times reductions in communication overhead, respectively. Moreover, in five few-shot tasks, it achieves an average performance score of 82.45, outperforming SFT's 79.90 and prompt tuning's 73.73. Additionally, the
black-box/API-style"
privacy-preserving tuning paradigm of SecP-Tuning effectively avoids memory
leakage risks caused by gradient/parameter transmission.
[826] Online Multi-Agent Control with Adversarial Disturbances
Anas Barakat, John Lazarsfeld, Georgios Piliouras, Antonios Varvitsiotis
Main category: cs.LG
TL;DR: This paper studies online multi-agent control in linear dynamical systems with adversarial disturbances, where agents have competing time-varying objectives. It analyzes gradient-based controllers with local policy updates and proves sublinear regret bounds for individual agents.
Details
Motivation: Multi-agent control problems with adversarial disturbances are common in autonomous robotics, economics, and energy systems, but most prior work assumes noiseless or stochastic perturbations rather than adversarial disturbances.Method: The authors use online gradient-based controllers with local policy updates under two feedback models, and analyze multi-agent linear dynamical systems where each agent minimizes its own sequence of convex losses.
Result: The paper proves per-agent regret bounds that are sublinear and near-optimal in the time horizon, with different scalings based on the number of agents. When objectives are aligned, it shows equilibrium tracking guarantees for time-varying potential games.
Conclusion: This work bridges online control with online learning in games, establishing robust individual and collective performance guarantees in dynamic continuous-state environments with adversarial disturbances.
Abstract: Online multi-agent control problems, where many agents pursue competing and time-varying objectives, are widespread in domains such as autonomous robotics, economics, and energy systems. In these settings, robustness to adversarial disturbances is critical. In this paper, we study online control in multi-agent linear dynamical systems subject to such disturbances. In contrast to most prior work in multi-agent control, which typically assumes noiseless or stochastically perturbed dynamics, we consider an online setting where disturbances can be adversarial, and where each agent seeks to minimize its own sequence of convex losses. Under two feedback models, we analyze online gradient-based controllers with local policy updates. We prove per-agent regret bounds that are sublinear and near-optimal in the time horizon and that highlight different scalings with the number of agents. When agents’ objectives are aligned, we further show that the multi-agent control problem induces a time-varying potential game for which we derive equilibrium tracking guarantees. Together, our results take a first step in bridging online control with online learning in games, establishing robust individual and collective performance guarantees in dynamic continuous-state environments.
[827] SpectrumWorld: Artificial Intelligence Foundation for Spectroscopy
Zhuo Yang, Jiaqing Xie, Shuaike Shen, Daolang Wang, Yeyun Chen, Ben Gao, Shuzhou Sun, Biqing Qi, Dongzhan Zhou, Lei Bai, Linjiang Chen, Shufei Zhang, Qinying Gu, Jun Jiang, Tianfan Fu, Yuqiang Li
Main category: cs.LG
TL;DR: SpectrumLab is a unified platform that systematizes deep learning research in spectroscopy through a Python library, SpectrumAnnotator for benchmark generation, and SpectrumBench with 14 tasks across 10+ spectrum types from 1.2M+ chemical substances. Evaluation of 18 multimodal LLMs reveals current limitations.
Details
Motivation: To address the lack of standardized formulations and accelerate deep learning research in spectroscopy, which currently suffers from inconsistent evaluation methods and limited benchmark resources.Method: Developed SpectrumLab with three core components: 1) Python library with data processing and evaluation tools, 2) SpectrumAnnotator module for generating high-quality benchmarks from limited seed data, 3) SpectrumBench benchmark suite covering 14 spectroscopic tasks across 10+ spectrum types from 1.2M+ chemical substances.
Result: Empirical studies on SpectrumBench with 18 cutting-edge multimodal LLMs revealed critical limitations of current approaches in spectroscopy deep learning.
Conclusion: SpectrumLab serves as a crucial foundation for future advancements in deep learning-driven spectroscopy by providing standardized evaluation and comprehensive benchmarking capabilities.
Abstract: Deep learning holds immense promise for spectroscopy, yet research and evaluation in this emerging field often lack standardized formulations. To address this issue, we introduce SpectrumLab, a pioneering unified platform designed to systematize and accelerate deep learning research in spectroscopy. SpectrumLab integrates three core components: a comprehensive Python library featuring essential data processing and evaluation tools, along with leaderboards; an innovative SpectrumAnnotator module that generates high-quality benchmarks from limited seed data; and SpectrumBench, a multi-layered benchmark suite covering 14 spectroscopic tasks and over 10 spectrum types, featuring spectra curated from over 1.2 million distinct chemical substances. Thorough empirical studies on SpectrumBench with 18 cutting-edge multimodal LLMs reveal critical limitations of current approaches. We hope SpectrumLab will serve as a crucial foundation for future advancements in deep learning-driven spectroscopy.
[828] Whom to Trust? Adaptive Collaboration in Personalized Federated Learning
Amr Abourayya, Jens Kleesiek, Bharat Rao, Michael Kamp
Main category: cs.LG
TL;DR: FEDMOSAIC is a personalized federated learning method that addresses data heterogeneity through fine-grained trust at the example level, using federated semi-supervised learning with per-example agreement and confidence to outperform both local and centralized training.
Details
Motivation: Data heterogeneity in federated learning creates challenges, and many personalized FL methods fail to outperform basic baselines like local and centralized training, suggesting personalization only works in a narrow regime where global models are insufficient but collaboration still has value.Method: FEDMOSAIC uses federated semi-supervised learning where clients exchange predictions over shared unlabeled data, enabling per-example reweighting of loss and pseudo-label contributions based on agreement and confidence, without sharing model parameters or raw data.
Result: FEDMOSAIC outperforms strong FL and PFL baselines across various non-IID settings and achieves better performance than both local and centralized training, with proven convergence under standard assumptions.
Conclusion: Federated personalization is effective when using fine-grained, trust-aware collaboration at the example level, and FEDMOSAIC demonstrates how adaptivity in collaboration enables successful personalization in the challenging regime between local and centralized training.
Abstract: Data heterogeneity poses a fundamental challenge in federated learning (FL), especially when clients differ not only in distribution but also in the reliability of their predictions across individual examples. While personalized FL (PFL) aims to address this, we observe that many PFL methods fail to outperform two necessary baselines, local training and centralized training. This suggests that meaningful personalization only emerges in a narrow regime, where global models are insufficient, but collaboration across clients still holds value. Our empirical findings point to two key ingredients for success in this regime: adaptivity in collaboration and fine-grained trust, at the level of individual examples. We show that these properties can be achieved within federated semi-supervised learning, where clients exchange predictions over a shared unlabeled dataset. This enables each client to align with public consensus when it is helpful, and disregard it when it is not, without sharing model parameters or raw data. As a concrete realization of this idea, we develop FEDMOSAIC, a personalized co-training method where clients reweight their loss and their contribution to pseudo-labels based on per-example agreement and confidence. FEDMOSAIC outperforms strong FL and PFL baselines across a range of non-IID settings, and we prove convergence under standard smoothness, bounded-variance, and drift assumptions. In contrast to many of these baselines, it also outperforms local and centralized training. These results clarify when federated personalization can be effective, and how fine-grained, trust-aware collaboration enables it.
[829] Relative Entropy Pathwise Policy Optimization
Claas Voelcker, Axel Brunnbauer, Marcel Hussing, Michal Nauman, Pieter Abbeel, Eric Eaton, Radu Grosu, Amir-massoud Farahmand, Igor Gilitschenski
Main category: cs.LG
TL;DR: REPPO is an on-policy algorithm that combines pathwise policy gradients with Q-value models trained purely from on-policy data, achieving stable training and superior efficiency compared to state-of-the-art methods.
Details
Motivation: Score-function methods like REINFORCE and PPO suffer from high variance, while pathwise policy gradients require accurate action-conditioned value functions that typically need off-policy data from replay buffers.Method: Uses stochastic policies for exploration with constrained updates, trains Q-value models purely from on-policy trajectories, and incorporates architectural components to stabilize value function learning.
Result: REPPO demonstrates strong empirical performance with superior sample efficiency, wall-clock time, memory footprint, and hyperparameter robustness compared to state-of-the-art methods on standard benchmarks.
Conclusion: REPPO successfully combines the stability of pathwise policy gradients with the simplicity of on-policy learning, providing an efficient alternative to existing methods.
Abstract: Score-function based methods for policy learning, such as REINFORCE and PPO, have delivered strong results in game-playing and robotics, yet their high variance often undermines training stability. Using pathwise policy gradients, i.e. computing a derivative by differentiating the objective function, alleviates the variance issues. However, they require an accurate action-conditioned value function, which is notoriously hard to learn without relying on replay buffers for reusing past off-policy data. We present an on-policy algorithm that trains Q-value models purely from on-policy trajectories, unlocking the possibility of using pathwise policy updates in the context of on-policy learning. We show how to combine stochastic policies for exploration with constrained updates for stable training, and evaluate important architectural components that stabilize value function learning. The result, Relative Entropy Pathwise Policy Optimization (REPPO), is an efficient on-policy algorithm that combines the stability of pathwise policy gradients with the simplicity and minimal memory footprint of standard on-policy learning. Compared to state-of-the-art on two standard GPU-parallelized benchmarks, REPPO provides strong empirical performance at superior sample efficiency, wall-clock time, memory footprint, and hyperparameter robustness.
[830] Graph is a Natural Regularization: Revisiting Vector Quantization for Graph Representation Learning
Zian Zhai, Fan Li, Xingyu Tan, Xiaoyang Wang, Wenjie Zhang
Main category: cs.LG
TL;DR: RGVQ is a novel framework that addresses codebook collapse in graph vector quantization by using regularization based on graph topology and feature similarity, improving codebook utilization and token diversity.
Details
Motivation: Codebook collapse is a fundamental challenge in applying vector quantization to graph data, limiting the expressiveness and generalization of graph tokens, and existing mitigation strategies from vision or language domains are insufficient.Method: RGVQ integrates graph topology and feature similarity as explicit regularization signals, uses soft assignments via Gumbel-Softmax reparameterization to ensure all codewords receive gradient updates, and incorporates structure-aware contrastive regularization to penalize token co-assignments among dissimilar node pairs.
Result: Extensive experiments show that RGVQ substantially improves codebook utilization and consistently boosts the performance of state-of-the-art graph VQ backbones across multiple downstream tasks.
Conclusion: RGVQ enables more expressive and transferable graph token representations by effectively addressing codebook collapse in graph vector quantization.
Abstract: Vector Quantization (VQ) has recently emerged as a promising approach for learning discrete representations of graph-structured data. However, a fundamental challenge, i.e., codebook collapse, remains underexplored in the graph domain, significantly limiting the expressiveness and generalization of graph tokens.In this paper, we present the first empirical study showing that codebook collapse consistently occurs when applying VQ to graph data, even with mitigation strategies proposed in vision or language domains. To understand why graph VQ is particularly vulnerable to collapse, we provide a theoretical analysis and identify two key factors: early assignment imbalances caused by redundancy in graph features and structural patterns, and self-reinforcing optimization loops in deterministic VQ. To address these issues, we propose RGVQ, a novel framework that integrates graph topology and feature similarity as explicit regularization signals to enhance codebook utilization and promote token diversity. RGVQ introduces soft assignments via Gumbel-Softmax reparameterization, ensuring that all codewords receive gradient updates. In addition, RGVQ incorporates a structure-aware contrastive regularization to penalize the token co-assignments among dissimilar node pairs. Extensive experiments demonstrate that RGVQ substantially improves codebook utilization and consistently boosts the performance of state-of-the-art graph VQ backbones across multiple downstream tasks, enabling more expressive and transferable graph token representations.
[831] Tricks and Plug-ins for Gradient Boosting with Transformers
Biyi Fang, Jean Utke, Truong Vo, Diego Klabjan
Main category: cs.LG
TL;DR: BoostTransformer integrates boosting principles into transformers via subgrid token selection and importance-weighted sampling for more efficient training and better performance.
Details
Motivation: To address the computational demands and complex hyperparameter tuning required by standard transformer architectures in NLP.Method: Augments transformers with boosting principles using subgrid token selection, importance-weighted sampling, and a least square boosting objective integrated into the transformer pipeline.
Result: Demonstrates faster convergence and higher accuracy across multiple fine-grained text classification benchmarks compared to standard transformers, while minimizing architectural search overhead.
Conclusion: BoostTransformer provides an effective framework that surpasses standard transformers in both efficiency and performance, reducing computational requirements and hyperparameter tuning complexity.
Abstract: Transformer architectures dominate modern NLP but often demand heavy computational resources and intricate hyperparameter tuning. To mitigate these challenges, we propose a novel framework, BoostTransformer, that augments transformers with boosting principles through subgrid token selection and importance-weighted sampling. Our method incorporates a least square boosting objective directly into the transformer pipeline, enabling more efficient training and improved performance. Across multiple fine-grained text classification benchmarks, BoostTransformer demonstrates both faster convergence and higher accuracy, surpassing standard transformers while minimizing architectural search overhead.
[832] Next Generation Equation-Free Multiscale Modelling of Crowd Dynamics via Machine Learning
Hector Vargas Alvarez, Dimitrios G. Patsatzis, Lucia Russo, Ioannis Kevrekidis, Constantinos Siettos
Main category: cs.LG
TL;DR: A manifold and machine learning approach to learn crowd dynamics from agent-based simulations by mapping microscopic data to latent spaces, learning reduced-order models, and reconstructing macroscopic density profiles.
Details
Motivation: To bridge the gap between microscopic and macroscopic modeling scales in crowd dynamics for systematic numerical analysis, optimization, and control.Method: Four-stage approach: (1) derive continuous macroscopic fields from microscopic data using KDE, (2) construct latent space mapping via POD, (3) learn reduced-order models using LSTMs and MVARs, (4) reconstruct dynamics in high-dimensional space.
Result: High accuracy, robustness, and generalizability in modeling crowd dynamics from agent-based simulations. Linear MVAR models outperformed nonlinear LSTMs in predictive accuracy with lower complexity and better interpretability.
Conclusion: The proposed framework creates an effective solution operator for unavailable macroscopic PDEs, enabling fast and accurate modeling of crowd dynamics through latent space learning and reconstruction.
Abstract: Bridging the microscopic and the macroscopic modelling scales in crowd dynamics constitutes an important open challenge for systematic numerical analysis, optimization and control. We propose a combined manifold and machine learning approach to learn the discrete evolution operator for the emergent crowd dynamics in latent spaces from high-fidelity agent-based simulations. The proposed framework builds upon our previous works on next-generation Equation-free algorithms for learning surrogate models of high-dim. multiscale systems. Our approach is a four-stage one, explicitly conserving the mass of the reconstructed dynamics in the high-dim. space. In the first step, we derive continuous macroscopic fields (densities) from discrete microscopic data (pedestrians’ positions) using KDE. In the second step, based on manifold learning, we construct a map from the macroscopic ambient space into the latent space parametrized by a few coordinates based on POD of the corresponding density distribution. The third step involves learning reduced-order surrogate ROMs in the latent space using machine learning techniques, particularly LSTMs networks and MVARs. Finally, we reconstruct the crowd dynamics in the high-dim. space in terms of macroscopic density profiles. With this “embed->learn in latent space->lift back to ambient space” pipeline, we create an effective solution operator of the unavailable macroscopic PDE for the density evolution. For our illustrations, we use SFM to generate data in a corridor with an obstacle, imposing periodic boundary conditions. The numerical results demonstrate high accuracy, robustness, and generalizability, thus allowing for fast and accurate modelling of crowd dynamics from agent-based simulations. Notably, linear MVAR models surpass nonlinear LSTMs in predictive accuracy, while also offering significantly lower complexity and greater interpretability.
[833] ERIS: An Energy-Guided Feature Disentanglement Framework for Out-of-Distribution Time Series Classification
Xin Wu, Fei Teng, Ji Zhang, Xingwang Li, Yuxuan Liang
Main category: cs.LG
TL;DR: ERIS framework enables guided feature disentanglement for time series classification by combining energy-guided calibration, weight-level orthogonality, and adversarial generalization to improve out-of-distribution robustness.
Details
Motivation: Current time series classification models struggle with out-of-distribution data due to entangled domain-specific and label-relevant features, creating spurious correlations. Existing feature disentanglement methods lack semantic guidance for effective separation.Method: Proposed ERIS framework with three key mechanisms: 1) Energy-guided calibration for semantic guidance in feature separation, 2) Weight-level orthogonality to enforce structural independence between domain-specific and label-relevant features, 3) Auxiliary adversarial generalization with structured perturbations for enhanced robustness.
Result: Experiments across four benchmarks show ERIS achieves statistically significant improvement over state-of-the-art baselines, consistently ranking top in performance.
Conclusion: ERIS demonstrates that effective feature disentanglement requires both mathematical constraints and semantic guidance, providing a reliable framework for shift-robust time series classification.
Abstract: An ideal time series classification (TSC) should be able to capture invariant representations, but achieving reliable performance on out-of-distribution (OOD) data remains a core obstacle. This obstacle arises from the way models inherently entangle domain-specific and label-relevant features, resulting in spurious correlations. While feature disentanglement aims to solve this, current methods are largely unguided, lacking the semantic direction required to isolate truly universal features. To address this, we propose an end-to-end Energy-Regularized Information for Shift-Robustness (ERIS) framework to enable guided and reliable feature disentanglement. The core idea is that effective disentanglement requires not only mathematical constraints but also semantic guidance to anchor the separation process. ERIS incorporates three key mechanisms to achieve this goal. Specifically, we first introduce an energy-guided calibration mechanism, which provides crucial semantic guidance for the separation, enabling the model to self-calibrate. Additionally, a weight-level orthogonality strategy enforces structural independence between domain-specific and label-relevant features, thereby mitigating their interference. Moreover, an auxiliary adversarial generalization mechanism enhances robustness by injecting structured perturbations. Experiments across four benchmarks demonstrate that ERIS achieves a statistically significant improvement over state-of-the-art baselines, consistently securing the top performance rank.
[834] DualNILM: Energy Injection Identification Enabled Disaggregation with Deep Multi-Task Learning
Xudong Wang, Guoming Tang, Junyu Xue, Srinivasan Keshav, Tongxin Li, Chris Ding
Main category: cs.LG
TL;DR: DualNILM is a deep multi-task learning framework using Transformer architecture to address NILM challenges caused by behind-the-meter energy sources like solar panels and batteries, enabling simultaneous appliance state recognition and injected energy identification.
Details
Motivation: Conventional NILM methods perform poorly when behind-the-meter energy sources inject power, obscuring appliance power signatures and reducing monitoring accuracy in modern energy systems with renewable penetration.Method: Transformer-based multi-task learning framework integrating sequence-to-point and sequence-to-sequence strategies to capture multiscale temporal dependencies in aggregate power consumption patterns for dual tasks of appliance state recognition and injected energy identification.
Result: Extensive evaluation shows DualNILM maintains excellent performance for both tasks, significantly outperforming conventional NILM methods on self-collected and synthesized datasets.
Conclusion: DualNILM demonstrates strong potential for robust energy disaggregation in modern energy systems with renewable energy sources, and the synthetic photovoltaic datasets will be open-sourced.
Abstract: Non-Intrusive Load Monitoring (NILM) offers a cost-effective method to obtain fine-grained appliance-level energy consumption in smart homes and building applications. However, the increasing adoption of behind-the-meter (BTM) energy sources such as solar panels and battery storage poses new challenges for conventional NILM methods that rely solely on at-the-meter data. The energy injected from the BTM sources can obscure the power signatures of individual appliances, leading to a significant decrease in NILM performance. To address this challenge, we present DualNILM, a deep multi-task learning framework designed for the dual tasks of appliance state recognition and injected energy identification. Using a Transformer-based architecture that integrates sequence-to-point and sequence-to-sequence strategies, DualNILM effectively captures multiscale temporal dependencies in the aggregate power consumption patterns, allowing for accurate appliance state recognition and energy injection identification. Extensive evaluation on self-collected and synthesized datasets demonstrates that DualNILM maintains an excellent performance for dual tasks in NILM, much outperforming conventional methods. Our work underscores the framework’s potential for robust energy disaggregation in modern energy systems with renewable penetration. Synthetic photovoltaic augmented datasets with realistic injection simulation methodology will be open-sourced after review.
[835] Sparse but Wrong: Incorrect L0 Leads to Incorrect Features in Sparse Autoencoders
David Chanin, Adrià Garriga-Alonso
Main category: cs.LG
TL;DR: Sparse Autoencoders (SAE) require correct L0 hyperparameter setting - too low mixes correlated features, too high finds degenerate solutions. Authors propose a proxy metric to find optimal L0.
Details
Motivation: Existing work treats L0 as a free parameter affecting only reconstruction, but improper L0 causes SAEs to fail at feature disentanglement in LLMs.Method: Study L0 effects on SAEs, develop proxy metric to find optimal L0, validate on toy models and LLM SAEs with sparse probing performance.
Result: Found most commonly used SAEs have L0 that is too low. Optimal L0 coincides with peak sparse probing performance in LLM SAEs.
Conclusion: L0 must be set correctly to train SAEs with correct features; improper L0 leads to feature mixing and failure of feature disentanglement.
Abstract: Sparse Autoencoders (SAEs) extract features from LLM internal activations, meant to correspond to interpretable concepts. A core SAE training hyperparameter is L0: how many SAE features should fire per token on average. Existing work compares SAE algorithms using sparsity-reconstruction tradeoff plots, implying L0 is a free parameter with no single correct value aside from its effect on reconstruction. In this work we study the effect of L0 on SAEs, and show that if L0 is not set correctly, the SAE fails to disentangle the underlying features of the LLM. If L0 is too low, the SAE will mix correlated features to improve reconstruction. If L0 is too high, the SAE finds degenerate solutions that also mix features. Further, we present a proxy metric that can help guide the search for the correct L0 for an SAE on a given training distribution. We show that our method finds the correct L0 in toy models and coincides with peak sparse probing performance in LLM SAEs. We find that most commonly used SAEs have an L0 that is too low. Our work shows that L0 must be set correctly to train SAEs with correct features.
[836] Multi-Channel Differential Transformer for Cross-Domain Sleep Stage Classification with Heterogeneous EEG and EOG
Benjamin Wei Hao Chin, Yuin Torng Yew, Haocheng Wu, Lanxin Liang, Chow Khuen Chan, Norita Mohd Zain, Siti Balqis Samdin, Sim Kuan Goh
Main category: cs.LG
TL;DR: SleepDIFFormer is a multi-channel differential transformer framework for sleep stage classification that addresses generalization challenges across diverse clinical EEG-EOG datasets through domain-invariant representation learning.
Details
Motivation: Manual sleep stage classification is time-consuming and error-prone, while existing machine learning methods struggle with generalization due to non-stationarity and variability of EEG-EOG signals across different clinical settings.Method: Proposes SleepDIFFormer with Multi-channel Differential Transformer Architecture (MDTA) that processes raw EEG-EOG signals, incorporates cross-domain alignment, and mitigates spatial-temporal attention noise through feature distribution alignment across datasets.
Result: Achieved state-of-the-art performance on five diverse sleep staging datasets under domain generalization settings, with comprehensive ablation studies and attention weight interpretation showing relevance to characteristic sleep EEG patterns.
Conclusion: The framework advances automated sleep stage classification and shows potential for quantifying sleep architecture and detecting abnormalities that disrupt restorative rest, with publicly available source code.
Abstract: Classification of sleep stages is essential for assessing sleep quality and diagnosing sleep disorders. However, manual inspection of EEG characteristics for each stage is time-consuming and prone to human error. Although machine learning and deep learning methods have been actively developed, they continue to face challenges arising from the non-stationarity and variability of electroencephalography (EEG) and electrooculography (EOG) signals across diverse clinical configurations, often resulting in poor generalization. In this work, we propose SleepDIFFormer, a multi-channel differential transformer framework for heterogeneous EEG-EOG representation learning. SleepDIFFormer is trained across multiple sleep staging datasets, each treated as a source domain, with the goal of generalizing to unseen target domains. Specifically, it employs a Multi-channel Differential Transformer Architecture (MDTA) designed to process raw EEG and EOG signals while incorporating cross-domain alignment. Our approach mitigates spatial and temporal attention noise and learns a domain-invariant EEG-EOG representation through feature distribution alignment across datasets, thereby enhancing generalization to new domains. Empirically, we evaluated SleepDIFFormer on five diverse sleep staging datasets under domain generalization settings and benchmarked it against existing approaches, achieving state-of-the-art performance. We further conducted a comprehensive ablation study and interpreted the differential attention weights, demonstrating their relevance to characteristic sleep EEG patterns. These findings advance the development of automated sleep stage classification and highlight its potential in quantifying sleep architecture and detecting abnormalities that disrupt restorative rest. Our source code and checkpoint are made publicly available at https://github.com/Ben1001409/SleepDIFFormer
[837] In-Context Algorithm Emulation in Fixed-Weight Transformers
Jerry Yao-Chieh Hu, Hude Liu, Jennifer Yuntong Zhang, Han Liu
Main category: cs.LG
TL;DR: Minimal Transformers with frozen weights can emulate algorithms through in-context prompting, with task-specific mode reproducing functions like gradient descent and linear regression, and prompt-programmable mode achieving universality via prompting alone.
Details
Motivation: To understand how Transformers can emulate algorithms through in-context learning without weight updates, and establish a link between in-context learning and algorithmic emulation.Method: Use single-head softmax attention layers with frozen weights, constructing prompts that encode algorithm parameters into token representations to create sharp dot-product gaps that force the attention to follow intended computations.
Result: Proved that Transformers can emulate a broad class of algorithms including gradient descent and linear regression, with numerical results supporting the theory.
Conclusion: Transformers can serve as prompt-programmable libraries of algorithms, enabling GPT-style models to swap algorithms via prompts alone, establishing algorithmic universality in modern Transformer models.
Abstract: We prove that a minimal Transformer with frozen weights emulates a broad class of algorithms by in-context prompting. We formalize two modes of in-context algorithm emulation. In the task-specific mode, for any continuous function $f: \mathbb{R} \to \mathbb{R}$, we show the existence of a single-head softmax attention layer whose forward pass reproduces functions of the form $f(w^\top x - y)$ to arbitrary precision. This general template subsumes many popular machine learning algorithms (e.g., gradient descent, linear regression, ridge regression). In the prompt-programmable mode, we prove universality: a single fixed-weight two-layer softmax attention module emulates all algorithms from the task-specific class (i.e., each implementable by a single softmax attention) via only prompting. Our key idea is to construct prompts that encode an algorithm’s parameters into token representations, creating sharp dot-product gaps that force the softmax attention to follow the intended computation. This construction requires no feed-forward layers and no parameter updates. All adaptation happens through the prompt alone. Numerical results corroborate our theory. These findings forge a direct link between in-context learning and algorithmic emulation, and offer a simple mechanism for large Transformers to serve as prompt-programmable libraries of algorithms. They illuminate how GPT-style foundation models may swap algorithms via prompts alone, and establish a form of algorithmic universality in modern Transformer models.
[838] Scalable Option Learning in High-Throughput Environments
Mikael Henaff, Scott Fujimoto, Michael Matthews, Michael Rabbat
Main category: cs.LG
TL;DR: SOL is a highly scalable hierarchical RL algorithm that achieves 35x higher throughput than existing methods, demonstrating strong performance on NetHack, MiniHack, and Mujoco environments.
Details
Motivation: To enable effective decision-making over long timescales through hierarchical RL and overcome the scaling limitations of existing approaches in high-throughput environments.Method: Proposed Scalable Option Learning (SOL), a hierarchical RL algorithm designed for high scalability and throughput, trained on 30 billion frames of experience.
Result: Achieved ~35x higher throughput compared to existing hierarchical methods, significantly surpassed flat agents on NetHack, and showed positive scaling trends across multiple environments.
Conclusion: SOL successfully demonstrates the scalability and general applicability of hierarchical RL, with open-sourced implementation available for further research.
Abstract: Hierarchical reinforcement learning (RL) has the potential to enable effective decision-making over long timescales. Existing approaches, while promising, have yet to realize the benefits of large-scale training. In this work, we identify and solve several key challenges in scaling online hierarchical RL to high-throughput environments. We propose Scalable Option Learning (SOL), a highly scalable hierarchical RL algorithm which achieves a ~35x higher throughput compared to existing hierarchical methods. To demonstrate SOL’s performance and scalability, we train hierarchical agents using 30 billion frames of experience on the complex game of NetHack, significantly surpassing flat agents and demonstrating positive scaling trends. We also validate SOL on MiniHack and Mujoco environments, showcasing its general applicability. Our code is open sourced at: github.com/facebookresearch/sol.
[839] Challenges in Non-Polymeric Crystal Structure Prediction: Why a Geometric, Permutation-Invariant Loss is Needed
Emmanuel Jehanno, Romain Menegaux, Julien Mairal, Sergei Grudinin
Main category: cs.LG
TL;DR: A simple regression model with a better-formulated loss function outperforms state-of-the-art methods for molecular crystal structure prediction on the COD-Cluster17 benchmark.
Details
Motivation: Accurately predicting three-dimensional non-polymeric crystal structures remains challenging despite advances in computational materials science, and existing methods have ill-posed learning objectives.Method: Proposed a better formulation with a loss function that captures key geometric molecular properties while ensuring permutation invariance, using a simple regression model within this framework.
Result: The simple regression model outperforms prior approaches including flow matching techniques on the COD-Cluster17 benchmark, a curated non-polymeric subset of the Crystallography Open Database.
Conclusion: A properly formulated learning objective with geometric constraints enables even simple models to achieve state-of-the-art performance in molecular crystal structure prediction.
Abstract: Crystalline structure prediction is an essential prerequisite for designing materials with targeted properties. Yet, it is still an open challenge in materials design and drug discovery. Despite recent advances in computational materials science, accurately predicting three-dimensional non-polymeric crystal structures remains elusive. In this work, we focus on the molecular assembly problem, where a set $\mathcal{S}$ of identical rigid molecules is packed to form a crystalline structure. Such a simplified formulation provides a useful approximation to the actual problem. However, while recent state-of-the-art methods have increasingly adopted sophisticated techniques, the underlying learning objective remains ill-posed. We propose a better formulation that introduces a loss function capturing key geometric molecular properties while ensuring permutation invariance over $\mathcal{S}$. Remarkably, we demonstrate that within this framework, a simple regression model already outperforms prior approaches, including flow matching techniques, on the COD-Cluster17 benchmark, a curated non-polymeric subset of the Crystallography Open Database (COD).
[840] Graph Random Features for Scalable Gaussian Processes
Matthew Zhang, Jihao Andreas Lin, Krzysztof Choromanski, Adrian Weller, Richard E. Turner, Isaac Reid
Main category: cs.LG
TL;DR: Graph random features enable scalable Gaussian processes on graphs with O(N^{3/2}) time complexity instead of O(N^3), allowing Bayesian optimization on graphs with over 1 million nodes.
Details
Motivation: To enable scalable Bayesian inference on large graphs where exact kernel methods have prohibitive O(N^3) computational complexity.Method: Use graph random features (GRFs) as stochastic estimators of graph node kernels for Gaussian processes on discrete input spaces.
Result: Achieved substantial speedups and memory savings, enabling Bayesian optimization on graphs with over 1 million nodes on a single computer chip while maintaining competitive performance.
Conclusion: GRFs provide an efficient approach for scalable Bayesian inference on large graphs with proven computational advantages over exact kernel methods.
Abstract: We study the application of graph random features (GRFs) - a recently introduced stochastic estimator of graph node kernels - to scalable Gaussian processes on discrete input spaces. We prove that (under mild assumptions) Bayesian inference with GRFs enjoys $O(N^{3/2})$ time complexity with respect to the number of nodes $N$, compared to $O(N^3)$ for exact kernels. Substantial wall-clock speedups and memory savings unlock Bayesian optimisation on graphs with over $10^6$ nodes on a single computer chip, whilst preserving competitive performance.
[841] Towards a Physics Foundation Model
Florian Wiesner, Matthias Wessling, Stephen Baek
Main category: cs.LG
TL;DR: GPhyT is a General Physics Transformer trained on diverse simulation data that demonstrates foundation model capabilities for physics, enabling single-model simulation across multiple domains without retraining.
Details
Motivation: To create a Physics Foundation Model (PFM) that democratizes access to high-fidelity simulations, accelerates scientific discovery, and eliminates the need for specialized solver development for different physical systems.Method: Trained a transformer model (GPhyT) on 1.8 TB of diverse simulation data, enabling it to learn governing dynamics from context and simulate various physical phenomena without being told the underlying equations.
Result: Achieved three breakthroughs: (1) superior performance across multiple physics domains, outperforming specialized architectures by up to 29x, (2) zero-shot generalization to unseen systems through in-context learning, and (3) stable 50-timestep rollouts for long-term predictions.
Conclusion: This work establishes that a single model can learn generalizable physical principles from data alone, opening the path toward a universal PFM that could transform computational science and engineering.
Abstract: Foundation models have revolutionized natural language processing through a ``train once, deploy anywhere’’ paradigm, where a single pre-trained model adapts to countless downstream tasks without retraining. Access to a Physics Foundation Model (PFM) would be transformative – democratizing access to high-fidelity simulations, accelerating scientific discovery, and eliminating the need for specialized solver development. Yet current physics-aware machine learning approaches remain fundamentally limited to single, narrow domains and require retraining for each new system. We present the General Physics Transformer (GPhyT), trained on 1.8 TB of diverse simulation data, that demonstrates foundation model capabilities are achievable for physics. Our key insight is that transformers can learn to infer governing dynamics from context, enabling a single model to simulate fluid-solid interactions, shock waves, thermal convection, and multi-phase dynamics without being told the underlying equations. GPhyT achieves three critical breakthroughs: (1) superior performance across multiple physics domains, outperforming specialized architectures by up to 29x, (2) zero-shot generalization to entirely unseen physical systems through in-context learning, and (3) stable long-term predictions through 50-timestep rollouts. By establishing that a single model can learn generalizable physical principles from data alone, this work opens the path toward a universal PFM that could transform computational science and engineering.
[842] AI for Scientific Discovery is a Social Problem
Georgia Channing, Avijit Ghosh
Main category: cs.LG
TL;DR: AI for science faces social and institutional barriers, not just technical ones, requiring collaborative approaches and equitable infrastructure.
Details
Motivation: To address the uneven distribution of AI benefits in science, focusing on social and institutional challenges rather than just technical obstacles.Method: Analysis of four interconnected challenges: community dysfunction, misaligned research priorities, data fragmentation, and infrastructure inequities.
Result: Identified cultural and organizational practices as root causes, requiring community-building, cross-disciplinary education, and shared infrastructure.
Conclusion: AI for science should be reframed as a collective social project where sustainable collaboration and equitable participation are prerequisites for progress.
Abstract: Artificial intelligence promises to accelerate scientific discovery, yet its benefits remain unevenly distributed. While technical obstacles such as scarce data, fragmented standards, and unequal access to computation are significant, we argue that the primary barriers are social and institutional. Narratives that defer progress to speculative “AI scientists,” the undervaluing of data and infrastructure contributions, misaligned incentives, and gaps between domain experts and machine learning researchers all constrain impact. We highlight four interconnected challenges: community dysfunction, research priorities misaligned with upstream needs, data fragmentation, and infrastructure inequities. We argue that their roots lie in cultural and organizational practices. Addressing them requires not only technical innovation but also intentional community-building, cross-disciplinary education, shared benchmarks, and accessible infrastructure. We call for reframing AI for science as a collective social project, where sustainable collaboration and equitable participation are treated as prerequisites for technical progress.
[843] A Variational Framework for Residual-Based Adaptivity in Neural PDE Solvers and Operator Learning
Juan Diego Toscano, Daniel T. Chen, Vivek Oommen, Jérôme Darbon, George Em Karniadakis
Main category: cs.LG
TL;DR: A variational framework formalizes residual-based adaptive strategies in scientific machine learning by integrating convex transformations of the residual, linking discretization choices to error metrics and improving performance.
Details
Motivation: To provide a theoretical foundation for residual-based adaptive strategies in scientific machine learning, which have been widely used but remain largely heuristic.Method: Introduces a unifying variational framework that integrates convex transformations of the residual, where different transformations correspond to distinct objective functionals (exponential weights for uniform error, linear weights for quadratic error).
Result: The framework enables systematic design of adaptive schemes across norms, reduces discretization error through variance reduction, and enhances learning dynamics by improving gradient signal-to-noise ratio. Demonstrates substantial performance gains in operator learning.
Conclusion: Provides theoretical justification for residual-based adaptivity and establishes a foundation for principled discretization and training strategies in scientific machine learning.
Abstract: Residual-based adaptive strategies are widely used in scientific machine learning but remain largely heuristic. We introduce a unifying variational framework that formalizes these methods by integrating convex transformations of the residual. Different transformations correspond to distinct objective functionals: exponential weights target the minimization of uniform error, while linear weights recover the minimization of quadratic error. Within this perspective, adaptive weighting is equivalent to selecting sampling distributions that optimize the primal objective, thereby linking discretization choices directly to error metrics. This principled approach yields three benefits: (1) it enables systematic design of adaptive schemes across norms, (2) reduces discretization error through variance reduction of the loss estimator, and (3) enhances learning dynamics by improving the gradient signal-to-noise ratio. Extending the framework to operator learning, we demonstrate substantial performance gains across optimizers and architectures. Our results provide a theoretical justification of residual-based adaptivity and establish a foundation for principled discretization and training strategies.
[844] Dendritic Resonate-and-Fire Neuron for Effective and Efficient Long Sequence Modeling
Dehao Zhang, Malu Zhang, Shuai Wang, Jingya Wang, Wenjie Wei, Zeyu Ma, Guoqing Wang, Yang Yang, Haizhou Li
Main category: cs.LG
TL;DR: The paper proposes a Dendritic Resonate-and-Fire (D-RF) model that improves long sequence modeling by incorporating multi-dendritic architecture and adaptive threshold mechanisms to enhance memory capacity and spike sparsity while maintaining computational efficiency.
Details
Motivation: To address the limitations of Resonate-and-Fire (RF) neurons in long sequence modeling, particularly their limited effective memory capacity and trade-off between energy efficiency and training speed on complex temporal tasks.Method: Proposed a Dendritic Resonate-and-Fire (D-RF) model with multi-dendritic and soma architecture, where each dendritic branch encodes specific frequency bands using RF neuron oscillatory dynamics, and introduced an adaptive threshold mechanism in the soma that adjusts based on historical spiking activity.
Result: Extensive experiments show the method maintains competitive accuracy while substantially ensuring sparse spikes without compromising computational efficiency during training.
Conclusion: The D-RF model demonstrates potential as an effective and efficient solution for long sequence modeling on edge platforms, achieving better frequency representation and reduced redundant spikes while maintaining training efficiency.
Abstract: The explosive growth in sequence length has intensified the demand for effective and efficient long sequence modeling. Benefiting from intrinsic oscillatory membrane dynamics, Resonate-and-Fire (RF) neurons can efficiently extract frequency components from input signals and encode them into spatiotemporal spike trains, making them well-suited for long sequence modeling. However, RF neurons exhibit limited effective memory capacity and a trade-off between energy efficiency and training speed on complex temporal tasks. Inspired by the dendritic structure of biological neurons, we propose a Dendritic Resonate-and-Fire (D-RF) model, which explicitly incorporates a multi-dendritic and soma architecture. Each dendritic branch encodes specific frequency bands by utilizing the intrinsic oscillatory dynamics of RF neurons, thereby collectively achieving comprehensive frequency representation. Furthermore, we introduce an adaptive threshold mechanism into the soma structure that adjusts the threshold based on historical spiking activity, reducing redundant spikes while maintaining training efficiency in long sequence tasks. Extensive experiments demonstrate that our method maintains competitive accuracy while substantially ensuring sparse spikes without compromising computational efficiency during training. These results underscore its potential as an effective and efficient solution for long sequence modeling on edge platforms.
[845] RMT-KD: Random Matrix Theoretic Causal Knowledge Distillation
Davide Ettori, Nastaran Darabi, Sureshkumar Senthilkumar, Amit Ranjan Trivedi
Main category: cs.LG
TL;DR: RMT-KD uses Random Matrix Theory for knowledge distillation to compress large models by preserving only informative spectral directions, achieving 80% parameter reduction with minimal accuracy loss.
Details
Motivation: Large models like BERT and ResNet are costly to deploy at the edge due to size and compute demands, requiring efficient compression methods.Method: Layer-by-layer RMT-based causal reduction with self-distillation, using spectral properties of hidden representations to identify and preserve informative directions instead of pruning or heuristic rank selection.
Result: Achieves up to 80% parameter reduction with only 2% accuracy loss on GLUE, AG News, and CIFAR-10, delivering 2.8x faster inference and nearly halved power consumption.
Conclusion: RMT-KD establishes a mathematically grounded approach to network distillation that effectively balances compression and performance.
Abstract: Large deep learning models such as BERT and ResNet achieve state-of-the-art performance but are costly to deploy at the edge due to their size and compute demands. We present RMT-KD, a compression method that leverages Random Matrix Theory (RMT) for knowledge distillation to iteratively reduce network size. Instead of pruning or heuristic rank selection, RMT-KD preserves only informative directions identified via the spectral properties of hidden representations. RMT-based causal reduction is applied layer by layer with self-distillation to maintain stability and accuracy. On GLUE, AG News, and CIFAR-10, RMT-KD achieves up to 80% parameter reduction with only 2% accuracy loss, delivering 2.8x faster inference and nearly halved power consumption. These results establish RMT-KD as a mathematically grounded approach to network distillation.
[846] TimeMosaic: Temporal Heterogeneity Guided Time Series Forecasting via Adaptive Granularity Patch and Segment-wise Decoding
Kuiye Ding, Fanda Fan, Chunyi Hou, Zheya Wang, Lei Wang, Zhengxin Yang, Jianfeng Zhan
Main category: cs.LG
TL;DR: TimeMosaic is a multivariate time series forecasting framework that addresses temporal heterogeneity through adaptive patch embedding and segment-wise decoding, achieving competitive performance with state-of-the-art models.
Details
Motivation: Existing patch-based methods use fixed-length segmentation, which overlooks heterogeneity in local temporal dynamics and forecasting decoding. This causes loss of details in information-dense regions, redundancy in stable segments, and failure to capture different complexities of short-term vs long-term horizons.Method: TimeMosaic employs adaptive patch embedding to dynamically adjust granularity based on local information density, balancing motif reuse with structural clarity while maintaining temporal continuity. It also uses segment-wise decoding that treats each prediction horizon as a related subtask, adapting to horizon-specific difficulty and information requirements.
Result: Extensive evaluations on benchmark datasets show consistent improvements over existing methods. The model trained on a large-scale corpus with 321 billion observations achieves performance competitive with state-of-the-art time series foundation models (TSFMs).
Conclusion: TimeMosaic effectively addresses temporal heterogeneity in multivariate time series forecasting through its adaptive patch embedding and segment-wise decoding approach, demonstrating superior performance compared to traditional fixed-length segmentation methods.
Abstract: Multivariate time series forecasting is essential in domains such as finance, transportation, climate, and energy. However, existing patch-based methods typically adopt fixed-length segmentation, overlooking the heterogeneity of local temporal dynamics and the decoding heterogeneity of forecasting. Such designs lose details in information-dense regions, introduce redundancy in stable segments, and fail to capture the distinct complexities of short-term and long-term horizons. We propose TimeMosaic, a forecasting framework that aims to address temporal heterogeneity. TimeMosaic employs adaptive patch embedding to dynamically adjust granularity according to local information density, balancing motif reuse with structural clarity while preserving temporal continuity. In addition, it introduces segment-wise decoding that treats each prediction horizon as a related subtask and adapts to horizon-specific difficulty and information requirements, rather than applying a single uniform decoder. Extensive evaluations on benchmark datasets demonstrate that TimeMosaic delivers consistent improvements over existing methods, and our model trained on the large-scale corpus with 321 billion observations achieves performance competitive with state-of-the-art TSFMs.
[847] EigenTrack: Spectral Activation Feature Tracking for Hallucination and Out-of-Distribution Detection in LLMs and VLMs
Davide Ettori, Nastaran Darabi, Sina Tayebati, Ranganath Krishnan, Mahesh Subedar, Omesh Tickoo, Amit Ranjan Trivedi
Main category: cs.LG
TL;DR: EigenTrack is a real-time detector that uses spectral geometry of hidden activations to identify hallucination and out-of-distribution errors in LLMs before surface errors appear.
Details
Motivation: Large language models are prone to hallucination and out-of-distribution errors, requiring effective detection methods that can identify these issues early.Method: Uses spectral geometry of hidden activations, streaming covariance-spectrum statistics (entropy, eigenvalue gaps, KL divergence) into a lightweight recurrent classifier to track temporal shifts in representation structure.
Result: Can detect hallucination and OOD drift before surface errors appear, requires only single forward pass without resampling, preserves temporal context, and offers interpretable accuracy-latency trade-offs.
Conclusion: EigenTrack provides an interpretable, efficient method for real-time detection of LLM errors that outperforms existing black-box, grey-box, and white-box approaches.
Abstract: Large language models (LLMs) offer broad utility but remain prone to hallucination and out-of-distribution (OOD) errors. We propose EigenTrack, an interpretable real-time detector that uses the spectral geometry of hidden activations, a compact global signature of model dynamics. By streaming covariance-spectrum statistics such as entropy, eigenvalue gaps, and KL divergence from random baselines into a lightweight recurrent classifier, EigenTrack tracks temporal shifts in representation structure that signal hallucination and OOD drift before surface errors appear. Unlike black- and grey-box methods, it needs only a single forward pass without resampling. Unlike existing white-box detectors, it preserves temporal context, aggregates global signals, and offers interpretable accuracy-latency trade-offs.
[848] GPU Temperature Simulation-Based Testing for In-Vehicle Deep Learning Frameworks
Yinglong Zou, Juan Zhai, Chunrong Fang, Zhenyu Chen
Main category: cs.LG
TL;DR: ThermalGuardian is a testing method for automotive deep learning frameworks that addresses quality issues caused by temperature variations in vehicular environments.
Details
Motivation: Automotive deep learning frameworks are deployed in temperature-varying environments (-40°C to 50°C) which affect GPU performance through frequency adjustments, causing quality issues that existing testing methods don't detect.Method: Generates test input models using model mutation rules for temperature-sensitive operators, simulates GPU temperature fluctuations using Newton’s law of cooling, and controls GPU frequency based on real-time temperature.
Result: The method can detect critical quality issues including delays/errors in compute-intensive operators, precision errors in high/mixed-precision operators, and synchronization issues in time-series operators.
Conclusion: ThermalGuardian is the first automotive deep learning framework testing method that considers temperature effects, bridging the gap in existing testing approaches for temperature-varying environments.
Abstract: Deep learning models play a vital role in autonomous driving systems, supporting critical functions such as environmental perception. To accelerate model inference, these deep learning models’ deployment relies on automotive deep learning frameworks, for example, PaddleInference in Apollo and TensorRT in AutoWare. However, unlike deploying deep learning models on the cloud, vehicular environments experience extreme ambient temperatures varying from -40{\deg}C to 50{\deg}C, significantly impacting GPU temperature. Additionally, heats generated when computing further lead to the GPU temperature increase. These temperature fluctuations lead to dynamic GPU frequency adjustments through mechanisms such as DVFS. However, automotive deep learning frameworks are designed without considering the impact of temperature-induced frequency variations. When deployed on temperature-varying GPUs, these frameworks suffer critical quality issues: compute-intensive operators face delays or errors, high/mixed-precision operators suffer from precision errors, and time-series operators suffer from synchronization issues. The above quality issues cannot be detected by existing deep learning framework testing methods because they ignore temperature’s effect on the deep learning framework quality. To bridge this gap, we propose ThermalGuardian, the first automotive deep learning framework testing method under temperature-varying environments. Specifically, ThermalGuardian generates test input models using model mutation rules targeting temperature-sensitive operators, simulates GPU temperature fluctuations based on Newton’s law of cooling, and controls GPU frequency based on real-time GPU temperature.
[849] Self-Supervised Learning of Graph Representations for Network Intrusion Detection
Lorenzo Guerra, Thomas Chapuis, Guillaume Duc, Pavlo Mozharovskyi, Van-Tam Nguyen
Main category: cs.LG
TL;DR: GraphIDS is a self-supervised intrusion detection model that unifies representation learning and anomaly detection using graph neural networks and masked autoencoders to identify network intrusions through reconstruction errors.
Details
Motivation: Existing graph neural network approaches for network intrusion detection often decouple representation learning from anomaly detection, limiting the utility of embeddings for identifying attacks.Method: Uses an inductive graph neural network to embed network flows with local topological context, combined with a Transformer-based encoder-decoder that reconstructs embeddings to learn global co-occurrence patterns via self-attention without explicit positional information.
Result: Achieves up to 99.98% PR-AUC and 99.61% macro F1-score on diverse NetFlow benchmarks, outperforming baselines by 5-25 percentage points.
Conclusion: The end-to-end framework ensures embeddings are directly optimized for intrusion detection, facilitating effective recognition of malicious traffic through reconstruction errors.
Abstract: Detecting intrusions in network traffic is a challenging task, particularly under limited supervision and constantly evolving attack patterns. While recent works have leveraged graph neural networks for network intrusion detection, they often decouple representation learning from anomaly detection, limiting the utility of the embeddings for identifying attacks. We propose GraphIDS, a self-supervised intrusion detection model that unifies these two stages by learning local graph representations of normal communication patterns through a masked autoencoder. An inductive graph neural network embeds each flow with its local topological context to capture typical network behavior, while a Transformer-based encoder-decoder reconstructs these embeddings, implicitly learning global co-occurrence patterns via self-attention without requiring explicit positional information. During inference, flows with unusually high reconstruction errors are flagged as potential intrusions. This end-to-end framework ensures that embeddings are directly optimized for the downstream task, facilitating the recognition of malicious traffic. On diverse NetFlow benchmarks, GraphIDS achieves up to 99.98% PR-AUC and 99.61% macro F1-score, outperforming baselines by 5-25 percentage points.
[850] Diffusion-Augmented Contrastive Learning: A Noise-Robust Encoder for Biosignal Representations
Rami Zewail
Main category: cs.LG
TL;DR: DACL is a hybrid framework combining diffusion models and supervised contrastive learning to create robust biosignal representations using diffusion-based data augmentation in latent space.
Details
Motivation: Traditional data augmentation methods fail to capture complex variations in physiological data, necessitating more effective approaches for robust representation learning.Method: Uses VAE on Scattering Transformer features to create latent space, applies diffusion forward process for data augmentation, and trains U-Net encoder with supervised contrastive learning for noise-invariant embeddings.
Result: Achieved competitive AUROC of 0.7815 on PhysioNet 2017 ECG dataset, demonstrating effective class separability.
Conclusion: Establishes new paradigm using diffusion process to drive contrastive learning, creating noise-invariant embeddings with strong class discrimination foundation.
Abstract: Learning robust representations for biosignals is often hampered by the challenge of designing effective data augmentations.Traditional methods can fail to capture the complex variations inherent in physiological data. Within this context, we propose a novel hybrid framework, Diffusion-Augmented Contrastive Learning (DACL), that fuses concepts from diffusion models and supervised contrastive learning. The DACL framework operates on a latent space created by a lightweight Variational Autoencoder (VAE) trained on our novel Scattering Transformer (ST) features [12]. It utilizes the diffusion forward process as a principled data augmentation technique to generate multiple noisy views of these latent embeddings. A U-Net style encoder is then trained with a supervised contrastive objective to learn a representation that balances class discrimination with robustness to noise across various diffusion time steps. We evaluated this proof-of-concept method on the PhysioNet 2017 ECG dataset, achieving a competitive AUROC of 0.7815. This work establishes a new paradigm for representation learning by using the diffusion process itself to drive the contrastive objective, creating noise-invariant embeddings that demonstrate a strong foundation for class separability.
[851] Aligning Inductive Bias for Data-Efficient Generalization in State Space Models
Qiyu Chen, Guozhang Chen
Main category: cs.LG
TL;DR: The paper introduces Task-Dependent Initialization (TDI) to improve data efficiency in State Space Models by aligning model inductive bias with task characteristics through power spectrum matching.
Details
Motivation: Large-scale models face data scarcity issues, and fixed inductive biases in State Space Models are inefficient when task structure doesn't match the model's prior assumptions.Method: Formalizes SSM inductive bias via SSM-induced kernel analysis, then proposes TDI using power spectrum matching to align model frequency response with task spectral characteristics before training.
Result: Experiments on diverse real-world benchmarks show TDI significantly improves generalization and sample efficiency, especially in low-data regimes.
Conclusion: Provides theoretical and practical framework for creating more data-efficient models, addressing sustainable scaling challenges.
Abstract: The remarkable success of large-scale models is fundamentally tied to scaling laws, yet the finite nature of high-quality data presents a looming challenge. One of the next frontiers in modeling is data efficiency: the ability to learn more from less. A model’s inductive bias is a critical lever for this, but foundational sequence models like State Space Models (SSMs) rely on a fixed bias. This fixed prior is sample-inefficient when a task’s underlying structure does not match. In this work, we introduce a principled framework to solve this problem. We first formalize the inductive bias of linear time-invariant SSMs through an SSM-induced kernel, mathematically and empirically proving its spectrum is directly governed by the model’s frequency response. Further, we propose a method of Task-Dependent Initialization (TDI): power spectrum matching, a fast and efficient method that aligns the model’s inductive bias with the task’s spectral characteristics before large-scale training. Our experiments on a diverse set of real-world benchmarks show that TDI significantly improves generalization and sample efficiency, particularly in low-data regimes. This work provides a theoretical and practical tool to create more data-efficient models, a crucial step towards sustainable scaling.
[852] FERD: Fairness-Enhanced Data-Free Robustness Distillation
Zhengxiao Li, Liming Lu, Xu Zheng, Siyuan Liang, Zhenghan Chen, Yongbin Zhou, Shuchao Pang
Main category: cs.LG
TL;DR: FERD is a fairness-enhanced data-free robustness distillation framework that addresses robust fairness issues by adjusting adversarial example proportions and distributions to improve worst-class robustness across categories.
Details
Motivation: Existing data-free robustness distillation methods overlook robust fairness issues, leading to severe disparity of robustness across different categories, with students showing different behavior across categories and unstable robustness across attack targets.Method: FERD uses robustness-guided class reweighting to synthesize more samples for less robust categories, generates Fairness-Aware Examples (FAEs) with uniformity constraints on feature-level predictions, and constructs Uniform-Target Adversarial Examples (UTAEs) with uniform target class constraints.
Result: FERD achieves state-of-the-art worst-class robustness on three public datasets, improving worst-class robustness under FGSM and AutoAttack by 15.1% and 6.4% respectively using MobileNet-V2 on CIFAR-10.
Conclusion: FERD demonstrates superior performance in both robustness and fairness aspects by addressing robust fairness issues in data-free robustness distillation through balanced adversarial example generation.
Abstract: Data-Free Robustness Distillation (DFRD) aims to transfer the robustness from the teacher to the student without accessing the training data. While existing methods focus on overall robustness, they overlook the robust fairness issues, leading to severe disparity of robustness across different categories. In this paper, we find two key problems: (1) student model distilled with equal class proportion data behaves significantly different across distinct categories; and (2) the robustness of student model is not stable across different attacks target. To bridge these gaps, we present the first Fairness-Enhanced data-free Robustness Distillation (FERD) framework to adjust the proportion and distribution of adversarial examples. For the proportion, FERD adopts a robustness-guided class reweighting strategy to synthesize more samples for the less robust categories, thereby improving robustness of them. For the distribution, FERD generates complementary data samples for advanced robustness distillation. It generates Fairness-Aware Examples (FAEs) by enforcing a uniformity constraint on feature-level predictions, which suppress the dominance of class-specific non-robust features, providing a more balanced representation across all categories. Then, FERD constructs Uniform-Target Adversarial Examples (UTAEs) from FAEs by applying a uniform target class constraint to avoid biased attack directions, which distribute the attack targets across all categories and prevents overfitting to specific vulnerable categories. Extensive experiments on three public datasets show that FERD achieves state-of-the-art worst-class robustness under all adversarial attack (e.g., the worst-class robustness under FGSM and AutoAttack are improved by 15.1% and 6.4% using MobileNet-V2 on CIFAR-10), demonstrating superior performance in both robustness and fairness aspects.
[853] FORCE: Transferable Visual Jailbreaking Attacks via Feature Over-Reliance CorrEction
Runqi Lin, Alasdair Paren, Suqin Yuan, Muyang Li, Philip Torr, Adel Bibi, Tongliang Liu
Main category: cs.LG
TL;DR: The paper analyzes why visual jailbreaking attacks on multimodal LLMs have poor cross-model transferability, finding they reside in high-sharpness regions. It proposes FORCE method to correct feature over-reliance and improve transferability.
Details
Motivation: Visual jailbreaking attacks can easily manipulate open-source MLLMs but fail to transfer to closed-source models, limiting vulnerability assessment capabilities.Method: Analyzed loss landscape and feature representations, then proposed FORCE method that guides attacks to explore broader feasible regions across layer features and rescales frequency feature influence based on semantic content.
Result: FORCE method successfully discovers flattened feasible regions for visual jailbreaking attacks, significantly improving cross-model transferability for red-teaming evaluations.
Conclusion: By eliminating non-generalizable reliance on layer and spectral features, the proposed approach enables more effective vulnerability assessment of closed-source MLLMs through improved attack transferability.
Abstract: The integration of new modalities enhances the capabilities of multimodal large language models (MLLMs) but also introduces additional vulnerabilities. In particular, simple visual jailbreaking attacks can manipulate open-source MLLMs more readily than sophisticated textual attacks. However, these underdeveloped attacks exhibit extremely limited cross-model transferability, failing to reliably identify vulnerabilities in closed-source MLLMs. In this work, we analyse the loss landscape of these jailbreaking attacks and find that the generated attacks tend to reside in high-sharpness regions, whose effectiveness is highly sensitive to even minor parameter changes during transfer. To further explain the high-sharpness localisations, we analyse their feature representations in both the intermediate layers and the spectral domain, revealing an improper reliance on narrow layer representations and semantically poor frequency components. Building on this, we propose a Feature Over-Reliance CorrEction (FORCE) method, which guides the attack to explore broader feasible regions across layer features and rescales the influence of frequency features according to their semantic content. By eliminating non-generalizable reliance on both layer and spectral features, our method discovers flattened feasible regions for visual jailbreaking attacks, thereby improving cross-model transferability. Extensive experiments demonstrate that our approach effectively facilitates visual red-teaming evaluations against closed-source MLLMs.
[854] Differential-Integral Neural Operator for Long-Term Turbulence Forecasting
Hao Wu, Yuan Gao, Fan Xu, Fan Zhang, Qingsong Wen, Kun Wang, Xiaomeng Huang, Xian Wu
Main category: cs.LG
TL;DR: DINO is a novel neural operator framework that decomposes turbulent dynamics into local differential and global integral operators, enabling stable long-term turbulence forecasting by suppressing error accumulation and maintaining physical fidelity.
Details
Motivation: Existing deep learning methods fail in long-term turbulence forecasting due to catastrophic error accumulation and inability to capture both local dissipative effects and global non-local interactions simultaneously.Method: Proposes DINO framework with parallel branches: a constrained convolutional network for local differential operator (converging to derivative) and a Transformer for global integral operator (learning data-driven kernel).
Result: Significantly outperforms state-of-the-art models on 2D Kolmogorov flow benchmark, suppresses error accumulation over hundreds of timesteps, maintains high fidelity in vorticity fields and energy spectra.
Conclusion: DINO establishes a new benchmark for physically consistent, long-range turbulence forecasting through physics-based operator decomposition.
Abstract: Accurately forecasting the long-term evolution of turbulence represents a grand challenge in scientific computing and is crucial for applications ranging from climate modeling to aerospace engineering. Existing deep learning methods, particularly neural operators, often fail in long-term autoregressive predictions, suffering from catastrophic error accumulation and a loss of physical fidelity. This failure stems from their inability to simultaneously capture the distinct mathematical structures that govern turbulent dynamics: local, dissipative effects and global, non-local interactions. In this paper, we propose the {\textbf{\underline{D}}}ifferential-{\textbf{\underline{I}}}ntegral {\textbf{\underline{N}}}eural {\textbf{\underline{O}}}perator (\method{}), a novel framework designed from a first-principles approach of operator decomposition. \method{} explicitly models the turbulent evolution through parallel branches that learn distinct physical operators: a local differential operator, realized by a constrained convolutional network that provably converges to a derivative, and a global integral operator, captured by a Transformer architecture that learns a data-driven global kernel. This physics-based decomposition endows \method{} with exceptional stability and robustness. Through extensive experiments on the challenging 2D Kolmogorov flow benchmark, we demonstrate that \method{} significantly outperforms state-of-the-art models in long-term forecasting. It successfully suppresses error accumulation over hundreds of timesteps, maintains high fidelity in both the vorticity fields and energy spectra, and establishes a new benchmark for physically consistent, long-range turbulence forecast.
cs.MA
[855] Visual Multi-Agent System: Mitigating Hallucination Snowballing via Visual Flow
Xinlei Yu, Chengming Xu, Guibin Zhang, Yongbo He, Zhangquan Chen, Zhucun Xue, Jiangning Zhang, Yue Liao, Xiaobin Hu, Yu-Gang Jiang, Shuicheng Yan
Main category: cs.MA
TL;DR: The paper introduces ViF, a method to mitigate multi-agent visual hallucination snowballing in Visual Language Model systems by using visual flow and attention reallocation.
Details
Motivation: Multi-agent systems with Visual Language Models suffer from hallucination snowballing where visual errors propagate and amplify through textual communication between agents.Method: ViF uses visual relay tokens and attention reallocation to preserve visual evidence across agent interactions, focusing on middle-layer vision tokens with unimodal attention peaks.
Result: The method significantly reduces hallucination snowballing and improves performance across eight benchmarks using four MAS structures and ten base models.
Conclusion: ViF effectively addresses visual hallucination propagation in multi-agent systems through visual flow communication and attention optimization.
Abstract: Multi-Agent System (MAS) powered by Visual Language Models (VLMs) enables challenging tasks but suffers from a novel failure term, multi-agent visual hallucination snowballing, where hallucinations are seeded in a single agent and amplified by following ones due to the over-reliance on textual flow to relay visual information. Through turn-, layer-, and token-wise attention analyses, we provide detailed insights into the essence of hallucination snowballing regarding the reduction of visual attention allocation. It leads us to identify a subset of vision tokens with a unimodal attention peak in middle layers that best preserve visual evidence but gradually diminish in deeper agent turns, resulting in the visual hallucination snowballing in MAS. Thus, we propose ViF, a lightweight, plug-and-play mitigation paradigm that relays inter-agent messages with Visual Flow powered by the selected visual relay tokens and applies attention reallocation to amplify this pattern. The experiment results demonstrate that our method markedly reduces hallucination snowballing, consistently improving the performance across eight benchmarks based on four common MAS structures and ten base models. The source code will be available at: https://github.com/YU-deep/ViF.git.
[856] RobustFlow: Towards Robust Agentic Workflow Generation
Shengxiang Xu, Jiayi Zhang, Shimin Di, Yuyu Luo, Liang Yao, Hanmo Liu, Jia Zhu, Fan Liu, Min-Ling Zhang
Main category: cs.MA
TL;DR: The paper addresses the robustness issue in automated agentic workflow generation, showing that current methods produce inconsistent workflows for semantically identical but differently phrased instructions. The authors propose metrics to evaluate workflow consistency and introduce RobustFlow, a training framework that improves robustness scores to 70-90%.
Details
Motivation: Current agentic workflow generation methods are brittle and produce inconsistent results when given semantically identical but differently phrased instructions, which undermines their reliability for real-world applications.Method: Proposed nodal and topological similarity metrics to evaluate workflow consistency, and developed RobustFlow - a training framework using preference optimization to teach models invariance to instruction variations by training on sets of synonymous task descriptions.
Result: RobustFlow significantly improves workflow robustness scores to 70-90%, representing a substantial improvement over existing approaches.
Conclusion: The proposed RobustFlow framework effectively addresses the robustness challenge in agentic workflow generation, making LLM-based workflow generation more reliable and trustworthy for practical applications.
Abstract: The automated generation of agentic workflows is a promising frontier for enabling large language models (LLMs) to solve complex tasks. However, our investigation reveals that the robustness of agentic workflow remains a critical, unaddressed challenge. Current methods often generate wildly inconsistent workflows when provided with instructions that are semantically identical but differently phrased. This brittleness severely undermines their reliability and trustworthiness for real-world applications. To quantitatively diagnose this instability, we propose metrics based on nodal and topological similarity to evaluate workflow consistency against common semantic variations such as paraphrasing and noise injection. Subsequently, we further propose a novel training framework, RobustFlow, that leverages preference optimization to teach models invariance to instruction variations. By training on sets of synonymous task descriptions, RobustFlow boosts workflow robustness scores to 70% - 90%, which is a substantial improvement over existing approaches. The code is publicly available at https://github.com/DEFENSE-SEU/RobustFlow.
[857] Multi-Agent Path Finding via Offline RL and LLM Collaboration
Merve Atasever, Matthew Hong, Mihir Nitin Kulkarni, Qingpei Li, Jyotirmoy V. Deshmukh
Main category: cs.MA
TL;DR: Proposes a decentralized MAPF framework using Decision Transformer with offline RL to reduce training time from weeks to hours, and integrates GPT-4o for dynamic environment adaptability.
Details
Motivation: Addresses challenges in decentralized MAPF including self-centered agent behaviors causing collisions, and long training times due to complex communication modules in traditional RL methods.Method: Uses Decision Transformer with offline reinforcement learning for efficient decentralized planning, and integrates GPT-4o to dynamically guide agent policies in changing environments.
Result: Significantly reduces training duration from weeks to hours, handles long-horizon credit assignment, improves sparse/delayed reward performance, and enhances adaptability in dynamic environments.
Conclusion: The DT-based approach with GPT-4o augmentation effectively addresses MAPF challenges, improving both training efficiency and adaptability in static and dynamic environments.
Abstract: Multi-Agent Path Finding (MAPF) poses a significant and challenging problem critical for applications in robotics and logistics, particularly due to its combinatorial complexity and the partial observability inherent in realistic environments. Decentralized reinforcement learning methods commonly encounter two substantial difficulties: first, they often yield self-centered behaviors among agents, resulting in frequent collisions, and second, their reliance on complex communication modules leads to prolonged training times, sometimes spanning weeks. To address these challenges, we propose an efficient decentralized planning framework based on the Decision Transformer (DT), uniquely leveraging offline reinforcement learning to substantially reduce training durations from weeks to mere hours. Crucially, our approach effectively handles long-horizon credit assignment and significantly improves performance in scenarios with sparse and delayed rewards. Furthermore, to overcome adaptability limitations inherent in standard RL methods under dynamic environmental changes, we integrate a large language model (GPT-4o) to dynamically guide agent policies. Extensive experiments in both static and dynamically changing environments demonstrate that our DT-based approach, augmented briefly by GPT-4o, significantly enhances adaptability and performance.
[858] Impact of Collective Behaviors of Autonomous Vehicles on Urban Traffic Dynamics: A Multi-Agent Reinforcement Learning Approach
Ahmet Onur Akman, Anastasia Psarou, Zoltán György Varga, Grzegorz Jamróz, Rafał Kucharski
Main category: cs.MA
TL;DR: RL-enabled autonomous vehicles can optimize their travel times by up to 5% in mixed traffic environments, with varying impacts on human drivers depending on the AV behavior adopted.
Details
Motivation: To examine how reinforcement learning-enabled autonomous vehicles affect urban traffic flow in mixed traffic environments with different behavioral objectives.Method: Used Deep Q-learning algorithm in a multi-agent setting where one-third of vehicles are converted to AVs with different behaviors (selfish, collaborative, competitive, social, altruistic, malicious) imposed through reward functions, simulated using PARCOUR framework.
Result: AVs achieved up to 5% travel time optimization, with self-serving behaviors consistently yielding shorter travel times than human drivers. Different behaviors showed varying complexity in learning tasks and impacts on human drivers.
Conclusion: Multi-agent RL is applicable for collective routing in traffic networks, but the impact on coexisting parties varies significantly with the adopted behaviors.
Abstract: This study examines the potential impact of reinforcement learning (RL)-enabled autonomous vehicles (AV) on urban traffic flow in a mixed traffic environment. We focus on a simplified day-to-day route choice problem in a multi-agent setting. We consider a city network where human drivers travel through their chosen routes to reach their destinations in minimum travel time. Then, we convert one-third of the population into AVs, which are RL agents employing Deep Q-learning algorithm. We define a set of optimization targets, or as we call them behaviors, namely selfish, collaborative, competitive, social, altruistic, and malicious. We impose a selected behavior on AVs through their rewards. We run our simulations using our in-house developed RL framework PARCOUR. Our simulations reveal that AVs optimize their travel times by up to 5%, with varying impacts on human drivers’ travel times depending on the AV behavior. In all cases where AVs adopt a self-serving behavior, they achieve shorter travel times than human drivers. Our findings highlight the complexity differences in learning tasks of each target behavior. We demonstrate that the multi-agent RL setting is applicable for collective routing on traffic networks, though their impact on coexisting parties greatly varies with the behaviors adopted.
[859] VizGen: Data Exploration and Visualization from Natural Language via a Multi-Agent AI Architecture
Sandaru Fernando, Imasha Jayarathne, Sithumini Abeysekara, Shanuja Sithamparanthan, Thushari Silva, Deshan Jayawardana
Main category: cs.MA
TL;DR: VizGen is an AI-powered system that enables users to create data visualizations using natural language, translating queries into SQL and recommending graph types through a multi-agent architecture.
Details
Motivation: Traditional data visualization tools require technical expertise, limiting accessibility for non-technical users who want to interpret complex datasets.Method: Leverages advanced NLP and LLMs (Claude 3.7 Sonnet, Gemini 2.0 Flash) with a multi-agent architecture that handles SQL generation, graph creation, customization, and insight extraction. Supports real-time interaction with SQL databases and conversational graph refinement.
Result: The system successfully translates natural language queries into visualizations, analyzes data for patterns and correlations, and provides contextual explanations by gathering information from the internet.
Conclusion: VizGen democratizes data visualization by bridging the gap between technical complexity and user-friendly design, making data analysis intuitive and accessible to non-technical users.
Abstract: Data visualization is essential for interpreting complex datasets, yet traditional tools often require technical expertise, limiting accessibility. VizGen is an AI-assisted graph generation system that empowers users to create meaningful visualizations using natural language. Leveraging advanced NLP and LLMs like Claude 3.7 Sonnet and Gemini 2.0 Flash, it translates user queries into SQL and recommends suitable graph types. Built on a multi-agent architecture, VizGen handles SQL generation, graph creation, customization, and insight extraction. Beyond visualization, it analyzes data for patterns, anomalies, and correlations, and enhances user understanding by providing explanations enriched with contextual information gathered from the internet. The system supports real-time interaction with SQL databases and allows conversational graph refinement, making data analysis intuitive and accessible. VizGen democratizes data visualization by bridging the gap between technical complexity and user-friendly design.
[860] Effective Policy Learning for Multi-Agent Online Coordination Beyond Submodular Objectives
Qixin Zhang, Yan Sun, Can Jin, Xikun Zhang, Yao Shu, Puning Zhao, Li Shen, Dacheng Tao
Main category: cs.MA
TL;DR: Two policy learning algorithms (MA-SPL and MA-MPL) for multi-agent online coordination with submodular and weakly submodular objectives, achieving optimal approximation guarantees with parameter-free capabilities.
Details
Motivation: To address the multi-agent online coordination problem with various submodular objectives and reduce reliance on unknown parameters in algorithm design.Method: Proposed two algorithms: MA-SPL handles submodular, α-weakly DR-submodular, and (γ,β)-weakly submodular scenarios with optimal approximation guarantees; MA-MPL is parameter-free while maintaining same approximation ratio. Both use novel policy-based continuous extension technique.
Result: MA-SPL achieves optimal (1-c/e)-approximation for submodular objectives and handles weakly submodular scenarios. MA-MPL maintains same approximation ratio without requiring unknown parameters. Extensive simulations validate effectiveness.
Conclusion: The proposed algorithms effectively solve multi-agent online coordination problems with various submodular objectives, with MA-MPL providing parameter-free solution while maintaining performance guarantees.
Abstract: In this paper, we present two effective policy learning algorithms for multi-agent online coordination(MA-OC) problem. The first one, \texttt{MA-SPL}, not only can achieve the optimal $(1-\frac{c}{e})$-approximation guarantee for the MA-OC problem with submodular objectives but also can handle the unexplored $\alpha$-weakly DR-submodular and $(\gamma,\beta)$-weakly submodular scenarios, where $c$ is the curvature of the investigated submodular functions, $\alpha$ denotes the diminishing-return(DR) ratio and the tuple $(\gamma,\beta)$ represents the submodularity ratios. Subsequently, in order to reduce the reliance on the unknown parameters $\alpha,\gamma,\beta$ inherent in the \texttt{MA-SPL} algorithm, we further introduce the second online algorithm named \texttt{MA-MPL}. This \texttt{MA-MPL} algorithm is entirely \emph{parameter-free} and simultaneously can maintain the same approximation ratio as the first \texttt{MA-SPL} algorithm. The core of our \texttt{MA-SPL} and \texttt{MA-MPL} algorithms is a novel continuous-relaxation technique termed as \emph{policy-based continuous extension}. Compared with the well-established \emph{multi-linear extension}, a notable advantage of this new \emph{policy-based continuous extension} is its ability to provide a lossless rounding scheme for any set function, thereby enabling us to tackle the challenging weakly submodular objectives. Finally, extensive simulations are conducted to validate the effectiveness of our proposed algorithms.
[861] Voting-Bloc Entropy: A New Metric for DAO Decentralization
Andrés Fábrega, Amy Zhao, Jay Yu, James Austgen, Sarah Allen, Kushal Babel, Mahimna Kelkar, Ari Juels
Main category: cs.MA
TL;DR: This paper proposes Voting-Bloc Entropy (VBE), a new framework for measuring DAO decentralization that models voters with aligned interests as centralizing forces, derived from first principles using reinforcement learning.
Details
Motivation: Existing definitions of decentralization in DAOs fail to capture key properties for diverse and equitable participation, requiring a more principled approach to measure decentralization.Method: Developed VBE framework based on similarity of participants’ utility functions across voting rounds, using a reinforcement learning-based conceptual model for voting that implies VBE.
Result: Proved theoretical results about (de)centralizing effects of vote delegation, proposal bundling, bribery, etc., and conducted empirical measurement studies and governance experiments using VBE.
Conclusion: VBE provides both theoretical and practical tools for enhancing DAO decentralization, with open-source artifacts made available for community use and future research.
Abstract: Decentralized Autonomous Organizations (DAOs) use smart contracts to foster communities working toward common goals. Existing definitions of decentralization, however – the ‘D’ in DAO – fall short of capturing the key properties characteristic of diverse and equitable participation. This work proposes a new framework for measuring DAO decentralization called Voting-Bloc Entropy (VBE, pronounced ‘‘vibe’’). VBE is based on the idea that voters with closely aligned interests act as a centralizing force and should be modeled as such. VBE formalizes this notion by measuring the similarity of participants’ utility functions across a set of voting rounds. Unlike prior, ad hoc definitions of decentralization, VBE derives from first principles: We introduce a simple (yet powerful) reinforcement learning-based conceptual model for voting, that in turn implies VBE. We first show VBE’s utility as a theoretical tool. We prove a number of results about the (de)centralizing effects of vote delegation, proposal bundling, bribery, etc. that are overlooked in previous notions of DAO decentralization. Our results lead to practical suggestions for enhancing DAO decentralization. We also show how VBE can be used empirically by presenting measurement studies and VBE-based governance experiments. We make the tools we developed for these results available to the community in the form of open-source artifacts in order to facilitate future study of DAO decentralization.
[862] Neural Orchestration for Multi-Agent Systems: A Deep Learning Framework for Optimal Agent Selection in Multi-Domain Task Environments
Kushagra Agrawal, Nisharg Nargund
Main category: cs.MA
TL;DR: MetaOrch is a neural orchestration framework that uses supervised learning and fuzzy evaluation to dynamically select optimal agents for multi-domain tasks, achieving 86.3% selection accuracy.
Details
Motivation: Traditional multi-agent systems have rigid coordination mechanisms and struggle to adapt to dynamic tasks, requiring a more flexible and intelligent approach to agent selection.Method: Uses supervised learning to model task context, agent histories, and expected response quality, with a fuzzy evaluation module that scores agent responses on completeness, relevance, and confidence dimensions.
Result: Achieved 86.3% selection accuracy in simulated environments, significantly outperforming baseline strategies like random selection and round-robin scheduling.
Conclusion: Neural orchestration provides a powerful approach to enhance autonomy, interpretability, and adaptability of multi-agent systems across diverse task domains.
Abstract: Multi-agent systems (MAS) are foundational in simulating complex real-world scenarios involving autonomous, interacting entities. However, traditional MAS architectures often suffer from rigid coordination mechanisms and difficulty adapting to dynamic tasks. We propose MetaOrch, a neural orchestration framework for optimal agent selection in multi-domain task environments. Our system implements a supervised learning approach that models task context, agent histories, and expected response quality to select the most appropriate agent for each task. A novel fuzzy evaluation module scores agent responses along completeness, relevance, and confidence dimensions, generating soft supervision labels for training the orchestrator. Unlike previous methods that hard-code agent-task mappings, MetaOrch dynamically predicts the most suitable agent while estimating selection confidence. Experiments in simulated environments with heterogeneous agents demonstrate that our approach achieves 86.3% selection accuracy, significantly outperforming baseline strategies including random selection and round-robin scheduling. The modular architecture emphasizes extensibility, allowing agents to be registered, updated, and queried independently. Results suggest that neural orchestration offers a powerful approach to enhancing the autonomy, interpretability, and adaptability of multi-agent systems across diverse task domains.
[863] Constructive Conflict-Driven Multi-Agent Reinforcement Learning for Strategic Diversity
Yuxiang Mai, Qiyue Yin, Wancheng Ni, Pei Xu, Kaiqi Huang
Main category: cs.MA
TL;DR: CoDiCon introduces competitive incentives in cooperative MARL to foster strategic diversity through constructive conflict, using a centralized intrinsic reward mechanism that balances competition and cooperation.
Details
Motivation: Existing MARL diversity methods focus on individual agent characteristics but neglect agent interplay and mutual influence during policy formation, creating a gap in leveraging competitive dynamics for strategic diversity.Method: Proposes Competitive Diversity through Constructive Conflict (CoDiCon) with intrinsic reward mechanism using ranking features, centralized reward module for varying reward distribution, and reformulated bilevel optimization to align with task objectives.
Result: CoDiCon achieves superior performance against state-of-the-art methods in SMAC and GRF environments, with competitive intrinsic rewards effectively promoting diverse and adaptive strategies among cooperative agents.
Conclusion: Incorporating competitive incentives through constructive conflict successfully enhances strategic diversity in cooperative MARL, demonstrating the value of balancing competition and cooperation for improved agent performance.
Abstract: In recent years, diversity has emerged as a useful mechanism to enhance the efficiency of multi-agent reinforcement learning (MARL). However, existing methods predominantly focus on designing policies based on individual agent characteristics, often neglecting the interplay and mutual influence among agents during policy formation. To address this gap, we propose Competitive Diversity through Constructive Conflict (CoDiCon), a novel approach that incorporates competitive incentives into cooperative scenarios to encourage policy exchange and foster strategic diversity among agents. Drawing inspiration from sociological research, which highlights the benefits of moderate competition and constructive conflict in group decision-making, we design an intrinsic reward mechanism using ranking features to introduce competitive motivations. A centralized intrinsic reward module generates and distributes varying reward values to agents, ensuring an effective balance between competition and cooperation. By optimizing the parameterized centralized reward module to maximize environmental rewards, we reformulate the constrained bilevel optimization problem to align with the original task objectives. We evaluate our algorithm against state-of-the-art methods in the SMAC and GRF environments. Experimental results demonstrate that CoDiCon achieves superior performance, with competitive intrinsic rewards effectively promoting diverse and adaptive strategies among cooperative agents.
cs.MM
[864] Perception-Consistency Multimodal Large Language Models Reasoning via Caption-Regularized Policy Optimization
Songjun Tu, Qichao Zhang, Jingbo Sun, Yuqian Fu, Linjing Li, Xiangyuan Lan, Dongmei Jiang, Yaowei Wang, Dongbin Zhao
Main category: cs.MM
TL;DR: CapPO is a reinforcement learning framework that addresses perception-induced errors in multimodal models by enforcing perceptual consistency through caption-based regularization and adaptive advantage estimation.
Details
Motivation: Multimodal LLMs suffer from perception-induced errors that propagate through reasoning chains, and current RL fine-tuning methods fail to address the misalignment between visual grounding and reasoning.Method: Caption-Regularized Policy Optimization (CapPO) uses: 1) caption-based consistency regularization to minimize divergence between responses from raw images vs captions, and 2) KL-weighted advantage estimation to scale reinforcement signals for perceptually consistent trajectories.
Result: CapPO achieves +6.0% accuracy on math tasks and +2.4% on general reasoning tasks over Qwen2.5-VL-7B, with significant reduction in perception-related mistakes compared to baselines.
Conclusion: CapPO provides a simple yet effective framework for improving multimodal reasoning by explicitly enforcing perceptual consistency during policy optimization.
Abstract: While multimodal large language models excel at tasks that integrate visual perception with symbolic reasoning, their performance is often undermined by a critical vulnerability: perception-induced errors that propagate through the reasoning chain. Current reinforcement learning (RL) fine-tuning methods, while enhancing reasoning abilities, largely fail to address the underlying misalignment between visual grounding and the subsequent reasoning process. To address this challenge, we propose \textbf{Caption-Regularized Policy Optimization (CapPO)}, a novel RL framework that explicitly enforces perceptual consistency during policy optimization. CapPO integrates two key mechanisms: (1) a caption-based consistency regularization, which minimizes the divergence between responses conditioned on raw images and those conditioned on captions, thereby anchoring reasoning to semantically faithful visual content; and (2) a KL-weighted advantage estimation scheme, which adaptively scales reinforcement signals to strengthen perceptually consistent trajectories while suppressing spurious correlations. Extensive experiments on five math-focused and five general reasoning benchmarks demonstrate that CapPO achieves competitive performance, yielding gains of +6.0% accuracy on math-related tasks and +2.4% on general reasoning tasks over the base Qwen2.5-VL-7B model. Moreover, ablation studies further confirm the effectiveness of each component, while error analysis reveals that CapPO significantly reduces perception-related mistakes compared with baselines. Overall, CapPO provides a simple yet effective framework for improving multimodal reasoning.
[865] Small Stickers, Big Meanings: A Multilingual Sticker Semantic Understanding Dataset with a Gamified Approach
Heng Er Metilda Chee, Jiayin Wang, Zhiqiang Guo, Weizhi Ma, Min Zhang
Main category: cs.MM
TL;DR: The paper introduces Sticktionary, a gamified framework for collecting high-quality sticker queries, and presents StickerQueries, a multilingual dataset that improves sticker retrieval through better query generation.
Details
Motivation: Sticker retrieval is underexplored due to challenges in creating high-quality query datasets, and current LLMs struggle with the nuanced nature of sticker query generation.Method: Proposed a threefold solution: 1) Sticktionary gamified annotation framework, 2) StickerQueries multilingual dataset with 1,115 English and 615 Chinese queries from 60+ contributors, 3) Fine-tuned query generation models.
Result: The approach significantly enhances query generation quality, retrieval accuracy, and semantic understanding in the sticker domain through extensive quantitative and qualitative evaluation.
Conclusion: The publicly released multilingual dataset and fine-tuned models support future research in sticker retrieval, addressing the limitations of current LLMs in this domain.
Abstract: Stickers, though small, are a highly condensed form of visual expression, ubiquitous across messaging platforms and embraced by diverse cultures, genders, and age groups. Despite their popularity, sticker retrieval remains an underexplored task due to the significant human effort and subjectivity involved in constructing high-quality sticker query datasets. Although large language models (LLMs) excel at general NLP tasks, they falter when confronted with the nuanced, intangible, and highly specific nature of sticker query generation. To address this challenge, we propose a threefold solution. First, we introduce Sticktionary, a gamified annotation framework designed to gather diverse, high-quality, and contextually resonant sticker queries. Second, we present StickerQueries, a multilingual sticker query dataset containing 1,115 English and 615 Chinese queries, annotated by over 60 contributors across 60+ hours. Lastly, Through extensive quantitative and qualitative evaluation, we demonstrate that our approach significantly enhances query generation quality, retrieval accuracy, and semantic understanding in the sticker domain. To support future research, we publicly release our multilingual dataset along with two fine-tuned query generation models.
[866] MultiVox: A Benchmark for Evaluating Voice Assistants for Multimodal Interactions
Ramaneswaran Selvakumar, Ashish Seth, Nishit Anand, Utkarsh Tyagi, Sonal Kumar, Sreyan Ghosh, Dinesh Manocha
Main category: cs.MM
TL;DR: MultiVox is the first benchmark for evaluating voice assistants’ ability to integrate spoken and visual cues with paralinguistic speech features for multimodal understanding.
Details
Motivation: Current benchmarks fail to comprehensively evaluate how well omni models generate context-aware responses by understanding fine-grained speech characteristics and aligning paralinguistic cues with visual signals.Method: Created MultiVox benchmark with 1000 human-annotated and recorded speech dialogues encompassing diverse paralinguistic features and visual cues (images and videos), then evaluated 10 state-of-the-art models.
Result: Evaluation revealed that while humans excel at these tasks, current models consistently struggle to produce contextually grounded responses.
Conclusion: There is a significant gap between human performance and current model capabilities in integrating multimodal cues for context-aware responses, highlighting the need for improved multimodal understanding in voice assistants.
Abstract: The rapid progress of Large Language Models (LLMs) has empowered omni models to act as voice assistants capable of understanding spoken dialogues. These models can process multimodal inputs beyond text, such as speech and visual data, enabling more context-aware interactions. However, current benchmarks fall short in comprehensively evaluating how well these models generate context-aware responses, particularly when it comes to implicitly understanding fine-grained speech characteristics, such as pitch, emotion, timbre, and volume or the environmental acoustic context such as background sounds. Additionally, they inadequately assess the ability of models to align paralinguistic cues with complementary visual signals to inform their responses. To address these gaps, we introduce MultiVox, the first omni voice assistant benchmark designed to evaluate the ability of voice assistants to integrate spoken and visual cues including paralinguistic speech features for truly multimodal understanding. Specifically, MultiVox includes 1000 human-annotated and recorded speech dialogues that encompass diverse paralinguistic features and a range of visual cues such as images and videos. Our evaluation on 10 state-of-the-art models reveals that, although humans excel at these tasks, current models consistently struggle to produce contextually grounded responses.
eess.AS
[867] Toward a Realistic Encoding Model of Auditory Affective Understanding in the Brain
Guandong Pan, Yaqian Yang, Shi Chen, Xin Wang, Longzhao Liu, Hongwei Zheng, Shaoting Tang
Main category: eess.AS
TL;DR: A computational framework models how the brain encodes naturalistic auditory inputs into emotional responses, revealing that high-level semantic features dominate emotion encoding and outperform low-level acoustic features.
Details
Motivation: Understanding how complex auditory stimuli drive emotion arousal dynamics remains unresolved in affective neuroscience and emotion-aware AI.Method: Decompose audio into multilevel auditory features using classical algorithms and wav2vec 2.0/Hubert, mapping them to emotion-related responses via cross-dataset analyses across SEED, LIRIS, and self-collected BAVE datasets.
Result: High-level semantic representations dominate emotion encoding, outperforming low-level features. Middle layers of wav2vec 2.0/Hubert surpass final layers in emotion induction. Human voices and soundtracks show dataset-dependent emotion-evoking biases aligned with stimulus energy distribution.
Conclusion: This work uncovers hierarchical mechanisms of auditory-emotion encoding, providing a foundation for adaptive emotion-aware systems and cross-disciplinary explorations of audio-affective interactions.
Abstract: In affective neuroscience and emotion-aware AI, understanding how complex auditory stimuli drive emotion arousal dynamics remains unresolved. This study introduces a computational framework to model the brain’s encoding of naturalistic auditory inputs into dynamic behavioral/neural responses across three datasets (SEED, LIRIS, self-collected BAVE). Guided by neurobiological principles of parallel auditory hierarchy, we decompose audio into multilevel auditory features (through classical algorithms and wav2vec 2.0/Hubert) from the original and isolated human voice/background soundtrack elements, mapping them to emotion-related responses via cross-dataset analyses. Our analysis reveals that high-level semantic representations (derived from the final layer of wav2vec 2.0/Hubert) exert a dominant role in emotion encoding, outperforming low-level acoustic features with significantly stronger mappings to behavioral annotations and dynamic neural synchrony across most brain regions ($p < 0.05$). Notably, middle layers of wav2vec 2.0/hubert (balancing acoustic-semantic information) surpass the final layers in emotion induction across datasets. Moreover, human voices and soundtracks show dataset-dependent emotion-evoking biases aligned with stimulus energy distribution (e.g., LIRIS favors soundtracks due to higher background energy), with neural analyses indicating voices dominate prefrontal/temporal activity while soundtracks excel in limbic regions. By integrating affective computing and neuroscience, this work uncovers hierarchical mechanisms of auditory-emotion encoding, providing a foundation for adaptive emotion-aware systems and cross-disciplinary explorations of audio-affective interactions.
[868] Multi-Speaker DOA Estimation in Binaural Hearing Aids using Deep Learning and Speaker Count Fusion
Farnaz Jazaeri, Homayoun Kamkar-Parsi, François Grondin, Martin Bouchard
Main category: eess.AS
TL;DR: Adding source-count information improves DOA estimation for binaural hearing aids, with late fusion of ground-truth source count achieving 14% higher F1-scores than baseline CRNN.
Details
Motivation: Direction-of-arrival (DOA) estimation is crucial for binaural hearing aids in noisy multi-speaker environments, and source-count information could enhance this capability.Method: Used dual-task training with joint multi-sources DOA estimation and source counting, and integrated source count as auxiliary feature in CRNN architecture through early, mid, and late fusion strategies.
Result: Dual-task training didn’t improve DOA estimation but helped source-count prediction. Ground-truth source count as auxiliary feature significantly enhanced DOA estimation, with late fusion yielding up to 14% higher average F1-scores.
Conclusion: Source-count estimation has potential for robust DOA estimation in binaural hearing aids, particularly when used as auxiliary feature rather than through dual-task training.
Abstract: For extracting a target speaker voice, direction-of-arrival (DOA) estimation is crucial for binaural hearing aids operating in noisy, multi-speaker environments. Among the solutions developed for this task, a deep learning convolutional recurrent neural network (CRNN) model leveraging spectral phase differences and magnitude ratios between microphone signals is a popular option. In this paper, we explore adding source-count information for multi-sources DOA estimation. The use of dual-task training with joint multi-sources DOA estimation and source counting is first considered. We then consider using the source count as an auxiliary feature in a standalone DOA estimation system, where the number of active sources (0, 1, or 2+) is integrated into the CRNN architecture through early, mid, and late fusion strategies. Experiments using real binaural recordings are performed. Results show that the dual-task training does not improve DOA estimation performance, although it benefits source-count prediction. However, a ground-truth (oracle) source count used as an auxiliary feature significantly enhances standalone DOA estimation performance, with late fusion yielding up to 14% higher average F1-scores over the baseline CRNN. This highlights the potential of using source-count estimation for robust DOA estimation in binaural hearing aids.
[869] ARTI-6: Towards Six-dimensional Articulatory Speech Encoding
Jihwan Lee, Sean Foley, Thanathai Lertpetchpun, Kevin Huang, Yoonjeong Lee, Tiantian Feng, Louis Goldstein, Dani Byrd, Shrikanth Narayanan
Main category: eess.AS
TL;DR: ARTI-6 is a compact 6D articulatory speech encoding framework derived from real-time MRI data that captures key vocal tract regions for interpretable and efficient speech inversion and synthesis.
Details
Motivation: To create an interpretable, computationally efficient, and physiologically grounded framework for articulatory speech processing that captures crucial vocal tract regions including velum, tongue root, and larynx.Method: Three-component framework: (1) 6D articulatory feature set from real-time MRI data, (2) articulatory inversion model using speech foundation models to predict features from acoustics, (3) articulatory synthesis model to reconstruct speech from features.
Result: Achieved prediction correlation of 0.87 for articulatory inversion and demonstrated that low-dimensional representation can generate natural-sounding, intelligible speech.
Conclusion: ARTI-6 provides an effective framework for advancing articulatory inversion, synthesis, and broader speech technology applications with interpretable and physiologically grounded representations.
Abstract: We propose ARTI-6, a compact six-dimensional articulatory speech encoding framework derived from real-time MRI data that captures crucial vocal tract regions including the velum, tongue root, and larynx. ARTI-6 consists of three components: (1) a six-dimensional articulatory feature set representing key regions of the vocal tract; (2) an articulatory inversion model, which predicts articulatory features from speech acoustics leveraging speech foundation models, achieving a prediction correlation of 0.87; and (3) an articulatory synthesis model, which reconstructs intelligible speech directly from articulatory features, showing that even a low-dimensional representation can generate natural-sounding speech. Together, ARTI-6 provides an interpretable, computationally efficient, and physiologically grounded framework for advancing articulatory inversion, synthesis, and broader speech technology applications. The source code and speech samples are publicly available.
[870] Enhanced Generative Machine Listener
Vishnu Raj, Gouthaman KV, Shiv Gehlot, Lars Villemoes, Arijit Biswas
Main category: eess.AS
TL;DR: GMLv2 is a reference-based model that predicts subjective audio quality using MUSHRA scores, featuring a Beta distribution-based loss and enhanced generalization through neural audio coding datasets.
Details
Motivation: To provide a scalable and automated framework for perceptual audio quality evaluation that outperforms existing metrics like PEAQ and ViSQOL, accelerating research in audio coding technologies.Method: Uses a Beta distribution-based loss to model listener ratings and incorporates additional neural audio coding subjective datasets to improve generalization.
Result: GMLv2 consistently outperforms PEAQ and ViSQOL in correlation with subjective scores and reliably predicts scores across diverse content types and codec configurations.
Conclusion: GMLv2 offers an effective and scalable solution for automated perceptual audio quality evaluation, supporting faster development in modern audio coding technologies.
Abstract: We present GMLv2, a reference-based model designed for the prediction of subjective audio quality as measured by MUSHRA scores. GMLv2 introduces a Beta distribution-based loss to model the listener ratings and incorporates additional neural audio coding (NAC) subjective datasets to extend its generalization and applicability. Extensive evaluations on diverse testset demonstrate that proposed GMLv2 consistently outperforms widely used metrics, such as PEAQ and ViSQOL, both in terms of correlation with subjective scores and in reliably predicting these scores across diverse content types and codec configurations. Consequently, GMLv2 offers a scalable and automated framework for perceptual audio quality evaluation, poised to accelerate research and development in modern audio coding technologies.
[871] AUDDT: Audio Unified Deepfake Detection Benchmark Toolkit
Yi Zhu, Heitor R. Guimarães, Arthur Pimentel, Tiago Falk
Main category: eess.AS
TL;DR: This paper presents AUDDT, an open-source benchmarking toolkit for systematically evaluating audio deepfake detectors across 28 datasets, revealing generalization issues in current models.
Details
Motivation: Most audio deepfake detection models are evaluated on narrow datasets, leaving their real-world generalization uncertain. There's a need for systematic benchmarking across diverse datasets.Method: Developed AUDDT toolkit to automate evaluation of pretrained detectors across 28 audio deepfake datasets, analyzing in-domain and out-of-domain performance across different deepfake subgroups and manipulation types.
Result: Revealed notable performance differences across conditions and audio manipulation types, showing that current detectors struggle with generalization to real-world deployment scenarios.
Conclusion: Existing audio deepfake datasets have limitations and gaps relative to practical deployment, highlighting the need for more comprehensive benchmarking and improved dataset diversity.
Abstract: With the prevalence of artificial intelligence (AI)-generated content, such as audio deepfakes, a large body of recent work has focused on developing deepfake detection techniques. However, most models are evaluated on a narrow set of datasets, leaving their generalization to real-world conditions uncertain. In this paper, we systematically review 28 existing audio deepfake datasets and present an open-source benchmarking toolkit called AUDDT (https://github.com/MuSAELab/AUDDT). The goal of this toolkit is to automate the evaluation of pretrained detectors across these 28 datasets, giving users direct feedback on the advantages and shortcomings of their deepfake detectors. We start by showcasing the usage of the developed toolkit, the composition of our benchmark, and the breakdown of different deepfake subgroups. Next, using a widely adopted pretrained deepfake detector, we present in- and out-of-domain detection results, revealing notable differences across conditions and audio manipulation types. Lastly, we also analyze the limitations of these existing datasets and their gap relative to practical deployment scenarios.
[872] HuLA: Prosody-Aware Anti-Spoofing with Multi-Task Learning for Expressive and Emotional Synthetic Speech
Aurosweta Mahapatra, Ismail Rasim Ulgen, Berrak Sisman
Main category: eess.AS
TL;DR: HuLA is a two-stage prosody-aware framework for synthetic speech detection that uses F0 prediction and voiced/unvoiced classification to improve robustness against expressive and emotional spoofing attacks.
Details
Motivation: Current anti-spoofing systems are vulnerable to expressive synthetic speech because they don't leverage prosodic cues like F0 patterns and voiced/unvoiced structure that humans naturally use to distinguish real from synthetic speech.Method: Two-stage multi-task learning: Stage 1 trains SSL backbone on real speech with F0 prediction and voiced/unvoiced classification; Stage 2 jointly optimizes for spoof detection and prosody tasks on both real and synthetic data.
Result: HuLA consistently outperforms strong baselines on challenging out-of-domain datasets including expressive, emotional, and cross-lingual attacks.
Conclusion: Explicit prosodic supervision combined with SSL embeddings substantially improves robustness against advanced synthetic speech attacks.
Abstract: Current anti-spoofing systems remain vulnerable to expressive and emotional synthetic speech, since they rarely leverage prosody as a discriminative cue. Prosody is central to human expressiveness and emotion, and humans instinctively use prosodic cues such as F0 patterns and voiced/unvoiced structure to distinguish natural from synthetic speech. In this paper, we propose HuLA, a two-stage prosody-aware multi-task learning framework for spoof detection. In Stage 1, a self-supervised learning (SSL) backbone is trained on real speech with auxiliary tasks of F0 prediction and voiced/unvoiced classification, enhancing its ability to capture natural prosodic variation similar to human perceptual learning. In Stage 2, the model is jointly optimized for spoof detection and prosody tasks on both real and synthetic data, leveraging prosodic awareness to detect mismatches between natural and expressive synthetic speech. Experiments show that HuLA consistently outperforms strong baselines on challenging out-of-domain dataset, including expressive, emotional, and cross-lingual attacks. These results demonstrate that explicit prosodic supervision, combined with SSL embeddings, substantially improves robustness against advanced synthetic speech attacks.
[873] FastEnhancer: Speed-Optimized Streaming Neural Speech Enhancement
Sunghwan Ahn, Jinmo Han, Beom Jun Woo, Nam Soo Kim
Main category: eess.AS
TL;DR: FastEnhancer is a streaming neural speech enhancement model designed to minimize real-world latency while maintaining state-of-the-art performance.
Details
Motivation: Current deep neural network-based speech enhancement models achieve good performance but have high computational demands and processing latency, making them unsuitable for real-time applications like online meetings and hearing aids.Method: Proposes FastEnhancer with a simple encoder-decoder structure using efficient RNNFormer blocks, specifically designed for streaming speech enhancement with minimal latency.
Result: FastEnhancer achieves state-of-the-art speech quality and intelligibility while demonstrating the fastest processing speed on a single CPU thread across various objective metrics.
Conclusion: The proposed FastEnhancer successfully balances high performance with low latency, making it suitable for real-time streaming applications.
Abstract: Streaming speech enhancement is a crucial task for real-time applications such as online meetings, smart home appliances, and hearing aids. Deep neural network-based approaches achieve exceptional performance while demanding substantial computational resources. Although recent neural speech enhancement models have succeeded in reducing the number of parameters and multiply-accumulate operations, their sophisticated architectures often introduce significant processing latency on common hardware. In this work, we propose FastEnhancer, a streaming neural speech enhancement model designed explicitly to minimize real-world latency. It features a simple encoder-decoder structure with efficient RNNFormer blocks. Evaluations on various objective metrics show that FastEnhancer achieves state-of-the-art speech quality and intelligibility while simultaneously demonstrating the fastest processing speed on a single CPU thread. Code and pre-trained weights are publicly available (https://github.com/aask1357/fastenhancer).
[874] IPDnet2: an efficient and improved inter-channel phase difference estimation network for sound source localization
Yabo Wang, Bing Yang, Xiaofei Li
Main category: eess.AS
TL;DR: IPDnet2 improves upon IPDnet by using oSpatialNet backbone for better spatial cues extraction and adding frequency-time pooling to reduce computation by over 98% while maintaining comparable localization performance.
Details
Motivation: IPDnet had high computational complexity from independent narrow-band processing and limited scalability due to LSTM layers, constraining localization accuracy.Method: Extended IPDnet to IPDnet2 using oSpatialNet backbone for enhanced spatial cues extraction and scalability, plus frequency-time pooling mechanism to compress frequency/time resolutions and reduce computation.
Result: IPDnet2 achieves comparable localization performance to IPDnet with less than 2% of its computation cost, and achieves state-of-the-art SSL performance with scalable model size while maintaining low complexity.
Conclusion: IPDnet2 successfully addresses IPDnet’s limitations by improving both localization accuracy and efficiency through better architecture design and computational optimization.
Abstract: IPDnet is our recently proposed real-time sound source localization network. It employs alternating full-band and narrow-band (B)LSTMs to learn the full-band correlation and narrow-band extraction of DP-IPD, respectively, which achieves superior performance. However, processing narrow-band independently incurs high computational complexity and the limited scalability of LSTM layers constrains the localization accuracy. In this work, we extend IPDnet to IPDnet2, improving both localization accuracy and efficiency. IPDnet2 adapts the oSpatialNet as the backbone to enhance spatial cues extraction and provide superior scalability. Additionally, a simple yet effective frequency-time pooling mechanism is proposed to compress frequency and time resolutions and thus reduce computational cost, and meanwhile not losing localization capability. Experimental results show that IPDnet2 achieves comparable localization performance with IPDnet while only requiring less than 2% of its computation cost. Moreover, the proposed network achieves state-of-the-art SSL performance by scaling up the model size while still maintaining relatively low complexity.
[875] AUV: Teaching Audio Universal Vector Quantization with Single Nested Codebook
Yushen Chen, Kai Hu, Long Zhou, Shulin Feng, Xusheng Yang, Hangting Chen, Xie Chen
Main category: eess.AS
TL;DR: AUV is a unified neural audio codec with a single codebook that achieves high-quality reconstruction of speech and general audio at 700 bps using nested domain-specific partitions and teacher distillation in single-stage training.
Details
Motivation: To create a universal audio codec capable of handling both speech and general audio (vocal, music, sound) with a single codebook, overcoming limitations of domain-specific codecs.Method: Uses a conformer-style encoder-decoder with STFT features, matryoshka codebook with nested domain-specific partitions, and teacher distillation in single-stage training.
Result: Achieves comparable audio reconstruction quality to state-of-the-art domain-specific single-layer quantizer codecs at 700 bps for 16kHz mixed-domain audio.
Conclusion: AUV demonstrates the potential of audio universal vector quantization with a single codebook, providing a unified solution for diverse audio types.
Abstract: We propose AUV, a unified neural audio codec with a single codebook, which enables a favourable reconstruction of speech and further extends to general audio, including vocal, music, and sound. AUV is capable of tackling any 16 kHz mixed-domain audio segment at bit rates around 700 bps. To accomplish this, we guide the matryoshka codebook with nested domain-specific partitions, assigned with corresponding teacher models to perform distillation, all in a single-stage training. A conformer-style encoder-decoder architecture with STFT features as audio representation is employed, yielding better audio quality. Comprehensive evaluations demonstrate that AUV exhibits comparable audio reconstruction ability to state-of-the-art domain-specific single-layer quantizer codecs, showcasing the potential of audio universal vector quantization with a single codebook. The pre-trained model and demo samples are available at https://swivid.github.io/AUV/.
[876] Speak Your Mind: The Speech Continuation Task as a Probe of Voice-Based Model Bias
Shree Harsha Bokkahalli Satish, Harm Lameris, Olivier Perrotin, Gustav Eje Henter, Éva Székely
Main category: eess.AS
TL;DR: This paper presents the first systematic evaluation of bias in Speech Continuation (SC), examining how gender and phonation type affect continuation behavior in speech foundation models.
Details
Motivation: Speech Continuation offers a more direct setting for probing biases in speech foundation models than dialogue, as it's constrained to a single audio stream. The researchers aim to investigate socially relevant representational biases.Method: Evaluated three recent models (SpiritLM, VAE-GSLM, and SpeechGPT) across speaker similarity, voice quality preservation, and text-based bias metrics. Investigated gender and phonation type (breathy, creaky, end-creak) effects.
Result: Results show challenges in speaker similarity and coherence. Textual evaluations reveal significant model and gender interactions: gender effects emerge on text-metrics like agency and sentence polarity once coherence is high. Continuations revert toward modal phonation more strongly for female prompts, revealing systematic voice-quality bias.
Conclusion: Speech Continuation serves as a controlled probe for socially relevant representational biases in speech foundation models, and will become increasingly informative as continuation quality improves.
Abstract: Speech Continuation (SC) is the task of generating a coherent extension of a spoken prompt while preserving both semantic context and speaker identity. Because SC is constrained to a single audio stream, it offers a more direct setting for probing biases in speech foundation models than dialogue does. In this work we present the first systematic evaluation of bias in SC, investigating how gender and phonation type (breathy, creaky, end-creak) affect continuation behaviour. We evaluate three recent models: SpiritLM (base and expressive), VAE-GSLM, and SpeechGPT across speaker similarity, voice quality preservation, and text-based bias metrics. Results show that while both speaker similarity and coherence remain a challenge, textual evaluations reveal significant model and gender interactions: once coherence is sufficiently high (for VAE-GSLM), gender effects emerge on text-metrics such as agency and sentence polarity. In addition, continuations revert toward modal phonation more strongly for female prompts than for male ones, revealing a systematic voice-quality bias. These findings highlight SC as a controlled probe of socially relevant representational biases in speech foundation models, and suggest that it will become an increasingly informative diagnostic as continuation quality improves.
[877] Speaker Anonymisation for Speech-based Suicide Risk Detection
Ziyun Cui, Sike Jia, Yang Lin, Yinan Duan, Diyang Qu, Runsen Chen, Chao Zhang, Chang Lei, Wen Wu
Main category: eess.AS
TL;DR: First systematic study of speaker anonymization for speech-based suicide risk detection, showing that combining anonymization methods can protect speaker identity while maintaining detection performance comparable to original speech.
Details
Motivation: Adolescent suicide is a critical global health issue, and speech provides a cost-effective modality for automatic detection. Protecting speaker identity is crucial for vulnerable populations since speech can reveal personally identifiable information if data is leaked or exploited.Method: Investigates a broad range of anonymization methods including traditional signal processing, neural voice conversion, and speech synthesis. Builds a comprehensive evaluation framework to assess trade-off between speaker identity protection and preservation of suicide risk detection information.
Result: Combining anonymization methods that retain complementary information yields detection performance comparable to original speech while achieving protection of speaker identity for vulnerable populations.
Conclusion: Effective speaker anonymization is achievable for speech-based suicide risk detection, balancing privacy protection with clinical utility through complementary anonymization approaches.
Abstract: Adolescent suicide is a critical global health issue, and speech provides a cost-effective modality for automatic suicide risk detection. Given the vulnerable population, protecting speaker identity is particularly important, as speech itself can reveal personally identifiable information if the data is leaked or maliciously exploited. This work presents the first systematic study of speaker anonymisation for speech-based suicide risk detection. A broad range of anonymisation methods are investigated, including techniques based on traditional signal processing, neural voice conversion, and speech synthesis. A comprehensive evaluation framework is built to assess the trade-off between protecting speaker identity and preserving information essential for suicide risk detection. Results show that combining anonymisation methods that retain complementary information yields detection performance comparable to that of original speech, while achieving protection of speaker identity for vulnerable populations.
[878] Towards Cross-Task Suicide Risk Detection via Speech LLM
Jialun Li, Weitao Jiang, Ziyun Cui, Yinan Duan, Diyang Qu, Chao Zhang, Runsen Chen, Chang Lei, Wen Wu
Main category: eess.AS
TL;DR: This paper proposes a cross-task approach using a speech large language model with mixture of DoRA experts (MoDE) to unify diverse speech suicide risk assessment tasks, achieving higher accuracy and better calibration than single-task models.
Details
Motivation: Suicide risk among adolescents is a critical public health concern, and speech provides a non-invasive, scalable detection approach. Existing methods focus on single tasks, but cross-task approaches could capture complementary cues across diverse assessments.Method: Leverage a speech large language model as backbone and incorporate mixture of DoRA experts (MoDE) to dynamically capture complementary cues across diverse speech suicide risk assessment tasks.
Result: Tested on 1,223 participants across ten spontaneous speech tasks. MoDE achieves higher detection accuracy than single-task specialized models and conventional joint-tuning approaches, with better confidence calibration.
Conclusion: The proposed cross-task MoDE approach effectively unifies diverse speech suicide risk assessment tasks, providing superior performance and calibration for medical detection applications.
Abstract: Suicide risk among adolescents remains a critical public health concern, and speech provides a non-invasive and scalable approach for its detection. Existing approaches, however, typically focus on one single speech assessment task at a time. This paper, for the first time, investigates cross-task approaches that unify diverse speech suicide risk assessment tasks within a single model. Specifically, we leverage a speech large language model as the backbone and incorporate a mixture of DoRA experts (MoDE) approach to capture complementary cues across diverse assessments dynamically. The proposed approach was tested on 1,223 participants across ten spontaneous speech tasks. Results demonstrate that MoDE not only achieves higher detection accuracy than both single-task specialised models and conventional joint-tuning approaches, but also provides better confidence calibration, which is especially important for medical detection tasks.
[879] Semantic-VAE: Semantic-Alignment Latent Representation for Better Speech Synthesis
Zhikang Niu, Shujie Hu, Jeongsoo Choi, Yushen Chen, Peining Chen, Pengcheng Zhu, Yunting Yang, Bowen Zhang, Jian Zhao, Chunhui Wang, Xie Chen
Main category: eess.AS
TL;DR: Semantic-VAE overcomes the VAE optimization dilemma in zero-shot TTS by using semantic alignment regularization, improving both intelligibility and reconstruction quality compared to mel-spectrograms and vanilla VAEs.
Details
Motivation: Current VAE-based latent representations in zero-shot TTS face a fundamental trade-off: higher dimensions improve reconstruction but degrade intelligibility, while lower dimensions improve intelligibility but sacrifice reconstruction fidelity.Method: Proposed Semantic-VAE framework that utilizes semantic alignment regularization in the latent space to capture semantic structure in high-dimensional representations, alleviating the reconstruction-generation trade-off.
Result: Achieved 2.10% WER and 0.64 speaker similarity on LibriSpeech-PC, outperforming mel-based systems (2.23%, 0.60) and vanilla acoustic VAE baselines (2.65%, 0.59).
Conclusion: Semantic-VAE significantly improves synthesis quality and training efficiency in zero-shot TTS by addressing the fundamental optimization dilemma of VAE-based latent representations.
Abstract: While mel-spectrograms have been widely utilized as intermediate representations in zero-shot text-to-speech (TTS), their inherent redundancy leads to inefficiency in learning text-speech alignment. Compact VAE-based latent representations have recently emerged as a stronger alternative, but they also face a fundamental optimization dilemma: higher-dimensional latent spaces improve reconstruction quality and speaker similarity, but degrade intelligibility, while lower-dimensional spaces improve intelligibility at the expense of reconstruction fidelity. To overcome this dilemma, we propose Semantic-VAE, a novel VAE framework that utilizes semantic alignment regularization in the latent space. This design alleviates the reconstruction-generation trade-off by capturing semantic structure in high-dimensional latent representations. Extensive experiments demonstrate that Semantic-VAE significantly improves synthesis quality and training efficiency. When integrated into F5-TTS, our method achieves 2.10% WER and 0.64 speaker similarity on LibriSpeech-PC, outperforming mel-based systems (2.23%, 0.60) and vanilla acoustic VAE baselines (2.65%, 0.59). We also release the code and models to facilitate further research.
[880] ZSDEVC: Zero-Shot Diffusion-based Emotional Voice Conversion with Disentangled Mechanism
Hsing-Hang Chou, Yun-Shao Lin, Ching-Chin Sung, Yu Tsao, Chi-Chun Lee
Main category: eess.AS
TL;DR: A novel diffusion framework for emotional voice conversion that addresses emotion accuracy and speech distortion issues, particularly in zero-shot scenarios with unseen speakers.
Details
Motivation: Existing EVC methods struggle with emotion accuracy and speech distortion, and the zero-shot scenario for unseen speakers remains underexplored despite its importance for practical applications.Method: A diffusion framework with disentangled mechanisms and expressive guidance, trained on a large emotional speech dataset and evaluated on unseen speakers across in-domain and out-of-domain datasets.
Result: The method produces expressive speech with high emotional accuracy, naturalness, and quality, demonstrating strong performance on both in-domain and out-of-domain datasets.
Conclusion: The proposed framework shows potential for broader EVC applications by effectively handling zero-shot scenarios and producing high-quality emotional voice conversions.
Abstract: The human voice conveys not just words but also emotional states and individuality. Emotional voice conversion (EVC) modifies emotional expressions while preserving linguistic content and speaker identity, improving applications like human-machine interaction. While deep learning has advanced EVC models for specific target speakers on well-crafted emotional datasets, existing methods often face issues with emotion accuracy and speech distortion. In addition, the zero-shot scenario, in which emotion conversion is applied to unseen speakers, remains underexplored. This work introduces a novel diffusion framework with disentangled mechanisms and expressive guidance, trained on a large emotional speech dataset and evaluated on unseen speakers across in-domain and out-of-domain datasets. Experimental results show that our method produces expressive speech with high emotional accuracy, naturalness, and quality, showcasing its potential for broader EVC applications.
[881] On the Within-class Variation Issue in Alzheimer’s Disease Detection
Jiawen Kang, Dongrui Han, Lingwei Meng, Jingyan Zhou, Jinchao Li, Xixin Wu, Helen Meng
Main category: eess.AS
TL;DR: The paper addresses within-class variation and instance-level imbalance in Alzheimer’s Disease detection by proposing Soft Target Distillation and Instance-level Re-balancing methods that improve classification performance.
Details
Motivation: Conventional binary classification for AD detection overlooks within-class heterogeneity (spectrum of cognitive impairments in AD patients) and instance-level imbalance, which are critical challenges in this domain.Method: Proposed two methods: 1) Soft Target Distillation (SoTD) that uses sample score estimators to generate sample-specific soft scores aligned with cognitive scores, and 2) Instance-level Re-balancing (InRe) to address instance-level imbalance.
Result: Demonstrated and analyzed advantages of the proposed approaches on ADReSS and CU-MARVEL corpora, showing improved detection performance.
Conclusion: The findings provide insights for developing robust and reliable AD detection models by addressing within-class variation and instance-level imbalance challenges.
Abstract: Alzheimer’s Disease (AD) detection employs machine learning classification models to distinguish between individuals with AD and those without. Different from conventional classification tasks, we identify within-class variation as a critical challenge in AD detection: individuals with AD exhibit a spectrum of cognitive impairments. Therefore, simplistic binary AD classification may overlook two crucial aspects: within-class heterogeneity and instance-level imbalance. In this work, we found using a sample score estimator can generate sample-specific soft scores aligning with cognitive scores. We subsequently propose two simple yet effective methods: Soft Target Distillation (SoTD) and Instance-level Re-balancing (InRe), targeting two problems respectively. Based on the ADReSS and CU-MARVEL corpora, we demonstrated and analyzed the advantages of the proposed approaches in detection performance. These findings provide insights for developing robust and reliable AD detection models.
[882] CapSpeech: Enabling Downstream Applications in Style-Captioned Text-to-Speech
Helin Wang, Jiarui Hai, Dading Chong, Karan Thakkar, Tiantian Feng, Dongchao Yang, Junhyeok Lee, Thomas Thebaud, Laureano Moro Velazquez, Jesus Villalba, Zengyi Qin, Shrikanth Narayanan, Mounya Elhiali, Najim Dehak
Main category: eess.AS
TL;DR: CapSpeech is a new benchmark for style-captioned text-to-speech (CapTTS) tasks, featuring over 10M machine-annotated and 0.36M human-annotated audio-caption pairs, with comprehensive experiments showing high-fidelity speech synthesis.
Details
Motivation: Address the lack of standardized datasets and limited research on downstream tasks in CapTTS, which hinders real-world applications of style-captioned text-to-speech synthesis.Method: Introduce CapSpeech benchmark with multiple CapTTS tasks (CapTTS-SE, AccCapTTS, EmoCapTTS, AgentTTS), create large-scale datasets with professional recordings, and conduct experiments using both autoregressive and non-autoregressive models.
Result: Achieved high-fidelity and highly intelligible speech synthesis across diverse speaking styles. CapSpeech is the largest available dataset with comprehensive annotations for CapTTS-related tasks.
Conclusion: CapSpeech provides valuable insights into CapTTS system development challenges and serves as a comprehensive benchmark for advancing style-captioned text-to-speech research and applications.
Abstract: Recent advancements in generative artificial intelligence have significantly transformed the field of style-captioned text-to-speech synthesis (CapTTS). However, adapting CapTTS to real-world applications remains challenging due to the lack of standardized, comprehensive datasets and limited research on downstream tasks built upon CapTTS. To address these gaps, we introduce CapSpeech, a new benchmark designed for a series of CapTTS-related tasks, including style-captioned text-to-speech synthesis with sound events (CapTTS-SE), accent-captioned TTS (AccCapTTS), emotion-captioned TTS (EmoCapTTS), and text-to-speech synthesis for chat agent (AgentTTS). CapSpeech comprises over 10 million machine-annotated audio-caption pairs and nearly 0.36 million human-annotated audio-caption pairs. In addition, we introduce two new datasets collected and recorded by a professional voice actor and experienced audio engineers, specifically for the AgentTTS and CapTTS-SE tasks. Alongside the datasets, we conduct comprehensive experiments using both autoregressive and non-autoregressive models on CapSpeech. Our results demonstrate high-fidelity and highly intelligible speech synthesis across a diverse range of speaking styles. To the best of our knowledge, CapSpeech is the largest available dataset offering comprehensive annotations for CapTTS-related tasks. The experiments and findings further provide valuable insights into the challenges of developing CapTTS systems.
[883] MMedFD: A Real-world Healthcare Benchmark for Multi-turn Full-Duplex Automatic Speech Recognition
Hongzhao Chen, XiaoYang Wang, Jing Lan, Hexiao Ding, Yufeng Jiang, MingHui Yang, DanHui Xu, Jun Luo, Nga-Chun Ng, Gerald W. Y. Cheng, Yunlin Mao, Jung Sun Yoo
Main category: eess.AS
TL;DR: MMedFD is the first real-world Chinese healthcare ASR corpus for multi-turn, full-duplex settings, featuring 5,805 annotated sessions with streaming segmentation and speaker attribution pipeline.
Details
Motivation: There is a scarcity of open benchmarks for clinical dialogue ASR that can handle full-duplex interaction, speaker overlap, and low-latency constraints in real healthcare deployment scenarios.Method: Created MMedFD corpus from deployed AI assistant with synchronized user/mixed-channel views, timing data, and role labels. Developed model-agnostic pipeline for streaming segmentation and speaker attribution. Fine-tuned Whisper-small on role-concatenated audio for long-context recognition.
Result: Established comprehensive ASR evaluation metrics including WER, CER, and HC-WER (concept-level accuracy). Used LLM-generated responses assessed via rubric-based and pairwise protocols. Created reproducible framework for benchmarking streaming ASR and duplex agents.
Conclusion: MMedFD provides the first public benchmark for streaming ASR and end-to-end duplex agents in healthcare, with dataset and resources publicly available to advance clinical dialogue systems.
Abstract: Automatic speech recognition (ASR) in clinical dialogue demands robustness to full-duplex interaction, speaker overlap, and low-latency constraints, yet open benchmarks remain scarce. We present MMedFD, the first real-world Chinese healthcare ASR corpus designed for multi-turn, full-duplex settings. Captured from a deployed AI assistant, the dataset comprises 5,805 annotated sessions with synchronized user and mixed-channel views, RTTM/CTM timing, and role labels. We introduce a model-agnostic pipeline for streaming segmentation, speaker attribution, and dialogue memory, and fine-tune Whisper-small on role-concatenated audio for long-context recognition. ASR evaluation includes WER, CER, and HC-WER, which measures concept-level accuracy across healthcare settings. LLM-generated responses are assessed using rubric-based and pairwise protocols. MMedFD establishes a reproducible framework for benchmarking streaming ASR and end-to-end duplex agents in healthcare deployment. The dataset and related resources are publicly available at https://github.com/Kinetics-JOJO/MMedFD
[884] Phoenix-VAD: Streaming Semantic Endpoint Detection for Full-Duplex Speech Interaction
Weijie Wu, Wenhao Guan, Kaidi Wang, Peijie Chen, Zhuanling Zha, Junbo Li, Jun Fang, Lin Li, Qingyang Hong
Main category: eess.AS
TL;DR: Phoenix-VAD is an LLM-based streaming semantic endpoint detection model that enables plug-and-play full-duplex prediction for spoken dialogue systems.
Details
Motivation: Current spoken dialogue models lack plug-and-play full-duplex prediction modules for semantic endpoint detection, hindering seamless audio interactions.Method: Leverages LLM’s semantic comprehension capability with sliding window training strategy to achieve reliable semantic endpoint detection while supporting streaming inference.
Result: Achieves excellent and competitive performance on both semantically complete and incomplete speech scenarios.
Conclusion: Enables independent optimization of full-duplex prediction module, providing more reliable and flexible support for next-generation human-computer interaction.
Abstract: Spoken dialogue models have significantly advanced intelligent human-computer interaction, yet they lack a plug-and-play full-duplex prediction module for semantic endpoint detection, hindering seamless audio interactions. In this paper, we introduce Phoenix-VAD, an LLM-based model that enables streaming semantic endpoint detection. Specifically, Phoenix-VAD leverages the semantic comprehension capability of the LLM and a sliding window training strategy to achieve reliable semantic endpoint detection while supporting streaming inference. Experiments on both semantically complete and incomplete speech scenarios indicate that Phoenix-VAD achieves excellent and competitive performance. Furthermore, this design enables the full-duplex prediction module to be optimized independently of the dialogue model, providing more reliable and flexible support for next-generation human-computer interaction.
[885] SPADE: Structured Pruning and Adaptive Distillation for Efficient LLM-TTS
Tan Dat Nguyen, Jaehun Kim, Ji-Hoon Kim, Shukjae Choi, Youshin Lim, Joon Son Chung
Main category: eess.AS
TL;DR: SPADE is a framework that combines structured pruning and adaptive distillation to create efficient LLM-based text-to-speech models, reducing model size and latency while maintaining quality.
Details
Motivation: Recent LLM-TTS systems have strong controllability and zero-shot generalization but suffer from large parameter counts and high latency that limit real-world deployment.Method: Combines (i) pruning guided by word-error-rate-based layer importance index to remove non-essential Transformer layers, and (ii) multi-level knowledge distillation to restore autoregressive coherence.
Result: Preserves near-parity perceptual quality while halving Transformer depth, reduces VRAM usage by up to 20%, achieves up to 1.7x faster real-time factor with less than 5% of original training data.
Conclusion: Compact LLM-TTS models can maintain naturalness and speaker similarity while enabling practical real-time speech generation.
Abstract: The goal of this paper is to introduce SPADE, a framework for Structured Pruning and Adaptive Distillation for Efficient Large Language Model-based text-to-speech (LLM-TTS). Recent LLM-TTS systems achieve strong controllability and zero-shot generalization, but their large parameter counts and high latency limit real-world deployment. SPADE addresses this by combining (i) a pruning step guided by a word-error-rate-based layer importance index to remove non-essential Transformer layers, with (ii) multi-level knowledge distillation to restore autoregressive coherence. On zero-shot benchmarks, SPADE preserves near-parity perceptual quality while halving Transformer depth, reducing VRAM usage by up to 20%, and achieving up to 1.7x faster real-time factor with less than 5% of the original training data. These results show that compact LLM-TTS models can maintain naturalness and speaker similarity while enabling practical real-time speech generation. Audio samples are available at https://mm.kaist.ac.kr/projects/SPADE/.
[886] TF-Restormer: Complex Spectral Prediction for Speech Restoration
Ui-Hyeop Shin, Jaehyun Ko, Woocheol Jeong, Hyung-Min Park
Main category: eess.AS
TL;DR: TF-Restormer is an encoder-decoder architecture for universal speech restoration that handles arbitrary input-output sampling rates without redundant resampling, supports streaming, and achieves superior performance across various distortions.
Details
Motivation: Existing speech restoration systems have limitations including sacrificed signal fidelity in vocoder-based approaches, impracticality of diffusion models for streaming, and redundant computations from fixed target sampling rates requiring external resampling.Method: Uses a time-frequency dual-path encoder for input-bandwidth analysis and a light decoder with frequency extension queries to reconstruct missing high-frequency bands. Includes a shared SFI STFT discriminator for adversarial training across rates, causal time module for streaming, spectral inductive bias for robustness, and scaled log-spectral loss for optimization stability.
Result: TF-Restormer consistently outperforms prior systems across sampling rates, achieving balanced gains in signal fidelity and perceptual quality, with streaming mode maintaining competitive effectiveness for real-time applications.
Conclusion: TF-Restormer provides an efficient and universal solution for speech restoration that handles arbitrary input-output rates without redundant computations, supports streaming, and delivers superior performance across various real-world distortions.
Abstract: Speech restoration in real-world conditions is challenging due to compounded distortions such as clipping, band-pass filtering, digital artifacts, noise, and reverberation, and low sampling rates. Existing systems, including vocoder-based approaches, often sacrifice signal fidelity, while diffusion models remain impractical for streaming. Moreover, most assume a fixed target sampling rate, requiring external resampling that leads to redundant computations. We present TF-Restormer, an encoder-decoder architecture that concentrates analysis on input-bandwidth with a time-frequency dual-path encoder and reconstructs missing high-frequency bands through a light decoder with frequency extension queries. It enables efficient and universal restoration across arbitrary input-output rates without redundant resampling. To support adversarial training across diverse rates, we introduce a shared sampling-frequency-independent (SFI) STFT discriminator. TF-Restormer further supports streaming with a causal time module, and improves robustness under extreme degradations by injecting spectral inductive bias into the frequency module. Finally, we propose a scaled log-spectral loss that stabilizes optimization under severe conditions while emphasizing well-predicted spectral details. As a single model across sampling rates, TF-Restormer consistently outperforms prior systems, achieving balanced gains in signal fidelity and perceptual quality, while its streaming mode maintains competitive effectiveness for real-time application. Code and demos are available at https://tf-restormer.github.io/demo.
[887] Measuring Audio’s Impact on Correctness: Audio-Contribution-Aware Post-Training of Large Audio Language Models
Haolin He, Xingjian Du, Renhe Sun, Zheqi Dai, Yujia Xiao, Mingru Yang, Jiayi Zhou, Xiquan Li, Zhengxi Liu, Zining Liang, Chunyat Wu, Qianhua He, Tan Lee, Xie Chen, Wei-Long Zheng, Weiqiang Wang, Mark Plumbley, Jian Liu, Qiuqiang Kong
Main category: eess.AS
TL;DR: The paper introduces AudioMCQ, a large-scale audio multiple-choice question dataset, and addresses the zero audio-contribution problem in Large Audio Language Models (LALMs) by proposing data filtering and effective multi-stage post-training strategies that achieve state-of-the-art performance.
Details
Motivation: To improve Large Audio Language Models (LALMs) by addressing the suboptimal performance of multi-stage post-training approaches and the lack of large-scale, high-quality datasets for audio tasks, particularly focusing on the zero audio-contribution phenomenon where models ignore audio content.Method: 1) Created AudioMCQ dataset with 571k samples and chain-of-thought annotations. 2) Proposed Audio-Contribution Filtering to partition data into weak and strong audio-contribution subsets. 3) Developed two post-training paradigms: Weak-to-Strong (SFT on weak audio-contribution data followed by RL on strong audio-contribution data) and Mixed-to-Strong (SFT on mixed audio-contribution data followed by RL on strong audio-contribution data).
Result: Achieved first place in DCASE 2025 Audio-Question-Answering challenge and established new state-of-the-art performance: 78.2% on MMAU-test-mini, 75.6% on MMAU, 67.1% on MMAR, and 70.7% on MMSU.
Conclusion: The proposed AudioMCQ dataset and multi-stage post-training strategies effectively address the zero audio-contribution problem in LALMs and significantly improve model performance across multiple audio benchmarks, demonstrating the importance of proper data allocation across training stages.
Abstract: Large Audio Language Models (LALMs) represent an important frontier in multimodal AI, addressing diverse audio tasks. Recently, post-training of LALMs has received increasing attention due to significant performance improvements over foundation models. While single-stage post-training such as reinforcement learning (RL) has demonstrated promising results, multi-stage approaches such as supervised fine-tuning (SFT) followed by RL remain suboptimal. The allocation of data across multiple training stages to maximize LALM capabilities has not been fully explored, and large-scale, high-quality datasets for such research are also lacking. To address these problems, we firstly present AudioMCQ, a comprehensive audio multiple-choice question dataset comprising 571k samples with two kinds of chain-of-thought annotations. Secondly, we investigate the prevalent zero audio-contribution phenomenon in LALMs, where models derive correct answers solely from textual information without processing audio content. We propose Audio-Contribution Filtering to partition data into weak and strong audio-contribution subsets. Based on these insights, we develop two effective post-training paradigms: Weak-to-Strong (SFT on weak audio-contribution data followed by RL on strong audio-contribution data) and Mixed-to-Strong (SFT on mixed audio-contribution data followed by RL on strong audio-contribution data). We achieve first place in the DCASE 2025 Audio-Question-Answering challenge by using AudioMCQ. Additionally, leveraging our dataset with different training strategies, we achieve 78.2% on MMAU-test-mini, 75.6% on MMAU, 67.1% on MMAR, and 70.7% on MMSU, establishing new state-of-the-art performance across these benchmarks.
eess.IV
[888] Patch-Based Diffusion for Data-Efficient, Radiologist-Preferred MRI Reconstruction
Rohan Sanda, Asad Aali, Andrew Johnston, Eduardo Reis, Jonathan Singh, Gordon Wetzstein, Sara Fridovich-Keil
Main category: eess.IV
TL;DR: Patch-based diffusion models (PaDIS-MRI) enable high-quality MRI reconstruction from severely undersampled data using small training datasets, outperforming state-of-the-art methods in image quality and diagnostic confidence.
Details
Motivation: MRI requires long acquisition times, increasing costs and motion artifacts. Existing diffusion models need large datasets, which are expensive to collect in clinical settings.Method: Extends Patch-based Diffusion Inverse Solver (PaDIS) to complex-valued, multi-coil MRI reconstruction, using patch-based diffusion priors trained on small datasets (as few as 25 k-space images).
Result: PaDIS-MRI outperforms FastMRI-EDM on image quality metrics (PSNR, SSIM, NRMSE), uncertainty, cross-contrast generalization, and robustness to severe undersampling. In blinded radiologist study, chosen as diagnostically superior in 91.7% of cases.
Conclusion: Patch-based diffusion priors show strong potential for high-fidelity MRI reconstruction in data-scarce clinical settings where diagnostic confidence is critical.
Abstract: Magnetic resonance imaging (MRI) requires long acquisition times, raising costs, reducing accessibility, and making scans more susceptible to motion artifacts. Diffusion probabilistic models that learn data-driven priors can potentially assist in reducing acquisition time. However, they typically require large training datasets that can be prohibitively expensive to collect. Patch-based diffusion models have shown promise in learning effective data-driven priors over small real-valued datasets, but have not yet demonstrated clinical value in MRI. We extend the Patch-based Diffusion Inverse Solver (PaDIS) to complex-valued, multi-coil MRI reconstruction, and compare it against a state-of-the-art whole-image diffusion baseline (FastMRI-EDM) for 7x undersampled MRI reconstruction on the FastMRI brain dataset. We show that PaDIS-MRI models trained on small datasets of as few as 25 k-space images outperform FastMRI-EDM on image quality metrics (PSNR, SSIM, NRMSE), pixel-level uncertainty, cross-contrast generalization, and robustness to severe k-space undersampling. In a blinded study with three radiologists, PaDIS-MRI reconstructions were chosen as diagnostically superior in 91.7% of cases, compared to baselines (i) FastMRI-EDM and (ii) classical convex reconstruction with wavelet sparsity. These findings highlight the potential of patch-based diffusion priors for high-fidelity MRI reconstruction in data-scarce clinical settings where diagnostic confidence matters.
[889] Transabdominal Fetal Oximetry via Diffuse Optics: Principled Analysis and Demonstration in Pregnant Ovine Models
Weitai Qian, Rishad Raiyan Joarder, Randall Fowler, Begum Kasap, Mahya Saffarpour, Kourosh Vali, Tailai Lihe, Aijun Wang, Diana Farmer, Soheil Ghiasi
Main category: eess.IV
TL;DR: The paper introduces a novel method for non-invasive fetal blood oxygen saturation (fSpO2) monitoring using diffuse optics, achieving improved accuracy through machine learning and a new feature called Exponential Pulsation Ratio.
Details
Motivation: To advance fetal health monitoring by enabling continuous measurement of fetal blood oxygen saturation through diffuse optics, addressing the need for more accurate and non-invasive intrapartum monitoring technologies.Method: Developed a theoretical derivation and comprehensive pipeline using diffuse light intensity values, introduced Exponential Pulsation Ratio (EPR) as a key feature, and employed machine-learning models to fuse information from multiple detectors across simulations and in-vivo experiments.
Result: Achieved Mean Absolute Error of 4.81% (simulated) and 6.85% (in-vivo) with Pearson’s r correlations of 0.81 and 0.71 respectively, outperforming existing approaches in both datasets.
Conclusion: The proposed method demonstrates viability as a supplemental technology for intrapartum fetal monitoring, significantly enhancing fSpO2 estimation accuracy compared to current approaches.
Abstract: Diffuse optics has the potential to offer a substantial advancement in fetal health monitoring via enabling continuous measurement of fetal blood oxygen saturation (fSpO$_2$). Aiming to enhance the sensing accuracy and to elucidate the foundational limits of Transabdominal Fetal Oximetry (TFO) via diffuse optics, we introduce a theoretical derivation, and a comprehensive pipeline for fSpO$_2$ estimation from non-invasively sensed diffuse light intensity values, which are leveraged to analyze datasets obtained through both simulations and in-vivo experiments in gold standard large animal model of pregnancy. We propose the Exponential Pulsation Ratio (EPR) as a key feature, and develop machine-learning models to fuse the information collected across multiple detectors. Our proposed method demonstrates a Mean Absolute Error (MAE) of 4.81% and 6.85% with a Pearson’s r correlation of 0.81 (p<0.001) and 0.71 (p<0.001) for estimation of fSpO$_2$ in simulated dataset and in-vivo dataset, respectively. Across both datasets, our method outperforms the existing approaches, enhancing the accuracy of the fSpO$_2$ estimation and demonstrates its viability as a supplemental technology for intrapartum fetal monitoring.
[890] Multicollinearity-Aware Parameter-Free Strategy for Hyperspectral Band Selection: A Dependence Measures-Based Approach
Dibyabha Deb, Ujjwal Verma
Main category: eess.IV
TL;DR: A parameter-free band selection method for hyperspectral images using VIF, ABC, and MI to reduce dimensionality while maintaining classification performance.
Details
Motivation: Hyperspectral images have high dimensionality causing computational challenges. Existing band selection methods suffer from sensitivity to initialization, parameter tuning, and high computational costs.Method: Combines three dependence measures: Average Band Correlation (ABC) for linear correlations, Mutual Information (MI) for uncertainty reduction relative to labels, and Variance Inflation Factor (VIF) for multicollinearity reduction. Uses VIF-based pre-selection followed by clustering on ABC and MI values.
Result: Evaluated on four benchmark datasets with significant overlap with other methods’ selected bands. SVM classification shows VIF-driven pruning enhances classification by minimizing multicollinearity. Ablation studies confirm ABC+MI combination yields robust band subsets.
Conclusion: The proposed parameter-free approach effectively captures relevant spectral features and enhances classification performance while eliminating the need for optimal parameter estimation.
Abstract: Hyperspectral bands offer rich spectral and spatial information; however, their high dimensionality poses challenges for efficient processing. Band selection (BS) methods aim to extract a smaller subset of bands to reduce spectral redundancy. Existing approaches, such as ranking-based, clustering-based, and iterative methods, often suffer from issues like sensitivity to initialization, parameter tuning, and high computational cost. This work introduces a BS strategy integrating three dependence measures: Average Band Correlation (ABC) and Mutual Information (MI), and Variance Inflation Factor (VIF). ABC quantifies linear correlations between spectral bands, while MI measures uncertainty reduction relative to ground truth labels. To address multicollinearity and reduce the search space, the approach first applies a VIF-based pre-selection of spectral bands. Subsequently, a clustering algorithm is used to identify the optimal subset of bands based on the ABC and MI values. Unlike previous methods, this approach is completely parameter-free for hyperspectral band selection, eliminating the need for optimal parameter estimation. The proposed method is evaluated on four standard benchmark datasets: WHU-Hi-LongKou, Pavia University, Salinas, and Oil Spill datasets, and is compared to existing state-of-the-art approaches. There is significant overlap between the bands identified by our proposed method and those selected by other methods, indicating that our approach effectively captures the most relevant spectral features. Further, support vector machine (SVM) classification validates that VIF-driven pruning enhances classification by minimizing multicollinearity. Ablation studies confirm that combining ABC with MI yields robust, discriminative band subsets.
[891] Comparative Analysis of GAN and Diffusion for MRI-to-CT translation
Emily Honey, Anders Helbo, Jens Petersen
Main category: eess.IV
TL;DR: This paper compares cGAN (Pix2Pix) and cDDPM (Palette) architectures for MRI-to-CT translation, finding cDDPM with multi-channel conditioning performs best.
Details
Motivation: CT scans are essential for treatment but sometimes unavailable, making synthetic CT generation from MRI valuable. Need to establish which strategies are most effective for MRI-to-CT translation.Method: Compared cGAN (Pix2Pix) and cDDPM (Palette) architectures. Reduced 3D translation to 2D slices on transverse plane to lower computational cost. Investigated single vs multi-slice conditioning. Used novel SIMOS metric to assess slice continuity.
Result: MRI-to-CT generative models benefit from multi-channel conditional input and using cDDPM architecture. The 2D slice approach proved viable for reducing computational complexity.
Conclusion: cDDPM with multi-slice conditioning is the most effective approach for MRI-to-CT translation, and the 2D slice strategy successfully reduces computational requirements while maintaining quality.
Abstract: Computed tomography (CT) is essential for treatment and diagnostics; In case CT are missing or otherwise difficult to obtain, methods for generating synthetic CT (sCT) images from magnetic resonance imaging (MRI) images are sought after. Therefore, it is valuable to establish a reference for what strategies are most effective for MRI-to-CT translation. In this paper, we compare the performance of two frequently used architectures for MRI-to-CT translation: a conditional generative adversarial network (cGAN) and a conditional denoising diffusion probabilistic model (cDDPM). We chose well-established implementations to represent each architecture: Pix2Pix for cGAN, and Palette for cDDPM. We separate the classical 3D translation problem into a sequence of 2D translations on the transverse plane, to investigate the viability of a strategy that reduces the computational cost. We also investigate the impact of conditioning the generative process on a single MRI image/slice and on multiple MRI slices. The performance is assessed using a thorough evaluation protocol, including a novel slice-wise metric Similarity Of Slices (SIMOS), which measures the continuity between transverse slices when compiling the sCTs into 3D format. Our comparative analysis revealed that MRI-to-CT generative models benefit from multi-channel conditional input and using cDDPM as an architecture.
[892] Fifty Years of SAR Automatic Target Recognition: The Road Forward
Jie Zhou, Yongxiang Liu, Li Liu, Weijie Li, Bowen Peng, Yafei Song, Gangyao Kuang, Xiang Li
Main category: eess.IV
TL;DR: A comprehensive 50-year review of SAR automatic target recognition development, analyzing the evolution from traditional methods to modern deep learning approaches, with organized datasets and future directions.
Details
Motivation: To document the technical evolution of SAR ATR over 50 years, distinguish solved vs. emerging challenges, and provide practical resources for researchers.Method: Systematic review and synthesis of literature, analysis of inheritance from traditional methods (statistical modeling, scattering center analysis, feature engineering) to deep learning frameworks, and compilation of public datasets.
Result: First comprehensive review of 50 years of SAR ATR development, identification of challenges mitigated by deep learning vs. new obstacles, organized compilation of all public SAR datasets with direct links.
Conclusion: The survey provides historical documentation, practical resources, and forward-looking insights for SAR ATR research, with open-source literature, code, and datasets available for reproducibility and future development toward more generalizable and physically-consistent systems.
Abstract: This paper provides the first comprehensive review of fifty years of synthetic aperture radar automatic target recognition (SAR ATR) development, tracing its evolution from inception to the present day. Central to our analysis is the inheritance and refinement of traditional methods, such as statistical modeling, scattering center analysis, and feature engineering, within modern deep learning frameworks. The survey clearly distinguishes long-standing challenges that have been substantially mitigated by deep learning from newly emerging obstacles. We synthesize recent advances in physics-guided deep learning and propose future directions toward more generalizable and physically-consistent SAR ATR. Additionally, we provide a systematically organized compilation of all publicly available SAR datasets, complete with direct links to support reproducibility and benchmarking. This work not only documents the technical evolution of the field but also offers practical resources and forward-looking insights for researchers and practitioners. A systematic summary of existing literature, code, and datasets are open-sourced at \href{https://github.com/JoyeZLearning/SAR-ATR-From-Beginning-to-Present}{https://github.com/JoyeZLearning/SAR-ATR-From-Beginning-to-Present}.
[893] COMPASS: Robust Feature Conformal Prediction for Medical Segmentation Metrics
Matt Y. Cheung, Ashok Veeraraghavan, Guha Balakrishnan
Main category: eess.IV
TL;DR: COMPASS is a conformal prediction framework that generates efficient uncertainty intervals for downstream metrics in medical image segmentation by leveraging deep neural network representations, producing tighter intervals than traditional methods.
Details
Motivation: In clinical applications, segmentation model utility depends on derived metrics like organ size rather than pixel-level accuracy, making uncertainty quantification crucial for decision-making. Traditional conformal prediction applied to final metrics is inefficient.Method: COMPASS performs calibration directly in the model’s representation space by perturbing intermediate features along low-dimensional subspaces maximally sensitive to the target metric, leveraging deep neural network inductive biases.
Result: COMPASS produces significantly tighter intervals than traditional CP baselines on four medical image segmentation tasks for area estimation. It also recovers target coverage under covariate shifts using learned internal features for importance weighting.
Conclusion: COMPASS provides practical, metric-based uncertainty quantification for medical image segmentation by efficiently leveraging model representations, paving the way for reliable clinical decision-making.
Abstract: In clinical applications, the utility of segmentation models is often based on the accuracy of derived downstream metrics such as organ size, rather than by the pixel-level accuracy of the segmentation masks themselves. Thus, uncertainty quantification for such metrics is crucial for decision-making. Conformal prediction (CP) is a popular framework to derive such principled uncertainty guarantees, but applying CP naively to the final scalar metric is inefficient because it treats the complex, non-linear segmentation-to-metric pipeline as a black box. We introduce COMPASS, a practical framework that generates efficient, metric-based CP intervals for image segmentation models by leveraging the inductive biases of their underlying deep neural networks. COMPASS performs calibration directly in the model’s representation space by perturbing intermediate features along low-dimensional subspaces maximally sensitive to the target metric. We prove that COMPASS achieves valid marginal coverage under exchangeability and nestedness assumptions. Empirically, we demonstrate that COMPASS produces significantly tighter intervals than traditional CP baselines on four medical image segmentation tasks for area estimation of skin lesions and anatomical structures. Furthermore, we show that leveraging learned internal features to estimate importance weights allows COMPASS to also recover target coverage under covariate shifts. COMPASS paves the way for practical, metric-based uncertainty quantification for medical image segmentation.
[894] Deep Learning-Based Cross-Anatomy CT Synthesis Using Adapted nnResU-Net with Anatomical Feature Prioritized Loss
Javier Sequeiro González, Arthur Longuefosse, Miguel Díaz Benito, Álvaro García Martín, Fabien Baldacci
Main category: eess.IV
TL;DR: A patch-based 3D nnUNet adaptation for MR-to-CT and CBCT-to-CT image translation using the SynthRAD2025 dataset, featuring two network configurations (standard UNet and residual UNet) with Anatomical Feature-Prioritized (AFP) loss for enhanced clinical structure reconstruction.
Details
Motivation: To develop a stable solution for cross-modality medical image synthesis that improves anatomical fidelity, particularly for bone structures in MR-to-CT and lesions in CBCT-to-CT translations, leveraging the multicenter SynthRAD2025 dataset.Method: Adapted nnUNet with two configurations (standard UNet and residual UNet), introduced AFP loss using features from a compact segmentation network trained on TotalSegmentator labels, used 3D patches tailored to anatomical regions, trained for 1000-1500 epochs with AFP fine-tuning for 500 epochs using L1+AFP objective.
Result: Residual networks with AFP yielded sharper reconstructions and improved anatomical fidelity, especially for bone structures and lesions, while L1-only networks achieved slightly better intensity-based metrics.
Conclusion: The combination of automatic nnUNet pipeline with residual learning and anatomically guided feature losses provides an effective methodology for cross-modality medical image synthesis.
Abstract: We present a patch-based 3D nnUNet adaptation for MR to CT and CBCT to CT image translation using the multicenter SynthRAD2025 dataset, covering head and neck (HN), thorax (TH), and abdomen (AB) regions. Our approach leverages two main network configurations: a standard UNet and a residual UNet, both adapted from nnUNet for image synthesis. The Anatomical Feature-Prioritized (AFP) loss was introduced, which compares multilayer features extracted from a compact segmentation network trained on TotalSegmentator labels, enhancing reconstruction of clinically relevant structures. Input volumes were normalized per-case using zscore normalization for MRIs, and clipping plus dataset level zscore normalization for CBCT and CT. Training used 3D patches tailored to each anatomical region without additional data augmentation. Models were trained for 1000 and 1500 epochs, with AFP fine-tuning performed for 500 epochs using a combined L1+AFP objective. During inference, overlapping patches were aggregated via mean averaging with step size of 0.3, and postprocessing included reverse zscore normalization. Both network configurations were applied across all regions, allowing consistent model design while capturing local adaptations through residual learning and AFP loss. Qualitative and quantitative evaluation revealed that residual networks combined with AFP yielded sharper reconstructions and improved anatomical fidelity, particularly for bone structures in MR to CT and lesions in CBCT to CT, while L1only networks achieved slightly better intensity-based metrics. This methodology provides a stable solution for cross modality medical image synthesis, demonstrating the effectiveness of combining the automatic nnUNet pipeline with residual learning and anatomically guided feature losses.
[895] Surgical Vision World Model
Saurabh Koju, Saurav Bastola, Prashant Shrestha, Sanskar Amgain, Yash Raj Shrestha, Rudra P. K. Poudel, Binod Bhattarai
Main category: eess.IV
TL;DR: Proposes the first surgical vision world model that generates action-controllable surgical data from unlabeled videos, enabling realistic surgical simulation for training without expensive action annotations.
Details
Motivation: Enable realistic surgical simulation for medical training and autonomous agent development without requiring expensive action-labeled data, addressing limitations of current simplified simulations.Method: Leverages unlabeled surgical video data (SurgToolLoc-2022) to infer latent actions and generate action-controllable surgical data, inspired by Genie’s approach for video games.
Result: Successfully developed a surgical vision world model capable of generating action-controllable surgical data, with architecture validated through extensive experiments.
Conclusion: The proposed model enables realistic surgical simulation and autonomous agent training using unlabeled video data, overcoming the cost barrier of action annotation in surgical domains.
Abstract: Realistic and interactive surgical simulation has the potential to facilitate crucial applications, such as medical professional training and autonomous surgical agent training. In the natural visual domain, world models have enabled action-controlled data generation, demonstrating the potential to train autonomous agents in interactive simulated environments when large-scale real data acquisition is infeasible. However, such works in the surgical domain have been limited to simplified computer simulations, and lack realism. Furthermore, existing literature in world models has predominantly dealt with action-labeled data, limiting their applicability to real-world surgical data, where obtaining action annotation is prohibitively expensive. Inspired by the recent success of Genie in leveraging unlabeled video game data to infer latent actions and enable action-controlled data generation, we propose the first surgical vision world model. The proposed model can generate action-controllable surgical data and the architecture design is verified with extensive experiments on the unlabeled SurgToolLoc-2022 dataset. Codes and implementation details are available at https://github.com/bhattarailab/Surgical-Vision-World-Model
[896] Distillation-Enabled Knowledge Alignment Protocol for Semantic Communication in AI Agent Networks
Jingzhi Hu, Geoffrey Ye Li
Main category: eess.IV
TL;DR: Proposes DeKAP protocol for knowledge alignment in semantic communication networks by distilling expert knowledge into parameter-efficient matrices and optimizing resource allocation.
Details
Motivation: Future networks need to connect AI agents for collaboration, but semantic communication requires knowledge alignment while agents have distinct expert knowledge for their individual tasks.Method: Distills expert knowledge into parameter-efficient low-rank matrices, allocates them across the network, formulates joint optimization as large-scale integer linear programming, and develops efficient greedy algorithm.
Result: DeKAP achieves knowledge alignment with the lowest communication and computation resources compared to conventional approaches in computer simulations.
Conclusion: The proposed DeKAP protocol effectively enables knowledge alignment among AI agents while minimizing resource consumption for semantic communication networks.
Abstract: Future networks are envisioned to connect massive artificial intelligence (AI) agents, enabling their extensive collaboration on diverse tasks. Compared to traditional entities, these agents naturally suit the semantic communication (SC), which can significantly enhance the bandwidth efficiency. Nevertheless, SC requires the knowledge among agents to be aligned, while agents have distinct expert knowledge for their individual tasks in practice. In this paper, we propose a distillation-enabled knowledge alignment protocol (DeKAP), which distills the expert knowledge of each agent into parameter-efficient low-rank matrices, allocates them across the network, and allows agents to simultaneously maintain aligned knowledge for multiple tasks. We formulate the joint minimization of alignment loss, communication overhead, and storage cost as a large-scale integer linear programming problem and develop a highly efficient greedy algorithm. From computer simulation, the DeKAP establishes knowledge alignment with the lowest communication and computation resources compared to conventional approaches.
[897] SNR and Resource Adaptive Deep JSCC for Distributed IoT Image Classification
Ali Waqas, Sinem Coleri
Main category: eess.IV
TL;DR: A novel SNR- and computation-adaptive distributed CNN framework for wireless image classification using learning-assisted intelligent Genetic Algorithm (LAIGA) to optimize network configurations under FLOPs constraints and SNR conditions.
Details
Motivation: Sensor-based IoT devices face computational limitations and noisy wireless channels, requiring efficient split-network DNN approaches. Existing methods lack adaptability to varying computational budgets and channel conditions.Method: Proposed LAIGA (learning-assisted intelligent Genetic Algorithm) that explores CNN hyperparameter space while discarding infeasible configurations and using Random Forests for learning assistance to avoid exhaustive search.
Result: Achieves 10% increase in classification accuracy compared to existing JSCC-based SNR-adaptive methods at low SNR (-10dB) across computational budgets from 1M to 70M FLOPs.
Conclusion: The framework outperforms fixed-split architectures and existing SNR-adaptive methods, particularly under low SNR and limited computational resources.
Abstract: Sensor-based local inference at IoT devices faces severe computational limitations, often requiring data transmission over noisy wireless channels for server-side processing. To address this, split-network Deep Neural Network (DNN) based Joint Source-Channel Coding (JSCC) schemes are used to extract and transmit relevant features instead of raw data. However, most existing methods rely on fixed network splits and static configurations, lacking adaptability to varying computational budgets and channel conditions. In this paper, we propose a novel SNR- and computation-adaptive distributed CNN framework for wireless image classification across IoT devices and edge servers. We introduce a learning-assisted intelligent Genetic Algorithm (LAIGA) that efficiently explores the CNN hyperparameter space to optimize network configuration under given FLOPs constraints and given SNR. LAIGA intelligently discards the infeasible network configurations that exceed computational budget at IoT device. It also benefits from the Random Forests based learning assistance to avoid a thorough exploration of hyperparameter space and to induce application specific bias in candidate optimal configurations. Experimental results demonstrate that the proposed framework outperforms fixed-split architectures and existing SNR-adaptive methods, especially under low SNR and limited computational resources. We achieve a 10% increase in classification accuracy as compared to existing JSCC based SNR-adaptive multilayer framework at an SNR as low as -10dB across a range of available computational budget (1M to 70M FLOPs) at IoT device.
[898] Adiabatic Capacitive Neuron: An Energy-Efficient Functional Unit for Artificial Neural Networks
Sachin Maheshwari, Mike Smart, Himadri Singh Raghav, Themis Prodromakis, Alexander Serb
Main category: eess.IV
TL;DR: This paper presents an Adiabatic Capacitive Neuron (ACN) hardware implementation with 12-bit precision, featuring improved energy efficiency, accuracy, and robustness over conventional designs. The proposed design achieves over 90% energy savings compared to non-adiabatic CMOS Capacitive Neurons.
Details
Motivation: To develop a more energy-efficient and robust artificial neuron hardware implementation that addresses the limitations of conventional designs, particularly in terms of energy consumption, accuracy, and process/temperature variations.Method: Implemented a 12-bit single neuron with positive/negative weight support in 0.18μm CMOS technology. Designed a new Threshold Logic (TL) for binary activation function with low symmetrical offset. Used post-layout simulations and 1000-sample Monte Carlo analysis to validate performance across process corners and temperatures.
Result: The proposed TL design achieved maximum offset voltage of 9mV (vs 27mV/5mV in conventional TL), 1.5-2.3% energy reduction in TL, and over 90% total synapse energy savings (12x improvement) compared to non-adiabatic CCN across 500kHz-100MHz frequency range. Monte Carlo confirmed worst-case energy savings >90%.
Conclusion: The ACN implementation successfully demonstrates significant energy efficiency improvements (>90% savings) while maintaining functionality, accuracy, and robustness across process variations and temperature ranges, making it suitable for energy-constrained neural network applications.
Abstract: This paper introduces a new, highly energy-efficient, Adiabatic Capacitive Neuron (ACN) hardware implementation of an Artificial Neuron (AN) with improved functionality, accuracy, robustness and scalability over previous work. The paper describes the implementation of a \mbox{12-bit} single neuron, with positive and negative weight support, in an $\mathbf{0.18\mu m}$ CMOS technology. The paper also presents a new Threshold Logic (TL) design for a binary AN activation function that generates a low symmetrical offset across three process corners and five temperatures between $-55^o$C and $125^o$C. Post-layout simulations demonstrate a maximum rising and falling offset voltage of 9$mV$ compared to conventional TL, which has rising and falling offset voltages of 27$mV$ and 5$mV$ respectively, across temperature and process. Moreover, the proposed TL design shows a decrease in average energy of 1.5$%$ at the SS corner and 2.3$%$ at FF corner compared to the conventional TL design. The total synapse energy saving for the proposed ACN was above 90$%$ (over 12x improvement) when compared to a non-adiabatic CMOS Capacitive Neuron (CCN) benchmark for a frequency ranging from 500$kHz$ to 100$MHz$. A 1000-sample Monte Carlo simulation including process variation and mismatch confirms the worst-case energy savings of $>$90$%$ compared to CCN in the synapse energy profile. Finally, the impact of supply voltage scaling shows consistent energy savings of above 90$%$ (except all zero inputs) without loss of functionality.
[899] A Two-Stage Strategy for Mitosis Detection Using Improved YOLO11x Proposals and ConvNeXt Classification
Jie Xiao, Mengye Lyu, Shaojun Liu
Main category: eess.IV
TL;DR: A two-stage framework for mitosis detection in whole-slide images that combines improved YOLO11x for candidate generation and ConvNeXt-Tiny classifier for false positive filtering, achieving improved F1-score.
Details
Motivation: Mitosis detection in complex WSIs with non-tumor, inflamed, and necrotic regions suffers from false positives and negatives due to heterogeneous context and artifacts, degrading F1-score.Method: Two-stage approach: 1) Improved YOLO11x with EMA attention and LSConv generates mitosis candidates using low confidence threshold for high recall; 2) ConvNeXt-Tiny classifier filters false positives to ensure precision.
Result: Achieved F1-score of 0.882 on fused dataset (0.035 higher than baseline), with precision improvement from 0.762 to 0.839 while maintaining comparable recall. Scored 0.7587 F1 on MIDOG 2025 Track 1 preliminary test set.
Conclusion: The proposed two-stage framework effectively addresses false positive and negative issues in mitosis detection, significantly improving precision and overall F1-score in complex WSI contexts.
Abstract: MIDOG 2025 Track 1 requires mitosis detection in whole-slideimages (WSIs) containing non-tumor, inflamed, and necrotic re-gions. Due to the complicated and heterogeneous context, aswell as possible artifacts, there are often false positives and falsenegatives, thus degrading the detection F1-score. To addressthis problem, we propose a two-stage framework. Firstly, an im-proved YOLO11x, integrated with EMA attention and LSConv,is employed to generate mitosis candidates. We use a low confi-dence threshold to generate as many proposals as possible, en-suring the detection recall. Then, a ConvNeXt-Tiny classifieris employed to filter out the false positives, ensuring the detec-tion precision. Consequently, the proposed two-stage frame-work can generate a high detection F1-score. Evaluated on afused dataset comprising MIDOG++, MITOS_WSI_CCMCT,and MITOS_WSI_CMC, our framework achieves an F1-scoreof 0.882, which is 0.035 higher than the single-stage YOLO11xbaseline. This performance gain is produced by a significantprecision improvement, from 0.762 to 0.839, and a comparablerecall. On the MIDOG 2025 Track 1 preliminary test set, thealgorithm scores an F1 score of 0.7587. The code is available athttps://github.com/xxiao0304/MIDOG-2025-Track-1-of-SZTU.
[900] Recent Advancements in Microscopy Image Enhancement using Deep Learning: A Survey
Debasish Dutta, Neeharika Sonowal, Risheraj Barauh, Deepjyoti Chetia, Sanjib Kr Kalita
Main category: eess.IV
TL;DR: This survey paper provides a comprehensive overview of deep learning methods for microscopy image enhancement, covering super-resolution, reconstruction, and denoising applications.
Details
Motivation: Microscopy image enhancement is crucial for understanding biological cells and materials at microscopic scales, and there has been significant recent advancement in deep learning methods for this purpose.Method: The paper conducts a systematic survey of state-of-the-art deep learning methods for microscopy image enhancement, analyzing their evolution, applications, and current trends across three key domains.
Result: The survey provides a snapshot of rapidly growing deep learning approaches in microscopy image enhancement, documenting current trends and practical utilities across super-resolution, reconstruction, and denoising domains.
Conclusion: The paper serves as a comprehensive reference for researchers working on microscopy image enhancement using deep learning, highlighting challenges and future directions in this evolving field.
Abstract: Microscopy image enhancement plays a pivotal role in understanding the details of biological cells and materials at microscopic scales. In recent years, there has been a significant rise in the advancement of microscopy image enhancement, specifically with the help of deep learning methods. This survey paper aims to provide a snapshot of this rapidly growing state-of-the-art method, focusing on its evolution, applications, challenges, and future directions. The core discussions take place around the key domains of microscopy image enhancement of super-resolution, reconstruction, and denoising, with each domain explored in terms of its current trends and their practical utility of deep learning.
[901] Investigation of ArUco Marker Placement for Planar Indoor Localization
Sven Hinderer, Martina Scheffler, Bin Yang
Main category: eess.IV
TL;DR: Analysis of ArUco fiducial marker system for indoor robot localization, focusing on marker placement effects and proposing a Kalman filter with adaptive noise for real-time tracking.
Details
Motivation: To enable scalable and cost-effective indoor localization for autonomous mobile robots using simple camera systems and printable fiducial markers.Method: Investigated ArUco marker system behavior regarding marker quantity, orientation, and camera distance; proposed Kalman filter with adaptive measurement noise variances.
Result: Characterized localization performance based on marker placement parameters; developed adaptive filtering approach for improved real-time tracking.
Conclusion: Fiducial marker systems provide scalable indoor localization solution; proper marker placement and adaptive filtering enhance localization accuracy and robustness.
Abstract: Indoor localization of autonomous mobile robots (AMRs) can be realized with fiducial markers. Such systems require only a simple, monocular camera as sensor and fiducial markers as passive, identifiable position references that can be printed on a piece of paper and distributed in the area of interest. Thus, fiducial marker systems can be scaled to large areas with a minor increase in system complexity and cost. We investigate the localization behavior of the fiducial marker framework ArUco w.r.t. the placement of the markers including the number of markers, their orientation w.r.t. the camera, and the camera-marker distance. In addition, we propose a simple Kalman filter with adaptive measurement noise variances for real-time AMR tracking.