Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 353]
cs.CV [Total: 514]
cs.AI [Total: 195]
cs.SD [Total: 42]
cs.LG [Total: 522]
cs.MA [Total: 11]
cs.MM [Total: 4]
eess.AS [Total: 25]
eess.IV [Total: 23]

cs.CL

[1] Are you sure? Measuring models bias in content moderation through uncertainty

Alessandra Urbinati, Mirko Lai, Simona Frenda, Marco Antonio Stranisci

Main category: cs.CL

TL;DR: This paper presents an unsupervised approach using conformal prediction to measure bias in language models for content moderation by analyzing uncertainty in classifying messages from vulnerable groups.

Details

Motivation: Language model-based classifiers for content moderation perpetuate racial and social biases, and current methods for measuring fairness remain inadequate despite existing resources and benchmarks.

Method: Uses conformal prediction technique to compute uncertainty as a proxy for bias analysis, benchmarking 11 models against women and non-white annotators, comparing uncertainty metrics with performance-based metrics like F1 score.

Result: Some pre-trained models show high accuracy in predicting labels from minority groups but with low confidence in predictions, revealing that performance metrics alone don’t capture bias effectively.

Conclusion: Measuring model confidence through uncertainty analysis helps identify which annotator groups are better represented in pre-trained models, enabling debiasing before deployment.

Abstract: Automatic content moderation is crucial to ensuring safety in social media. Language Model-based classifiers are being increasingly adopted for this task, but it has been shown that they perpetuate racial and social biases. Even if several resources and benchmark corpora have been developed to challenge this issue, measuring the fairness of models in content moderation remains an open issue. In this work, we present an unsupervised approach that benchmarks models on the basis of their uncertainty in classifying messages annotated by people belonging to vulnerable groups. We use uncertainty, computed by means of the conformal prediction technique, as a proxy to analyze the bias of 11 models against women and non-white annotators and observe to what extent it diverges from metrics based on performance, such as the $F_1$ score. The results show that some pre-trained models predict with high accuracy the labels coming from minority groups, even if the confidence in their prediction is low. Therefore, by measuring the confidence of models, we are able to see which groups of annotators are better represented in pre-trained models and lead the debiasing process of these models before their effective use.

[2] AccessEval: Benchmarking Disability Bias in Large Language Models

Srikant Panda, Amit Agarwal, Hitesh Laxmichand Patel

Main category: cs.CL

TL;DR: AccessEval benchmark evaluates 21 LLMs across 6 domains and 9 disability types using paired Neutral and Disability-Aware Queries, revealing systematic biases against disabled users.

Details

Motivation: To systematically investigate disparities in how LLMs handle real-life queries across various disability contexts, as these models are increasingly deployed across diverse domains.

Method: Introduced AccessEval benchmark evaluating 21 closed- and open-source LLMs across 6 real-world domains and 9 disability types using paired Neutral and Disability-Aware Queries, with metrics for sentiment, social perception, and factual accuracy.

Result: Responses to disability-aware queries have more negative tone, increased stereotyping, and higher factual error compared to neutral queries, with notable variation by domain and disability type. Disabilities affecting hearing, speech, and mobility are disproportionately impacted.

Conclusion: These disparities reflect persistent ableism embedded in model behavior and demonstrate how such biases can translate into tangible harms for disabled users, reinforcing the importance of bias mitigation in day-to-day applications.

Abstract: Large Language Models (LLMs) are increasingly deployed across diverse domains but often exhibit disparities in how they handle real-life queries. To systematically investigate these effects within various disability contexts, we introduce \textbf{AccessEval (Accessibility Evaluation)}, a benchmark evaluating 21 closed- and open-source LLMs across 6 real-world domains and 9 disability types using paired Neutral and Disability-Aware Queries. We evaluated model outputs with metrics for sentiment, social perception, and factual accuracy. Our analysis reveals that responses to disability-aware queries tend to have a more negative tone, increased stereotyping, and higher factual error compared to neutral queries. These effects show notable variation by domain and disability type, with disabilities affecting hearing, speech, and mobility disproportionately impacted. These disparities reflect persistent forms of ableism embedded in model behavior. By examining model performance in real-world decision-making contexts, we better illuminate how such biases can translate into tangible harms for disabled users. This framing helps bridges the gap between technical evaluation and user impact, reinforcing importance of bias mitigation in day-to-day applications. Our dataset is publicly available at: https://huggingface.co/datasets/Srikant86/AccessEval

[3] RAR$^2$: Retrieval-Augmented Medical Reasoning via Thought-Driven Retrieval

Kaishuai Xu, Wenjun Hou, Yi Cheng, Wenjie Li

Main category: cs.CL

TL;DR: RAR² is a joint learning framework that improves both Reasoning-Augmented Retrieval and Retrieval-Augmented Reasoning for medical question answering, addressing limitations of standard RAG in handling complex medical reasoning tasks.

Details

Motivation: Standard RAG struggles with complex medical questions requiring intensive reasoning, as surface-level inputs fail to reflect true knowledge needs. Existing methods focus on query refinement without explicitly modeling reasoning processes.

Method: RAR² constructs a thought process to uncover implicit knowledge requirements and uses it to guide retrieval and answer generation. It uses Direct Preference Optimization (DPO) on mixed preference pairs and includes test-time scaling strategies.

Result: Experiments show RAR² outperforms RAG baselines with or without fine-tuning across several biomedical question answering datasets.

Conclusion: The joint learning framework effectively addresses reasoning-intensive medical tasks by explicitly modeling the reasoning process to guide both retrieval and generation.

Abstract: Large Language Models (LLMs) have shown promising performance on diverse medical benchmarks, highlighting their potential in supporting real-world clinical tasks. Retrieval-Augmented Generation (RAG) has emerged as a key approach for mitigating knowledge gaps and hallucinations by incorporating external medical information. However, RAG still struggles with complex medical questions that require intensive reasoning, as surface-level input often fails to reflect the true knowledge needs of the task. Existing methods typically focus on refining queries without explicitly modeling the reasoning process, limiting their ability to retrieve and integrate clinically relevant knowledge. In this work, we propose RAR$^2$, a joint learning framework that improves both Reasoning-Augmented Retrieval and Retrieval-Augmented Reasoning. RAR$^2$ constructs a thought process to uncover implicit knowledge requirements and uses it to guide retrieval and answer generation. We build a training dataset of mixed preference pairs and apply Direct Preference Optimization (DPO) to train the model. Moreover, we design two test-time scaling strategies to explore the boundaries of our framework. Experiments demonstrate the effectiveness of RAR$^2$ across several biomedical question answering datasets, outperforming RAG baselines with or without fine-tuning.

[4] TRUEBench: Can LLM Response Meet Real-world Constraints as Productivity Assistant?

Jiho Park, Jongyoon Song, Minjin Choi, Kyuho Heo, Taehun Huh, Ji Won Kim

Main category: cs.CL

TL;DR: TRUEBench is a new benchmark that addresses limitations in existing LLM evaluation by incorporating multilingual prompts, implicit constraints, and multi-turn dialogues to provide more realistic assessment of LLM productivity assistants.

Details

Motivation: Existing benchmarks fail to properly evaluate LLMs' real-world instruction-following capabilities due to lack of multilinguality, inability to capture implicit constraints, and overlooking multi-turn dialogue complexities.

Method: Created TRUEBench with 12 languages, intra-instance multilingual instructions, rigorous evaluation criteria for explicit/implicit constraints, complex multi-turn dialogues with accumulating constraints and context switches, and LLM-based constraint validation.

Result: TRUEBench is significantly more challenging than existing benchmarks - OpenAI o1 achieved only 69.07% overall pass rate, demonstrating the benchmark’s demanding nature.

Conclusion: TRUEBench provides a realistic and comprehensive assessment of LLMs in practical productivity settings, effectively highlighting both their capabilities and limitations.

Abstract: Large language models (LLMs) are increasingly integral as productivity assistants, but existing benchmarks fall short in rigorously evaluating their real-world instruction-following capabilities. Current benchmarks often (i) lack sufficient multilinguality, (ii) fail to capture the implicit constraints inherent in user requests, and (iii) overlook the complexities of multi-turn dialogue. To address these critical gaps and provide a more realistic assessment, we introduce TRUEBench (Trustworthy Real-world Usage Evaluation Benchmark)1, a novel benchmark specifically designed for LLM-based productivity assistants. TRUEBench distinguishes itself by featuring input prompts across 12 languages, incorporating intra-instance multilingual instructions, employing rigorous evaluation criteria to capture both explicit and implicit constraints, and including complex multi-turn dialogue scenarios with both accumulating constraints and context switches. Furthermore, to ensure reliability in evaluation, we refined constraints using an LLM validator. Extensive experiments demonstrate that TRUEBench presents significantly greater challenges than existing benchmarks; for instance, a strong model like OpenAI o1 achieved only a 69.07% overall pass rate. TRUEBench offers a demanding and realistic assessment of LLMs in practical productivity settings, highlighting their capabilities and limitations.

Sadia Abdulhalim, Muaz Albaghdadi, Moshiur Farazi

Main category: cs.CL

TL;DR: DAF is a lightweight multimodal framework that combines text and acoustic features using dynamic attention, outperforming unimodal and static fusion methods without finetuning encoders.

Details

Motivation: Traditional sentiment analysis relying only on text overlooks important non-verbal cues like vocal tone and prosody that are essential for capturing true emotional intent.

Method: Dynamic Attention Fusion (DAF) combines frozen text embeddings from a pretrained language model with acoustic features from a speech encoder, using an adaptive attention mechanism to weight each modality per utterance.

Result: DAF consistently outperforms both static fusion and unimodal baselines on a large multimodal benchmark, with notable gains in F1-score and reductions in prediction error.

Conclusion: Dynamic weighting strategy is crucial for modeling emotionally complex inputs, and effectively integrating verbal and non-verbal information offers a more robust foundation for sentiment prediction and affective computing applications.

Abstract: Traditional sentiment analysis has long been a unimodal task, relying solely on text. This approach overlooks non-verbal cues such as vocal tone and prosody that are essential for capturing true emotional intent. We introduce Dynamic Attention Fusion (DAF), a lightweight framework that combines frozen text embeddings from a pretrained language model with acoustic features from a speech encoder, using an adaptive attention mechanism to weight each modality per utterance. Without any finetuning of the underlying encoders, our proposed DAF model consistently outperforms both static fusion and unimodal baselines on a large multimodal benchmark. We report notable gains in F1-score and reductions in prediction error and perform a variety of ablation studies that support our hypothesis that the dynamic weighting strategy is crucial for modeling emotionally complex inputs. By effectively integrating verbal and non-verbal information, our approach offers a more robust foundation for sentiment prediction and carries broader impact for affective computing applications – from emotion recognition and mental health assessment to more natural human computer interaction.

[6] Enabling Approximate Joint Sampling in Diffusion LMs

Parikshit Bansal, Sujay Sanghavi

Main category: cs.CL

TL;DR: A method for approximate joint sampling in masked diffusion language models that allows unmasking multiple tokens in parallel while maintaining distributional accuracy, using a lightweight sampler layer on top of existing diffusion LMs.

Details

Motivation: Autoregressive LMs sample tokens sequentially from the correct joint distribution, while masked diffusion LMs unmask tokens out-of-order in parallel, causing deviation from the true joint distribution when multiple tokens are unmasked simultaneously. This creates a trade-off between speed (more parallel tokens) and accuracy (closer to true joint distribution).

Method: Develop a lightweight single-layer ‘sampler’ on top of existing large diffusion LMs. One full-model forward pass is followed by multiple forward passes of only this sampler layer to yield multiple unmasked tokens. The sampler is trained to mimic exact joint sampling from the frozen full model.

Result: When four tokens are unmasked per full-model denoising step, the method achieves MAUVE score of 0.87 (vs marginal baseline of 0.31) with respect to the true joint distribution. Shows effectiveness on both pretrained-only (Dream-7B-Base) and instruction-tuned (Dream-7B-Instruct) models for language modeling and math & coding tasks.

Conclusion: The proposed approximate joint sampling method successfully bridges the speed-accuracy trade-off in masked diffusion language models, enabling parallel token generation while maintaining distributional fidelity to the true joint distribution.

Abstract: In autoregressive language models, each token is sampled by conditioning on all the past tokens; the overall string has thus been sampled from the correct underlying joint distribution represented by the model. In contrast, masked diffusion language models generate text by unmasking tokens out of order and potentially in parallel. Generating an overall string sampled from the correct underlying joint distribution would (again) require exactly one token unmasking in every full-model forward pass. The more tokens unmasked in parallel, the further away the string is from the true joint; this can be seen in the resulting drop in accuracy (but, increase in speed). In this paper we devise a way to {\em approximately} sample multiple tokens from the joint distribution in a single full-model forward pass; we do so by developing a new lightweight single-layer ``sampler" on top of an existing large diffusion LM. One forward pass of the full model can now be followed by multiple forward passes of only this sampler layer, to yield multiple unmasked tokens. Our sampler is trained to mimic exact joint sampling from the (frozen) full model. We show the effectiveness of our approximate joint sampling for both pretrained-only (Dream-7B-Base) and instruction-tuned (Dream-7B-Instruct) models on language modeling and math & coding tasks. When four tokens are unmasked for each full-model denoising step, our sampling algorithm achieves a MAUVE score of 0.87 (vs marginal baseline of 0.31) with respect to the true joint distribution.

[7] Painless Activation Steering: An Automated, Lightweight Approach for Post-Training Large Language Models

Sasha Cui, Zhongren Chen

Main category: cs.CL

TL;DR: PAS (Painless Activation Steering) is a fully automated activation steering method that improves LM behavior tasks without manual prompt construction or feature labeling, achieving significant gains on bias, morality, and alignment tasks.

Details

Motivation: Current activation steering methods require hand-crafted prompts or labor-intensive feature annotation, making them less convenient than plug-and-play methods like RL and SFT. PAS aims to provide a fully automated alternative.

Method: PAS is a family of automated activation steering methods that work with any labeled dataset without manual intervention. It constructs fast, lightweight activation vectors that can be cheaply trained, easily stored, and activated on demand.

Result: PAS reliably improves performance on behavior tasks (10.1% on Bias, 5.2% on Morality, 34.8% on Alignment) but not on intelligence-oriented tasks. It also delivers additional gains on top of ICL and SFT.

Conclusion: PAS provides a practical, automated LM post-training option that characterizes where activation steering helps and where it fails, offering a fast, lightweight alternative to traditional methods.

Abstract: Language models (LMs) are typically post-trained for desired capabilities and behaviors via weight-based or prompt-based steering, but the former is time-consuming and expensive, and the latter is not precisely controllable and often requires manual trial-and-error. While activation steering (AS) promises a cheap, fast, and controllable alternative to the two existing post-training methods, current AS techniques require hand-crafted prompt pairs or labor-intensive feature annotation, making them more inconvenient than the plug-and-play methods such as Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT). We introduce Painless Activation Steering (PAS), a family of fully automated methods that make AS readily usable with any given labeled dataset, with no need for prompt construction, feature labeling, or human intervention. We evaluate PAS on three open-weight models (Llama3.1-8B-Instruct, DeepSeek-R1-Distill-8B, and Nous-Hermes-2) and 18 tasks; we find that PAS reliably improves performance for behavior tasks, but not for intelligence-oriented tasks. The introspective variant (iPAS) delivers the strongest causal steering effects (10.1% on Bias, 5.2% on Morality, and 34.8% on Alignment). We also show PAS delivers additional gains on top of In-Context Learning (ICL) and SFT. PAS constructs a fast, lightweight activation vector that can be cheaply trained, easily stored, and activated at will. Our results provide a characterization of where AS helps, where it fails, and how to deploy it as a practical, automated LM post-training option.

[8] MIRAGE: Multi-hop Reasoning with Ambiguity Evaluation for Illusory Questions

Jeonghyun Park, Ingeol Baek, Seunghyun Yoon, Haeun Jang, Aparna Garimella, Akriti Jain, Nedim Lipka, Hwanhee Lee

Main category: cs.CL

TL;DR: MIRAGE benchmark introduces ambiguous multi-hop QA where each reasoning step contains ambiguity, challenging LLMs’ ability to resolve layered ambiguity during multi-step inference.

Details

Motivation: Real-world multi-hop QA involves inherent ambiguity that requires resolving multiple ambiguous reasoning paths simultaneously, which current LLMs struggle with.

Method: Created MIRAGE benchmark with 1,142 ambiguous multi-hop questions categorized by ambiguity types, and proposed CLARION multi-agent framework for handling layered ambiguity.

Result: State-of-the-art models perform poorly on MIRAGE, confirming that ambiguity resolution combined with multi-step reasoning is a significant challenge. CLARION framework outperforms existing approaches.

Conclusion: Multi-hop reasoning with ambiguity requires specialized approaches like CLARION, paving the way for more adaptive and robust reasoning systems that can handle layered ambiguity.

Abstract: Real-world Multi-hop Question Answering (QA) often involves ambiguity that is inseparable from the reasoning process itself. This ambiguity creates a distinct challenge, where multiple reasoning paths emerge from a single question, each requiring independent resolution. Since each sub-question is ambiguous, the model must resolve ambiguity at every step. Thus, answering a single question requires handling multiple layers of ambiguity throughout the reasoning chain. We find that current Large Language Models (LLMs) struggle in this setting, typically exploring wrong reasoning paths and producing incomplete answers. To facilitate research on multi-hop ambiguity, we introduce MultI-hop Reasoning with AmbiGuity Evaluation for Illusory Questions (MIRAGE), a benchmark designed to analyze and evaluate this challenging intersection of ambiguity interpretation and multi-hop reasoning. MIRAGE contains 1,142 high-quality examples of ambiguous multi-hop questions, categorized under a taxonomy of syntactic, general, and semantic ambiguity, and curated through a rigorous multi-LLM verification pipeline. Our experiments reveal that even state-of-the-art models struggle on MIRAGE, confirming that resolving ambiguity combined with multi-step inference is a distinct and significant challenge. To establish a robust baseline, we propose CLarifying Ambiguity with a Reasoning and InstructiON (CLARION), a multi-agent framework that significantly outperforms existing approaches on MIRAGE, paving the way for more adaptive and robust reasoning systems.

[9] ML2B: Multi-Lingual ML Benchmark For AutoML

Ekaterina Trofimova, Zosia Shamina, Maria Selifanova, Artem Zaitsev, Remi Savchuk, Maxim Minets, Daria Ozerova, Emil Sataev, Denis Zuenko, Andrey E. Ustyuzhanin

Main category: cs.CL

TL;DR: ML2B is the first benchmark for evaluating multilingual ML code generation, covering 30 Kaggle competitions translated into 13 languages, revealing 15-45% performance degradation on non-English tasks.

Details

Motivation: Existing ML code generation benchmarks are mainly restricted to English, overlooking the global and multilingual nature of ML research and practice.

Method: Created ML2B benchmark with 30 Kaggle competitions translated into 13 languages, covering tabular, text, and image data types. Used AIDE framework for automated end-to-end assessment of data science pipelines.

Result: Substantial 15-45% performance degradation on non-English tasks compared to English, highlighting challenges in multilingual representation learning for code generation.

Conclusion: The benchmark and evaluation framework are made publicly available to facilitate future research in multilingual ML code generation, addressing critical gaps in current evaluation methodologies.

Abstract: Large language models (LLMs) have recently demonstrated strong capabilities in generating machine learning (ML) code, enabling end-to-end pipeline construction from natural language instructions. However, existing benchmarks for ML code generation are mainly restricted to English, overlooking the global and multilingual nature of ML research and practice. To address this gap, we present ML2B, the first benchmark for evaluating multilingual ML code generation. ML2B consists of 30 Kaggle competitions translated into 13 natural languages, covering tabular, text, and image data types, with structured metadata and validated human-reviewed translations. For evaluation, we employ AIDE, an automated framework for end-to-end assessment of data science pipelines, and provide insights into cross-lingual model performance. Our results reveal substantial 15-45% performance degradation on non-English tasks, highlighting critical challenges in multilingual representation learning for code generation. The benchmark, evaluation framework, and comprehensive results are made available through our GitHub repository to facilitate future research in multilingual ML code generation: https://github.com/enaix/ml2b.

[10] ArFake: A Multi-Dialect Benchmark and Baselines for Arabic Spoof-Speech Detection

Mohamed Maged, Alhassan Ehab, Ali Mekky, Besher Hassan, Shady Shehata

Main category: cs.CL

TL;DR: This paper introduces the first multi-dialect Arabic spoofed speech dataset and evaluates various TTS models to determine which produces the most challenging synthetic speech for spoof detection.

Details

Motivation: With the rise of generative text-to-speech models, distinguishing real from synthetic speech has become challenging, especially for Arabic which has received limited research attention compared to English.

Method: Created a multi-dialect Arabic spoofed speech dataset and evaluated TTS models using an evaluation pipeline that included: embedding-based classifiers with classifier heads, classical ML algorithms on MFCC features, RawNet2 architecture, Mean Opinion Score from human ratings, and Word Error Rate from ASR processing.

Result: FishSpeech outperformed other TTS models in Arabic voice cloning on the Casablanca corpus, producing more realistic and challenging synthetic speech samples.

Conclusion: While FishSpeech produces the most challenging synthetic samples, relying on a single TTS model for dataset creation may limit generalizability.

Abstract: With the rise of generative text-to-speech models, distinguishing between real and synthetic speech has become challenging, especially for Arabic that have received limited research attention. Most spoof detection efforts have focused on English, leaving a significant gap for Arabic and its many dialects. In this work, we introduce the first multi-dialect Arabic spoofed speech dataset. To evaluate the difficulty of the synthesized audio from each model and determine which produces the most challenging samples, we aimed to guide the construction of our final dataset either by merging audios from multiple models or by selecting the best-performing model, we conducted an evaluation pipeline that included training classifiers using two approaches: modern embedding-based methods combined with classifier heads; classical machine learning algorithms applied to MFCC features; and the RawNet2 architecture. The pipeline further incorporated the calculation of Mean Opinion Score based on human ratings, as well as processing both original and synthesized datasets through an Automatic Speech Recognition model to measure the Word Error Rate. Our results demonstrate that FishSpeech outperforms other TTS models in Arabic voice cloning on the Casablanca corpus, producing more realistic and challenging synthetic speech samples. However, relying on a single TTS for dataset creation may limit generalizability.

[11] EditGRPO: Reinforcement Learning with Post -Rollout Edits for Clinically Accurate Chest X-Ray Report Generation

Kai Zhang, Christopher Malon, Lichao Sun, Martin Renqiang Min

Main category: cs.CL

TL;DR: EditGRPO is a mixed-policy RL algorithm that optimizes radiology report generation using clinically motivated rewards, outperforming SFT and vanilla GRPO baselines with improved generalization.

Details

Motivation: Current MLLMs for radiology report generation use SFT objectives not explicitly aligned with clinical efficacy, requiring better optimization methods.

Method: EditGRPO integrates on-policy exploration with off-policy guidance by injecting sentence-level corrections during training rollouts, using a mixed-policy RL approach.

Result: Applied to Qwen2.5-VL-3B MLLM, EditGRPO achieved 3.4% average improvement in CheXbert, GREEN, Radgraph, and RATEScore metrics across four chest X-ray datasets, with 5.9% gain on unseen datasets.

Conclusion: EditGRPO effectively addresses RL exploration challenges and improves clinical report generation performance and generalization.

Abstract: Radiology report generation requires advanced medical image analysis, effective temporal reasoning, and accurate text generation. Although recent innovations, particularly multimodal large language models (MLLMs), have shown improved performance, their supervised fine-tuning (SFT) objective is not explicitly aligned with clinical efficacy. In this work, we introduce EditGRPO, a mixed-policy reinforcement learning (RL) algorithm designed specifically to optimize the generation through clinically motivated rewards. EditGRPO integrates on-policy exploration with off-policy guidance by injecting sentence-level detailed corrections during training rollouts. This mixed-policy approach addresses the exploration dilemma and sampling efficiency issues typically encountered in RL. Applied to a Qwen2.5-VL-3B MLLM initialized with supervised fine-tuning (SFT), EditGRPO outperforms both SFT and vanilla GRPO baselines, achieving an average improvement of 3.4% in CheXbert, GREEN, Radgraph, and RATEScore metrics across four major chest X-ray report generation datasets. Notably, EditGRPO also demonstrates superior out-of-domain generalization, with an average performance gain of 5.9% on unseen datasets.

[12] Critique-Coder: Enhancing Coder Models by Critique Reinforcement Learning

Chi Ruan, Dongfu Jiang, Yubo Wang, Wenhu Chen

Main category: cs.CL

TL;DR: Critique Reinforcement Learning (CRL) enhances standard RL by training models to generate critiques, improving both code generation and general reasoning abilities.

Details

Motivation: Standard RL focuses on generating responses but lacks mechanisms for explicit critique or reflection. Recent studies show benefits of teaching LLMs to critique, motivating the development of CRL.

Method: Propose CRL where models generate critiques for (question, solution) pairs, with rewards based on alignment between generated and ground-truth judgment labels. Implement Critique-Coder by replacing 20% of standard RL data with CRL data.

Result: Critique-Coder consistently outperforms RL-only baselines across benchmarks, achieving over 60% on LiveCodeBench (v5) and better performance on logic reasoning tasks from BBEH dataset.

Conclusion: CRL serves as an effective complement to standard RL for LLM reasoning, enhancing both coding performance and transferable general reasoning abilities.

Abstract: Reinforcement Learning (RL) has emerged as a popular training paradigm, particularly when paired with reasoning models. While effective, it primarily focuses on generating responses and lacks mechanisms to explicitly foster critique or reflection. Several recent studies, like Critique-Fine-Tuning (CFT) and Critique-Guided-Distillation (CGD) have shown the benefits of explicitly teaching LLMs how to critique. Motivated by them, we propose Critique Reinforcement Learning (CRL), where the model is tasked with generating a critique for a given (question, solution) pair. The reward is determined solely by whether the final judgment label $c \in {\texttt{True}, \texttt{False}}$ of the generated critique aligns with the ground-truth judgment $c^*$. Building on this point, we introduce \textsc{Critique-Coder}, which is trained on a hybrid of RL and CRL by substituting 20% of the standard RL data with CRL data. We fine-tune multiple models (\textsc{Critique-Coder}) and evaluate them on different benchmarks to show their advantages over RL-only models. We show that \textsc{Critique-Coder} consistently outperforms RL-only baselines on all the evaluated benchmarks. Notably, our \textsc{Critique-Coder-8B} can reach over 60% on LiveCodeBench (v5), outperforming other reasoning models like DeepCoder-14B and GPT-o1. Beyond code generation, \textsc{Critique-Coder} also demonstrates enhanced general reasoning abilities, as evidenced by its better performance on logic reasoning tasks from the BBEH dataset. This indicates that the application of CRL on coding datasets enhances general reasoning and critique abilities, which are transferable across a broad range of tasks. Hence, we believe that CRL works as a great complement to standard RL for LLM reasoning.

[13] ChatInject: Abusing Chat Templates for Prompt Injection in LLM Agents

Hwan Chang, Yonghyun Jun, Hwanhee Lee

Main category: cs.CL

TL;DR: ChatInject is a new attack method that exploits LLM agents’ vulnerability to structured chat templates and multi-turn dialogues, achieving significantly higher success rates than traditional prompt injection attacks.

Details

Motivation: To address the underexplored vulnerability of LLMs' dependence on structured chat templates and their susceptibility to contextual manipulation through persuasive dialogues, which creates new attack surfaces for adversarial manipulation.

Method: Developed ChatInject attack that formats malicious payloads to mimic native chat templates, and a persuasion-driven Multi-turn variant that primes agents across conversational turns to accept suspicious actions.

Result: ChatInject achieved significantly higher attack success rates (32.05% on AgentDojo, 45.90% on InjecAgent) compared to traditional methods (5.18%, 15.13%), with multi-turn variants reaching 52.33% success rate. The attack shows strong transferability across models and bypasses existing defenses.

Conclusion: Current agent systems have critical vulnerabilities to chat-template-based attacks, and existing prompt-based defenses are largely ineffective, especially against multi-turn variants.

Abstract: The growing deployment of large language model (LLM) based agents that interact with external environments has created new attack surfaces for adversarial manipulation. One major threat is indirect prompt injection, where attackers embed malicious instructions in external environment output, causing agents to interpret and execute them as if they were legitimate prompts. While previous research has focused primarily on plain-text injection attacks, we find a significant yet underexplored vulnerability: LLMs’ dependence on structured chat templates and their susceptibility to contextual manipulation through persuasive multi-turn dialogues. To this end, we introduce ChatInject, an attack that formats malicious payloads to mimic native chat templates, thereby exploiting the model’s inherent instruction-following tendencies. Building on this foundation, we develop a persuasion-driven Multi-turn variant that primes the agent across conversational turns to accept and execute otherwise suspicious actions. Through comprehensive experiments across frontier LLMs, we demonstrate three critical findings: (1) ChatInject achieves significantly higher average attack success rates than traditional prompt injection methods, improving from 5.18% to 32.05% on AgentDojo and from 15.13% to 45.90% on InjecAgent, with multi-turn dialogues showing particularly strong performance at average 52.33% success rate on InjecAgent, (2) chat-template-based payloads demonstrate strong transferability across models and remain effective even against closed-source LLMs, despite their unknown template structures, and (3) existing prompt-based defenses are largely ineffective against this attack approach, especially against Multi-turn variants. These findings highlight vulnerabilities in current agent systems.

[14] Learning to Detect Relevant Contexts and Knowledge for Response Selection in Retrieval-based Dialogue Systems

Kai Hua, Zhiyuan Feng, Chongyang Tao, Rui Yan, Lu Zhang

Main category: cs.CL

TL;DR: The paper proposes RSM-DCK, a multi-turn response selection model that detects relevant parts of context and knowledge to improve response matching in dialogue systems.

Details

Motivation: Existing retrieval-based dialogue systems use all context and knowledge content for response matching, but much of this information is irrelevant due to topic shifts, which negatively impacts performance.

Method: RSM-DCK uses recent context as a query to pre-select relevant context and knowledge at word-level and utterance-level semantics, then interacts the response candidate with selected content, and finally uses fused context-response representation to post-select knowledge for confident matching.

Result: The model achieves better performance than existing methods on two benchmark datasets and effectively detects relevant context and knowledge for response selection.

Conclusion: The proposed RSM-DCK model successfully addresses the problem of irrelevant information in context and knowledge by detecting relevant parts, leading to improved response selection performance in knowledge-grounded conversations.

Abstract: Recently, knowledge-grounded conversations in the open domain gain great attention from researchers. Existing works on retrieval-based dialogue systems have paid tremendous efforts to utilize neural networks to build a matching model, where all of the context and knowledge contents are used to match the response candidate with various representation methods. Actually, different parts of the context and knowledge are differentially important for recognizing the proper response candidate, as many utterances are useless due to the topic shift. Those excessive useless information in the context and knowledge can influence the matching process and leads to inferior performance. To address this problem, we propose a multi-turn \textbf{R}esponse \textbf{S}election \textbf{M}odel that can \textbf{D}etect the relevant parts of the \textbf{C}ontext and \textbf{K}nowledge collection (\textbf{RSM-DCK}). Our model first uses the recent context as a query to pre-select relevant parts of the context and knowledge collection at the word-level and utterance-level semantics. Further, the response candidate interacts with the selected context and knowledge collection respectively. In the end, The fused representation of the context and response candidate is utilized to post-select the relevant parts of the knowledge collection more confidently for matching. We test our proposed model on two benchmark datasets. Evaluation results indicate that our model achieves better performance than the existing methods, and can effectively detect the relevant context and knowledge for response selection.

[15] Towards Generalizable Implicit In-Context Learning with Attention Routing

Jiaqian Li, Yanshu Li, Ligong Han, Ruixiang Tang, Wenya Wang

Main category: cs.CL

TL;DR: ICR is a novel implicit in-context learning method that internalizes generalizable ICL patterns at the attention logits level, enabling train-once-and-reuse framework without task-specific retrieval or training.

Details

Motivation: Existing implicit ICL methods rely on injecting shift vectors from labeled demonstrations or task-specific alignment, which don't utilize structural ICL mechanisms and have limited generalizability.

Method: Extracts reusable structural directions that emerge during ICL and employs a learnable input-conditioned router to modulate attention logits accordingly.

Result: Outperforms prior implicit ICL methods on 12 real-world datasets across diverse domains and multiple LLMs, with robust generalization to out-of-domain tasks.

Conclusion: ICR pushes the boundary of ICL’s practical value by enabling generalizable implicit ICL without task-specific requirements.

Abstract: Implicit in-context learning (ICL) has newly emerged as a promising paradigm that simulates ICL behaviors in the representation space of Large Language Models (LLMs), aiming to attain few-shot performance at zero-shot cost. However, existing approaches largely rely on injecting shift vectors into residual flows, which are typically constructed from labeled demonstrations or task-specific alignment. Such designs fall short of utilizing the structural mechanisms underlying ICL and suffer from limited generalizability. To address this, we propose In-Context Routing (ICR), a novel implicit ICL method that internalizes generalizable ICL patterns at the attention logits level. It extracts reusable structural directions that emerge during ICL and employs a learnable input-conditioned router to modulate attention logits accordingly, enabling a train-once-and-reuse framework. We evaluate ICR on 12 real-world datasets spanning diverse domains and multiple LLMs. The results show that ICR consistently outperforms prior implicit ICL methods that require task-specific retrieval or training, while demonstrating robust generalization to out-of-domain tasks where existing methods struggle. These findings position ICR to push the boundary of ICL’s practical value.

[16] The Bias is in the Details: An Assessment of Cognitive Bias in LLMs

R. Alexander Knipper, Charles S. Knipper, Kaiqi Zhang, Valerie Sims, Clint Bowers, Santu Karmaker

Main category: cs.CL

TL;DR: Large-scale evaluation of 8 cognitive biases across 45 LLMs shows they exhibit bias-consistent behavior in 17.8-57.3% of cases, with model size and prompt specificity significantly affecting bias susceptibility.

Details

Motivation: As LLMs are increasingly used in real-world decision-making, it's crucial to examine their cognitive biases, which are systematic distortions extensively studied in human psychology.

Method: Introduced novel evaluation framework using multiple-choice tasks, curated 220 decision scenarios with psychologists, and generated diverse prompts from human-authored templates to analyze over 2.8 million LLM responses.

Result: LLMs showed bias-consistent behavior across anchoring, availability, confirmation, framing, interpretation, overattribution, prospect theory, and representativeness biases. Larger models (>32B parameters) reduced bias in 39.5% of cases, while detailed prompts reduced most biases by up to 14.9% (except overattribution, which increased by 8.8%).

Conclusion: LLMs systematically exhibit cognitive biases, with model size and prompt specificity being key factors in bias susceptibility, highlighting the need for careful consideration when deploying LLMs in decision-making contexts.

Abstract: As Large Language Models (LLMs) are increasingly embedded in real-world decision-making processes, it becomes crucial to examine the extent to which they exhibit cognitive biases. Extensively studied in the field of psychology, cognitive biases appear as systematic distortions commonly observed in human judgments. This paper presents a large-scale evaluation of eight well-established cognitive biases across 45 LLMs, analyzing over 2.8 million LLM responses generated through controlled prompt variations. To achieve this, we introduce a novel evaluation framework based on multiple-choice tasks, hand-curate a dataset of 220 decision scenarios targeting fundamental cognitive biases in collaboration with psychologists, and propose a scalable approach for generating diverse prompts from human-authored scenario templates. Our analysis shows that LLMs exhibit bias-consistent behavior in 17.8-57.3% of instances across a range of judgment and decision-making contexts targeting anchoring, availability, confirmation, framing, interpretation, overattribution, prospect theory, and representativeness biases. We find that both model size and prompt specificity play a significant role on bias susceptibility as follows: larger size (>32B parameters) can reduce bias in 39.5% of cases, while higher prompt detail reduces most biases by up to 14.9%, except in one case (Overattribution), which is exacerbated by up to 8.8%.

[17] Lexicon-Enriched Graph Modeling for Arabic Document Readability Prediction

Passant Elchafei, Mayar Osama, Mohamed Rageh, Mervat Abuelkheir

Main category: cs.CL

TL;DR: A graph-based approach with lexicon enrichment for Arabic document readability prediction, combining GNN and transformer models via late fusion.

Details

Motivation: To develop an effective method for predicting document-level readability in Arabic by leveraging linguistic relationships and lexical features.

Method: Model documents as sentence-level graphs with nodes for sentences/lemmas and edges for linguistic relationships. Use SAMER lexicon features and Arabic transformer embeddings. Train GNN and transformer branches independently, combine via late fusion, and aggregate sentence-level predictions with max pooling.

Result: The hybrid fusion method outperforms standalone GNN or transformer branches across multiple readability metrics at document level, while GNN-only approach works better for sentence-level prediction.

Conclusion: Fusion offers advantages for document-level readability prediction, but GNN-only approach remains stronger for precise sentence-level readability assessment.

Abstract: We present a graph-based approach enriched with lexicons to predict document-level readability in Arabic, developed as part of the Constrained Track of the BAREC Shared Task 2025. Our system models each document as a sentence-level graph, where nodes represent sentences and lemmas, and edges capture linguistic relationships such as lexical co-occurrence and class membership. Sentence nodes are enriched with features from the SAMER lexicon as well as contextual embeddings from the Arabic transformer model. The graph neural network (GNN) and transformer sentence encoder are trained as two independent branches, and their predictions are combined via late fusion at inference. For document-level prediction, sentence-level outputs are aggregated using max pooling to reflect the most difficult sentence. Experimental results show that this hybrid method outperforms standalone GNN or transformer branches across multiple readability metrics. Overall, the findings highlight that fusion offers advantages at the document level, but the GNN-only approach remains stronger for precise prediction of sentence-level readability.

[18] HEART: Emotionally-driven test-time scaling of Language Models

Gabriela Pinto, Palash Goyal, Yiwen Song, Souradip Chakraborty, Zifeng Wang, Tomas Pfister, Hamid Palangi

Main category: cs.CL

TL;DR: HEART is a test-time scaling framework that uses emotionally-driven prompts for iterative self-correction, leveraging affective feedback based on six universal emotions to improve reasoning performance.

Details

Motivation: Current self-reflection strategies focus on logical refinement but don't leverage affective feedback, despite psychological research showing emotions can modulate cognitive performance.

Method: HEART provides feedback on incorrect responses using curated emotionally charged phrases based on Ekman’s six universal emotions, systematically varying emotional tone across iterations to guide models away from flawed reasoning paths.

Result: When guided by an oracle verifier, HEART unlocks significantly deeper reasoning with consistent substantial accuracy increases over state-of-the-art baselines. However, in verifier-free settings it struggles to consistently harness these gains.

Conclusion: The next frontier in machine reasoning may lie not just in refining logic, but also in understanding and leveraging the ‘HEART’ of models through affective iteration protocols.

Abstract: Test-time scaling has shown considerable success in improving the performance of language models on complex reasoning tasks without requiring fine-tuning. However, current strategies such as self-reflection primarily focus on logical or structural refinement. They do not leverage the guiding potential of affective feedback. Inspired by psychological research showing that emotions can modulate cognitive performance, we introduce HEART–a novel framework that uses emotionally-driven prompts for iterative self-correction. HEART provides feedback on a model’s incorrect response using a curated set of concise, emotionally charged phrases based on the six universal emotions categorized by Dr. Paul Ekman. By systematically varying the emotional tone of the feedback across iterations, our method guides the model to escape flawed reasoning paths and explore more promising alternatives. We evaluate our framework on challenging reasoning benchmarks including OlympiadBench, Humanity’s Last Exam, and SimpleQA. Our results reveal a significant new phenomenon: when guided by an oracle verifier, this affective iteration protocol unlocks significantly deeper reasoning, leading to consistent and substantial increases in accuracy over state-of-the-art baselines with the same verifier. However, we also identify a critical bottleneck for practical deployment. In a verifier-free setting, it struggles to harness these gains consistently, highlighting as a key challenge for future work. Our findings suggest that the next frontier in machine reasoning may lie not just in refining logic, but also in understanding and leveraging the `HEART’ of the models.

[19] Infusing Theory of Mind into Socially Intelligent LLM Agents

EunJeong Hwang, Yuwei Yin, Giuseppe Carenini, Peter West, Vered Shwartz

Main category: cs.CL

TL;DR: LLMs that explicitly use Theory of Mind (ToM) achieve better dialogue performance and goal achievement. The proposed ToMAgent method combines ToM with dialogue lookahead to generate maximally useful mental states.

Details

Motivation: Current chatbots and LLM-based social agents lack Theory of Mind understanding, which is crucial for human social intelligence. Integrating ToM can improve dialogue effectiveness and goal achievement.

Method: Introduces ToMAgent (ToMA), which pairs Theory of Mind with dialogue lookahead to produce mental states that are maximally useful for achieving dialogue goals. The method involves prompting models to generate mental states between dialogue turns.

Result: Experiments on Sotopia benchmark show ToMA outperforms baselines, exhibiting more strategic, goal-oriented reasoning behaviors, enabling long-horizon adaptation while maintaining better partner relationships.

Conclusion: Explicit integration of Theory of Mind represents a step forward in building socially intelligent LLM agents, improving dialogue performance and goal achievement.

Abstract: Theory of Mind (ToM)-an understanding of the mental states of others-is a key aspect of human social intelligence, yet, chatbots and LLM-based social agents do not typically integrate it. In this work, we demonstrate that LLMs that explicitly use ToM get better at dialogue, achieving goals more effectively. After showing that simply prompting models to generate mental states between dialogue turns already provides significant benefit, we further introduce ToMAgent (ToMA), a ToM-focused dialogue agent. ToMA is trained by pairing ToM with dialogue lookahead to produce mental states that are maximally useful for achieving dialogue goals. Experiments on the Sotopia interactive social evaluation benchmark demonstrate the effectiveness of our method over a range of baselines. Comprehensive analysis shows that ToMA exhibits more strategic, goal-oriented reasoning behaviors, which enable long-horizon adaptation, while maintaining better relationships with their partners. Our results suggest a step forward in integrating ToM for building socially intelligent LLM agents.

[20] Extract-0: A Specialized Language Model for Document Information Extraction

Henrique Godoy

Main category: cs.CL

TL;DR: Extract-0 is a 7B parameter model optimized for document information extraction that outperforms larger models like GPT-4.1 using synthetic data generation, LoRA fine-tuning, and GRPO reinforcement learning.

Details

Motivation: To create a specialized model for document information extraction that achieves superior performance compared to general-purpose large models while using significantly fewer computational resources.

Method: Combines synthetic data generation (280,128 examples), supervised fine-tuning with LoRA (modifying only 0.53% of weights), and reinforcement learning with GRPO using a novel semantic similarity-based reward function.

Result: Achieves mean reward of 0.573 on 1,000 document extraction tasks, outperforming GPT-4.1 (0.457), o3 (0.464), and GPT-4.1-2025 (0.459).

Conclusion: Task-specific optimization can produce models that surpass general-purpose systems while requiring substantially fewer computational resources.

Abstract: This paper presents Extract-0, a 7-billion parameter language model specifically optimized for document information extraction that achieves performance exceeding models with parameter counts several orders of magnitude larger. Through a novel combination of synthetic data generation, supervised fine-tuning with Low-Rank Adaptation (LoRA), and reinforcement learning via Group Relative Policy Optimization (GRPO), Extract-0 achieves a mean reward of 0.573 on a benchmark of 1,000 diverse document extraction tasks, outperforming GPT-4.1 (0.457), o3 (0.464), and GPT-4.1-2025 (0.459). The training methodology employs a memory-preserving synthetic data generation pipeline that produces 280,128 training examples from diverse document sources, followed by parameterefficient fine-tuning that modifies only 0.53% of model weights (40.4M out of 7.66B parameters). The reinforcement learning phase introduces a novel semantic similarity-based reward function that handles the inherent ambiguity in information extraction tasks. This research demonstrates that task-specific optimization can yield models that surpass general-purpose systems while requiring substantially fewer computational resource.

[21] HiKE: Hierarchical Evaluation Framework for Korean-English Code-Switching Speech Recognition

Gio Paik, Yongbeom Kim, Soungmin Lee, Sangmin Ahn, Chanwoo Kim

Main category: cs.CL

TL;DR: HiKE is the first Korean-English code-switching benchmark providing hierarchical CS-level labels and loanword annotations to systematically evaluate multilingual ASR models’ code-switching capabilities.

Details

Motivation: Code-switching remains a severely underexplored challenge in multilingual ASR despite being common in daily speech, and there's no globally accessible evaluation framework specifically for Korean-English CS.

Method: Created HiKE benchmark with high-quality natural CS data across various topics, meticulous loanword labels, and hierarchical CS-level labeling scheme (word, phrase, sentence) for systematic evaluation.

Result: Most multilingual ASR models initially struggle with CS-ASR, but this capability can be enabled through fine-tuning with CS data.

Conclusion: HiKE provides the first comprehensive evaluation framework for Korean-English code-switching that enables systematic assessment and improvement of multilingual ASR models’ CS handling capabilities.

Abstract: Despite advances in multilingual automatic speech recognition (ASR), code-switching (CS), the mixing of languages within an utterance common in daily speech, remains a severely underexplored challenge. In this paper, we introduce HiKE: the Hierarchical Korean-English code-switching benchmark, the first globally accessible evaluation framework for Korean-English CS, aiming to provide a means for the precise evaluation of multilingual ASR models and to foster research in the field. The proposed framework not only consists of high-quality, natural CS data across various topics, but also provides meticulous loanword labels and a hierarchical CS-level labeling scheme (word, phrase, and sentence) that together enable a systematic evaluation of a model’s ability to handle each distinct level of code-switching. Through evaluations of diverse multilingual ASR models and fine-tuning experiments, this paper demonstrates that while most multilingual ASR models initially struggle with CS-ASR, this capability can be enabled through fine-tuning with CS data. HiKE will be available at https://github.com/ThetaOne-AI/HiKE.

[22] Large language models management of medications: three performance analyses

Kelli Henry, Steven Xu, Kaitlin Blotske, Moriah Cargile, Erin F. Barreto, Brian Murray, Susan Smith, Seth R. Bauer, Yanjun Gao, Tianming Liu, Andrea Sikora

Main category: cs.CL

TL;DR: GPT-4o performs poorly on medical medication tasks including drug-formulation matching, drug-drug interaction identification, and medication order preparation, highlighting the need for domain-specific training and better evaluation frameworks.

Details

Motivation: To evaluate GPT-4o's consistency in recommending appropriate medication regimens, as few studies have assessed LLM performance on medication benchmarking tests despite their potential utility in medical diagnosis.

Method: Three experiments using GPT-4o: drug-formulation matching, drug-drug interaction identification (with and without web search), and medication order sentence preparation. Evaluation used cosine similarity, Levenshtein similarity, ROUGE scores, and clinician manual assessment.

Result: Poor performance across all tests: 49% correct drug-formulation matching with frequent omissions and hallucinations; inconsistent drug-drug interaction identification (54.7% vs 69.2% accuracy with/without search); only 65.8% of medication orders contained no errors.

Conclusion: GPT-4o’s overall poor performance on medication-related tasks indicates the need for domain-specific training using clinician-annotated datasets and comprehensive evaluation frameworks for benchmarking LLM performance in medical applications.

Abstract: Background: Large language models (LLMs) can be useful in diagnosing medical conditions, but few studies have evaluated their consistency in recommending appropriate medication regimens. The purpose of this evaluation was to test GPT-4o on three medication benchmarking tests including mapping a drug name to its correct formulation, identifying drug-drug interactions using both its internal knowledge and using a web search, and preparing a medication order sentence after being given the medication name. Methods: Using GTP-4o three experiments were completed. Accuracy was quantified by computing cosine similarity on TF-IDF vectors, normalized Levenshtein similarity, and ROUGE-1/ROUGE-L F1 between each response and its reference string or by manual evaluation by clinicians. Results: GPT-4o performed poorly on drug-formulation matching, with frequent omissions of available drug formulations (mean 1.23 per medication) and hallucinations of formulations that do not exist (mean 1.14 per medication). Only 49% of tested medications were correctly matched to all available formulations. Accuracy was decreased for medications with more formulations (p<0.0001). GPT-4o was also inconsistent at identifying drug-drug-interactions, although it had better performance with the search-augmented assessment compared to its internal knowledge (54.7% vs. 69.2%, p=0.013). However, allowing a web-search worsened performance when there was no drug-drug interaction (median % correct 100% vs. 40%, p<0.001). Finally, GPT-4o performed moderately with preparing a medication order sentence, with only 65.8% of medication order sentences containing no medication or abbreviation errors. Conclusions: Model performance was overall poor for all tests. This highlights the need for domain-specific training through clinician-annotated datasets and a comprehensive evaluation framework for benchmarking performance.

[23] LLMs Behind the Scenes: Enabling Narrative Scene Illustration

Melissa Roemmele, John Joon Young Chung, Taewook Kim, Yuqian Sun, Alex Calderwood, Max Kreminski

Main category: cs.CL

TL;DR: The paper introduces SceneIllustrations dataset for narrative scene illustration using LLMs to prompt text-to-image models, showing LLMs can effectively verbalize scene knowledge from story text.

Details

Motivation: To leverage generative AI for transforming text stories into visual illustrations, specifically focusing on narrative scene illustration to illuminate stories through images.

Method: A pipeline using LLMs as interface for prompting text-to-image models to generate scene illustrations from raw story text, applied to a story corpus with human annotation for quality judgments.

Result: Created SceneIllustrations dataset with pairwise quality judgments, demonstrating LLMs can effectively verbalize scene knowledge from story text for illustration generation and evaluation.

Conclusion: LLMs are impactful for generating and evaluating illustrations by effectively extracting and verbalizing implicit scene knowledge from narrative text.

Abstract: Generative AI has established the opportunity to readily transform content from one medium to another. This capability is especially powerful for storytelling, where visual illustrations can illuminate a story originally expressed in text. In this paper, we focus on the task of narrative scene illustration, which involves automatically generating an image depicting a scene in a story. Motivated by recent progress on text-to-image models, we consider a pipeline that uses LLMs as an interface for prompting text-to-image models to generate scene illustrations given raw story text. We apply variations of this pipeline to a prominent story corpus in order to synthesize illustrations for scenes in these stories. We conduct a human annotation task to obtain pairwise quality judgments for these illustrations. The outcome of this process is the SceneIllustrations dataset, which we release as a new resource for future work on cross-modal narrative transformation. Through our analysis of this dataset and experiments modeling illustration quality, we demonstrate that LLMs can effectively verbalize scene knowledge implicitly evoked by story text. Moreover, this capability is impactful for generating and evaluating illustrations.

[24] What Matters More For In-Context Learning under Matched Compute Budgets: Pretraining on Natural Text or Incorporating Targeted Synthetic Examples?

Mohammed Sabry, Anya Belz

Main category: cs.CL

TL;DR: Explicitly exercising induction circuits during pretraining accelerates induction-head emergence but doesn’t consistently improve in-context learning performance compared to natural text training under iso-FLOPs.

Details

Motivation: To test whether targeted synthetic data can accelerate induction-head emergence and enhance in-context learning, challenging the assumption that early induction circuit activation directly improves ICL.

Method: Introduced Bi-Induct curriculum that injects forward-copy (Induction), backward-copy (Anti), or balanced mix into pretraining stream. Trained models from 0.13B to 1B parameters under iso-FLOPs, evaluating few-shot ICL benchmarks, head-level telemetry, and language modeling perplexity.

Result: Bi-Induct accelerates induction-head emergence at small scales but doesn’t yield stronger generalization. Natural-only training performs best on function-style ICL probes. Larger natural-only models develop broader, earlier induction heads without explicit patterns. Perplexity penalties from synthetic data shrink with scale.

Conclusion: Inducing activation is not sufficient for ICL gains - circuits must become functionally necessary. Results emphasize mechanism-aware pretraining diagnostics and data mixtures that foster load-bearing, not merely present, structure.

Abstract: Does explicitly exercising the induction circuit during pretraining improve in-context learning (ICL), or is natural text sufficient when compute is held constant (iso-FLOPs)? To test whether targeted synthetic data can accelerate induction-head emergence and enhance ICL, we introduce Bi-Induct, a lightweight curriculum that injects forward-copy (Induction), backward-copy (Anti), or a balanced mix into the pretraining stream. We train models from 0.13B to 1B parameters under iso-FLOPs, evaluating (i) few-shot ICL benchmarks, (ii) head-level telemetry, and (iii) held-out language modeling perplexity. Our findings challenge the assumption that early induction circuit activation directly improves ICL. While Bi-Induct accelerates induction-head emergence at small scales, this does not consistently yield stronger generalization. On standard LM benchmarks, Bi-Induct matches natural-only training; on function-style ICL probes, the 1B natural-only performs best. Stress tests (e.g., label permutation, HITS@1 vs. HITS@3, 1 vs. 10 shots) preserve these trends. Telemetry shows larger natural-only models develop broader, earlier induction heads without explicit induction patterns. Anti-induction data fails to elicit meaningful activation. Perplexity penalties from synthetic data shrink with scale, suggesting larger models can absorb non-natural patterns with minimal cost. Crucially, ablating the top 2% of induction heads degrades ICL more than random ablations, especially for natural-only models, indicating more centralized, load-bearing circuits. Bi-Induct variants exhibit more redundant induction activity, implying different circuit utilization. Overall, inducing activation is not sufficient: ICL gains depend on these circuits becoming functionally necessary. These results underscore mechanism-aware pretraining diagnostics and data mixtures that foster load-bearing, not merely present, structure.

[25] Emergent morpho-phonological representations in self-supervised speech models

Jon Gauthier, Canaan Breiss, Matthew Leonard, Edward F. Chang

Main category: cs.CL

TL;DR: Self-supervised speech models for word recognition develop linear geometric representations that capture regular distributional relationships between English nouns/verbs and their inflected forms, rather than directly tracking phonological or morphological units.

Details

Motivation: To understand what types of linguistic representations self-supervised speech models use for word recognition in noisy environments, specifically how they represent phonological and morphological phenomena in English noun and verb inflections.

Method: Study S3M variants optimized for word recognition by analyzing how they represent frequent English noun and verb inflections, examining the geometric structure of their representations.

Result: The models develop representations with a global linear geometry that can link English nouns and verbs to their regular inflected forms. This structure tracks regular distributional relationships between word pairs in the English lexicon, often (but not always) due to morphological inflection.

Conclusion: These findings challenge the presumed necessity of distinct linguistic representations for phonology and morphology in human spoken word recognition, suggesting alternative representational strategies based on distributional relationships.

Abstract: Self-supervised speech models can be trained to efficiently recognize spoken words in naturalistic, noisy environments. However, we do not understand the types of linguistic representations these models use to accomplish this task. To address this question, we study how S3M variants optimized for word recognition represent phonological and morphological phenomena in frequent English noun and verb inflections. We find that their representations exhibit a global linear geometry which can be used to link English nouns and verbs to their regular inflected forms. This geometric structure does not directly track phonological or morphological units. Instead, it tracks the regular distributional relationships linking many word pairs in the English lexicon – often, but not always, due to morphological inflection. These findings point to candidate representational strategies that may support human spoken word recognition, challenging the presumed necessity of distinct linguistic representations of phonology and morphology.

[26] Same Content, Different Representations: A Controlled Study for Table QA

Yue Zhang, Seiji Maekawa, Nikita Bhutani

Main category: cs.CL

TL;DR: First controlled study examining how table representation affects Table QA performance, showing consistent trade-offs between SQL-based methods, LLMs, and hybrid approaches across different data formats.

Details

Motivation: Real-world Table QA must handle both structured databases and semi-structured tables, but existing benchmarks don't systematically examine how representation affects model performance.

Method: Created paired structured and semi-structured tables using verbalization pipeline, introduced diagnostic benchmark with splits along table size, join requirements, query complexity, and schema quality.

Result: SQL-based methods achieve high accuracy on structured inputs but degrade on semi-structured data; LLMs show flexibility but reduced precision; hybrid approaches balance performance, especially under noisy schemas.

Conclusion: No single method excels across all conditions; representation plays central role in Table QA performance; findings provide insights for model selection and design of robust hybrid approaches.

Abstract: Table Question Answering (Table QA) in real-world settings must operate over both structured databases and semi-structured tables containing textual fields. However, existing benchmarks are tied to fixed data formats and have not systematically examined how representation itself affects model performance. We present the first controlled study that isolates the role of table representation by holding content constant while varying structure. Using a verbalization pipeline, we generate paired structured and semi-structured tables, enabling direct comparisons across modeling paradigms. To support detailed analysis, we introduce a diagnostic benchmark with splits along table size, join requirements, query complexity, and schema quality. Our experiments reveal consistent trade-offs: SQL-based methods achieve high accuracy on structured inputs but degrade on semi-structured data, LLMs exhibit flexibility but reduced precision, and hybrid approaches strike a balance, particularly under noisy schemas. These effects intensify with larger tables and more complex queries. Ultimately, no single method excels across all conditions, and we highlight the central role of representation in shaping Table QA performance. Our findings provide actionable insights for model selection and design, paving the way for more robust hybrid approaches suited for diverse real-world data formats.

[27] ADAM: A Diverse Archive of Mankind for Evaluating and Enhancing LLMs in Biographical Reasoning

Jasin Cekinmez, Omid Ghahroodi, Saad Fowad Chandle, Dhiman Gupta, Ehsaneddin Asgari

Main category: cs.CL

TL;DR: ADAM is a framework for evaluating and improving multimodal LLMs in biographical reasoning, featuring a multilingual dataset (AdamDB), cognitive evaluations (AdamBench), and a retrieval-augmented system (AdamRAG) to reduce hallucinations.

Details

Motivation: Biography is a critical yet underexplored dimension of factual knowledge in LLMs, with current models lacking systematic evaluation and suffering from hallucinations, especially for lesser-known individuals.

Method: Created AdamDB dataset covering 4M+ individuals across geography/time/profession, AdamBench evaluations based on Bloom’s taxonomy across 6 reasoning levels, and AdamRAG retrieval-augmented generation system for biographical contexts.

Result: AdamRAG substantially improves open-source models and modestly benefits closed-source ones, with largest gains on lower-order reasoning. Popularity strongly mediates accuracy, and multimodal input via face images offers smaller improvements than retrieval.

Conclusion: ADAM establishes the first benchmark and framework for cognitively, culturally, and multimodally grounded biographical evaluation, advancing development of multilingual, accurate, and hallucination-resistant MLLMs.

Abstract: We introduce ADAM (A Diverse Archive of Mankind), a framework for evaluating and improving multimodal large language models (MLLMs) in biographical reasoning. To the best of our knowledge, this is the first work to systematically examine LLM capabilities in biography, a critical yet underexplored dimension of factual knowledge. At its core, AdamDB is a multilingual and multimodal dataset covering over 4 million individuals across geography, time, and profession, while AdamBench provides cognitively structured evaluations based on Bloom’s taxonomy, spanning six reasoning levels in both English and native languages. To address hallucinations, particularly for lesser-known individuals, we propose AdamRAG, a retrieval-augmented generation system tailored to biographical contexts. Experiments show that AdamRAG substantially improves open-source models and modestly benefits closed-source ones, with the largest gains on lower-order reasoning. Popularity strongly mediates accuracy, and multimodal input via face images offers smaller, less consistent improvements than retrieval. ADAM establishes the first benchmark and framework for cognitively, culturally, and multimodally grounded biographical evaluation, advancing the development of multilingual, accurate, and hallucination-resistant MLLMs.

[28] DM-Codec: Distilling Multimodal Representations for Speech Tokenization

Md Mubtasim Ahasan, Md Fahim, Tasnim Mohiuddin, A K M Mahbubur Rahman, Aman Chadha, Tariq Iqbal, M Ashraful Amin, Md Mofijul Islam, Amin Ahsan Ali

Main category: cs.CL

TL;DR: Proposes DM-Codec, a novel speech tokenization model that integrates acoustic, semantic, and contextual representations through LM and SM-guided distillation, achieving state-of-the-art performance on speech transcription and quality metrics.

Details

Motivation: Existing speech tokenization methods overlook contextual representation, leading to poor transcription performance with high WER and WIL scores. There's a need to unify acoustic, semantic, and contextual information for comprehensive speech modeling.

Method: Two distillation approaches: (1) LM-guided distillation for contextual information, and (2) combined LM and SM-guided distillation for multimodal representations. Uses encoder-decoder framework with RVQ, incorporating LM and SM during training.

Result: DM-Codec significantly outperforms SOTA models, reducing WER by 13.46%, WIL by 9.82%, and improving speech quality by 5.84% and intelligibility by 1.85% on LibriSpeech.

Conclusion: The proposed DM-Codec successfully integrates contextual information with acoustic and semantic representations, demonstrating substantial improvements in speech tokenization and synthesis performance.

Abstract: Recent advancements in speech-language models have yielded significant improvements in speech tokenization and synthesis. However, effectively mapping the complex, multidimensional attributes of speech into discrete tokens remains challenging. This process demands acoustic, semantic, and contextual information for precise speech representations. Existing speech representations generally fall into two categories: acoustic tokens from audio codecs and semantic tokens from speech self-supervised learning models. Although recent efforts have unified acoustic and semantic tokens for improved performance, they overlook the crucial role of contextual representation in comprehensive speech modeling. Our empirical investigations reveal that the absence of contextual representations results in elevated Word Error Rate (WER) and Word Information Lost (WIL) scores in speech transcriptions. To address these limitations, we propose two novel distillation approaches: (1) a language model (LM)-guided distillation method that incorporates contextual information, and (2) a combined LM and self-supervised speech model (SM)-guided distillation technique that effectively distills multimodal representations (acoustic, semantic, and contextual) into a comprehensive speech tokenizer, termed DM-Codec. The DM-Codec architecture adopts a streamlined encoder-decoder framework with a Residual Vector Quantizer (RVQ) and incorporates the LM and SM during the training process. Experiments show DM-Codec significantly outperforms state-of-the-art speech tokenization models, reducing WER by up to 13.46%, WIL by 9.82%, and improving speech quality by 5.84% and intelligibility by 1.85% on the LibriSpeech benchmark dataset. Code, samples, and checkpoints are available at https://github.com/mubtasimahasan/DM-Codec.

[29] AI Brown and AI Koditex: LLM-Generated Corpora Comparable to Traditional Corpora of English and Czech Texts

Jiří Milička, Anna Marklová, Václav Cvrček

Main category: cs.CL

TL;DR: This paper presents two LLM-generated corpora (English and Czech) that replicate human reference corpora for linguistic comparison between human-written and AI-generated texts.

Details

Motivation: To create resources for comparing human-written texts with LLM-generated text linguistically, ensuring multi-genre coverage with diverse topics, authors, and text types while maintaining comparability with existing human corpora.

Method: Generated corpora using multiple LLMs (OpenAI, Anthropic, Alphabet, Meta, DeepSeek) from GPT-3 to GPT-4.5, replicating BE21 and Koditex reference human corpora. The corpora are tagged with Universal Dependencies standard for tokenization, lemmatization, and morphological/syntactical annotation.

Result: Created English corpus with average 864k tokens per model (27M total) and Czech corpus with average 768k tokens per model (21.5M total). The corpora are freely available under CC BY 4.0 license and accessible through Czech National Corpus search interface.

Conclusion: Successfully developed comprehensive LLM-generated corpora for linguistic research, providing valuable resources for comparing human and AI-generated text across multiple languages and genres.

Abstract: This article presents two corpora of English and Czech texts generated with large language models (LLMs). The motivation is to create a resource for comparing human-written texts with LLM-generated text linguistically. Emphasis was placed on ensuring these resources are multi-genre and rich in terms of topics, authors, and text types, while maintaining comparability with existing human-created corpora. These generated corpora replicate reference human corpora: BE21 by Paul Baker, which is a modern version of the original Brown Corpus, and Koditex corpus that also follows the Brown Corpus tradition but in Czech. The new corpora were generated using models from OpenAI, Anthropic, Alphabet, Meta, and DeepSeek, ranging from GPT-3 (davinci-002) to GPT-4.5, and are tagged according to the Universal Dependencies standard (i.e., they are tokenized, lemmatized, and morphologically and syntactically annotated). The subcorpus size varies according to the model used (the English part contains on average 864k tokens per model, 27M tokens altogether, the Czech partcontains on average 768k tokens per model, 21.5M tokens altogether). The corpora are freely available for download under the CC BY 4.0 license (the annotated data are under CC BY-NC-SA 4.0 licence) and are also accessible through the search interface of the Czech National Corpus.

[30] Look Back to Reason Forward: Revisitable Memory for Long-Context LLM Agents

Yaorui Shi, Yuxin Chen, Siyuan Wang, Sihang Li, Hengxing Cai, Qi Gu, Xiang Wang, An Zhang

Main category: cs.CL

TL;DR: ReMemR1 is a memory-augmented agent with callback-enhanced memory that enables selective retrieval from full memory history and non-linear reasoning, combined with multi-level reinforcement learning for improved long-context question answering.

Details

Motivation: Existing 'memorize while reading' methods for long-context QA suffer from irreversible forward-only processing, information loss through overwriting, and sparse reinforcement learning signals.

Method: Proposes ReMemR1 with callback-enhanced memory allowing selective retrieval from entire history and non-linear reasoning, plus Reinforcement Learning with Multi-Level Rewards (RLMLR) combining final-answer rewards with dense step-level signals.

Result: Experiments on long-document QA show significant gains over existing memory-based approaches.

Conclusion: ReMemR1 effectively mitigates information degradation, improves supervision, and supports multi-hop memory utilization for long-context reasoning agents.

Abstract: Large language models face challenges in long-context question answering, where key evidence of a query may be dispersed across millions of tokens. Existing works equip large language models with a memory corpus that is dynamically updated during a single-pass document scan, also known as the “memorize while reading” methods. While this approach scales efficiently, it suffers from irreversible forward-only processing, information loss through overwriting, and sparse reinforcement learning signals. To tackle these challenges, we present ReMemR1, a memory-augmented agent with callback-enhanced memory that allows selective retrieval from the entire memory history and allows non-linear reasoning and revisiting of early evidence. To further strengthen training, we propose Reinforcement Learning with Multi-Level Rewards (RLMLR), which combines final-answer rewards with dense, step-level signals that guide effective memory use. Together, these contributions mitigate information degradation, improve supervision, and support multi-hop memory utilizing. Experiments on long-document QA show significant gains over existing memory-based approaches, which validates ReMemR1 as an effective solution for long-context reasoning agents.

[31] Peacemaker or Troublemaker: How Sycophancy Shapes Multi-Agent Debate

Binwei Yao, Chao Shang, Wanyu Du, Jianfeng He, Ruixue Lian, Yi Zhang, Hang Su, Sandesh Swamy, Yanjun Qi

Main category: cs.CL

TL;DR: LLMs’ sycophancy (excessive agreeability) undermines multi-agent debate systems by causing premature consensus and disagreement collapse, leading to worse performance than single agents. The paper introduces the first framework to define, measure, and analyze inter-agent sycophancy in debate settings.

Details

Motivation: LLMs' inherent sycophancy poses significant challenges for multi-agent debating systems (MADS) by collapsing debates into premature consensus, undermining the benefits of productive disagreement. Prior studies focused on user-LLM sycophancy, leaving inter-agent sycophancy in debates poorly understood.

Method: Introduced the first operational framework that: (1) formally defines sycophancy specific to MADS settings, (2) develops new metrics to evaluate agent sycophancy level and its impact on information exchange, (3) systematically investigates how varying sycophancy levels across agent roles (debaters and judges) affects outcomes in decentralized and centralized debate frameworks.

Result: Sycophancy is a core failure mode that amplifies disagreement collapse before reaching correct conclusions, yields lower accuracy than single-agent baselines, and arises from distinct debater-driven and judge-driven failure modes.

Conclusion: Proposed actionable design principles for MADS that effectively balance productive disagreement with cooperation in agent interactions, addressing the sycophancy problem in multi-agent debates.

Abstract: Large language models (LLMs) often display sycophancy, a tendency toward excessive agreeability. This behavior poses significant challenges for multi-agent debating systems (MADS) that rely on productive disagreement to refine arguments and foster innovative thinking. LLMs’ inherent sycophancy can collapse debates into premature consensus, potentially undermining the benefits of multi-agent debate. While prior studies focus on user–LLM sycophancy, the impact of inter-agent sycophancy in debate remains poorly understood. To address this gap, we introduce the first operational framework that (1) proposes a formal definition of sycophancy specific to MADS settings, (2) develops new metrics to evaluate the agent sycophancy level and its impact on information exchange in MADS, and (3) systematically investigates how varying levels of sycophancy across agent roles (debaters and judges) affects outcomes in both decentralized and centralized debate frameworks. Our findings reveal that sycophancy is a core failure mode that amplifies disagreement collapse before reaching a correct conclusion in multi-agent debates, yields lower accuracy than single-agent baselines, and arises from distinct debater-driven and judge-driven failure modes. Building on these findings, we propose actionable design principles for MADS, effectively balancing productive disagreement with cooperation in agent interactions.

[32] Beyond Classification: Towards Speech Emotion Reasoning with Multitask AudioLLMs

Wenyu Zhang, Yingxu He, Geyu Lin, Zhuohan Liu, Shuo Sun, Bin Wang, Xunlong Zou, Jeremy H. M. Wong, Qiongqiong Wang, Hardik B. Sailor, Nancy F. Chen, Ai Ti Aw

Main category: cs.CL

TL;DR: The paper introduces emotion reasoning for AudioLLMs, using generative capabilities to produce evidence-grounded explanations that improve emotion recognition accuracy and response quality.

Details

Motivation: AudioLLMs excel at semantic tasks but struggle with paralinguistic cues like emotion. Existing emotion classification approaches lack insight into prediction rationales.

Method: Proposed unified framework with reasoning-augmented data supervision, dual-encoder architecture, and task-alternating training for multitask AudioLLMs.

Result: Experiments on IEMOCAP and MELD show improved emotion prediction accuracy and more coherent, evidence-grounded responses. Out-of-domain tests demonstrate good generalization.

Conclusion: Emotion reasoning enhances AudioLLMs’ emotion understanding by providing semantically aligned explanations, improving both accuracy and interpretability.

Abstract: Audio Large Language Models (AudioLLMs) have achieved strong results in semantic tasks like speech recognition and translation, but remain limited in modeling paralinguistic cues such as emotion. Existing approaches often treat emotion understanding as a classification problem, offering little insight into the underlying rationale behind predictions. In this work, we explore emotion reasoning, a strategy that leverages the generative capabilities of AudioLLMs to enhance emotion recognition by producing semantically aligned, evidence-grounded explanations. To support this in multitask AudioLLMs, we introduce a unified framework combining reasoning-augmented data supervision, dual-encoder architecture, and task-alternating training. This approach enables AudioLLMs to effectively learn different tasks while incorporating emotional reasoning. Experiments on IEMOCAP and MELD show that our approach not only improves emotion prediction accuracy but also enhances the coherence and evidential grounding of the generated responses. Experiments on two out-of-domain datasets demonstrate the generalization capabilities of the resulting model.

[33] Semantic Voting: A Self-Evaluation-Free Approach for Efficient LLM Self-Improvement on Unverifiable Open-ended Tasks

Chunyang Jiang, Yonggang Zhang, Yiyang Cai, Chi-Min Chan, Yulong Liu, Mingming Chen, Wei Xue, Yike Guo

Main category: cs.CL

TL;DR: Proposes semantic voting as a self-evaluation-free approach for unverifiable tasks, using lightweight sentence embeddings for semantic similarity instead of exact matching, achieving better efficiency and performance than self-evaluation methods.

Details

Motivation: Address limitations of self-evaluation methods for unverifiable tasks (e.g., translation) which suffer from high computational overhead and overconfidence issues due to LLM biases, while majority voting only works for verifiable tasks.

Method: Semantic voting mechanism that replaces hard matching (exact matching) with soft matching using semantic similarity measured by lightweight sentence embedding models, avoiding reliance on LLM self-evaluation.

Result: Achieves substantial gains in computational efficiency and overall better performance than self-evaluation methods across diverse model architectures and tasks.

Conclusion: Semantic voting provides an effective self-evaluation-free approach for unverifiable tasks that is both computationally efficient and avoids the biases of LLM-based self-evaluation.

Abstract: The rising cost of acquiring supervised data has driven significant interest in self-improvement for large language models (LLMs). Straightforward unsupervised signals like majority voting have proven effective in generating pseudo-labels for verifiable tasks, while their applicability to unverifiable tasks (e.g., translation) is limited by the open-ended character of responses. As a result, self-evaluation mechanisms (e.g., self-judging and entropy minimization) are predominantly used to derive pseudo-labels. However, self-evaluation relying on LLMs typically incurs high computational overhead and introduces overconfidence issues due to intrinsic biases. To address these challenges, we propose a novel self-evaluation-free approach for unverifiable tasks, designed for lightweight yet effective self-improvement. Inspired by majority voting commonly employed in verifiable tasks, we propose semantic voting as a novel mechanism that relaxes the principle of hard matching (i.e., exact matching) toward soft matching (i.e., semantic similarity). Soft matching is achieved by leveraging a lightweight sentence embedding model to quantify semantic similarity, thereby mitigating excessive computational burden and intrinsic bias-associated limitations of self-evaluation. Comprehensive experiments demonstrate that our method achieves substantial gains in computational efficiency and overall better performance than self-evaluation methods across diverse model architectures and tasks.

[34] From Evidence to Trajectory: Abductive Reasoning Path Synthesis for Training Retrieval-Augmented Generation Agents

Muzhi Li, Jinhu Qi, Yihong Wu, Minghao Zhao, Liheng Ma, Yifan Li, Xinyu Wang, Yingxue Zhang, Ho-fung Leung, Irwin King

Main category: cs.CL

TL;DR: EviPath is a new method for creating training data that helps RAG agents learn complex reasoning and tool-use capabilities through evidence-anchored reasoning path synthesis, achieving significant performance improvements.

Details

Motivation: Current RAG agent development lacks process-level supervision for guiding agentic capabilities like task decomposition and stepwise decision-making. Existing data synthesis methods only produce chain-of-thought rationales without modeling environmental interactions.

Method: EviPath has three components: (1) Abductive Subtask Planning for problem decomposition and optimal path planning, (2) Faithful Sub-question Answering using evidence to generate reasoning, and (3) Conversational Fine-Tuning to format interactions for supervised fine-tuning.

Result: An 8B parameter model trained with EviPath-synthesized data significantly outperforms state-of-the-art baselines with a 14.7% absolute EM gain in open-domain question answering.

Conclusion: EviPath enables LLMs to learn complex reasoning and tool-use capabilities directly from synthesized data, providing an effective solution for RAG agent development.

Abstract: Retrieval-augmented generation agents development is hindered by the lack of process-level supervision to effectively guide agentic capabilities like task decomposition, retriever invocation, and stepwise decision-making. While reinforcement learning offers a potential solution, it suffers from sparse rewards and the limited reasoning capabilities of large language models (LLMs). Meanwhile, existing data synthesis methods only produce chain-of-thought rationales and fail to model environmental interactions. In this paper, we propose EviPath, an evidence-anchored reasoning path synthesis paradigm for RAG agent development. EviPath comprises: (i) Abductive Subtask Planning, which decomposes the problem into sub-questions and iteratively plans an optimal solution path based on the dependencies between them; (ii) Faithful Sub-question Answering, which uses supporting evidence to construct a proxy environment to generate reasoning thoughts and answers for each sub-question; and (iii) Conversational Fine-Tuning, which formats the complete agent-environment interaction trajectory into a dialogue format suitable for Supervised Fine-Tuning. EviPath allows LLMs to learn complex reasoning and tool-use capabilities directly from synthesized data. Extensive experiments on widely-used question-answering benchmarks show that an 8B parameter model trained with EviPath-synthesized data significantly and consistently outperforms state-of-the-art baselines with a double-digit absolute EM gain of 14.7% in open-domain question answering.

[35] The Geometry of Creative Variability: How Credal Sets Expose Calibration Gaps in Language Models

Esteban Garces Arias, Julian Rodemann, Christian Heumann

Main category: cs.CL

TL;DR: A geometric framework using credal sets quantifies uncertainty in neural text generation, revealing gaps in capturing human creative variation and showing decoding strategy choice significantly contributes to epistemic uncertainty.

Details

Motivation: Understanding uncertainty in large language models for creative tasks where multiple valid outputs exist is a fundamental challenge.

Method: Used credal sets (convex hulls of probability distributions) to analyze 500 creative writing prompts from WritingPrompts dataset with 10 human continuations each, evaluating 4 language models across 5 decoding strategies (100,000 stories total).

Result: Substantial gaps in capturing human creative variation, with best model-human calibration reaching only 0.434. Decoding strategy contributes 39.4% to 72.0% of total epistemic uncertainty. Model scale shows weak correlation with calibration quality.

Conclusion: The geometric framework provides actionable insights for improving generation systems for human-AI creative alignment.

Abstract: Understanding uncertainty in large language models remains a fundamental challenge, particularly in creative tasks where multiple valid outputs exist. We present a geometric framework using credal sets - convex hulls of probability distributions - to quantify and decompose uncertainty in neural text generation, calibrated against human creative variation. Analyzing 500 creative writing prompts from the WritingPrompts dataset with 10 unique human continuations each, we evaluate four language models across five decoding strategies, generating 100,000 stories. Our credal set analysis reveals substantial gaps in capturing human creative variation, with the best model-human calibration reaching only 0.434 (Gemma-2B with temperature 0.7). We decompose total uncertainty into epistemic and aleatoric components, finding that the choice of decoding strategy contributes 39.4% to 72.0% of total epistemic uncertainty. Model scale shows weak correlation with calibration quality and no significant difference exists between base and instruction-tuned models in calibration quality. Our geometric framework provides actionable insights for improving generation systems for human-AI creative alignment. We release our complete experimental framework.

[36] d$^2$Cache: Accelerating Diffusion-Based LLMs via Dual Adaptive Caching

Yuchu Jiang, Yue Cai, Xiangzhong Luo, Jiale Fu, Jiarui Wang, Chonghan Liu, Xu Yang

Main category: cs.CL

TL;DR: d²Cache is a training-free KV cache framework that accelerates diffusion-based LLM inference through adaptive token selection and caching, achieving both speed improvements and better generation quality.

Details

Motivation: Diffusion-based LLMs suffer from poor inference efficiency due to bidirectional attention that prevents standard KV cache usage, unlike autoregressive models.

Method: Two-stage fine-grained selection strategy to identify and update KV states of selected tokens while caching others for reuse, enabling quasi left-to-right generation.

Result: Substantial inference speedups and consistent improvements in generation quality on LLaDA and Dream models.

Conclusion: d²Cache effectively addresses dLLM inference inefficiency without requiring training, providing both performance acceleration and enhanced decoding reliability.

Abstract: Diffusion-based large language models (dLLMs), despite their promising performance, still suffer from inferior inference efficiency. This is because dLLMs rely on bidirectional attention and cannot directly benefit from the standard key-value (KV) cache as autoregressive models (ARMs) do. To tackle this issue, we introduce \textit{Dual aDaptive Cache} (d$^2$Cache), which is a training-free approximate KV cache framework for accelerating dLLM inference. d$^2$Cache features a two-stage fine-grained selection strategy to identify tokens and adaptively update their KV states at each decoding step, while caching the KV states of the remaining tokens for reuse. Furthermore, d$^2$Cache naturally offers a more reliable decoding alternative, which can enable quasi left-to-right generation and mitigate premature overconfidence in tokens at the end of the sequence. Extensive experimental results on two representative dLLMs (\ie, LLaDA and Dream) demonstrate that d$^2$Cache not only achieves substantial inference speedups, but also yields consistent improvements in generation quality. The code is available at https://github.com/Kamichanw/d2Cache.

[37] How to Make Large Language Models Generate 100% Valid Molecules?

Wen Tao, Jing Tang, Alvin Chan, Bryan Hooi, Baolong Bi, Nanyun Peng, Yuansheng Liu, Yiwei Wang

Main category: cs.CL

TL;DR: SmiSelf is a cross-chemical language framework that converts invalid SMILES to SELFIES using grammatical rules, ensuring 100% valid molecule generation while preserving molecular characteristics and maintaining performance on other metrics.

Details

Motivation: Molecule generation is crucial for drug discovery and materials science, but LLMs struggle to generate valid molecules using SMILES representations in few-shot settings. The goal is to enable LLMs to generate 100% valid molecules.

Method: The authors explore LLMs’ ability with SELFIES representation, examine their capacity to correct invalid SMILES, and introduce SmiSelf - a framework that converts invalid SMILES to SELFIES using grammatical rules to leverage SELFIES’ mechanisms for correction.

Result: SmiSelf ensures 100% validity while preserving molecular characteristics and maintaining or even enhancing performance on other metrics. LLMs perform worse with SELFIES than with SMILES, and their capacity to correct invalid SMILES is limited.

Conclusion: SmiSelf helps expand LLMs’ practical applications in biomedicine and is compatible with all SMILES-based generative models, providing a solution for valid molecule generation in few-shot settings.

Abstract: Molecule generation is key to drug discovery and materials science, enabling the design of novel compounds with specific properties. Large language models (LLMs) can learn to perform a wide range of tasks from just a few examples. However, generating valid molecules using representations like SMILES is challenging for LLMs in few-shot settings. In this work, we explore how LLMs can generate 100% valid molecules. We evaluate whether LLMs can use SELFIES, a representation where every string corresponds to a valid molecule, for valid molecule generation but find that LLMs perform worse with SELFIES than with SMILES. We then examine LLMs’ ability to correct invalid SMILES and find their capacity limited. Finally, we introduce SmiSelf, a cross-chemical language framework for invalid SMILES correction. SmiSelf converts invalid SMILES to SELFIES using grammatical rules, leveraging SELFIES’ mechanisms to correct the invalid SMILES. Experiments show that SmiSelf ensures 100% validity while preserving molecular characteristics and maintaining or even enhancing performance on other metrics. SmiSelf helps expand LLMs’ practical applications in biomedicine and is compatible with all SMILES-based generative models. Code is available at https://github.com/wentao228/SmiSelf.

[38] Non-Collaborative User Simulators for Tool Agents

Jeonghoon Shim, Woojung Song, Cheyon Jin, Seungwon KooK, Yohan Jo

Main category: cs.CL

TL;DR: A non-collaborative user simulation method for tool agents that simulates challenging real-world user behaviors like requesting unavailable services, digressing, expressing impatience, and providing incomplete utterances.

Details

Motivation: Existing user simulators for tool agents are too agent-friendly and cooperative, failing to train and test agents against non-collaborative users encountered in real-world scenarios.

Method: Proposed a novel user simulator architecture that simulates four categories of non-collaborative behaviors: requesting unavailable services, digressing into tangential conversations, expressing impatience, and providing incomplete utterances.

Result: Experiments on MultiWOZ and τ-bench showed significant performance degradation in state-of-the-art tool agents when encountering non-collaborative users, with issues like escalated hallucinations and dialogue breakdowns.

Conclusion: Provides an easily extensible user simulation framework to help develop tool agents and preemptively diagnose them under challenging real-world conditions within their own services.

Abstract: Non-Collaborative User Simulators for Tool Agents Download PDF Jeonghoon Shim, Woojung Song, Cheyon Jin, Seungwon KooK, Yohan Jo 19 Sept 2025 (modified: 25 Sept 2025)ICLR 2026 Conference SubmissionConference, AuthorsRevisionsCC BY 4.0 Keywords: Tool Agent, User Simulator, Non-collaborative User, Dialogue Simulation TL;DR: A non-collaborative user simulation method for tool agent. Abstract: Tool agents interact with users through multi-turn dialogues to accomplish various tasks. Recent studies have adopted user simulation methods to develop these agents in multi-turn settings. However, existing user simulators tend to be agent-friendly, exhibiting only cooperative behaviors, which fails to train and test agents against non-collaborative users in the real world. To address this, we propose a novel user simulator architecture that simulates four categories of non-collaborative behaviors: requesting unavailable services, digressing into tangential conversations, expressing impatience, and providing incomplete utterances. Our user simulator can simulate challenging and natural non-collaborative behaviors while reliably delivering all intents and information necessary to accomplish the task. Our experiments on MultiWOZ and $\tau$-bench reveal significant performance degradation in state-of-the-art tool agents when encountering non-collaborative users. We provide detailed analyses of agents’ weaknesses under each non-collaborative condition, such as escalated hallucinations and dialogue breakdowns. Ultimately, we contribute an easily extensible user simulation framework to help the research community develop tool agents and preemptively diagnose them under challenging real-world conditions within their own services.

[39] Tagging the Thought: Unlocking Personalization Reasoning via Reinforcement Learning

Song Jin, Juntian Zhang, Yong Liu, Xun Zhang, Yufei Zhang, Fei Jiang, Guojun Yin, Wei Lin, Rui Yan

Main category: cs.CL

TL;DR: TagPR is a training framework that enhances LLMs’ personalization reasoning through tagged reasoning chains and a multi-stage training approach combining SFT and RL with composite rewards.

Details

Motivation: LLMs struggle with personalization reasoning - analyzing user history, inferring preferences, and generating tailored responses, despite having strong general reasoning capabilities.

Method: Develops a data-driven pipeline to generate semantically labeled reasoning chains, then uses SFT followed by multi-stage RL guided by composite rewards including tag-based constraints and a Personalization Reward Model with User Embeddings (PRMU).

Result: Achieves state-of-the-art results on LaMP benchmark and self-constructed dataset with 32.65% average improvement over base model across all tasks.

Conclusion: Structured, interpretable reasoning is an effective pathway to unlocking genuine personalization capabilities in LLMs.

Abstract: Recent advancements have endowed Large Language Models (LLMs) with impressive general reasoning capabilities, yet they often struggle with personalization reasoning - the crucial ability to analyze user history, infer unique preferences, and generate tailored responses. To address this limitation, we introduce TagPR, a novel training framework that significantly enhances an LLM’s intrinsic capacity for personalization reasoning through a tagging the thought approach. Our method first develops a data-driven pipeline to automatically generate and semantically label reasoning chains, creating a structured dataset that fosters interpretable reasoning. We then propose a synergistic training strategy that begins with Supervised Fine-Tuning (SFT) on this tagged data to establish foundational reasoning patterns, followed by a multi-stage reinforcement learning (RL) process. This RL phase is guided by a unique composite reward signal, which integrates tag-based constraints and a novel Personalization Reward Model with User Embeddings (PRMU) to achieve fine-grained alignment with user-specific logic. Extensive experiments on the public LaMP benchmark and a self-constructed dataset demonstrate that our approach achieves state-of-the-art results, delivering an average improvement of 32.65% over the base model across all tasks. Our work validates that structured, interpretable reasoning is a highly effective pathway to unlocking genuine personalization capabilities in LLMs.

[40] Tree Reward-Aligned Search for TReASURe in Masked Diffusion Language Models

Zichao Yu, Ming Li, Wenyi Zhang, Weiguo Gao

Main category: cs.CL

TL;DR: TReASURe is a tree-search test-time alignment method for Masked Diffusion Language Models that addresses correlation and variance issues through UnmaskBranch diversification and ResubstituteScore pruning.

Details

Motivation: Tree search for aligning generative models faces challenges with correlated branches from parallel unmasking and high-variance reward estimates from sampled completions, limiting exploration and pruning stability.

Method: Proposes UnmaskBranch for diversified token content and reveal order with single model calls, and ResubstituteScore for low-variance deterministic scoring of partially masked sequences.

Result: Achieves state-of-the-art results on perplexity, linguistic acceptability, and control of sentiment/toxicity, outperforming prior methods under matched compute budgets, especially in low-NFE regimes.

Conclusion: TReASURe effectively addresses key challenges in tree search for masked diffusion models, demonstrating theoretical efficiency gains and empirical superiority across multiple metrics.

Abstract: Tree search has recently emerged as a powerful framework for aligning generative models with task-specific rewards at test time. Applying tree search to Masked Diffusion Language Models, however, introduces two key challenges: (i) parallel unmasking yields highly correlated branches, limiting exploration, and (ii) reward evaluation via sampled completions produces high-variance estimates, making pruning unstable. We propose TReASURe, a tree-search test-time alignment method that addresses these issues. It introduces (i) UnmaskBranch, a branching strategy based on first-hitting unmasking that diversifies both token content and reveal order with a single model call per parent node, and (ii) ResubstituteScore, a pruning rule that uses deterministic resubstitution to score partially masked sequences with low-variance proxy completions. Theoretically, we quantify branching efficiency gains in NFEs (number of function evaluations), show that the scoring rule approximates the true reward with error bounded by predictive uncertainty, and prove improvements with larger tree widths. Empirically, TReASURe achieves state-of-the-art results on perplexity, linguistic acceptability, and control of sentiment and toxicity, outperforming prior methods under matched compute budgets, with especially strong gains in low-NFE regimes.

[41] Test-Time Policy Adaptation for Enhanced Multi-Turn Interactions with LLMs

Chenxing Wei, Hong Wang, Ying He, Fei Yu, Yao Shu

Main category: cs.CL

TL;DR: The paper proposes T2PAM, a new paradigm for adapting LLMs during multi-turn interactions using real-time user feedback, and introduces ROSA, a lightweight algorithm that enables efficient in-conversation self-correction with theoretical convergence guarantees.

Details

Motivation: LLMs perform poorly in extended interactions because they are trained on static, single-turn data and cannot adapt to real-time user feedback, limiting their effectiveness in complex multi-turn tasks.

Method: Proposes Test-Time Policy Adaptation for Multi-Turn Interactions (T2PAM) paradigm and Optimum-Referenced One-Step Adaptation (ROSA) algorithm that uses user feedback as reward signal to estimate optimal policy and updates a small parameter subset in a single efficient step.

Result: Extensive experiments show ROSA achieves significant improvements in both task effectiveness and efficiency on challenging benchmarks, with theoretical guarantees of convergence to user preferences.

Conclusion: ROSA enables efficient in-conversation self-correction for LLMs through lightweight policy adaptation, addressing the limitations of static training and improving performance in extended multi-turn interactions.

Abstract: Large Language Models (LLMs) employ multi-turn interaction as a fundamental paradigm for completing complex tasks. However, their performance often degrades in extended interactions, as they are typically trained on static, single-turn data, which hinders their ability to adapt to real-time user feedback. To address this limitation, we first propose a new paradigm: Test-Time Policy Adaptation for Multi-Turn Interactions (T2PAM), which utilizes user feedback from the ongoing interaction as a reward signal to estimate a latent optimal policy aligned with user preferences, then updates a small subset of parameters to steer the model toward this policy, ultimately enabling efficient in-conversation self-correction. We then introduce Optimum-Referenced One-Step Adaptation (ROSA), a lightweight algorithm that operationalizes T2PAM. ROSA guides the model parameters toward a theoretical optimal policy in a single, efficient update step, avoiding costly iterative gradient-based optimization and minimizing computational overhead. We provide a rigorous theoretical analysis guaranteeing that the policy of ROSA converges to the preference of user as the number of interactions increases. Extensive experiments on challenging benchmark demonstrate that ROSA achieves significant improvements in both task effectiveness and efficiency.

[42] Pretraining LLM with Latent Thoughts in Continuous Space

Boyi Zeng, He Li, Shixiang Song, Yixuan Wang, Ziwei He, Xinbing Wang, Zhouhan Lin

Main category: cs.CL

TL;DR: Pretraining Language Models with Latent Thoughts - a method that adds intermediate latent thought generation before token prediction, improving performance without increasing inference cost.

Details

Motivation: Inspired by Chain-of-Thought's success in scaling generation steps at test-time, the authors explore whether similar computational step scaling during pretraining can improve individual token generation.

Method: Pretrain LM to first generate intermediate latent thoughts (last hidden state of current position), then use these as input to predict subsequent tokens, enabling refinement in continuous space.

Result: At identical inference cost, LM with one additional latent thought per token outperforms standard model with double the parameters. 1.4B model surpasses vanilla 2.8B model on language modeling and downstream tasks.

Conclusion: Generating latent thoughts before each token (forming a chain analogous to CoT) consistently improves model performance, demonstrating the effectiveness of computational step scaling during pretraining.

Abstract: The remarkable success of Chain-of-Thought (CoT), which enhances performance by scaling generation steps at test-time, inspires us to ask: can we leverage a similar scaling of computational steps during pretraining to improve the generation of each individual token? To address this, we propose a novel pre-training methodology: Pretraining Language Models with Latent Thoughts. Our approach pretrains a language model (LM) to first generate an intermediate latent thought-the last hidden state of the current position-which is then used as input to predict the actual subsequent token. This additional computational step enables the LM to refine its prediction within unconstrained continuous space. Our experiments demonstrate that, at an identical inference cost, a LM that generates one additional latent thought per token outperforms a standard model with double the parameters. For instance, ours-1.4B (Pythia Arch), pretrained on 300B tokens from the Pile, significantly surpasses the vanilla Pythia-2.8B trained on the same data on both language modeling and a range of general downstream tasks. Furthermore, increasing the number of latent thoughts generated before each actual token-forming a chain analogous to CoT-consistently improves the model’s performance.

[43] Diagnose, Localize, Align: A Full-Stack Framework for Reliable LLM Multi-Agent Systems under Instruction Conflicts

Guancheng Wan, Leixin Sun, Longxu Dou, Zitong Shi, Fang Wu, Eric Hanchen Jiang, Wenke Huang, Guibin Zhang, Hejia Geng, Xiangru Tang, Zhenfei Yin, Yizhou Sun, Wei Wang

Main category: cs.CL

TL;DR: A framework to diagnose, localize, and align LLM-powered multi-agent systems to address hierarchical compliance failures under instruction conflicts.

Details

Motivation: LLM-powered multi-agent systems suffer from hierarchical compliance failures where agents misprioritize system-level rules when faced with competing demands, and current metrics don't reveal these micro-level violations.

Method: Three-stage framework: (1) CRAS metric for role adherence measurement, (2) attention drift analysis to localize conflict resolution in middle layers, (3) SAIL method using LoRA on focal layers with token-weighted DPO optimization.

Result: Improved instruction hierarchy compliance by +5.60% on MedQA with AutoGen without full-model finetuning across standard benchmarks.

Conclusion: The surgical alignment approach effectively addresses compliance failures in multi-agent systems by targeting specific layers and optimizing with attention-aware objectives.

Abstract: Large Language Model (LLM)-powered multi-agent systems (MAS) have rapidly advanced collaborative reasoning, tool use, and role-specialized coordination in complex tasks. However, reliability-critical deployment remains hindered by a systemic failure mode: hierarchical compliance under instruction conflicts (system-user, peer-peer), where agents misprioritize system-level rules in the presence of competing demands. Moreover, widely used macro-level metrics (e.g., pass@k) obscure these micro-level violations and offer little actionable guidance for remedy. In this work, we present a full-stack, three-stage framework: (1) Diagnose - Contextualized Role Adherence Score (CRAS), a query-wise, context-aware scoring metric that decomposes role adherence into four measurable dimensions; (2) Localize - attention drift analysis revealing that instruction conflicts are resolved by attention heads that are largely concentrated in middle layers; (3) Align - Surgical Alignment of Instruction Layers (SAIL), which installs LoRA only on the localized focal layers and optimizes a token-weighted DPO-style preference objective that credits tokens by their focal attentional contribution. Across standard benchmarks and MAS frameworks, our surgical approach improves instruction hierarchy compliance (e.g., +5.60% with AutoGen on MedQA) without full-model finetuning.

[44] Estimating the strength and timing of syntactic structure building in naturalistic reading

Nan Wang, Jiaxuan Li

Main category: cs.CL

TL;DR: The study uses EEG and eye-tracking data to show that phrase structure construction precedes syntactic category detection during natural reading, challenging traditional models of sentence processing.

Details

Motivation: To disentangle syntactic category detection and phrase structure construction processes in sentence processing, which are typically conflated in violation paradigms and assumed to follow a specific temporal order.

Method: Used co-registered EEG and eye-tracking data from the ZuCo corpus, analyzed gaze transitions, applied Bayesian network modeling to examine structural depth effects, and measured fixation-related potentials for syntactic surprisal.

Result: Readers preferentially moved between syntactic heads, structural depth was the strongest driver of reading deviations, and syntactic surprisal influenced neural activity before word onset and during early integration.

Conclusion: Phrase structure construction can precede category detection and dominate lexical influences, supporting a predictive ’tree-scaffolding’ account of comprehension that extends current models of syntactic timing.

Abstract: A central question in psycholinguistics is the timing of syntax in sentence processing. Much of the existing evidence comes from violation paradigms, which conflate two separable processes - syntactic category detection and phrase structure construction - and implicitly assume that phrase structure follows category detection. In this study, we use co-registered EEG and eye-tracking data from the ZuCo corpus to disentangle these processes and test their temporal order under naturalistic reading conditions. Analyses of gaze transitions showed that readers preferentially moved between syntactic heads, suggesting that phrase structures, rather than serial word order, organize scanpaths. Bayesian network modeling further revealed that structural depth was the strongest driver of deviations from linear reading, outweighing lexical familiarity and surprisal. Finally, fixation-related potentials demonstrated that syntactic surprisal influences neural activity before word onset (-184 to -10 ms) and during early integration (48 to 300 ms). These findings extend current models of syntactic timing by showing that phrase structure construction can precede category detection and dominate lexical influences, supporting a predictive “tree-scaffolding” account of comprehension.

[45] From Harm to Help: Turning Reasoning In-Context Demos into Assets for Reasoning LMs

Haonan Wang, Weida Liang, Zihang Fu, Nie Zheng, Yifan Zhang, Yao Tong, Tongyao Zhu, Hao Jiang, Chuang Li, Jiaying Wu, Kenji Kawaguchi

Main category: cs.CL

TL;DR: RLMs perform worse with few-shot CoT than direct answering. The paper introduces Insight-to-Solve (I2S), a method that converts demonstrations into explicit insights and generates target-specific reasoning, improving performance across various models.

Details

Motivation: Recent reasoning LLMs trained with verifier-based reinforcement learning paradoxically perform worse with few-shot Chain-of-Thought (CoT) than with direct answering, even with optimal demonstrations.

Method: Introduced Insight-to-Solve (I2S), a sequential test-time procedure that turns demonstrations into explicit, reusable insights and derives target-specific reasoning traces. Optionally includes self-refinement for coherence and correctness (I2S+).

Result: I2S and I2S+ consistently outperform both direct answering and test-time scaling baselines across diverse benchmarks and models. GPT-4.1 improved by +14.0% on AIME'25, and o1-mini improved by +2.7% on AIME and +1.7% on GPQA.

Conclusion: In-context demonstrations can be effectively harnessed through the insight-refine-solve framework, overcoming the limitations of traditional few-shot CoT approaches.

Abstract: Recent reasoning LLMs (RLMs), especially those trained with verifier-based reinforcement learning, often perform worse with few-shot CoT than with direct answering. We revisit this paradox using high-quality reasoning traces from DeepSeek-R1 as demonstrations and find that adding more exemplars consistently degrades accuracy, even when demonstrations are optimal. A detailed analysis reveals two mechanisms behind this decline: (i) semantic misguidance, where high textual similarity leads the model to treat the target as the same as the exemplar and to copy intermediate steps verbatim; and (ii) strategy transfer failure, where the model struggles to extract useful reasoning strategies and apply them to target questions. Guided by these, we introduce Insight-to-Solve (I2S), a sequential test-time procedure that turns demonstrations into explicit, reusable insights and derives a target-specific reasoning trace; optionally, the reasoning is self-refined for coherence and correctness (I2S+). Extensive experiments on diverse benchmarks show that I2S and I2S+ consistently outperform both direct answering and test-time scaling baselines across open- and closed-source models. Even for GPT models, our method helps: on AIME'25, GPT-4.1 rises by +14.0%, and o1-mini improves by +2.7% on AIME and +1.7% on GPQA, indicating that in-context demonstrations can be harnessed effectively via insight-refine-solve framework.

[46] Global Beats, Local Tongue: Studying Code Switching in K-pop Hits on Billboard Charts

Aditya Narayan Sankaran, Reza Farahbakhsh, Noel Crespi

Main category: cs.CL

TL;DR: Analysis of linguistic strategies in globally charting K-pop songs shows English dominates lyrics, with high code-switching between Korean and English, no significant gender differences, and higher English usage potentially more important for US chart success.

Details

Motivation: To investigate how code-switching and English lyric usage in K-pop songs contribute to global chart success, reflecting both aesthetic choices and global market strategies.

Method: Compiled dataset of K-pop songs on Billboard Hot 100 and Global 200 charts (2017-2025), analyzed English/Korean proportions, code-switching frequency, performed statistical tests for gender differences, and conducted classification task using multilingual embeddings and handcrafted features.

Result: English dominates globally charting K-pop songs with high code-switching; no significant gender-based differences found; female solo artists tend to use more English; classification achieved 0.76 F1 score for gender prediction; higher English usage may be more critical for US Hot 100 success.

Conclusion: Linguistic choices in K-pop lyrics are shaped by global market pressures, with code-switching and English usage reflecting performer identity and chart context, serving as strategic tools for international success.

Abstract: Code switching, particularly between Korean and English, has become a defining feature of modern K-pop, reflecting both aesthetic choices and global market strategies. This paper is a primary investigation into the linguistic strategies employed in K-pop songs that achieve global chart success, with a focus on the role of code-switching and English lyric usage. A dataset of K-pop songs that appeared on the Billboard Hot 100 and Global 200 charts from 2017 to 2025, spanning 14 groups and 8 solo artists, was compiled. Using this dataset, the proportion of English and Korean lyrics, the frequency of code-switching, and other stylistic features were analysed. It was found that English dominates the linguistic landscape of globally charting K-pop songs, with both male and female performers exhibiting high degrees of code-switching and English usage. Statistical tests indicated no significant gender-based differences, although female solo artists tend to favour English more consistently. A classification task was also performed to predict performer gender from lyrics, achieving macro F1 scores up to 0.76 using multilingual embeddings and handcrafted features. Finally, differences between songs charting on the Hot 100 versus the Global 200 were examined, suggesting that, while there is no significant gender difference in English, higher English usage may be more critical for success in the US-focused Hot 100. The findings highlight how linguistic choices in K-pop lyrics are shaped by global market pressures and reveal stylistic patterns that reflect performer identity and chart context.

[47] Steering Prepositional Phrases in Language Models: A Case of with-headed Adjectival and Adverbial Complements in Gemma-2

Stefan Arnold, René Gröbner

Main category: cs.CL

TL;DR: The paper investigates how language models resolve prepositional phrase ambiguity between instrumental adjuncts and attributive modifiers, revealing Gemma-2’s preference for instrumental readings and demonstrating control over this preference through attention head manipulation.

Details

Motivation: To understand the internal mechanisms that resolve the split decision between instrumental adjuncts and attributive modifiers in prepositional phrase generation, which remains poorly understood in language models.

Method: Used a prompt suite with with-headed prepositional phrases in ambiguous contexts, projected activations into vocabulary space to identify attention heads, and scaled value vectors of specific attention heads to control functional role distribution.

Result: Found a 3:4 preference for instrumental readings, and by scaling a single attention head’s value vector, shifted the distribution to 33% instrumental and 36% attributive complements.

Conclusion: Individual attention heads play a crucial role in determining prepositional phrase function, and their manipulation enables controlled shifting between instrumental and attributive interpretations.

Abstract: Language Models, when generating prepositional phrases, must often decide for whether their complements functions as an instrumental adjunct (describing the verb adverbially) or an attributive modifier (enriching the noun adjectivally), yet the internal mechanisms that resolve this split decision remain poorly understood. In this study, we conduct a targeted investigation into Gemma-2 to uncover and control the generation of prepositional complements. We assemble a prompt suite containing with-headed prepositional phrases whose contexts equally accommodate either an instrumental or attributive continuation, revealing a strong preference for an instrumental reading at a ratio of 3:4. To pinpoint individual attention heads that favor instrumental over attributive complements, we project activations into the vocabulary space. By scaling the value vector of a single attention head, we can shift the distribution of functional roles of complements, attenuating instruments to 33% while elevating attributes to 36%.

[48] PARL-MT: Learning to Call Functions in Multi-Turn Conversation with Progress Awareness

Huacan Chai, Zijie Cao, Maolin Ran, Yingxuan Yang, Jianghao Lin, pengxin, Hairui Wang, Renjie Ding, Ziyu Wan, Muning Wen, Weiwen Liu, Weinan Zhang, Fei Huang, Ying Wen

Main category: cs.CL

TL;DR: PARL-MT introduces progress awareness into LLM training for multi-turn function calling, combining automatic dataset generation with guided reinforcement learning to improve long-horizon task execution.

Details

Motivation: Real-world applications require multi-turn conversations where LLMs need progress awareness to maintain coherence across interactions, but existing approaches either neglect task-level planning or struggle with redundancy in RL training.

Method: PARL-MT framework includes: (1) Progress Awareness Generation pipeline for automatic dataset construction, and (2) Progress Awareness-Guided RL algorithm that integrates progress awareness into training to reduce redundancy and align local actions with global tasks.

Result: Empirical results on two public benchmarks show PARL-MT significantly outperforms existing methods in multi-turn function calling.

Conclusion: Progress awareness is effective for enabling robust and efficient multi-turn function calling in LLMs.

Abstract: Large language models (LLMs) have achieved impressive success in single-turn function calling, yet real-world applications such as travel planning or multi-stage data analysis typically unfold across multi-turn conversations. In these settings, LLMs must not only issue accurate function calls at each step but also maintain progress awareness, the ability to summarize past interactions and plan future actions to ensure coherent, long-horizon task execution. Existing approaches, however, either reduce multi-turn training to isolated single-turn samples, which neglects task-level planning, or employ end-to-end reinforcement learning (RL) that struggles with redundancy and lacks explicit integration of progress awareness. To overcome these limitations, we introduce PARL-MT, a framework that explicitly incorporates progress awareness into LLM training for multi-turn function calling. PARL-MT combines (i) a Progress Awareness Generation (PAG) pipeline, which automatically constructs datasets coupling conversation summaries with future task planning, and (ii) a Progress Awareness-Guided Reinforcement Learning (PAG-RL) algorithm, which integrates progress awareness into RL training to reduce contextual redundancy and improve alignment between local actions and global task completion. Empirical results on two public benchmarks demonstrate that PARL-MT significantly outperforms existing methods, highlighting the effectiveness of progress awareness in enabling robust and efficient multi-turn function calling.

[49] A Structured Framework for Evaluating and Enhancing Interpretive Capabilities of Multimodal LLMs in Culturally Situated Tasks

Haorui Yu, Ramon Ruiz-Dolz, Qiufeng Yi

Main category: cs.CL

TL;DR: This paper evaluates Visual Language Models’ ability to critique traditional Chinese paintings using a quantitative framework based on expert critiques and persona-guided prompting.

Details

Motivation: To assess the capabilities and characteristics of current VLMs in generating critiques for traditional Chinese painting, which requires complex semantic understanding and cultural knowledge.

Method: Developed a quantitative framework by extracting multi-dimensional evaluative features from human expert critiques using zero-shot classification, then tested VLMs like Llama, Qwen, and Gemini through persona-guided prompting.

Result: Revealed current performance levels, strengths, and areas for improvement of VLMs in art critique, showing their potential and limitations in complex semantic understanding tasks.

Conclusion: The study provides insights into VLMs’ capabilities in art critique and offers a framework for evaluating their performance in culturally complex domains like traditional Chinese painting.

Abstract: This study aims to test and evaluate the capabilities and characteristics of current mainstream Visual Language Models (VLMs) in generating critiques for traditional Chinese painting. To achieve this, we first developed a quantitative framework for Chinese painting critique. This framework was constructed by extracting multi-dimensional evaluative features covering evaluative stance, feature focus, and commentary quality from human expert critiques using a zero-shot classification model. Based on these features, several representative critic personas were defined and quantified. This framework was then employed to evaluate selected VLMs such as Llama, Qwen, or Gemini. The experimental design involved persona-guided prompting to assess the VLM’s ability to generate critiques from diverse perspectives. Our findings reveal the current performance levels, strengths, and areas for improvement of VLMs in the domain of art critique, offering insights into their potential and limitations in complex semantic understanding and content generation tasks. The code used for our experiments can be publicly accessed at: https://github.com/yha9806/VULCA-EMNLP2025.

[50] Detecting Corpus-Level Knowledge Inconsistencies in Wikipedia with Large Language Models

Sina J. Semnani, Jirayu Burapacheep, Arpandeep Khatua, Thanawan Atchariyachanvanit, Zheng Wang, Monica S. Lam

Main category: cs.CL

TL;DR: CLAIRE is an agentic system that detects inconsistencies in Wikipedia using LLM reasoning and retrieval, helping editors identify contradictions more efficiently.

Details

Motivation: Wikipedia is a critical knowledge resource used for training LLMs and RAG systems, but its accuracy needs improvement. The paper focuses on detecting factual inconsistencies as a specific type of inaccuracy.

Method: Developed CLAIRE - an agentic system combining LLM reasoning with retrieval to surface potentially inconsistent claims with contextual evidence for human review.

Result: In user studies, editors using CLAIRE identified 64.7% more inconsistencies with 87.5% higher confidence. Found at least 3.3% of English Wikipedia facts contradict another fact, with inconsistencies affecting 7.3% of FEVEROUS and 4.0% of AmbigQA examples. Best automated system achieved only 75.1% AUROC.

Conclusion: Contradictions are a measurable component of Wikipedia, and LLM-based systems like CLAIRE provide practical tools to help editors improve knowledge consistency at scale.

Abstract: Wikipedia is the largest open knowledge corpus, widely used worldwide and serving as a key resource for training large language models (LLMs) and retrieval-augmented generation (RAG) systems. Ensuring its accuracy is therefore critical. But how accurate is Wikipedia, and how can we improve it? We focus on inconsistencies, a specific type of factual inaccuracy, and introduce the task of corpus-level inconsistency detection. We present CLAIRE, an agentic system that combines LLM reasoning with retrieval to surface potentially inconsistent claims along with contextual evidence for human review. In a user study with experienced Wikipedia editors, 87.5% reported higher confidence when using CLAIRE, and participants identified 64.7% more inconsistencies in the same amount of time. Combining CLAIRE with human annotation, we contribute WIKICOLLIDE, the first benchmark of real Wikipedia inconsistencies. Using random sampling with CLAIRE-assisted analysis, we find that at least 3.3% of English Wikipedia facts contradict another fact, with inconsistencies propagating into 7.3% of FEVEROUS and 4.0% of AmbigQA examples. Benchmarking strong baselines on this dataset reveals substantial headroom: the best fully automated system achieves an AUROC of only 75.1%. Our results show that contradictions are a measurable component of Wikipedia and that LLM-based systems like CLAIRE can provide a practical tool to help editors improve knowledge consistency at scale.

[51] Fin-ExBERT: User Intent based Text Extraction in Financial Context using Graph-Augmented BERT and trainable Plugin

Soumick Sarker, Abhijit Kumar Rai

Main category: cs.CL

TL;DR: Fin-ExBERT is a lightweight framework using domain-adapted BERT with LoRA adapters for extracting intent-relevant sentences from financial service call transcripts, achieving strong performance with efficient fine-tuning.

Details

Motivation: Financial dialogue transcripts present challenges for information extraction due to informal structure, domain-specific vocabulary, and variable intent density, requiring specialized approaches.

Method: Two-stage training with progressive unfreezing: first training classifier head with frozen backbone, then fine-tuning entire model with differential learning rates. Uses dynamic thresholding based on probability curvature instead of fixed cutoffs.

Result: Strong precision and F1 performance on real-world transcripts, with interpretable output suitable for downstream auditing and question-answering workflows.

Conclusion: Fin-ExBERT offers a deployable solution for financial dialogue mining with batched evaluation, visualization, and calibrated export capabilities.

Abstract: Financial dialogue transcripts pose a unique challenge for sentence-level information extraction due to their informal structure, domain-specific vocabulary, and variable intent density. We introduce Fin-ExBERT, a lightweight and modular framework for extracting user intent-relevant sentences from annotated financial service calls. Our approach builds on a domain-adapted BERT (Bidirectional Encoder Representations from Transformers) backbone enhanced with LoRA (Low-Rank Adaptation) adapters, enabling efficient fine-tuning using limited labeled data. We propose a two-stage training strategy with progressive unfreezing: initially training a classifier head while freezing the backbone, followed by gradual fine-tuning of the entire model with differential learning rates. To ensure robust extraction under uncertainty, we adopt a dynamic thresholding strategy based on probability curvature (elbow detection), avoiding fixed cutoff heuristics. Empirical results show strong precision and F1 performance on real-world transcripts, with interpretable output suitable for downstream auditing and question-answering workflows. The full framework supports batched evaluation, visualization, and calibrated export, offering a deployable solution for financial dialogue mining.

[52] A2D: Any-Order, Any-Step Safety Alignment for Diffusion Language Models

Wonje Jeung, Sangyeon Yoon, Yoonjun Cho, Dongjae Jeon, Sangwoo Shin, Hyesoo Hong, Albert No

Main category: cs.CL

TL;DR: A2D is a token-level alignment method that defends diffusion LLMs against any-order generation attacks by making them emit [EOS] refusal signals when harmful content appears.

Details

Motivation: Diffusion LLMs' any-order generation flexibility creates security vulnerabilities where harmful content can appear at arbitrary positions and template-based prefilling attacks can bypass response-level refusals.

Method: Aligns dLLMs at token-level under randomized masking to emit [EOS] refusal signals whenever harmful content arises, enabling robustness to any-decoding-order and any-step prefilling attacks.

Result: Reduces DIJA attack success rates from over 80% to near-zero (1.3% on LLaDA-8B-Instruct, 0.0% on Dream-v0-Instruct-7B), enables real-time monitoring and early rejection with up to 19.3x faster safe termination.

Conclusion: A2D provides effective defense against any-order generation attacks in diffusion LLMs through token-level safety alignment and enables practical real-time safety monitoring.

Abstract: Diffusion large language models (dLLMs) enable any-order generation, but this flexibility enlarges the attack surface: harmful spans may appear at arbitrary positions, and template-based prefilling attacks such as DIJA bypass response-level refusals. We introduce A2D (Any-Order, Any-Step Defense), a token-level alignment method that aligns dLLMs to emit an [EOS] refusal signal whenever harmful content arises. By aligning safety directly at the token-level under randomized masking, A2D achieves robustness to both any-decoding-order and any-step prefilling attacks under various conditions. It also enables real-time monitoring: dLLMs may begin a response but automatically terminate if unsafe continuation emerges. On safety benchmarks, A2D consistently prevents the generation of harmful outputs, slashing DIJA success rates from over 80% to near-zero (1.3% on LLaDA-8B-Instruct, 0.0% on Dream-v0-Instruct-7B), and thresholded [EOS] probabilities allow early rejection, yielding up to 19.3x faster safe termination.

[53] Scaling Policy Compliance Assessment in Language Models with Policy Reasoning Traces

Joseph Marvin Imperial, Harish Tayyar Madabushi

Main category: cs.CL

TL;DR: Policy Reasoning Traces (PRT) are specialized reasoning chains that improve LLM performance in policy compliance assessment, achieving state-of-the-art results for HIPAA and GDPR policies.

Details

Motivation: Human experts perform policy compliance assessment through systematic, step-by-step reasoning processes, but documenting these gold-standard reasoning processes is costly to acquire.

Method: Introduce Policy Reasoning Traces (PRT) as specialized generated reasoning chains that serve as a reasoning bridge to enhance LLM’s policy compliance assessment capabilities.

Result: PRTs significantly enhance performance of both open-weight and commercial models in policy compliance assessment, setting new state-of-the-art for HIPAA and GDPR policies. They also improve LLM’s ability to accurately cite policy clauses and influence compliance decisions.

Conclusion: Policy Reasoning Traces effectively bridge the reasoning gap in policy compliance assessment, providing substantial improvements in model performance and citation accuracy across different policy domains.

Abstract: Policy compliance assessment is a fundamental task of evaluating whether an input case strictly complies with a set of human-defined rules, more generally known as policies. In practice, human experts follow a systematic, step-by-step process to identify violations with respect to specific stipulations outlined in the policy. However, such documentation of gold-standard, expert-level reasoning processes is costly to acquire. In this paper, we introduce Policy Reasoning Traces (PRT), a form of specialized generated reasoning chains that serve as a reasoning bridge to improve an LLM’s policy compliance assessment capabilities. Our empirical evaluations demonstrate that the use of PRTs for both inference-time and training-time scenarios significantly enhances the performance of open-weight and commercial models, setting a new state-of-the-art for HIPAA and GDPR policies. Beyond accuracy gains, we also highlight how PRTs can improve an LLM’s ability to accurately cite policy clauses, as well as influence compliance decisions through their high utilization from the raw chains of thought.

[54] Learning to Reason in Structured In-context Environments with Reinforcement Learning

Peng Yu, Zeyuan Zhao, Shao Zhang, Luoyi Fu, Xinbing Wang, Ying Wen

Main category: cs.CL

TL;DR: The paper introduces SIE (Structured In-context Environment), a framework that automatically creates reasoning environments from structured data to address limitations in existing RL environments for LLMs.

Details

Motivation: Existing reasoning environments for LLMs have limitations: mathematical/coding environments are hard to scale due to expert annotation requirements, while game-based environments produce skills that don't generalize well.

Method: SIE automatically constructs reasoning environments from large-scale structured data, leveraging rich compositional patterns for generalizable reasoning and using explicit schemas for rule-based verifiability.

Result: SIE achieves substantial improvements in in-domain structured reasoning and enables learned compositional reasoning skills to generalize effectively to out-of-domain mathematical and logical reasoning tasks. LLMs can also infer missing information in partial SIEs.

Conclusion: SIE provides a scalable, generalizable, and verifiable framework for RL finetuning of LLMs that bridges the gap between existing environment types and enables robust reasoning improvements.

Abstract: Large language models (LLMs) have achieved significant advancements in reasoning capabilities through reinforcement learning (RL) via environmental exploration. As the intrinsic properties of the environment determine the abilities that LLMs can learn, the environment plays a important role in the RL finetuning process. An ideal LLM reasoning environment should possess three core characteristics: scalability, generalizable reasoning, and verifiability. However, existing mathematical and coding environments are difficult to scale due to heavy reliance on expert annotation, while the skills learned in game-based environments are too specialized to generalize. To bridge this gap, we introduce the \textbf{S}tructured \textbf{I}n-context \textbf{E}nvironment (SIE) framework. SIE achieves scalability by automatically constructing reasoning environments from large-scale structured data, where the rich compositional patterns naturally support generalizable reasoning. Moreover, the explicit schemas and reasoning chains in structured data provide a foundation for rule-based verifiability. Experimental results show that SIE framework not only achieves substantial improvements in in-domain structured reasoning, but also enables the learned compositional reasoning skills to generalize effectively to out-of-domain mathematical and logical reasoning tasks. We further explored learning in information-limited partial SIEs and found that LLMs can infer the missing information through exploring the environment, leading to robust reasoning improvements and generalization performance.

[55] C-Evolve: Consensus-based Evolution for Prompt Groups

Tiancheng Li, Yuhang Wang, Zhiyang Chen, Zijun Wang, Liyuan Ma, Guo-jun Qi

Main category: cs.CL

TL;DR: C-Evolve is an evolutionary algorithm that discovers groups of prompts whose aggregated outputs via majority voting achieve optimal performance, using a voting-based fitness score instead of individual performance.

Details

Motivation: Few works explore whether aggregating results from multiple prompts to reach a consensus can further advance AI system capabilities beyond single prompt evolution.

Method: Uses an island-based evolutionary algorithm to maintain population diversity, forms groups from distinct islands, and employs voting score as fitness metric to evaluate each prompt’s contribution within groups.

Result: Achieves SOTA performance: 70.67% on HotpotQA and 43.88% on IFBench with Qwen3-8B (4.95% and 2.73% higher than GEPA), and 47.96% on IFBench and 95.33% on MATH with GPT-4.1-mini.

Conclusion: C-Evolve demonstrates competitive performance across diverse tasks by evolving prompts that work well together in consensus groups rather than individually.

Abstract: Prompt evolution algorithms offer a powerful paradigm for enhancing AI systems based on closed-source models, while few work explores whether aggregating results from multiple prompts to reach a consensus can further advance the system capability boundary. In this paper, we introduce Consensus-Evolve (C-Evolve), an evolutionary algorithm that discovers a group of prompts whose aggregated outputs after majority voting achieve optimal performance. More specifically, C-Evolve employs an island-based evolutionary algorithm to maintain population diversity, and prompts from distinct islands are selected to form groups to aggregate their outputs. The key difference from single individual evolution is a voting score, which evaluates each individual prompt’s contribution within groups. We take this as the fitness score for evolution instead of individual performance. Consequently, C-Evolve is more likely to produce and maintain prompts with higher potential to form a high-performing group and eliminate low-performing ones, gradually improving the group performance after reaching consensus. Our method achieves state-of-the-art performance across a wide range of tasks, including both open-ended tasks like HotpotQA and closed-ended tasks like MATH. On Qwen3-8B, C-Evolve achieves 70.67% on HotpotQA and 43.88% on IFBench, which are 4.95% and 2.73% higher than GEPA, respectively. For GPT-4.1-mini, the accuracy on IFBench is further improved to 47.96% and reaches 95.33% in the MATH benchmark. These results demonstrate the C-Evolve’s competitive performance.

[56] Dual-Space Smoothness for Robust and Balanced LLM Unlearning

Han Yan, Zheyuan Liu, Meng Jiang

Main category: cs.CL

TL;DR: PRISM is a unified framework for machine unlearning that enforces dual-space smoothness in representation and parameter spaces to improve robustness and balance unlearning metrics against attacks.

Details

Motivation: Address challenges in machine unlearning including catastrophic forgetting, metric imbalance, and vulnerability to relearn and jailbreak attacks that plague current state-of-the-art methods.

Method: Two-stage smoothness optimization: (1) representation space stage with robustly trained probe to defend against jailbreak attacks, (2) parameter-space stage that decouples retain-forget gradient conflicts and smooths parameter space to mitigate relearning attacks.

Result: Extensive experiments on WMDP and MUSE datasets across conversational-dialogue and continuous-text settings show PRISM outperforms SOTA baselines under multiple attacks while achieving better balance among key metrics.

Conclusion: PRISM provides an effective solution for robust machine unlearning by enforcing dual-space smoothness, addressing current limitations in balancing unlearning effectiveness, utility preservation, and privacy protection.

Abstract: With the rapid advancement of large language models, Machine Unlearning has emerged to address growing concerns around user privacy, copyright infringement, and overall safety. Yet state-of-the-art (SOTA) unlearning methods often suffer from catastrophic forgetting and metric imbalance, for example by over-optimizing one objective (e.g., unlearning effectiveness, utility preservation, or privacy protection) at the expense of others. In addition, small perturbations in the representation or parameter space can be exploited by relearn and jailbreak attacks. To address these challenges, we propose PRISM, a unified framework that enforces dual-space smoothness in representation and parameter spaces to improve robustness and balance unlearning metrics. PRISM consists of two smoothness optimization stages: (i) a representation space stage that employs a robustly trained probe to defend against jailbreak attacks, and (ii) a parameter-space stage that decouples retain-forget gradient conflicts, reduces imbalance, and smooths the parameter space to mitigate relearning attacks. Extensive experiments on WMDP and MUSE, across conversational-dialogue and continuous-text settings, show that PRISM outperforms SOTA baselines under multiple attacks while achieving a better balance among key metrics.

[57] MedCritical: Enhancing Medical Reasoning in Small Language Models via Self-Collaborative Correction

Xinchun Su, Chunxu Luo, Yixuan Li, Weidong Yang, Lipeng Ma

Main category: cs.CL

TL;DR: MedCritical is a two-stage framework that enables small language models to achieve complex medical reasoning comparable to large models through self-iteration and direct preference optimization, without expensive teacher model guidance.

Details

Motivation: Small language models underperform in complex medical reasoning tasks compared to large models like GPT-4, and traditional knowledge distillation methods using teacher models are costly and inefficient.

Method: Two-stage framework: 1) Extract thought templates from teacher model to guide student model, 2) Use direct preference optimization through model self-iteration collaboration where student plays against its own correction trajectory.

Result: MedCritical 7B model outperforms Taiyi and Huatuo-o1-7B models by 3.04% and 10.12% respectively on CMExam benchmark, achieving new SOTA performance among 7B-class small models.

Conclusion: The proposed self-learning DPO approach enables small models to achieve comparable results to traditional knowledge distillation at lower cost, demonstrating effective complex medical reasoning capabilities.

Abstract: In the field of medicine, complex reasoning tasks such as clinical diagnosis, treatment planning, and medical knowledge integration pose significant challenges, where small language models often underperform compared to large language models like GPT-4 and Deepseek. Recent knowledge distillation-based methods aim to address these issues through teacher-guided error correction, but this LLM as judge approach remains challenging in terms of cost, time, and efficiency. To circumvent this issue, we propose a novel two-stage framework, MedCritical, which uses a small language model fine-tuned by a large teacher model to play against itself. In the first stage, we extract high-level and detailed long-chain thought templates from the teacher model to guide the student model to generate more complex reasoning thoughts. In the second stage, we introduce direct preference optimization (DPO) through model self-iteration collaboration to enhance the reasoning ability of the student model by playing against the correction trajectory of the fine-tuned model during training. This model self-learning DPO approach teaches the student model to use its own error-driven insights to consolidate its skills and knowledge to solve complex problems, and achieves comparable results to traditional knowledge distillation methods using teacher models at a lower cost. Notably, our MedCritical 7B model outperforms the Taiyi and Huatuo-o1-7B models by 3.04% and 10.12% respectively on the CMExam benchmark, achieving new SOTA performance among 7B-class small models.

[58] Alignment through Meta-Weighted Online Sampling: Bridging the Gap between Data Generation and Preference Optimization

Junming Yang, Ning Xu, Biao Liu, Shiqi Qiao, Xin Geng

Main category: cs.CL

TL;DR: MetaAPO is a novel preference optimization framework that dynamically couples data generation with model training using a meta-learner to balance online and offline data, reducing annotation costs by 42% while outperforming existing methods.

Details

Motivation: To address the distribution mismatch between pre-collected offline preference data and evolving model policies in LLM alignment, as existing static heuristics and decoupled online sampling strategies fail to adapt to the model's dynamic learning state.

Method: Uses a lightweight meta-learner as an ‘alignment gap estimator’ to evaluate benefits of on-policy sampling vs offline data, guiding targeted online generation and assigning sample-wise meta-weights to dynamically balance quality and distribution of online and offline data.

Result: Consistently outperforms existing preference optimization approaches on AlpacaEval 2, Arena-Hard and MT-Bench benchmarks across various settings.

Conclusion: MetaAPO effectively bridges the distribution gap in preference optimization while significantly reducing online annotation costs, demonstrating superior performance over existing methods.

Abstract: Preference optimization is crucial for aligning large language models (LLMs) with human values and intentions. A significant challenge in this process is the distribution mismatch between pre-collected offline preference data and the evolving model policy. Existing methods attempt to reduce this gap using static heuristics or decoupled online sampling strategies, but they often fail to adapt to the model’s dynamic learning state. To bridge this gap, we propose Meta-Weighted Adaptive Preference Optimization (MetaAPO), a novel framework that dynamically couples data generation with model training. MetaAPO employs a lightweight meta-learner, as an “alignment gap estimator”, to evaluate the potential benefits of on-policy sampling in relation to offline data. This guides targeted online generation and assigns sample-wise meta-weights to the optimization objective, dynamically balancing the quality and distribution of online and offline data. Experiments on AlpacaEval 2, Arena-Hard and MT-Bench demonstrate that MetaAPO consistently outperforms existing preference optimization approaches across various settings, while reducing 42% in online annotation costs.

[59] CCD: Mitigating Hallucinations in Radiology MLLMs via Clinical Contrastive Decoding

Xi Zhang, Zaiqiao Meng, Jake Lever, Edmond S. L. Ho

Main category: cs.CL

TL;DR: CCD is a training-free framework that reduces medical hallucinations in radiology MLLMs by integrating clinical signals from expert models through dual-stage contrastive decoding.

Details

Motivation: Multimodal LLMs in radiology often generate clinically unsupported descriptions (medical hallucinations) due to over-sensitivity to clinical sections, posing serious risks in medical applications requiring accuracy.

Method: Clinical Contrastive Decoding (CCD) - a training-free, retrieval-free inference framework that integrates structured clinical signals from radiology expert models using a dual-stage contrastive mechanism to refine token-level logits during generation.

Result: CCD consistently improves performance on radiology report generation across three datasets and multiple models. On MIMIC-CXR, it achieves up to 17% improvement in RadGraph-F1 when applied to state-of-the-art RRG models.

Conclusion: CCD provides a lightweight, generalizable solution for mitigating medical hallucinations by effectively bridging expert models and MLLMs in radiology without modifying the base MLLM.

Abstract: Multimodal large language models (MLLMs) have recently achieved remarkable progress in radiology by integrating visual perception with natural language understanding. However, they often generate clinically unsupported descriptions, known as medical hallucinations, which pose serious risks in medical applications that demand accuracy and image-grounded outputs. Through empirical analysis, we find that prompt-induced hallucinations remain prevalent in radiology MLLMs, largely due to over-sensitivity to clinical sections. To address this, we introduce Clinical Contrastive Cecoding (CCD), a training-free and retrieval-free inference framework that integrates structured clinical signals from task-specific radiology expert models. CCD introduces a dual-stage contrastive mechanism to refine token-level logits during generation, thereby enhancing clinical fidelity without modifying the base MLLM. Experiments on three datasets and multiple models demonstrate that CCD consistently improves overall performance on radiology report generation (RRG). On the MIMIC-CXR dataset, it yields up to a 17% improvement in RadGraph-F1 when applied to state-of-the-art RRG models. Our approach provides a lightweight and generalisable solution for mitigating medical hallucinations, effectively bridging expert models and MLLMs in radiology.

[60] Guard Vector: Beyond English LLM Guardrails with Task-Vector Composition and Streaming-Aware Prefix SFT

Wonhyuk Lee, Youngchol Kim, Yunjin Park, Junhyung Moon, Dongyoung Jeong, Wanjin Park

Main category: cs.CL

TL;DR: Guard Vector is a safety task vector created from parameter differences between guardrail and pretrained models. When combined with target models, it improves safety classification, enables multilingual support without training, and works across different model architectures while optimizing for streaming applications.

Details

Motivation: To create a more efficient and portable safety mechanism for language models that works across different languages and model architectures without requiring additional training or language-specific labels.

Method: Compute Guard Vector as parameter difference between guardrail and pretrained models, compose with target models to create Target Guard Models, use prefix-based training with single-token classifier output for streaming optimization.

Result: Improves classification quality over existing guard models, enables multilingual support (Chinese, Japanese, Korean) without additional training, demonstrates portability across Llama and Gemma architectures, maintains performance under streaming with reduced latency and increased throughput.

Conclusion: Guard Vector provides an efficient, portable safety solution that reduces computational requirements while enabling multilingual safety capabilities and promoting responsible AI practices through streaming-aware evaluation.

Abstract: We introduce Guard Vector, a safety task vector computed as the parameter difference between a guardrail model (Guard Model) and a same-architecture pretrained language model. Composing this vector with a target language model yields a Target Guard Model (TGM). We then adapt TGM with a streaming-aware approach that combines prefix-based training and evaluation with a classifier that produces a single-token output. With this composition alone, TGM improves classification quality over established Guard Models across standard safety suites and enables language extensibility to Chinese, Japanese, and Korean, requiring neither additional training nor target language labels. It also demonstrates model portability across two widely used public guardrail backbones, Llama and Gemma. With prefix SFT (supervised fine-tuning), TGM preserves classification quality under streaming by aligning the behavior between prefix inputs and full-text inputs. The single-token output design increases throughput and reduces latency. Together, these components reduce data and compute requirements while promoting streaming-aware evaluation practices, thereby contributing to a more responsible AI ecosystem.

[61] Train Once, Answer All: Many Pretraining Experiments for the Cost of One

Sebastian Bordt, Martin Pawelczyk

Main category: cs.CL

TL;DR: Proposes conducting multiple pretraining experiments simultaneously in a single training run to reduce computational costs while enabling rigorous scientific experimentation with large language models.

Details

Motivation: The computational cost of pretraining large language models presents a significant constraint for controlled experiments, limiting the ability to understand learning, reasoning, and memorization.

Method: Conduct ten different experiments during a single training run of a 1.5B parameter model on 210B tokens, testing for interactions between experiments through continual pretraining.

Result: Successfully replicated results from previous works on data contamination, poisoning, and memorization, and conducted novel investigations into knowledge acquisition, mathematical reasoning, and watermarking with minimal impact on training dynamics and overall performance.

Conclusion: Performing multiple pretraining experiments in a single training run is feasible and can enable rigorous scientific experimentation with large models on a limited compute budget, though interactions between experiments should be tested.

Abstract: Recent work has demonstrated that controlled pretraining experiments are a powerful tool for understanding learning, reasoning, and memorization in large language models (LLMs). However, the computational cost of pretraining presents a significant constraint. To overcome this constraint, we propose to conduct multiple pretraining experiments simultaneously during a single training run. We demonstrate the feasibility of this approach by conducting ten experiments during the training of a 1.5B parameter model on 210B tokens. Although we only train a single model, we can replicate the results from multiple previous works on data contamination, poisoning, and memorization. We also conduct novel investigations into knowledge acquisition, mathematical reasoning, and watermarking. For example, we dynamically update the training data until the model acquires a particular piece of knowledge. Remarkably, the influence of the ten experiments on the model’s training dynamics and overall performance is minimal. However, interactions between different experiments may act as a potential confounder in our approach. We propose to test for interactions with continual pretraining experiments, finding them to be negligible in our setup. Overall, our findings suggest that performing multiple pretraining experiments in a single training run can enable rigorous scientific experimentation with large models on a compute budget.

Wenhang Shi, Yiren Chen, Shuqing Bian, Xinyi Zhang, Kai Tang, Pengfei Hu, Zhe Zhao, Wei Lu, Xiaoyong Du

Main category: cs.CL

TL;DR: GRACE is a prompt optimization framework that uses gated refinement and adaptive compression to efficiently improve LLM prompts, achieving significant performance gains with 75% less computational budget than prior methods.

Details

Motivation: Automatic prompt optimization struggles with stability, efficiency, and getting trapped in local optima, while manual design is costly and unscalable.

Method: Combines gated refinement (feedback regulation gate and update rejection gate) for stable improvements, and adaptive compression to escape local optima by distilling core concepts when optimization stagnates.

Result: Achieved average relative improvements of 4.7%, 4.4%, and 2.7% over SOTA methods on BBH, domain-specific, and general NLP tasks respectively, using only 25% of the prompt generation budget.

Conclusion: GRACE demonstrates that strategic information loss through refinement and compression enables efficient and effective prompt optimization with substantial performance gains and reduced computational overhead.

Abstract: Prompt engineering is crucial for leveraging the full potential of large language models (LLMs). While automatic prompt optimization offers a scalable alternative to costly manual design, generating effective prompts remains challenging. Existing methods often struggle to stably generate improved prompts, leading to low efficiency, and overlook that prompt optimization easily gets trapped in local optima. Addressing this, we propose GRACE, a framework that integrates two synergistic strategies: Gated Refinement and Adaptive Compression, achieving Efficient prompt optimization. The gated refinement strategy introduces a feedback regulation gate and an update rejection gate, which refine update signals to produce stable and effective prompt improvements. When optimization stagnates, the adaptive compression strategy distills the prompt’s core concepts, restructuring the optimization trace and opening new paths. By strategically introducing information loss through refinement and compression, GRACE delivers substantial gains in performance and efficiency. In extensive experiments on 11 tasks across three practical domains, including BIG-Bench Hard (BBH), domain-specific, and general NLP tasks, GRACE achieves significant average relative performance improvements of 4.7%, 4.4% and 2.7% over state-of-the-art methods, respectively. Further analysis shows that GRACE achieves these gains using only 25% of the prompt generation budget required by prior methods, highlighting its high optimization efficiency and low computational overhead. Our code is available at https://github.com/Eric8932/GRACE.

[63] Liaozhai through the Looking-Glass: On Paratextual Explicitation of Culture-Bound Terms in Machine Translation

Sherrie Shen, Weixuan Wang, Alexandra Birch

Main category: cs.CL

TL;DR: This paper introduces paratextual explicitation for machine translation, using footnotes/endnotes to explain culture-bound terms, and evaluates LLMs on this task using a dataset from Chinese-to-English translations.

Details

Motivation: Current MT systems struggle with culture-bound terms that resist direct translation, and existing approaches overlook the paratextual explanations (footnotes/endnotes) used by professional translators.

Method: Formalized Genette’s paratext theory, created a dataset of 560 expert-aligned paratexts from English translations of Liaozhai, and evaluated LLMs with/without reasoning traces on explicitation choice and content using intrinsic prompting and agentic retrieval methods.

Result: LLM-generated paratexts improve audience comprehension but are less effective than translator-authored ones. Statistical analysis shows wide variation in professional translators’ paratext usage, indicating cultural mediation is open-ended rather than prescriptive.

Conclusion: Paratextual explicitation has potential to advance MT beyond linguistic equivalence, with extensions to monolingual explanation and personalized adaptation.

Abstract: The faithful transfer of contextually-embedded meaning continues to challenge contemporary machine translation (MT), particularly in the rendering of culture-bound terms–expressions or concepts rooted in specific languages or cultures, resisting direct linguistic transfer. Existing computational approaches to explicitating these terms have focused exclusively on in-text solutions, overlooking paratextual apparatus in the footnotes and endnotes employed by professional translators. In this paper, we formalize Genette’s (1987) theory of paratexts from literary and translation studies to introduce the task of paratextual explicitation for MT. We construct a dataset of 560 expert-aligned paratexts from four English translations of the classical Chinese short story collection Liaozhai and evaluate LLMs with and without reasoning traces on choice and content of explicitation. Experiments across intrinsic prompting and agentic retrieval methods establish the difficulty of this task, with human evaluation showing that LLM-generated paratexts improve audience comprehension, though remain considerably less effective than translator-authored ones. Beyond model performance, statistical analysis reveals that even professional translators vary widely in their use of paratexts, suggesting that cultural mediation is inherently open-ended rather than prescriptive. Our findings demonstrate the potential of paratextual explicitation in advancing MT beyond linguistic equivalence, with promising extensions to monolingual explanation and personalized adaptation.

[64] Comparison of Scoring Rationales Between Large Language Models and Human Raters

Haowei Hua, Hong Jiao, Dan Song

Main category: cs.CL

TL;DR: This study compares human and LLM raters’ scoring rationales to understand scoring inconsistency causes, using GPT-4o and Gemini on large-scale test essays.

Details

Motivation: To understand the reasoning behind both human and LLM scoring by evaluating their rationales, helping identify causes of scoring inconsistency and improve automated scoring systems.

Method: Used essays from large-scale tests, analyzed scoring accuracy with quadratic weighted kappa and normalized mutual information, evaluated rationale similarity with cosine similarity, and explored clustering patterns using PCA on rationale embeddings.

Result: The study provides insights into LLM scoring accuracy and “thinking” processes, revealing patterns in how LLMs and humans provide rationales for their scores.

Conclusion: The findings help improve understanding of rationales behind both human scoring and LLM-based automated scoring, contributing to better automated scoring systems.

Abstract: Advances in automated scoring are closely aligned with advances in machine-learning and natural-language-processing techniques. With recent progress in large language models (LLMs), the use of ChatGPT, Gemini, Claude, and other generative-AI chatbots for automated scoring has been explored. Given their strong reasoning capabilities, LLMs can also produce rationales to support the scores they assign. Thus, evaluating the rationales provided by both human and LLM raters can help improve the understanding of the reasoning that each type of rater applies when assigning a score. This study investigates the rationales of human and LLM raters to identify potential causes of scoring inconsistency. Using essays from a large-scale test, the scoring accuracy of GPT-4o, Gemini, and other LLMs is examined based on quadratic weighted kappa and normalized mutual information. Cosine similarity is used to evaluate the similarity of the rationales provided. In addition, clustering patterns in rationales are explored using principal component analysis based on the embeddings of the rationales. The findings of this study provide insights into the accuracy and ``thinking’’ of LLMs in automated scoring, helping to improve the understanding of the rationales behind both human scoring and LLM-based automated scoring.

[65] Retrieval-Constrained Decoding Reveals Underestimated Parametric Knowledge in Language Models

Rajaa El Hamdani, Samy Haffoudhi, Nils Holzenberger, Fabian Suchanek, Thomas Bonald, Fragkiskos D. Malliaros

Main category: cs.CL

TL;DR: RCD decoding strategy improves LM factual accuracy by constraining outputs to unique surface forms, showing standard evaluation underestimates model knowledge.

Details

Motivation: Standard evaluation methods often dismiss correct LM answers expressed in alternative surface forms, leading to underestimation of parametric knowledge.

Method: Proposed Retrieval-Constrained Decoding (RCD) that restricts model outputs to unique surface forms, tested on YAGO-QA dataset with LMs from 135M to 70B parameters.

Result: RCD significantly improves performance: Llama-3.1-70B increased from 32.3% to 46.0% F1, and Llama-3.1-8B reached 33.0% with RCD, outperforming larger model under standard decoding.

Conclusion: Current evaluation methods underestimate LM knowledge, and RCD decoding strategy better reveals models’ true parametric knowledge capabilities.

Abstract: Language models (LMs) encode substantial factual knowledge, but often produce answers judged as incorrect. We hypothesize that many of these answers are actually correct, but are expressed in alternative surface forms that are dismissed due to an overly strict evaluation, leading to an underestimation of models’ parametric knowledge. We propose Retrieval-Constrained Decoding (RCD), a decoding strategy that restricts model outputs to unique surface forms. We introduce YAGO-QA, a dataset of 19,137 general knowledge questions. Evaluating open-source LMs from 135M to 70B parameters, we show that standard decoding undervalues their knowledge. For instance, Llama-3.1-70B scores only 32.3% F1 with vanilla decoding but 46.0% with RCD. Similarly, Llama-3.1-8B reaches 33.0% with RCD, outperforming the larger model under vanilla decoding. We publicly share the code and dataset at https://github.com/Rajjaa/disambiguated-LLM.

Xuanming Zhang, Yuxuan Chen, Min-Hsuan Yeh, Yixuan Li

Main category: cs.CL

TL;DR: CooT is a decoding-time framework that adds explicit cognitive self-monitoring to LLMs using a Perceiver module that detects misalignments and intervenes by rolling back generation when violations occur.

Details

Motivation: Current alignment strategies embed safety into model weights, making controls implicit, static, and difficult to modify. There's a need for more explicit, dynamic, and auditable alignment processes.

Method: CooT couples a text Generator with a cognitive Perceiver that continuously monitors generation using a structured hierarchy of principles. When violations are detected, it rolls back generation and regenerates with injected guidance combining universal social priors and context-specific warnings.

Result: Extensive experiments across multiple benchmarks and model families confirm that CooT consistently improves safety and social reasoning performance.

Conclusion: CooT transforms alignment from a fixed property into an explicit, dynamic, and auditable process during inference, allowing flexible policy updates without model retraining.

Abstract: Large language models (LLMs) excel at complex reasoning but can still exhibit harmful behaviors. Current alignment strategies typically embed safety into model weights, making these controls implicit, static, and difficult to modify. This paper introduces Cognition-of-Thought (CooT), a novel decoding-time framework that equips LLMs with an explicit cognitive self-monitoring loop. CooT couples a standard text Generator with a cognitive Perceiver that continuously monitors the unfolding sequence. The Perceiver uses a structured, precedence-based hierarchy of principles (e.g., safety over obedience) to detect potential misalignments as they arise. When violations are flagged, CooT intervenes by rolling back the generation to the point of error and regenerating under injected guidance that combines universal social priors with context-specific warnings. CooT thus transforms alignment from a fixed property into an explicit, dynamic, and auditable process active during inference, allowing for flexible policy updates without retraining the model. Extensive experiments across multiple benchmarks and model families confirm that CooT consistently improves safety and social reasoning performance.

[67] Text-Based Approaches to Item Difficulty Modeling in Large-Scale Assessments: A Systematic Review

Sydney Peters, Nan Zhang, Hong Jiao, Ming Li, Tianyi Zhou, Robert Lissitz

Main category: cs.CL

TL;DR: This paper reviews 37 studies on automated item difficulty prediction using text-based machine learning approaches, finding that transformer-based language models can effectively predict difficulty without manual feature engineering.

Details

Motivation: Traditional item difficulty modeling through field testing and classical methods is time-consuming and costly. Text-based ML approaches offer promising alternatives for large-scale assessments.

Method: Systematic review of 37 articles analyzing datasets, difficulty parameters, subject domains, item types, features, models, and evaluation criteria for automated difficulty prediction.

Result: State-of-the-art language models (including transformers) can predict item difficulty with RMSE as low as 0.165, Pearson correlation up to 0.87, and accuracy up to 0.806, outperforming classic ML models while eliminating manual feature engineering.

Conclusion: Text-based methods show strong potential for automated item difficulty modeling, with performance benchmarks established for future research. The review discusses practical implications and future research directions.

Abstract: Item difficulty plays a crucial role in test performance, interpretability of scores, and equity for all test-takers, especially in large-scale assessments. Traditional approaches to item difficulty modeling rely on field testing and classical test theory (CTT)-based item analysis or item response theory (IRT) calibration, which can be time-consuming and costly. To overcome these challenges, text-based approaches leveraging machine learning and language models, have emerged as promising alternatives. This paper reviews and synthesizes 37 articles on automated item difficulty prediction in large-scale assessment settings published through May 2025. For each study, we delineate the dataset, difficulty parameter, subject domain, item type, number of items, training and test data split, input, features, model, evaluation criteria, and model performance outcomes. Results showed that although classic machine learning models remain relevant due to their interpretability, state-of-the-art language models, using both small and large transformer-based architectures, can capture syntactic and semantic patterns without the need for manual feature engineering. Uniquely, model performance outcomes were summarized to serve as a benchmark for future research and overall, text-based methods have the potential to predict item difficulty with root mean square error (RMSE) as low as 0.165, Pearson correlation as high as 0.87, and accuracy as high as 0.806. The review concludes by discussing implications for practice and outlining future research directions for automated item difficulty modeling.

[68] The Impact of Role Design in In-Context Learning for Large Language Models

Hamidreza Rouzegar, Masoud Makrehchi

Main category: cs.CL

TL;DR: Role design in prompts can enhance LLM performance in zero-shot and few-shot learning across various tasks.

Details

Motivation: The impact of role design within prompts for in-context learning is underexplored compared to general prompt engineering.

Method: Evaluated role configurations in zero-shot and few-shot learning using GPT-3.5, GPT-4o, Llama2-7b, and Llama2-13b across sentiment analysis, text classification, question answering, and math reasoning tasks.

Result: Role-based prompt structuring shows potential to enhance LLM performance.

Conclusion: Role design is a promising approach for improving in-context learning in LLMs.

Abstract: In-context learning (ICL) enables Large Language Models (LLMs) to generate predictions based on prompts without additional fine-tuning. While prompt engineering has been widely studied, the impact of role design within prompts remains underexplored. This study examines the influence of role configurations in zero-shot and few-shot learning scenarios using GPT-3.5 and GPT-4o from OpenAI and Llama2-7b and Llama2-13b from Meta. We evaluate the models’ performance across datasets, focusing on tasks like sentiment analysis, text classification, question answering, and math reasoning. Our findings suggest the potential of role-based prompt structuring to enhance LLM performance.

[69] AraS2P: Arabic Speech-to-Phonemes System

Bassam Matar, Mohamed Fayed, Ayman Khalafallah

Main category: cs.CL

TL;DR: AraS2P speech-to-phonemes system won first place in Iqra’Eval 2025 using Wav2Vec2-BERT with two-stage training: phoneme-aware pretraining on Arabic speech-phonemes data and fine-tuning with targeted augmentation.

Details

Motivation: To develop an effective speech-to-phonemes system for Arabic that can handle phoneme-level mispronunciation detection, addressing the challenge of accurate phoneme recognition in Arabic speech.

Method: Two-stage training: 1) Task-adaptive pretraining on large-scale Arabic speech-phonemes datasets generated using MSA Phonetiser, 2) Fine-tuning on shared task data with augmentation from XTTS-v2-synthesized recitations featuring varied Ayat segments, speaker embeddings, and textual perturbations.

Result: The system ranked first on the official Iqra’Eval 2025 Shared Task leaderboard, demonstrating superior performance in phoneme-level mispronunciation detection.

Conclusion: Phoneme-aware pretraining combined with targeted augmentation yields strong performance in Arabic phoneme recognition and mispronunciation detection tasks.

Abstract: This paper describes AraS2P, our speech-to-phonemes system submitted to the Iqra’Eval 2025 Shared Task. We adapted Wav2Vec2-BERT via Two-Stage training strategy. In the first stage, task-adaptive continue pretraining was performed on large-scale Arabic speech-phonemes datasets, which were generated by converting the Arabic text using the MSA Phonetiser. In the second stage, the model was fine-tuned on the official shared task data, with additional augmentation from XTTS-v2-synthesized recitations featuring varied Ayat segments, speaker embeddings, and textual perturbations to simulate possible human errors. The system ranked first on the official leaderboard, demonstrating that phoneme-aware pretraining combined with targeted augmentation yields strong performance in phoneme-level mispronunciation detection.

[70] From Human Annotation to Automation: LLM-in-the-Loop Active Learning for Arabic Sentiment Analysis

Dania Refai, Alaa Dalaq, Doaa Dalaq, Irfan Ahmad

Main category: cs.CL

TL;DR: Active learning framework for Arabic sentiment analysis using LLM-assisted labeling achieves competitive performance with reduced annotation costs compared to human labeling.

Details

Motivation: Arabic sentiment analysis is limited by lack of large labeled datasets. Active learning can reduce annotation efforts, and LLMs' potential for assisting annotation in Arabic context remains unexplored.

Method: Proposed active learning framework using multiple deep learning architectures (LSTM, GRU, RNN) across three Arabic datasets. Compared human labeling vs LLM-assisted labeling using five LLMs (GPT-4o, Claude 3 Sonnet, Gemini 2.5 Pro, DeepSeek Chat, LLaMA 3 70B Instruct).

Result: LLM-assisted active learning achieved competitive or superior performance to human labeling. LSTM with GPT-4o labels reached 93% accuracy with only 450 samples on Hunger Station dataset. DeepSeek Chat achieved 82% accuracy with 650 samples on MASAC dataset, matching human labeling performance.

Conclusion: LLM-assisted active learning is effective for Arabic sentiment analysis, reducing annotation costs while maintaining high accuracy across different datasets and language variations.

Abstract: Natural language processing (NLP), particularly sentiment analysis, plays a vital role in areas like marketing, customer service, and social media monitoring by providing insights into user opinions and emotions. However, progress in Arabic sentiment analysis remains limited due to the lack of large, high-quality labeled datasets. While active learning has proven effective in reducing annotation efforts in other languages, few studies have explored it in Arabic sentiment tasks. Likewise, the use of large language models (LLMs) for assisting annotation and comparing their performance to human labeling is still largely unexplored in the Arabic context. In this paper, we propose an active learning framework for Arabic sentiment analysis designed to reduce annotation costs while maintaining high performance. We evaluate multiple deep learning architectures: Specifically, long short-term memory (LSTM), gated recurrent units (GRU), and recurrent neural networks (RNN), across three benchmark datasets: Hunger Station, AJGT, and MASAC, encompassing both modern standard Arabic and dialectal variations. Additionally, two annotation strategies are compared: Human labeling and LLM-assisted labeling. Five LLMs are evaluated as annotators: GPT-4o, Claude 3 Sonnet, Gemini 2.5 Pro, DeepSeek Chat, and LLaMA 3 70B Instruct. For each dataset, the best-performing LLM was used: GPT-4o for Hunger Station, Claude 3 Sonnet for AJGT, and DeepSeek Chat for MASAC. Our results show that LLM-assisted active learning achieves competitive or superior performance compared to human labeling. For example, on the Hunger Station dataset, the LSTM model achieved 93% accuracy with only 450 labeled samples using GPT-4o-generated labels, while on the MASAC dataset, DeepSeek Chat reached 82% accuracy with 650 labeled samples, matching the accuracy obtained through human labeling.

[71] On the Shelf Life of Fine-Tuned LLM Judges: Future Proofing, Backward Compatibility, and Question Generalization

Janvijay Singh, Austin Xu, Yilun Zhou, Yefan Zhou, Dilek Hakkani-Tur, Shafiq Joty

Main category: cs.CL

TL;DR: The paper analyzes practical deployment concerns for finetuned LLM judges, focusing on future-proofing, backward compatibility, and question generalization in the math domain.

Details

Motivation: Standard evaluation ignores practical deployment concerns for finetuned judges, particularly their shelf life when dealing with future/past model responses and unseen questions.

Method: Study three aspects (future-proofing, backward compatibility, question generalization) in math domain using unified framework with varying train/test distributions, three SFT- and DPO-based finetuning algorithms, and three base models.

Result: Future-proofing is challenging, backward compatibility is relatively easy (DPO-trained models improve performance), continual learning provides balanced adaptation, and all models show performance degradation on unseen questions.

Conclusion: Current judges don’t fully generalize to unseen questions, and findings provide insights for developing judge models that can handle ever-changing generators.

Abstract: The LLM-as-a-judge paradigm is widely used in both evaluating free-text model responses and reward modeling for model alignment and finetuning. Recently, finetuning judges with judge-specific data has emerged as an often preferred choice over directly prompting frontier models as judges, as the former achieves better performance with smaller model sizes while being more robust to common biases. However, the standard evaluation ignores several practical concerns of finetuned judges regarding their real world deployment. In this paper, we identify and formalize three aspects that affect the shelf life of these judges: future proofing and backward compatibility – how well judges finetuned on responses by today’s generator models perform on responses by future models or past models, as well as question generalization – how well judges generalize to unseen questions at test time. We study these three aspects in the math domain under a unified framework with varying train and test distributions, three SFT- and DPO-based finetuning algorithms and three different base models. Experiments suggest that future-proofing is challenging for most models, while backward compatibility is relatively easy, with DPO-trained models consistently improving performance. We further find that continual learning provides a more balanced adaptation to shifts between older and newer response distributions than training solely on stronger or weaker responses. Moreover, all models observe certain degrees of performance degradation when moving from questions seen during training to unseen ones, showing that current judges do not fully generalize to unseen questions. These findings provide insights into practical considerations for developing and deploying judge models in the face of ever-changing generators.

[72] Automatic Speech Recognition for Greek Medical Dictation

Vardis Georgilas, Themos Stafylakis

Main category: cs.CL

TL;DR: Development of a domain-specific Greek medical dictation system combining automatic speech recognition with text correction to handle medical terminology and improve transcription accuracy for healthcare professionals.

Details

Motivation: To assist Greek healthcare professionals by reducing manual documentation overload and improving workflow efficiency through accurate speech-to-text conversion for medical dictations.

Method: Combines automatic speech recognition techniques with text correction models, leveraging both acoustic and textual modeling. Uses domain-specific fine-tuning to adapt existing technologies to Greek medical context, addressing complex terminology and linguistic variations.

Result: The system achieves more accurate and coherent transcriptions through domain-specific adaptation, better handling Greek medical terminology and linguistic inconsistencies.

Conclusion: The developed system contributes to practical language technologies for Greek healthcare by providing reliable medical speech transcription, reducing documentation burden on healthcare professionals.

Abstract: Medical dictation systems are essential tools in modern healthcare, enabling accurate and efficient conversion of speech into written medical documentation. The main objective of this paper is to create a domain-specific system for Greek medical speech transcriptions. The ultimate goal is to assist healthcare professionals by reducing the overload of manual documentation and improving workflow efficiency. Towards this goal, we develop a system that combines automatic speech recognition techniques with text correction model, allowing better handling of domain-specific terminology and linguistic variations in Greek. Our approach leverages both acoustic and textual modeling to create more realistic and reliable transcriptions. We focused on adapting existing language and speech technologies to the Greek medical context, addressing challenges such as complex medical terminology and linguistic inconsistencies. Through domain-specific fine-tuning, our system achieves more accurate and coherent transcriptions, contributing to the development of practical language technologies for the Greek healthcare sector.

[73] Towards Efficient CoT Distillation: Self-Guided Rationale Selector for Better Performance with Fewer Rationales

Jianzhi Yan, Le Liu, Youcheng Pan, Shiwei Chen, Yang Xiang, Buzhou Tang

Main category: cs.CL

TL;DR: MoRSD improves chain-of-thought distillation by selecting high-quality rationales using a Rationale Difficulty metric, achieving 4.6% average improvement on seven datasets with fewer but better rationales.

Details

Motivation: Existing CoT distillation methods focus on data quantity over quality, potentially transferring noisy or incorrect information to student models.

Method: Proposed Model-Oriented Rationale Selection Distillation (MoRSD) with Rationale Difficulty metric to select high-quality rationales based on accuracy, diversity, and difficulty.

Result: Achieved 4.6% average improvement on seven datasets across three tasks using fewer rationales than baseline methods.

Conclusion: High-quality rationales are more important than quantity for effective CoT distillation, and MoRSD provides an efficient solution for reasoning capability transfer.

Abstract: Chain-of-thought (CoT) distillation aims to enhance small language models’ (SLMs) reasoning by transferring multi-step reasoning capability from the larger teacher models. However, existing work underestimates rationale quality, focusing primarily on data quantity, which may transfer noisy or incorrect information to the student model. To address the above issues, we proposed \textbf{M}odel-\textbf{O}riented \textbf{R}ationale \textbf{S}election \textbf{D}istillation (MoRSD), which can discern and select high quality rationales for distillation to improve performance further. We further propose a Rationale Difficulty (RD) metric to measure the ability of the student model to generate the correct answer under a given rationale. Compared to the baseline, we achieved 4.6$%$ average improvement on seven datasets over three tasks, using fewer rationales by controlling their accuracy, diversity, and difficulty. Our results reveal that a small portion of the high quality rationales can enhance the reasoning ability of student models than the entire dataset. Our method promises to be a possible solution for efficient CoT distillation. Our code will be released in https://github.com/Leon221220/MoRSD.

[74] Jackal: A Real-World Execution-Based Benchmark Evaluating Large Language Models on Text-to-JQL Tasks

Kevin Frank, Anmol Gulati, Elias Lumer, Sindy Campagna, Vamse Kumar Subbiah

Main category: cs.CL

TL;DR: Jackal is a large-scale benchmark for evaluating LLMs’ ability to translate natural language queries to JQL, featuring 100,000 text-to-JQL pairs with execution-based validation on a live Jira instance.

Details

Motivation: There is no existing open, real-world, execution-based benchmark for mapping natural language queries to JQL, despite enterprise teams' heavy reliance on JQL for retrieving Jira issues.

Method: Created Jackal benchmark with 100,000 natural language requests paired with validated JQL queries, executed on a live Jira instance with 200,000+ issues. Includes four user request types: Long NL, Short NL, Semantically Similar, and Semantically Exact.

Result: Evaluated 23 LLMs on Jackal-5K subset. Best model (Gemini 2.5 Pro) achieved only 60.3% execution accuracy overall, with significant variation across request types: Long NL (86.0%), Short NL (35.7%), Semantically Similar (22.7%), and Semantically Exact (99.3%).

Conclusion: Current state-of-the-art LLMs have significant limitations in producing correct and executable JQL queries, particularly for short and semantically similar natural language requests, highlighting the need for improved text-to-JQL capabilities.

Abstract: Enterprise teams rely on the Jira Query Language (JQL) to retrieve and filter issues from Jira. Yet, to our knowledge, there is no open, real-world, execution-based benchmark for mapping natural language queries to JQL. We introduce Jackal, a novel, large-scale text-to-JQL benchmark comprising 100,000 natural language (NL) requests paired with validated JQL queries and execution-based results on a live Jira instance with over 200,000 issues. To reflect real-world usage, each JQL query is associated with four types of user requests: (i) Long NL, (ii) Short NL, (iii) Semantically Similar, and (iv) Semantically Exact. We release Jackal, a corpus of 100,000 text-to-JQL pairs, together with an execution-based scoring toolkit, and a static snapshot of the evaluated Jira instance for reproducibility. We report text-to-JQL results on 23 Large Language Models (LLMs) spanning parameter sizes, open and closed source models, across execution accuracy, exact match, and canonical exact match. In this paper, we report results on Jackal-5K, a 5,000-pair subset of Jackal. On Jackal-5K, the best overall model (Gemini 2.5 Pro) achieves only 60.3% execution accuracy averaged equally across four user request types. Performance varies significantly across user request types: (i) Long NL (86.0%), (ii) Short NL (35.7%), (iii) Semantically Similar (22.7%), and (iv) Semantically Exact (99.3%). By benchmarking LLMs on their ability to produce correct and executable JQL queries, Jackal exposes the limitations of current state-of-the-art LLMs and sets a new, execution-based challenge for future research in Jira enterprise data.

[75] LLM Hallucination Detection: HSAD

JinXin Li, Gang Tu, JunJie Hu

Main category: cs.CL

TL;DR: HSAD is a novel hallucination detection method that analyzes hidden layer temporal signals in the frequency domain using Fast Fourier Transform to identify reasoning anomalies in LLMs.

Details

Motivation: Current hallucination detection methods are limited by knowledge coverage constraints and inability to capture reasoning biases during inference, creating barriers for LLM deployment in critical applications.

Method: Model LLM reasoning as a cognitive journey over time, apply FFT to map hidden layer temporal signals to frequency domain, construct spectral features to capture reasoning anomalies, and design detection algorithm based on these features.

Result: The method effectively captures anomalies during reasoning process and demonstrates higher detection accuracy and robustness compared to existing approaches.

Conclusion: HSAD overcomes limitations of existing methods by combining reasoning process modeling with frequency-domain feature extraction, enabling better hallucination detection in LLMs.

Abstract: Although Large Language Models have demonstrated powerful capabilities in a wide range of tasks such as language understanding and code generation, the frequent occurrence of hallucinations during the generation process has become a significant impediment to their deployment in critical application scenarios. Current mainstream hallucination detection methods rely on factual consistency verification or static hidden layer features. The former is constrained by the scope of knowledge coverage, while the latter struggles to capture reasoning biases during the inference process. To address these issues, and inspired by signal analysis methods in cognitive neuroscience, this paper proposes a hallucination detection method based on the frequency-domain analysis of hidden layer temporal signals, named HSAD (\textbf{H}idden \textbf{S}ignal \textbf{A}nalysis-based \textbf{D}etection). First, by treating the LLM’s reasoning process as a cognitive journey that unfolds over time, we propose modeling and simulating the human process of signal perception and discrimination in a deception-detection scenario through hidden layer temporal signals. Next, The Fast Fourier Transform is applied to map these temporal signals into the frequency domain to construct spectral features, which are used to capture anomalies that arise during the reasoning process; analysis experiments on these spectral features have proven the effectiveness of this approach. Finally, a hallucination detection algorithm is designed based on these spectral features to identify hallucinations in the generated content. By effectively combining the modeling of the reasoning process with frequency-domain feature extraction, the HSAD method overcomes the limitations of existing approaches in terms of knowledge coverage and the detection of reasoning biases, demonstrating higher detection accuracy and robustness.

[76] Timber: Training-free Instruct Model Refining with Base via Effective Rank

Taiqiang Wu, Runming Yang, Tao Liu, Jiahao Wang, Zenan Xu, Ngai Wong

Main category: cs.CL

TL;DR: Timber is a training-free method that enhances exploration capability of Instruct models by partially reverting them towards their Base models through targeted refinement of weight deltas, improving performance particularly on Pass@k metrics.

Details

Motivation: Post-training that converts Base models to Instruct models is considered superficial and creates a trade-off - improving exploitation capabilities while limiting exploration capabilities.

Method: Timber partially reverts Instruct models towards their paired Base models by making subtle, targeted refinements to the weight deltas between Base and Instruct models, without requiring additional training.

Result: Extensive experiments on Llama and Qwen series show Timber consistently improves vanilla Instruct models, particularly enhancing Pass@k performance.

Conclusion: The findings provide new insights into post-training at the weight level and offer practical strategies to refine Instruct models without additional training, addressing the exploration-exploitation trade-off.

Abstract: Post-training, which elicits a pretrained Base model into the corresponding Instruct model, is widely considered to be superficial. In this work, we first reinforce this hypothesis by providing novel quantitative evidence from the weight level that the effective rank (eRank) remains negligibly changed. However, this superficiality also suffers a critical trade-off, improving the exploitation capabilities at the cost of limiting its exploration. To tackle this issue, we propose Timber, a simple yet effective training-free method that enhances the exploration capability of the Instruct model while preserving its exploitation. The key insight is to partially revert Instruct towards the paired Base model by subtle yet targeted refinement of the weight deltas. Extensive experiments on Llama and Qwen series demonstrate that Timber consistently improves vanilla Instruct models, particularly on Pass@k performance. Our findings offer new insights into the post-training stage at the weight level and practical strategies to refine the Instruct model without training.

[77] Fast Thinking for Large Language Models

Haoyu Zheng, Zhuonan Wang, Yuqian Yuan, Tianwei Lin, Wenqiao Zhang, Zheqi Lv, Juncheng Li, Siliang Tang, Yueting Zhuang, Hongyang He

Main category: cs.CL

TL;DR: Latent Codebooks for Fast Thinking uses concise CoT sketches during training to learn discrete strategy priors, enabling efficient inference with continuous thinking vectors and adaptive routing to reduce token generation.

Details

Motivation: Traditional reasoning LLMs rely on explicit step-by-step token generation, which is inefficient due to long reasoning traces that increase latency and token usage.

Method: Learn a codebook of discrete strategy priors using concise CoT sketches during training, then use continuous thinking vectors from the codebook at inference with GainRouter for adaptive switching between fast and slow reasoning.

Result: Achieves competitive or superior accuracy across multiple reasoning benchmarks while substantially lowering inference cost.

Conclusion: Offers a practical path toward efficient and controllable reasoning in large language models by reducing unnecessary token generation and suppressing overthinking.

Abstract: Reasoning-oriented Large Language Models (LLMs) often rely on generating explicit tokens step by step, and their effectiveness typically hinges on large-scale supervised fine-tuning or reinforcement learning. While Chain-of-Thought (CoT) techniques substantially enhance performance on complex reasoning tasks, they remain inefficient, requiring long reasoning traces that increase latency and token usage. In this work, we introduce Latent Codebooks for Fast Thinking, a framework that uses concise CoT sketches only during training to learn a codebook of discrete strategy priors. At inference, the model conditions on a handful of continuous thinking vectors distilled from the codebook in a single pass, enabling strategy-level guidance without producing explicit reasoning tokens. To complement this design, we propose GainRouter, a lightweight routing mechanism that adaptively switches between fast codebook guided inference and slow explicit reasoning, thereby suppressing overthinking and reducing unnecessary token generation. Experiments across multiple reasoning benchmarks show that our approach achieves competitive or superior accuracy while substantially lowering inference cost, offering a practical path toward efficient and controllable reasoning in large language models.

[78] Don’t Settle Too Early: Self-Reflective Remasking for Diffusion Language Models

Zemin Huang, Yuhang Wang, Zhiyang Chen, Guo-Jun Qi

Main category: cs.CL

TL;DR: RemeDi is a mask-based diffusion language model that introduces remasking to enable flexible text refinement by predicting token distributions and confidence scores, allowing it to identify and resample low-quality tokens.

Details

Motivation: Mask-based DLMs struggle to revise incorrect tokens once generated, with the key challenge being error identification in inputs. Current models lack flexibility in text refinement.

Method: Proposes remasking mechanism that jointly predicts token distributions and per-token confidence scores. Uses remask-aware training pipeline with supervised fine-tuning to detect/remask incorrect tokens and reinforcement learning to optimize generation trajectories.

Result: Achieves state-of-the-art results among open-source DLMs on multiple datasets.

Conclusion: RemeDi successfully enables more flexible text refinement in diffusion-based text generation through the novel remasking mechanism and confidence-based token revision.

Abstract: Mask-based Diffusion Language Models (DLMs) struggle to revise incorrect tokens: once a token is generated, it typically remains fixed. The key challenge is to identify potential errors in the inputs. In this paper, we propose \emph{\underline{Rem}asking-\underline{e}nabled \underline{Di}ffusion Language Model (RemeDi}, a mask-based DLM that introduces \emph{remasking} as another fundamental mechanism, enabling more flexible text refinement in diffusion-based text generation. To achieve this, RemeDi jointly predicts token distributions and per-token confidence scores at each step. The confidence scores determine which tokens to be unmasked after the current step, allowing the model to identify tokens with low quality and remask them. These remasked tokens can be resampled with richer context in subsequent steps. We design a remask-aware pipeline to train this ability, including supervised fine-tuning which teaches the model to detect and remask incorrect tokens in addition to predict mask tokens, and reinforcement learning which optimizes full generation trajectories toward higher rewards. Experiments show that RemeDi achieves the state-of-the-art results among open-source DLMs on multiple datasets.

[79] Beyond English-Centric Training: How Reinforcement Learning Improves Cross-Lingual Reasoning in LLMs

Shulin Huang, Yiran Ding, Junshu Pan, Yue Zhang

Main category: cs.CL

TL;DR: RL training shows superior cross-lingual reasoning generalization compared to SFT, with non-English training data yielding better performance than English data.

Details

Motivation: To investigate the impact of reinforcement learning vs supervised fine-tuning on cross-lingual reasoning generalization, which remains unexplored despite RL's superior performance in complex reasoning.

Method: Systematic investigation using Qwen2.5-3B-Base model on diverse multilingual reasoning benchmarks (math, commonsense, scientific reasoning) with comprehensive mechanistic analyses.

Result: RL achieves higher accuracy and substantially stronger cross-lingual generalization than SFT. RL training on non-English data yields better overall performance and generalization than English data, unlike SFT.

Conclusion: RL enables more robust reasoning strategies, providing crucial guidance for equitable and effective multilingual reasoning.

Abstract: Enhancing the complex reasoning capabilities of Large Language Models (LLMs) attracts widespread attention. While reinforcement learning (RL) has shown superior performance for improving complex reasoning, its impact on cross-lingual generalization compared to Supervised Fine-Tuning (SFT) remains unexplored. We present the first systematic investigation into cross-lingual reasoning generalization of RL and SFT. Using Qwen2.5-3B-Base as our foundation model, we conduct experiments on diverse multilingual reasoning benchmarks, including math reasoning, commonsense reasoning, and scientific reasoning. Our investigation yields two significant findings: (1) Tuning with RL not only achieves higher accuracy but also demonstrates substantially stronger cross-lingual generalization capabilities compared to SFT. (2) RL training on non-English data yields better overall performance and generalization than training on English data, which is not observed with SFT. Furthermore, through comprehensive mechanistic analyses, we explore the underlying factors of RL’s superiority and generalization across languages. Our results provide compelling evidence that RL enables the model with more robust reasoning strategies, offering crucial guidance for more equitable and effective multilingual reasoning.

[80] Aligning LLMs for Multilingual Consistency in Enterprise Applications

Amit Agarwal, Hansa Meghwani, Hitesh Laxmichand Patel, Tao Sheng, Sujith Ravi, Dan Roth

Main category: cs.CL

TL;DR: A batch-wise alignment strategy for fine-tuning LLMs using multilingual data to reduce performance gaps between English and non-English languages, improving non-English accuracy by up to 23.9% without compromising English performance.

Details

Motivation: LLMs are unreliable for global enterprise applications due to substantial performance gaps between high-resource and mid/low-resource languages, driven by English-centric pretraining and internal reasoning biases. This inconsistency undermines customer experience and operational reliability in multilingual settings.

Method: A practical, batch-wise alignment strategy for fine-tuning LLMs that leverages semantically equivalent multilingual data in each training batch to directly align model outputs across languages.

Result: Improves non-English accuracy by up to 23.9% without compromising English performance, model reasoning, or retrieval quality. Reduces the accuracy drop from 29% to much lower levels.

Conclusion: The proposed method is simple to implement, scalable, and integrates seamlessly with existing LLM training & deployment pipelines, enabling more robust and equitable multilingual AI solutions in industry.

Abstract: Large language models (LLMs) remain unreliable for global enterprise applications due to substantial performance gaps between high-resource and mid/low-resource languages, driven by English-centric pretraining and internal reasoning biases. This inconsistency undermines customer experience and operational reliability in multilingual settings such as customer support, content moderation, and information retrieval. Even with advanced Retrieval-Augmented Generation (RAG) systems, we observe up to an 29% accuracy drop in non-English languages compared to English. We propose a practical, batch-wise alignment strategy for fine-tuning LLMs, leveraging semantically equivalent multilingual data in each training batch to directly align model outputs across languages. This approach improves non-English accuracy by up to 23.9% without compromising English performance, model reasoning, or retrieval quality. Our method is simple to implement, scalable, and integrates seamlessly with existing LLM training & deployment pipelines, enabling more robust and equitable multilingual AI solutions in industry.

[81] TF-Bench: Evaluating Program Semantics Reasoning with Type Inference in System F

Yifeng He, Luning Yang, Christopher Castro Gaw Gonzalo, Hao Chen

Main category: cs.CL

TL;DR: TF-Bench is a new benchmark for evaluating LLM reasoning capabilities using type inference in System F, with a pure semantics-driven variant that removes natural language cues to test genuine program understanding.

Details

Motivation: Current benchmarks lack formal deductive frameworks for evaluating program semantics reasoning and cannot distinguish between genuine reasoning and superficial pattern matching of natural language-code associations.

Method: Developed TF-Bench based on type inference in System F, with verified transformations to create TF-Bench_pure that removes semantically irrelevant natural language, plus novel metrics for robustness and test-time reasoning effectiveness.

Result: State-of-the-art LLMs show substantial limitations, with Claude-3.7-sonnet achieving only 55.85% accuracy on TF-Bench_pure, revealing critical gaps in genuine program semantics reasoning.

Conclusion: Current LLMs have significant limitations in program semantics reasoning, highlighting the need for improved benchmarks and future research directions in this area.

Abstract: Large Language Models (LLMs) are increasingly integrated into the software engineering ecosystem. Their test-time compute (TTC) reasoning capabilities show significant potential for understanding program logic and semantics beyond mere token recognition. However, current benchmarks for code reasoning lack a formal, program-centric deductive framework to ensure sound evaluation, and are incapable of assessing whether models genuinely reason about program semantics or merely exploit superficial associations between natural language and code tokens. To bridge this gap, we introduce TF-Bench, a benchmark designed to evaluate LLM reasoning based on type inference in System F, a task we refer to as program semantics reasoning. By employing verified transformations to remove semantically irrelevant natural language, we construct TF-Bench_pure, a purely semantics-driven variant of TF-Bench. Our analysis reveals substantial limitations in state-of-the-art LLMs, with the best-performing LLM (Claude-3.7-sonnet) achieving only 55.85% accuracy on TF-Bench_pure. Additionally, we propose two novel metrics to assess robustness and the effectiveness of test-time reasoning, underscoring critical limitations in current LLM capabilities and highlighting essential directions for future research.

[82] VIVA+: Human-Centered Situational Decision-Making

Zhe Hu, Yixiao Ren, Guanzhong Liu, Jing Li, Yu Yin

Main category: cs.CL

TL;DR: VIVA+ is a cognitively grounded benchmark with 1,317 real-world situations and 6,373 multiple-choice questions to evaluate MLLMs’ reasoning and decision-making across three core abilities: situation comprehension, action justification, and reflective reasoning.

Details

Motivation: Existing evaluation methods struggle to assess MLLMs' capacity for nuanced, human-like reasoning and decision-making in complex, human-centered environments, creating a need for more systematic evaluation frameworks.

Method: Developed VIVA+ benchmark with three core dimensions: Foundational Situation Comprehension, Context-Driven Action Justification, and Reflective Reasoning. Evaluated latest commercial and open-source MLLMs, and explored targeted training and multi-step reasoning strategies.

Result: Revealed distinct performance patterns and significant challenges in current MLLMs. Targeted training and multi-step reasoning strategies yielded consistent performance improvements across models.

Conclusion: Current MLLMs have limitations in robust, context-aware, and socially adept decision-making. The analysis provides actionable insights for advancing MLLMs toward better real-world performance in human-centered environments.

Abstract: Multimodal Large Language Models (MLLMs) show promising results for embodied agents in operating meaningfully in complex, human-centered environments. Yet, evaluating their capacity for nuanced, human-like reasoning and decision-making remains challenging. In this work, we introduce VIVA+, a cognitively grounded benchmark for evaluating the reasoning and decision-making of MLLMs in human-centered situations. VIVA+ consists of 1,317 real-world situations paired with 6,373 multiple-choice questions, targeting three core abilities for decision-making: (1) Foundational Situation Comprehension, (2) Context-Driven Action Justification, and (3) Reflective Reasoning. Together, these dimensions provide a systematic framework for assessing a model’s ability to perceive, reason, and act in socially meaningful ways. We evaluate the latest commercial and open-source models on VIVA+, where we reveal distinct performance patterns and highlight significant challenges. We further explore targeted training and multi-step reasoning strategies, which yield consistent performance improvements. Finally, our in-depth analysis highlights current model limitations and provides actionable insights for advancing MLLMs toward more robust, context-aware, and socially adept decision-making in real-world settings.

Zhiqiang Liu, Yichi Zhang, Mengshu Sun, Lei Liang, Wen Zhang

Main category: cs.CL

TL;DR: M-Hyper is a novel multi-modal knowledge graph completion method that uses quaternion algebra to enable both fused and independent modality representations, achieving state-of-the-art performance.

Details

Motivation: Existing MMKGC methods have limitations: fusion-based methods lose modality-specific information and lack flexibility, while ensemble-based methods fail to capture nuanced cross-modal interactions.

Method: Proposes M-Hyper using quaternion algebra with four orthogonal bases to represent independent modalities. Uses Hamilton product for modality interactions, with FERF module for entity representation factorization and R2MF module for relation-aware modality fusion.

Result: Extensive experiments show state-of-the-art performance, robustness, and computational efficiency compared to existing methods.

Conclusion: M-Hyper successfully integrates strengths of both fusion and ensemble paradigms, enabling effective cross-modal interactions while preserving modality-specific information through quaternion-based representation.

Abstract: Multi-modal knowledge graph completion (MMKGC) aims to discover missing facts in multi-modal knowledge graphs (MMKGs) by leveraging both structural relationships and diverse modality information of entities. Existing MMKGC methods follow two multi-modal paradigms: fusion-based and ensemble-based. Fusion-based methods employ fixed fusion strategies, which inevitably leads to the loss of modality-specific information and a lack of flexibility to adapt to varying modality relevance across contexts. In contrast, ensemble-based methods retain modality independence through dedicated sub-models but struggle to capture the nuanced, context-dependent semantic interplay between modalities. To overcome these dual limitations, we propose a novel MMKGC method M-Hyper, which achieves the coexistence and collaboration of fused and independent modality representations. Our method integrates the strengths of both paradigms, enabling effective cross-modal interactions while maintaining modality-specific information. Inspired by ``quaternion’’ algebra, we utilize its four orthogonal bases to represent multiple independent modalities and employ the Hamilton product to efficiently model pair-wise interactions among them. Specifically, we introduce a Fine-grained Entity Representation Factorization (FERF) module and a Robust Relation-aware Modality Fusion (R2MF) module to obtain robust representations for three independent modalities and one fused modality. The resulting four modality representations are then mapped to the four orthogonal bases of a biquaternion (a hypercomplex extension of quaternion) for comprehensive modality interaction. Extensive experiments indicate its state-of-the-art performance, robustness, and computational efficiency.

[84] Do LLMs Understand Romanian Driving Laws? A Study on Multimodal and Fine-Tuned Question Answering

Eduard Barbu, Adrian Marius Dumitran

Main category: cs.CL

TL;DR: This paper evaluates LLMs on Romanian driving-law QA with explanations, using a 1,208-question dataset. Fine-tuned 8B models compete with SOTA, textual image descriptions outperform visual input, and LLM-as-a-Judge reveals self-preference bias.

Details

Motivation: Ensuring drivers master traffic rules is critical for road safety, particularly for Romanian driving laws where explainable QA systems are needed for less-resourced languages.

Method: Created a 1,208-question dataset (387 multimodal), compared text-only and multimodal SOTA systems, fine-tuned Llama 3.1-8B-Instruct and RoLlama 3.1-8B-Instruct, and used LLM-as-a-Judge to assess explanation quality.

Result: SOTA models perform well but fine-tuned 8B models are competitive. Textual descriptions of images outperform direct visual input. LLM-as-a-Judge assessment reveals self-preference bias in explanation evaluation.

Conclusion: The study provides insights for developing explainable QA systems for less-resourced languages like Romanian, showing the effectiveness of fine-tuned smaller models and the limitations of visual input compared to textual descriptions.

Abstract: Ensuring that both new and experienced drivers master current traffic rules is critical to road safety. This paper evaluates Large Language Models (LLMs) on Romanian driving-law QA with explanation generation. We release a 1{,}208-question dataset (387 multimodal) and compare text-only and multimodal SOTA systems, then measure the impact of domain-specific fine-tuning for Llama 3.1-8B-Instruct and RoLlama 3.1-8B-Instruct. SOTA models perform well, but fine-tuned 8B models are competitive. Textual descriptions of images outperform direct visual input. Finally, an LLM-as-a-Judge assesses explanation quality, revealing self-preference bias. The study informs explainable QA for less-resourced languages.

[85] Compose and Fuse: Revisiting the Foundational Bottlenecks in Multimodal Reasoning

Yucheng Wang, Yifan Hou, Aydin Javadov, Mubashara Akhtar, Mrinmaya Sachan

Main category: cs.CL

TL;DR: MLLMs struggle with cross-modal reasoning due to integration failures rather than perception issues, with performance degrading when modalities provide redundant or chained information rather than independent reasoning paths.

Details

Motivation: To address inconsistencies in understanding whether additional modalities help or harm reasoning in multimodal LLMs, and to systematically analyze when and why modality interactions support or undermine reasoning.

Method: Developed a logic-grounded evaluation framework categorizing multimodal reasoning into six interaction patterns, analyzed attention patterns, and tested two-step prompting (recognize then reason) and early fusion modifications.

Result: Additional modalities only enhance reasoning when providing independent and sufficient reasoning paths. Performance degrades due to three systematic failures: weaker modalities dragging down performance, modality conflicts creating bias, and ineffective integration of joint signals.

Conclusion: Integration, not perception, is the main barrier to multimodal reasoning. Composition-aware training and early fusion control are promising directions to address task-composition and fusion bottlenecks.

Abstract: Multimodal large language models (MLLMs) promise enhanced reasoning by integrating diverse inputs such as text, vision, and audio. Yet cross-modal reasoning remains underexplored, with conflicting reports on whether added modalities help or harm performance. These inconsistencies stem from a lack of controlled evaluation frameworks and analysis of models’ internals to isolate when and why modality interactions support or undermine reasoning. We address this gap through a logic-grounded evaluation framework that categorizes multimodal reasoning into six interaction patterns, varying how facts are distributed across modalities and logically combined. Empirically, additional modalities enhance reasoning only when they provide independent and sufficient reasoning paths, while redundant or chained entailment support often hurts performance. Moreover, reasoning degrades in three systematic ways: weaker modalities drag down overall performance, conflicts bias preference toward certain modalities, and joint signals from different modalities fail to be integrated effectively. Therefore, we identify two core failures: task-composition bottleneck, where recognition and reasoning cannot be jointly executed in one pass, and fusion bottleneck, where early integration introduces bias. For further investigation, we find that attention patterns fail to encode fact usefulness, but a simple two-step prompting (recognize then reason) restores performance, confirming the task-composition bottleneck. Moreover, modality identity remains recoverable in early layers, and softening attention in early fusion improves reasoning, highlighting biased fusion as another failure mode. Overall, our findings show that integration, not perception, is the main barrier to multimodal reasoning, suggesting composition-aware training and early fusion control as promising directions.

[86] Understanding Textual Capability Degradation in Speech LLMs via Parameter Importance Analysis

Chao Wang, Rui-Chen Zheng, Yang Ai, Zhen-Hua Ling

Main category: cs.CL

TL;DR: Speech integration in LLMs weakens textual competence. The paper analyzes this issue via encoder-adaptor paradigm, identifies parameter importance distribution shift, and proposes layer-wise learning rate and LoRA to preserve textual knowledge.

Details

Motivation: Speech-enabled LLMs experience degradation in core textual competence, limiting their ability to leverage pre-trained text knowledge effectively.

Method: Proposed analytical framework using parameter importance estimation to identify textual importance distribution shift. Investigated two mitigation strategies: layer-wise learning rate scheduling and Low-Rank Adaptation (LoRA).

Result: Both approaches better maintain textual competence than full fine-tuning while improving spoken question answering performance.

Conclusion: The analysis provides principled explanation for mitigation strategies’ effectiveness, linking benefits to structural properties of textual knowledge in LLMs.

Abstract: The integration of speech into Large Language Models (LLMs) has substantially expanded their capabilities, but often at the cost of weakening their core textual competence. This degradation limits the ability of speech-enabled LLMs to fully exploit their pre-trained text-based knowledge. In this work, we analyze the underlying mechanisms of this issue through a focused study of the widely used encoder-adaptor paradigm. We propose an analytical framework based on parameter importance estimation, which reveals that fine-tuning for speech introduces a textual importance distribution shift: the layer-wise allocation of parameters critical to textual reasoning is disrupted. Building on this insight, we investigate two mitigation strategies: layer-wise learning rate scheduling and Low-Rank Adaptation (LoRA), both aim to preserve the original parameter distribution. Experimental results show that both approaches better maintain textual competence than full fine-tuning, while also improving downstream spoken question answering performance. Furthermore, our analysis offers a principled explanation for the effectiveness of the proposed mitigation strategies, linking their benefits to the structural properties of textual knowledge in LLMs.

[87] Knowledge-Level Consistency Reinforcement Learning: Dual-Fact Alignment for Long-Form Factuality

Junliang Li, Yucheng Wang, Yan Chen, Yu Ran, Ruiqing Zhang, Jing Liu, Hua Wu, Haifeng Wang

Main category: cs.CL

TL;DR: KLCF is a reinforcement learning framework that addresses LLM hallucinations by aligning policy model’s expressed knowledge with base model’s parametric knowledge through dual-fact alignment mechanism.

Details

Motivation: Hallucination and factuality deficits remain key obstacles to LLM reliability in long-form generation. Existing RLHF frameworks overlook model's internal knowledge boundaries, exacerbating 'hallucination tax'.

Method: Uses knowledge-level consistency reinforcement learning with dual-fact alignment: constructs fact checklist from pretrained knowledge boundaries for factual recall, and trains self-assessment module for factual precision. Fully external-knowledge-free and lightweight reward design.

Result: Substantially improves factuality metrics across multiple long-form benchmarks and effectively alleviates model hallucinations.

Conclusion: KLCF provides an efficient and scalable solution to improve LLM factuality without relying on external retrieval or heavy verification systems.

Abstract: Hallucination and factuality deficits remain key obstacles to the reliability of large language models (LLMs) in long-form generation. Existing reinforcement learning from human feedback (RLHF) frameworks primarily rely on preference rewards, yet they often overlook the model’s internal knowledge boundaries, exacerbating the so-called “hallucination tax”. To address this challenge, we propose Knowledge-Level Consistency Reinforcement Learning Framework (KLCF), a novel framework that focuses on the knowledge consistency between the policy model’s expressed knowledge and the base model’s parametric knowledge, and introduces a Dual-Fact Alignment mechanism to jointly optimize factual recall and precision. Specifically, KLCF leverages pretrained knowledge boundaries to construct fact checklist, guiding online reinforcement learning to improve factual coverage and recall; simultaneously, it trains a self-assessment module based on the base model’s internal knowledge to enhance factual precision during generation. Unlike prior methods that rely on external retrieval or heavy verification, our reward design is fully external-knowledge-free and lightweight, making KLCF efficient and easily scalable to large-scale training. Experimental results demonstrate that KLCF substantially improves factuality metrics across multiple long-form benchmarks and effectively alleviates model hallucinations.

[88] From Personal to Collective: On the Role of Local and Global Memory in LLM Personalization

Zehong Wang, Junlin Wu, ZHaoxuan Tan, Bolian Li, Xianrui Zhong, Zheli Liu, Qingkai Zeng

Main category: cs.CL

TL;DR: LoGo framework addresses LLM personalization challenges by combining local user memory with global collective memory and using a mediator to resolve conflicts, improving cold-start and bias issues.

Details

Motivation: To overcome the cold-start problem (insufficient user history) and biasing problem (overfitting to skewed user preferences) in LLM personalization by leveraging collective knowledge across users.

Method: Proposed LoGo framework with local-global memory: local memory for individual user preferences and global memory for shared interests across population, plus a mediator module to reconcile conflicts between local and global signals.

Result: Extensive experiments on multiple benchmarks show LoGo consistently improves personalization quality by warming up cold-start users and mitigating biased predictions.

Conclusion: Incorporating collective knowledge through the LoGo framework effectively enhances LLM personalization by addressing both cold-start and biasing problems.

Abstract: Large language model (LLM) personalization aims to tailor model behavior to individual users based on their historical interactions. However, its effectiveness is often hindered by two key challenges: the \textit{cold-start problem}, where users with limited history provide insufficient context for accurate personalization, and the \textit{biasing problem}, where users with abundant but skewed history cause the model to overfit to narrow preferences. We identify both issues as symptoms of a common underlying limitation, i.e., the inability to model collective knowledge across users. To address this, we propose a local-global memory framework (LoGo) that combines the personalized local memory with a collective global memory that captures shared interests across the population. To reconcile discrepancies between these two memory sources, we introduce a mediator module designed to resolve conflicts between local and global signals. Extensive experiments on multiple benchmarks demonstrate that LoGo consistently improves personalization quality by both warming up cold-start users and mitigating biased predictions. These results highlight the importance of incorporating collective knowledge to enhance LLM personalization.

[89] Bridging the Knowledge-Prediction Gap in LLMs on Multiple-Choice Questions

Yoonah Park, Haesung Pyun, Yohan Jo

Main category: cs.CL

TL;DR: LLMs often fail on MCQs despite having correct knowledge. KAPPA is a parameter-free method that aligns hidden states to bridge this gap by projecting onto knowledge and prediction subspaces.

Details

Motivation: To understand why LLMs fail on multiple-choice questions despite demonstrating correct knowledge in free-form generation, and to develop a method to bridge this knowledge-prediction gap.

Method: KAPPA (Knowledge-Aligned Prediction through Projection-based Adjustment) - a parameter-free intervention that transforms hidden states to align prediction coordinates with knowledge coordinates in a subspace spanned by knowledge and prediction bases.

Result: KAPPA substantially improves accuracy on binary-choice reformulations of Big-Bench-Hard and ARC-Challenge, consistently outperforming baselines. It also extends effectiveness to free-form questions beyond MCQs.

Conclusion: The work provides a geometric understanding of the knowledge-prediction gap and offers a practical method for better aligning model behavior with its latent knowledge.

Abstract: Large Language Models (LLMs) often fail on multiple-choice questions (MCQs) despite demonstrating correct knowledge in other contexts, such as free-form generation. To investigate the mechanism underlying this knowledge-prediction gap on MCQs and alleviate it, we conduct a probing analysis and find that residual streams in certain layers contain a subspace spanned by two important bases: a \emph{knowledge basis} that encodes the probability of the ground-truth answer for a given MCQ and a \emph{prediction basis} that encodes the probability of the answer choice predicted by the model. We observe that incorrect predictions arise from a misalignment of the model’s hidden states along these two bases. Hence, we introduce \textbf{KAPPA} (Knowledge-Aligned Prediction through Projection-based Adjustment), a parameter-free intervention that transforms the hidden states to align the prediction coordinate with the knowledge coordinate within this subspace. Experiments on binary-choice reformulations of Big-Bench-Hard and ARC-Challenge show that KAPPA substantially improves accuracy and consistently outperforms baselines. While optimal subspaces differ across tasks, subspaces generalize to some extent, as supported by cross-dataset experiments. Moreover, KAPPA extends its effectiveness to free-form questions beyond MCQs. Our work provides a new geometric understanding of the knowledge-prediction gap and offers a practical method for better aligning model behavior with its latent knowledge.

[90] Transformer Tafsir at QIAS 2025 Shared Task: Hybrid Retrieval-Augmented Generation for Islamic Knowledge Question Answering

Muhammad Abu Ahmad, Mohamad Ballout, Raia Abu Ahmad, Elia Bruni

Main category: cs.CL

TL;DR: A hybrid RAG system combining sparse and dense retrieval with cross-encoder reranking was developed for Islamic knowledge tasks, improving LLM performance by up to 25% accuracy.

Details

Motivation: To enhance large language model performance on Islamic knowledge understanding and reasoning tasks for the QIAS 2025 shared task.

Method: Three-stage pipeline: BM25 for initial retrieval, dense embedding model for semantic matching, and cross-encoder reranking for precise content retrieval, tested with Fanar and Mistral LLMs.

Result: RAG pipeline improved performance across both subtasks, with Fanar achieving best results: 45% accuracy in Subtask 1 and 80% in Subtask 2, with improvements up to 25%.

Conclusion: The hybrid RAG approach effectively enhances LLM performance on Islamic knowledge tasks, with Fanar showing superior results in the proposed pipeline.

Abstract: This paper presents our submission to the QIAS 2025 shared task on Islamic knowledge understanding and reasoning. We developed a hybrid retrieval-augmented generation (RAG) system that combines sparse and dense retrieval methods with cross-encoder reranking to improve large language model (LLM) performance. Our three-stage pipeline incorporates BM25 for initial retrieval, a dense embedding retrieval model for semantic matching, and cross-encoder reranking for precise content retrieval. We evaluate our approach on both subtasks using two LLMs, Fanar and Mistral, demonstrating that the proposed RAG pipeline enhances performance across both, with accuracy improvements up to 25%, depending on the task and model configuration. Our best configuration is achieved with Fanar, yielding accuracy scores of 45% in Subtask 1 and 80% in Subtask 2.

[91] Open-DeBias: Toward Mitigating Open-Set Bias in Language Models

Arti Rani, Shweta Singh, Nihar Ranjan Sahoo, Gaurav Kumar Nayak

Main category: cs.CL

TL;DR: Open-DeBias is a novel debiasing method that uses adapter modules to mitigate both known and unseen biases in LLMs, achieving significant improvements in QA accuracy and multilingual generalization with minimal training data.

Details

Motivation: LLMs often encode harmful biases that compromise fairness, but existing bias mitigation approaches are limited to predefined categories and cannot address novel or context-specific emergent biases.

Method: Proposes Open-DeBias, a data-efficient and parameter-efficient debiasing method using adapter modules to mitigate social and stereotypical biases while generalizing to unseen biases.

Result: Improves QA accuracy by 48% on ambiguous subsets and 6% on disambiguated ones compared to BMBI, and achieves 84% accuracy in zero-shot transfer to Korean BBQ, demonstrating robust multilingual generalization.

Conclusion: Open-DeBias is effective for general-purpose, open-domain bias mitigation across various NLP tasks, showing robustness, multilingual strength, and suitability for addressing both known and emergent biases.

Abstract: Large Language Models (LLMs) have achieved remarkable success on question answering (QA) tasks, yet they often encode harmful biases that compromise fairness and trustworthiness. Most existing bias mitigation approaches are restricted to predefined categories, limiting their ability to address novel or context-specific emergent biases. To bridge this gap, we tackle the novel problem of open-set bias detection and mitigation in text-based QA. We introduce OpenBiasBench, a comprehensive benchmark designed to evaluate biases across a wide range of categories and subgroups, encompassing both known and previously unseen biases. Additionally, we propose Open-DeBias, a novel, data-efficient, and parameter-efficient debiasing method that leverages adapter modules to mitigate existing social and stereotypical biases while generalizing to unseen ones. Compared to the state-of-the-art BMBI method, Open-DeBias improves QA accuracy on BBQ dataset by nearly $48%$ on ambiguous subsets and $6%$ on disambiguated ones, using adapters fine-tuned on just a small fraction of the training data. Remarkably, the same adapters, in a zero-shot transfer to Korean BBQ, achieve $84%$ accuracy, demonstrating robust language-agnostic generalization. Through extensive evaluation, we also validate the effectiveness of Open-DeBias across a broad range of NLP tasks, including StereoSet and CrowS-Pairs, highlighting its robustness, multilingual strength, and suitability for general-purpose, open-domain bias mitigation. The project page is available at: https://sites.google.com/view/open-debias25

[92] SPELL: Self-Play Reinforcement Learning for evolving Long-Context Language Models

Ziyi Yang, Weizhou Shen, Ruijun Chen, Chenliang Li, Fanqi Wan, Ming Yan, Xiaojun Quan, Fei Huang

Main category: cs.CL

TL;DR: SPELL is a multi-role self-play reinforcement learning framework that enables scalable, label-free optimization for long-context reasoning in LLMs, achieving significant performance improvements without requiring human annotations.

Details

Motivation: Progress in long-context reasoning for LLMs has lagged due to the intrinsic difficulty of processing long texts and scarcity of reliable human annotations and verifiable reward signals.

Method: SPELL integrates three cyclical roles (questioner, responder, verifier) within a single model for continual self-improvement. It uses automated curriculum learning with gradually increasing document length and adaptive reward functions based on question difficulty.

Result: Extensive experiments on six long-context benchmarks show SPELL consistently improves performance across diverse LLMs, outperforming equally sized models fine-tuned on large-scale annotated data. It achieves an average 7.6-point gain in pass@8 on Qwen3-30B-A3B-Thinking.

Conclusion: SPELL shows promise for scaling to even more capable models and effectively addresses the challenge of long-context reasoning optimization without requiring human annotations.

Abstract: Progress in long-context reasoning for large language models (LLMs) has lagged behind other recent advances. This gap arises not only from the intrinsic difficulty of processing long texts, but also from the scarcity of reliable human annotations and programmatically verifiable reward signals. In this paper, we propose SPELL, a multi-role self-play reinforcement learning framework that enables scalable, label-free optimization for long-context reasoning. SPELL integrates three cyclical roles-questioner, responder, and verifier-within a single model to enable continual self-improvement. The questioner generates questions from raw documents paired with reference answers; the responder learns to solve these questions based on the documents; and the verifier evaluates semantic equivalence between the responder’s output and the questioner’s reference answer, producing reward signals to guide continual training. To stabilize training, we introduce an automated curriculum that gradually increases document length and a reward function that adapts question difficulty to the model’s evolving capabilities. Extensive experiments on six long-context benchmarks show that SPELL consistently improves performance across diverse LLMs and outperforms equally sized models fine-tuned on large-scale annotated data. Notably, SPELL achieves an average 7.6-point gain in pass@8 on the strong reasoning model Qwen3-30B-A3B-Thinking, raising its performance ceiling and showing promise for scaling to even more capable models.

[93] Winning the Pruning Gamble: A Unified Approach to Joint Sample and Token Pruning for Efficient Supervised Fine-Tuning

Shaobo Wang, Jiaming Wang, Jiajun Zhang, Cong Wang, Yue Min, Zichen Wen, Fei Huang, Huiqiang Jiang, Junyang Lin, Dayiheng Liu, Linfeng Zhang

Main category: cs.CL

TL;DR: Q-Tuning is a unified framework that jointly optimizes sample and token pruning for efficient supervised fine-tuning of LLMs, achieving superior performance with only 12.5% of training data.

Details

Motivation: Existing data pruning methods operate in isolation at either sample or token level, leading to inefficiencies where high-value samples may contain redundant tokens and token pruning may discard crucial instructional signals.

Method: Proposes Error-Uncertainty Plane diagnostic framework and Quadrant-based Tuning with two-stage strategy: sample-level triage to retain informative examples, followed by asymmetric token-pruning that trims less salient tokens only from misconception samples while preserving calibration samples entirely.

Result: Sets new state-of-the-art across five diverse benchmarks. On SmolLM2-1.7B, achieves +38% average improvement over full-data SFT baseline using only 12.5% of original training data.

Conclusion: Q-Tuning is the first dynamic pruning approach to consistently outperform full-data training, providing a practical and scalable blueprint for maximizing data utilization in budget-constrained LLM SFT.

Abstract: As supervised fine-tuning (SFT) evolves from a lightweight post-training step into a compute-intensive phase rivaling mid-training in scale, data efficiency has become critical for aligning large language models (LLMs) under tight budgets. Existing data pruning methods suffer from a fragmented design: they operate either at the sample level or the token level in isolation, failing to jointly optimize both dimensions. This disconnect leads to significant inefficiencies–high-value samples may still contain redundant tokens, while token-level pruning often discards crucial instructional or corrective signals embedded in individual examples. To address this bottleneck, we introduce the Error-Uncertainty (EU) Plane, a diagnostic framework that jointly characterizes the heterogeneous utility of training data across samples and tokens. Guided by this insight, we propose Quadrant-based Tuning (Q-Tuning), a unified framework that strategically coordinates sample pruning and token pruning. Q-Tuning employs a two-stage strategy: first, it performs sample-level triage to retain examples rich in informative misconceptions or calibration signals; second, it applies an asymmetric token-pruning policy, using a context-aware scoring mechanism to trim less salient tokens exclusively from misconception samples while preserving calibration samples in their entirety. Our method sets a new state of the art across five diverse benchmarks. Remarkably, on SmolLM2-1.7B, Q-Tuning achieves a +38% average improvement over the full-data SFT baseline using only 12.5% of the original training data. As the first dynamic pruning approach to consistently outperform full-data training, Q-Tuning provides a practical and scalable blueprint for maximizing data utilization in budget-constrained LLM SFT.

[94] DocPruner: A Storage-Efficient Framework for Multi-Vector Visual Document Retrieval via Adaptive Patch-Level Embedding Pruning

Yibo Yan, Guangwei Xu, Xin Zou, Shuliang Liu, James Kwok, Xuming Hu

Main category: cs.CL

TL;DR: DocPruner is a framework that reduces storage overhead in Visual Document Retrieval by adaptively pruning redundant patch-level embeddings using intra-document attention distribution, achieving 50-60% storage reduction with minimal performance loss.

Details

Motivation: Current multi-vector VDR methods using LVLMs create prohibitive storage overhead by storing hundreds of vectors per page, making large-scale deployment costly and impractical.

Method: DocPruner employs adaptive patch-level embedding pruning by leveraging intra-document patch attention distribution to dynamically identify and discard redundant embeddings for each document.

Result: Achieves 50-60% reduction in storage for leading multi-vector VDR models with negligible degradation in document retrieval performance across more than ten datasets.

Conclusion: DocPruner offers a robust, flexible, and effective solution for building storage-efficient, large-scale VDR systems.

Abstract: Visual Document Retrieval (VDR), the task of retrieving visually-rich document pages using queries that combine visual and textual cues, is crucial for numerous real-world applications. Recent state-of-the-art methods leverage Large Vision-Language Models (LVLMs) in a multi-vector paradigm, representing each document as patch-level embeddings to capture fine-grained details. While highly effective, this approach introduces a critical challenge: prohibitive storage overhead, as storing hundreds of vectors per page makes large-scale deployment costly and impractical. To address this, we introduce DocPruner, the first framework to employ adaptive patch-level embedding pruning for VDR to effectively reduce the storage overhead. DocPruner leverages the intra-document patch attention distribution to dynamically identify and discard redundant embeddings for each document. This adaptive mechanism enables a significant 50-60% reduction in storage for leading multi-vector VDR models with negligible degradation in document retrieval performance. Extensive experiments across more than ten representative datasets validate that DocPruner offers a robust, flexible, and effective solution for building storage-efficient, large-scale VDR systems.

[95] Taming Masked Diffusion Language Models via Consistency Trajectory Reinforcement Learning with Fewer Decoding Step

Jingyi Yang, Guanxu Chen, Xuhao Hu, Jing Shao

Main category: cs.CL

TL;DR: The paper proposes EOSER and ASS decoding strategies and CJ-GRPO reinforcement learning method to optimize masked diffusion language models, addressing training-inference inconsistencies and enabling competitive performance with fewer decoding steps.

Details

Motivation: Masked diffusion language models offer advantages like parallel decoding and flexible generation orders, but existing decoding strategies and RL algorithms are not well-suited for them, leading to suboptimal performance and training-inference inconsistencies.

Method: Proposes EOSER (EOS Early Rejection) and ASS (Ascending Step-Size) decoding scheduler for full diffusion-style decoding, and CJ-GRPO (Consistency Trajectory Group Relative Policy Optimization) RL method that ensures consistency between rollout and optimization trajectories.

Result: Experiments on reasoning tasks using LLaDA-8B-Instruct show the proposed methods achieve competitive performance with fewer decoding steps and effectively tame MDLMs.

Conclusion: The EOSER and ASS mechanisms combined with CJ-GRPO provide promising solutions for efficiently optimizing masked diffusion language models, addressing key challenges in decoding strategies and reinforcement learning for non-causal parallel decoding models.

Abstract: Masked diffusion language models (MDLMs) have recently emerged as a promising alternative to autoregressive (AR) language models, offering properties such as parallel decoding, flexible generation orders, and the potential for fewer inference steps. Despite these advantages, decoding strategies and reinforcement learning (RL) algorithms tailored for MDLMs remain underexplored. A naive approach is to directly transfer techniques well-established for AR models to MDLMs. However, this raises an immediate question: Is such a naive transfer truly optimal? For example, 1) Block-wise and semi-AR decoding strategies are not employed during the training of MDLMs, so why do they outperform full diffusion-style decoding during inference? 2) Applying RL algorithms designed for AR models directly to MDLMs exhibits a training-inference inconsistency, since MDLM decoding are non-causal (parallel). This results in inconsistencies between the rollout trajectory and the optimization trajectory. To address these challenges, we propose EOS Early Rejection (EOSER) and Ascending Step-Size (ASS) decoding scheduler, which unlock the potential of MDLMs to perform full diffusion-style decoding, achieving competitive performance with fewer decoding steps. Additionally, we introduce Consistency Trajectory Group Relative Policy Optimization (CJ-GRPO) for taming MDLMs, which emphasizes the consistency between rollout trajectory and optimization trajectory, and reduces the optimization errors caused by skip-step optimization. We conduct extensive experiments on reasoning tasks, such as mathematical and planning benchmarks, using LLaDA-8B-Instruct. The results demonstrate that the proposed EOSER and ASS mechanisms, together with CJ-GRPO, hold significant promise for effectively and efficiently taming MDLMs. Code: https://github.com/yjyddq/EOSER-ASS-RL.

[96] Assessing Large Language Models in Updating Their Forecasts with New Information

Zhangdie Yuan, Zifeng Ding, Andreas Vlachos

Main category: cs.CL

TL;DR: EVOLVECAST framework evaluates how LLMs revise predictions when given new post-training information, finding they update inconsistently and conservatively compared to human forecasters.

Details

Motivation: Prior work treated future event prediction as static, ignoring how forecasts should evolve with new evidence. This gap needs addressing to understand LLM belief updating.

Method: Introduced EVOLVECAST framework to assess LLM prediction revisions when presented with post-training cutoff information, using human forecasters as reference for comparison.

Result: LLMs show some responsiveness to new information but updates are inconsistent and overly conservative. Neither verbalized nor logits-based confidence estimates consistently outperform each other, both falling short of human standards.

Conclusion: Models exhibit conservative bias across settings, highlighting the need for more robust approaches to belief updating in LLMs.

Abstract: Prior work has largely treated future event prediction as a static task, failing to consider how forecasts and the confidence in them should evolve as new evidence emerges. To address this gap, we introduce EVOLVECAST, a framework for evaluating whether large language models appropriately revise their predictions in response to new information. In particular, EVOLVECAST assesses whether LLMs adjust their forecasts when presented with information released after their training cutoff. We use human forecasters as a comparative reference to analyze prediction shifts and confidence calibration under updated contexts. While LLMs demonstrate some responsiveness to new information, their updates are often inconsistent or overly conservative. We further find that neither verbalized nor logits-based confidence estimates consistently outperform the other, and both remain far from the human reference standard. Across settings, models tend to express conservative bias, underscoring the need for more robust approaches to belief updating.

[97] Easy Turn: Integrating Acoustic and Linguistic Modalities for Robust Turn-Taking in Full-Duplex Spoken Dialogue Systems

Guojian Li, Chengyou Wang, Hongfei Xue, Shuiyuan Wang, Dehui Gao, Zihan Zhang, Yuke Lin, Wenjie Li, Longshuai Xiao, Zhonghua Fu, Lei Xie

Main category: cs.CL

TL;DR: Easy Turn is an open-source, modular turn-taking detection model that integrates acoustic and linguistic bimodal information to predict four dialogue turn states, achieving state-of-the-art performance on their open-source testset.

Details

Motivation: Full-duplex interaction is crucial for natural human-machine communication but remains challenging due to the lack of open-source turn-taking models. Existing solutions are either not open-sourced, limited by large parameter size, support only single modality, or require scarce full-duplex data.

Method: Proposed Easy Turn, an open-source modular turn-taking detection model that integrates acoustic and linguistic bimodal information to predict four dialogue turn states: complete, incomplete, backchannel, and wait. Also released Easy Turn trainset, a 1,145-hour speech dataset for training turn-taking detection models.

Result: Achieved state-of-the-art turn-taking detection accuracy on the open-source Easy Turn testset compared to existing open-source models like TEN Turn Detection and Smart Turn V2.

Conclusion: Easy Turn provides an effective open-source solution for turn-taking detection by integrating bimodal information and releasing a large training dataset, addressing the limitations of existing approaches.

Abstract: Full-duplex interaction is crucial for natural human-machine communication, yet remains challenging as it requires robust turn-taking detection to decide when the system should speak, listen, or remain silent. Existing solutions either rely on dedicated turn-taking models, most of which are not open-sourced. The few available ones are limited by their large parameter size or by supporting only a single modality, such as acoustic or linguistic. Alternatively, some approaches finetune LLM backbones to enable full-duplex capability, but this requires large amounts of full-duplex data, which remain scarce in open-source form. To address these issues, we propose Easy Turn, an open-source, modular turn-taking detection model that integrates acoustic and linguistic bimodal information to predict four dialogue turn states: complete, incomplete, backchannel, and wait, accompanied by the release of Easy Turn trainset, a 1,145-hour speech dataset designed for training turn-taking detection models. Compared to existing open-source models like TEN Turn Detection and Smart Turn V2, our model achieves state-of-the-art turn-taking detection accuracy on our open-source Easy Turn testset. The data and model will be made publicly available on GitHub.

[98] Vision-Grounded Machine Interpreting: Improving the Translation Process through Visual Cues

Claudio Fantinuoli

Main category: cs.CL

TL;DR: Vision-Grounded Interpreting (VGI) integrates visual input with speech to improve machine interpreting by using visual context to resolve ambiguities, showing benefits for lexical disambiguation but limited gains for gender resolution and no improvement for syntactic ambiguities.

Details

Motivation: Current machine interpreting systems rely solely on speech signals, limiting performance in contexts where visual, situational, or pragmatic cues are needed for disambiguation and adequacy.

Method: Developed a prototype system that integrates a vision-language model to process both speech and visual input from a webcam, using visual context to prime the translation process. Evaluated using a hand-crafted diagnostic corpus targeting three types of ambiguity.

Result: Visual grounding substantially improves lexical disambiguation, yields modest and less stable gains for gender resolution, and shows no benefit for syntactic ambiguities.

Conclusion: Embracing multimodality represents a necessary step forward for advancing translation quality in machine interpreting systems.

Abstract: Machine Interpreting systems are currently implemented as unimodal, real-time speech-to-speech architectures, processing translation exclusively on the basis of the linguistic signal. Such reliance on a single modality, however, constrains performance in contexts where disambiguation and adequacy depend on additional cues, such as visual, situational, or pragmatic information. This paper introduces Vision-Grounded Interpreting (VGI), a novel approach designed to address the limitations of unimodal machine interpreting. We present a prototype system that integrates a vision-language model to process both speech and visual input from a webcam, with the aim of priming the translation process through contextual visual information. To evaluate the effectiveness of this approach, we constructed a hand-crafted diagnostic corpus targeting three types of ambiguity. In our evaluation, visual grounding substantially improves lexical disambiguation, yields modest and less stable gains for gender resolution, and shows no benefit for syntactic ambiguities. We argue that embracing multimodality represents a necessary step forward for advancing translation quality in machine interpreting.

[99] HiPO: Hybrid Policy Optimization for Dynamic Reasoning in LLMs

Ken Deng, Zizheng Zhan, Wen Xiang, Wenqiang Zhu, Tianhao Peng, Xinping Lei, Weihao Li, Jingxuan Xu, Kun Wu, Yifan Yao, Haoyang Huang, Huaixi Tang, Kepeng Lei, Zhiyi Lai, Songwei Yu, Zongxian Feng, Zuchen Gao, Weihao Xie, Chenchen Zhang, Yanan Wu, Yuanxing Zhang, Lecheng Huang, Yuqun Zhang, Jie Liu, Zhaoxiang Zhang, Haotian Zhang, Bin Chen, Jiaheng Liu

Main category: cs.CL

TL;DR: HiPO is a framework for adaptive reasoning control that enables LLMs to selectively use detailed reasoning (Think-on) or direct responses (Think-off) to balance accuracy and efficiency.

Details

Motivation: Current LLMs using chain-of-thought reasoning generate lengthy reasoning traces inefficiently, leading to excessive token usage and higher inference costs.

Method: HiPO combines a hybrid data pipeline with paired Think-on/Think-off responses and a hybrid reinforcement learning reward system that balances accuracy and efficiency.

Result: Experiments across mathematics and coding benchmarks show HiPO substantially reduces token length while maintaining or improving accuracy.

Conclusion: HiPO provides a principled approach for efficient adaptive reasoning, advancing deployment of reasoning-oriented LLMs in resource-sensitive real-world settings.

Abstract: Large Language Models (LLMs) increasingly rely on chain-of-thought (CoT) reasoning to improve accuracy on complex tasks. However, always generating lengthy reasoning traces is inefficient, leading to excessive token usage and higher inference costs. This paper introduces the Hybrid Policy Optimization (i.e., HiPO), a framework for adaptive reasoning control that enables LLMs to selectively decide when to engage in detailed reasoning (Think-on) and when to respond directly (Think-off). Specifically, HiPO combines a hybrid data pipelineproviding paired Think-on and Think-off responseswith a hybrid reinforcement learning reward system that balances accuracy and efficiency while avoiding over-reliance on detailed reasoning. Experiments across mathematics and coding benchmarks demonstrate that HiPO can substantially reduce token length while maintaining or improving accuracy. Finally, we hope HiPO a can be a principled approach for efficient adaptive reasoning, advancing the deployment of reasoning-oriented LLMs in real-world, resource-sensitive settings.

[100] ByteSized32Refactored: Towards an Extensible Interactive Text Games Corpus for LLM World Modeling and Evaluation

Haonan Wang, Junfeng Sun, Xingdi Yuan, Ruoyao Wang, Ziang Xiao

Main category: cs.CL

TL;DR: ByteSized32Refactored is a modular implementation of the ByteSized32 text game corpus that reduces code by 50% through abstraction and enables extensibility for new scenarios.

Details

Motivation: To address the challenge of simulating interactive world models in LLMs by creating a more modular and extensible text game generation framework.

Method: Refactored the original ByteSized32 corpus by creating GameBasic.py foundation library with 7 base classes (GameObject, etc.) that centralize common logic across all 32 games.

Result: Reduced total lines of Python code from 20k to 10k. GPT-4o experiments showed mixed performance - quality improvements on 2 evaluation dimensions but decreases on 2 others due to hierarchical structure challenges.

Conclusion: The extensible code structure with foundation library and modular optimization facilitates LLM adaptation and establishes a scalable environment for future extensions.

Abstract: Simulating interactive world models remains a core challenge in Large Language Models(LLMs). In this work, we introduce the ByteSized32Refactored, a refactored, modular, and extensible implementation of the original ByteSized32 corpus to explore the task of text game generation. We further optimize the code structure of each text game and create the GameBasic.py foundation library, which centralizes common logic across all 32 games by abstracting 7 base classes (GameObject, etc.) into reusable modules, thereby reducing from 20k to 10k total lines of Python code compared to the original Bytesized32. Our refactored implementation enables extendability - with our centralized design, ByteSized32Refactored can be more efficiently extended to include text games of new scenarios and specifications by reusing the shared logic and functionalities. Extensive experiments with GPT-4o demonstrate a mix of performance - with Bytesized32Refactored, the generated text games for unseen scenarios showcase quality improvements on two of the four evaluation dimensions while decreases on the other two, indicating that the hierarchical structure of the refactored code presents new challenges for LLMs. Overall, we highlight that our extensible code structure, centered on the foundation library and the modular optimization, not only facilitates LLM adaptation to environment specifications but also establishes a scalable environment that supports future extensions.

[101] Toward Preference-aligned Large Language Models via Residual-based Model Steering

Lucio La Cava, Andrea Tagarelli

Main category: cs.CL

TL;DR: PaLRS is a training-free method for LLM preference alignment that extracts steering vectors from residual streams using minimal preference data, enabling plug-and-play inference-time alignment without expensive optimization.

Details

Motivation: Existing preference alignment methods like RLHF and DPO require curated data, expensive optimization over billions of parameters, and lead to persistent task-specific models, making them inefficient and inflexible.

Method: Extracts lightweight steering vectors from residual streams of LLMs using as few as 100 preference pairs, which can be applied at inference time to push models toward preferred behaviors without training.

Result: PaLRS-aligned models achieve consistent gains on mathematical reasoning and code generation benchmarks while preserving baseline general-purpose performance, and outperform DPO-aligned models with huge time savings.

Conclusion: PaLRS offers an effective, efficient, and flexible alternative to standard preference optimization pipelines, providing training-free, plug-and-play alignment with minimal data requirements.

Abstract: Preference alignment is a critical step in making Large Language Models (LLMs) useful and aligned with (human) preferences. Existing approaches such as Reinforcement Learning from Human Feedback or Direct Preference Optimization typically require curated data and expensive optimization over billions of parameters, and eventually lead to persistent task-specific models. In this work, we introduce Preference alignment of Large Language Models via Residual Steering (PaLRS), a training-free method that exploits preference signals encoded in the residual streams of LLMs. From as few as one hundred preference pairs, PaLRS extracts lightweight, plug-and-play steering vectors that can be applied at inference time to push models toward preferred behaviors. We evaluate PaLRS on various small-to-medium-scale open-source LLMs, showing that PaLRS-aligned models achieve consistent gains on mathematical reasoning and code generation benchmarks while preserving baseline general-purpose performance. Moreover, when compared to DPO-aligned models, they perform better with huge time savings. Our findings highlight that PaLRS offers an effective, much more efficient and flexible alternative to standard preference optimization pipelines, offering a training-free, plug-and-play mechanism for alignment with minimal data.

[102] The Hidden Costs of Translation Accuracy: Distillation, Quantization, and Environmental Impact

Dhaathri Vijay, Anandaswarup Vadapalli

Main category: cs.CL

TL;DR: Model compression strategies (distillation and quantization) can significantly reduce computational costs and environmental impact while maintaining competitive translation quality, with minimal performance degradation compared to full-scale models.

Details

Motivation: Address concerns about the computational and environmental costs of large language models by investigating trade-offs between translation quality and efficiency.

Method: Compared full-scale, distilled, and quantized models on machine translation using Flores+ benchmark and human evaluations of conversational translations in French, Hindi, and Kannada. Analyzed carbon emissions per evaluation run.

Result: Full 3.3B fp32 model had highest BLEU scores but largest environmental footprint (0.007-0.008 kg CO2 per run). Distilled models achieved 4.5x faster inference with minimal BLEU reductions. Aggressive quantization (INT4) preserved high accuracy and fluency with minor differences between models.

Conclusion: Model compression can substantially reduce computational demands and environmental impact while maintaining competitive quality, though trade-offs are more pronounced in low-resource settings. Evaluation frameworks should integrate efficiency and sustainability as central dimensions of NLP progress.

Abstract: The rapid expansion of large language models (LLMs) has heightened concerns about their computational and environmental costs. This study investigates the trade-offs between translation quality and efficiency by comparing full-scale, distilled, and quantized models using machine translation as a case study. We evaluated performance on the Flores+ benchmark and through human judgments of conversational translations in French, Hindi, and Kannada. Our analysis of carbon emissions per evaluation run revealed that the full 3.3B fp32 model, while achieving the highest BLEU scores, incurred the largest environmental footprint (about 0.007-0.008 kg CO2 per run). The distilled models achieved an inference of up to 4.5x faster than the full 3.3B model, with only minimal reductions in BLEU scores. Human evaluations also showed that even aggressive quantization (INT4) preserved high levels of accuracy and fluency, with differences between models generally minor. These findings demonstrate that model compression strategies can substantially reduce computational demands and environmental impact while maintaining competitive translation quality, though trade-offs are more pronounced in low-resource settings. We argue for evaluation frameworks that integrate efficiency and sustainability alongside objective metrics as central dimensions of progress in NLP.

[103] The AI Agent Code of Conduct: Automated Guardrail Policy-as-Prompt Synthesis

Gauri Kholkar, Ratinder Ahuja

Main category: cs.CL

TL;DR: A framework that automatically translates unstructured design documents into verifiable, real-time guardrails for autonomous AI agents using LLMs to interpret and enforce natural language policies.

Details

Motivation: As autonomous AI agents are increasingly deployed in industry, it is essential to safeguard them and bridge the critical policy-to-practice gap for verifiably safer and more regulatable AI.

Method: Introduces “Policy as Prompt” approach using LLMs to interpret natural language policies with contextual understanding and least privilege principle. System ingests technical artifacts to build verifiable policy trees, then compiles them into lightweight prompt-based classifiers for runtime behavior auditing.

Result: Validated across diverse applications, demonstrating a scalable and auditable pipeline that successfully bridges the policy-to-practice gap.

Conclusion: The framework paves the way for verifiably safer and more regulatable AI by providing automated translation of design documents into enforceable guardrails through LLM-based policy interpretation.

Abstract: As autonomous AI agents are increasingly deployed in industry, it is essential to safeguard them. We introduce a novel framework that automates the translation of unstructured design documents into verifiable, real-time guardrails. We introduce “Policy as Prompt,” a new approach that uses Large Language Models (LLMs) to interpret and enforce natural language policies by applying contextual understanding and the principle of least privilege. Our system first ingests technical artifacts to construct a verifiable policy tree, which is then compiled into lightweight, prompt-based classifiers that audit agent behavior at runtime. We validate our approach across diverse applications, demonstrating a scalable and auditable pipeline that bridges the critical policy-to-practice gap, paving the way for verifiably safer and more regulatable AI.

[104] MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use

Zijian Wu, Xiangyan Liu, Xinyuan Zhang, Lingjun Chen, Fanqing Meng, Lingxiao Du, Yiran Zhao, Fanshi Zhang, Yaoqi Ye, Jiawei Wang, Zirui Wang, Jinjie Ni, Yufan Yang, Arvin Xu, Michael Qizhe Shieh

Main category: cs.CL

TL;DR: MCPMark is a new benchmark with 127 tasks that better evaluates MCP (Model Context Protocol) for realistic agent workflows, showing current LLMs struggle with complex CRUD operations.

Details

Motivation: Existing MCP benchmarks are too narrow, focusing on read-heavy tasks with limited interaction depth, failing to capture real-world workflow complexity.

Method: Created 127 high-quality tasks with domain experts and AI agents, featuring curated initial states and programmatic verification scripts. Evaluated LLMs using minimal agent framework with tool-calling loops.

Result: Best model (gpt-5-medium) achieved only 52.56% pass@1 and 33.86% pass^4. Other strong models (claude-sonnet-4, o3) fell below 30% pass@1 and 15% pass^4. LLMs required average 16.2 execution turns and 17.4 tool calls per task.

Conclusion: MCPMark effectively stress-tests MCP capabilities, revealing significant performance gaps in current LLMs for complex real-world agent workflows requiring diverse CRUD operations.

Abstract: MCP standardizes how LLMs interact with external systems, forming the foundation for general agents. However, existing MCP benchmarks remain narrow in scope: they focus on read-heavy tasks or tasks with limited interaction depth, and fail to capture the complexity and realism of real-world workflows. To address this gap, we propose MCPMark, a benchmark designed to evaluate MCP use in a more realistic and comprehensive manner. It consists of $127$ high-quality tasks collaboratively created by domain experts and AI agents. Each task begins with a curated initial state and includes a programmatic script for automatic verification. These tasks demand richer and more diverse interactions with the environment, involving a broad range of create, read, update, and delete (CRUD) operations. We conduct a comprehensive evaluation of cutting-edge LLMs using a minimal agent framework that operates in a tool-calling loop. Empirical results show that the best-performing model, gpt-5-medium, reaches only $52.56$% pass@1 and $33.86$% pass^4, while other widely regarded strong models, including claude-sonnet-4 and o3, fall below $30$% pass@1 and $15$% pass^4. On average, LLMs require $16.2$ execution turns and $17.4$ tool calls per task, significantly surpassing those in previous MCP benchmarks and highlighting the stress-testing nature of MCPMark.

[105] Sequential Diffusion Language Models

Yangzhou Liu, Yue Cao, Hao Li, Gen Luo, Zhe Chen, Weiyun Wang, Xiaobo Liang, Biqing Qi, Lijun Wu, Changyao Tian, Yanting Zhang, Yuqiang Li, Tong Lu, Yu Qiao, Jifeng Dai, Wenhai Wang

Main category: cs.CL

TL;DR: SDLM unifies next-token and next-block prediction through NSP, enabling adaptive generation lengths while maintaining KV-cache compatibility and improving throughput over autoregressive models.

Details

Motivation: Diffusion language models have theoretical efficiency advantages but are limited by fixed-length decoding and KV-cache incompatibility. Block diffusion helps but still has fixed block sizes and requires expensive training.

Method: Introduces Next Sequence Prediction (NSP) that adaptively determines generation length per step. SDLM performs diffusion inference in fixed-size mask blocks but dynamically decodes consecutive subsequences based on model confidence, preserving KV-cache compatibility.

Result: SDLM matches or surpasses autoregressive baselines using only 3.5M training samples, achieves 2.1x higher throughput than Qwen-2.5, and SDLM-32B shows even stronger efficiency gains demonstrating scalability.

Conclusion: SDLM successfully unifies next-token and next-block prediction, enabling efficient adaptive generation while maintaining compatibility with existing infrastructure, showing strong potential for scalable language modeling.

Abstract: Diffusion language models (DLMs) have strong theoretical efficiency but are limited by fixed-length decoding and incompatibility with key-value (KV) caches. Block diffusion mitigates these issues, yet still enforces a fixed block size and requires expensive training. We introduce Next Sequence Prediction (NSP), which unifies next-token and next-block prediction, enabling the model to adaptively determine the generation length at each step. When the length is fixed to 1, NSP reduces to standard next-token prediction. Building on NSP, we propose Sequential Diffusion Language Model (SDLM), which can retrofit pre-trained autoregressive language models (ALMs) at minimal cost. Specifically, SDLM performs diffusion inference within fixed-size mask blocks, but dynamically decodes consecutive subsequences based on model confidence, thereby preserving KV-cache compatibility and improving robustness to varying uncertainty and semantics across the sequence. Experiments show that SDLM matches or surpasses strong autoregressive baselines using only 3.5M training samples, while achieving 2.1 higher throughput than Qwen-2.5. Notably, the SDLM-32B model delivers even more pronounced efficiency gains, demonstrating the strong scalability potential of our modeling paradigm. Project page and codes: https://github.com/OpenGVLab/SDLM

[106] SparseD: Sparse Attention for Diffusion Language Models

Zeqing Wang, Gongfan Fang, Xinyin Ma, Xingyi Yang, Xinchao Wang

Main category: cs.CL

TL;DR: SparseD is a novel sparse attention method for diffusion language models that achieves lossless acceleration by using pre-computed head-specific sparse patterns and switching from full to sparse attention during generation.

Details

Motivation: Existing open-source diffusion language models suffer from high inference latency due to attention's quadratic complexity, and sparse attention methods designed for autoregressive models are incompatible with DLMs due to different sparsity behaviors.

Method: SparseD pre-computes head-specific sparse patterns once and reuses them across all denoising steps, while using full attention in early critical steps and switching to sparse attention later to maintain generation quality.

Result: SparseD achieves up to 1.50× speedup over FlashAttention at 64k context length with 1,024 denoising steps while maintaining lossless performance.

Conclusion: SparseD provides a practical and efficient solution for deploying diffusion language models in long-context applications through its novel sparse attention approach.

Abstract: While diffusion language models (DLMs) offer a promising alternative to autoregressive models (ARs), existing open-source DLMs suffer from high inference latency. This bottleneck is mainly due to the attention’s quadratic complexity with respect to context length in computing all query-key pairs. Intuitively, to reduce this complexity, a natural strategy is to restrict attention to sparse patterns that retain only the most relevant connections. Such approaches are well-established in ARs, where attention follows fixed and clearly defined sparse patterns. However, in DLMs, we observe distinct sparsity behaviors: (1) attention patterns vary across heads, (2) attention patterns in each head remain highly similar across denoising steps, and (3) early denoising steps are critical for generation. These findings render sparse attention methods designed for ARs largely incompatible with DLMs, as they fail to capture head-specific structures and risk degrading generation when applied in early denoising steps. To address these challenges, we propose SparseD, a novel sparse attention method for DLMs. Leveraging the observations, SparseD only requires pre-computing head-specific sparse patterns one time, and reuses them across all steps. This prevents recomputing sparse patterns at each denoising step. Meanwhile, SparseD uses full attention in the early steps, then switches to sparse attention later to maintain generation quality. Together, these establish SparseD as a practical and efficient solution for deploying DLMs in long-context applications. Experimental results demonstrate that SparseD achieves lossless acceleration, delivering up to $1.50\times$ speedup over FlashAttention at a 64k context length with 1,024 denoising steps.

[107] ResFormer: All-Time Reservoir Memory for Long Sequence Classification

Hongbo Liu, Jia Xu

Main category: cs.CL

TL;DR: ResFormer is a novel neural architecture that combines reservoir computing for long-term dependencies and Transformers for short-term dependencies, achieving significant accuracy improvements while reducing memory consumption.

Details

Motivation: Transformer models have quadratic complexity limitations that restrict input length, making it challenging to process extensive contexts efficiently.

Method: ResFormer uses a cascaded approach with reservoir computing for long-term dependencies (linear time) and conventional Transformers for short-term dependencies with fixed-length inputs.

Result: Outperforms DeepSeek-Qwen and ModernBERT with up to +22.3% accuracy improvement on EmoryNLP and consistent gains on MultiWOZ, MELD, and IEMOCAP datasets, with reduced memory consumption.

Conclusion: ResFormer effectively addresses Transformer limitations by efficiently modeling varying context lengths, demonstrating superior performance and efficiency for sequence classification tasks.

Abstract: Sequence classification is essential in NLP for understanding and categorizing language patterns in tasks like sentiment analysis, intent detection, and topic classification. Transformer-based models, despite achieving state-of-the-art performance, have inherent limitations due to quadratic time and memory complexity, restricting their input length. Although extensive efforts have aimed at reducing computational demands, processing extensive contexts remains challenging. To overcome these limitations, we propose ResFormer, a novel neural network architecture designed to model varying context lengths efficiently through a cascaded methodology. ResFormer integrates an reservoir computing network featuring a nonlinear readout to effectively capture long-term contextual dependencies in linear time. Concurrently, short-term dependencies within sentences are modeled using a conventional Transformer architecture with fixed-length inputs. Experiments demonstrate that ResFormer significantly outperforms baseline models of DeepSeek-Qwen and ModernBERT, delivering an accuracy improvement of up to +22.3% on the EmoryNLP dataset and consistent gains on MultiWOZ, MELD, and IEMOCAP. In addition, ResFormer exhibits reduced memory consumption, underscoring its effectiveness and efficiency in modeling extensive contextual information.

[108] Ensembling Multilingual Transformers for Robust Sentiment Analysis of Tweets

Meysam Shirdel Bilehsavar, Negin Mahmoudi, Mohammad Jalili Torkamani, Kiana Kiashemshaki

Main category: cs.CL

TL;DR: This paper presents a transformer ensemble model and LLM approach for cross-lingual sentiment analysis, achieving over 86% accuracy on multilingual datasets.

Details

Motivation: Sentiment analysis faces challenges with foreign languages due to lack of labeled training data, limiting its effectiveness across different languages and applications in marketing, politics, and customer service.

Method: Used an ensemble of pre-trained sentiment analysis models (bert-base-multilingual-uncased-sentiment and XLM-R) combined with large language models for multilingual sentiment analysis on multi-language datasets.

Result: Experimental results showed sentiment analysis performance exceeding 86% using the proposed ensemble method.

Conclusion: The transformer ensemble model with LLM integration effectively addresses cross-lingual sentiment analysis challenges, providing high accuracy without requiring language-specific labeled data.

Abstract: Sentiment analysis is a very important natural language processing activity in which one identifies the polarity of a text, whether it conveys positive, negative, or neutral sentiment. Along with the growth of social media and the Internet, the significance of sentiment analysis has grown across numerous industries such as marketing, politics, and customer service. Sentiment analysis is flawed, however, when applied to foreign languages, particularly when there is no labelled data to train models upon. In this study, we present a transformer ensemble model and a large language model (LLM) that employs sentiment analysis of other languages. We used multi languages dataset. Sentiment was then assessed for sentences using an ensemble of pre-trained sentiment analysis models: bert-base-multilingual-uncased-sentiment, and XLM-R. Our experimental results indicated that sentiment analysis performance was more than 86% using the proposed method.

[109] Large-Scale Constraint Generation – Can LLMs Parse Hundreds of Constraints?

Matteo Boffa, Jiaxuan You

Main category: cs.CL

TL;DR: The paper introduces Large-Scale Constraint Generation (LSCG) to test LLMs’ ability to handle many fine-grained constraints, proposes FoCusNet to help LLMs focus on relevant constraints, and shows existing methods struggle with many constraints while FoCusNet improves accuracy by 8-13%.

Details

Motivation: To evaluate whether LLMs can parse large, fine-grained constraint lists, moving beyond few specific requirements to test scalability with increasing constraint numbers.

Method: Created Words Checker as a practical LSCG instance, tested model characteristics and steering techniques, and proposed FoCusNet - a small model that filters constraints to help LLMs focus on relevant ones.

Result: Existing solutions suffer significant performance drops as constraint numbers increase, while FoCusNet provides an 8-13% accuracy improvement over baseline methods.

Conclusion: LLMs struggle with large constraint sets, but dedicated constraint filtering models like FoCusNet can significantly improve performance by helping LLMs focus on relevant constraints.

Abstract: Recent research has explored the constrained generation capabilities of Large Language Models (LLMs) when explicitly prompted by few task-specific requirements. In contrast, we introduce Large-Scale Constraint Generation (LSCG), a new problem that evaluates whether LLMs can parse a large, fine-grained, generic list of constraints. To examine the LLMs’ ability to handle an increasing number constraints, we create a practical instance of LSCG, called Words Checker. In Words Checker, we evaluate the impact of model characteristics (e.g., size, family) and steering techniques (e.g., Simple Prompt, Chain of Thought, Best of N) on performance. We also propose FoCusNet, a small and dedicated model that parses the original list of constraints into a smaller subset, helping the LLM focus on relevant constraints. Experiments reveal that existing solutions suffer a significant performance drop as the number of constraints increases, with FoCusNet showing an 8-13% accuracy boost.

[110] GEAR: A General Evaluation Framework for Abductive Reasoning

Kaiyu He, Peilin Wu, Mian Zhang, Kun Wan, Wentian Zhao, Xinya Du, Zhiyu Chen

Main category: cs.CL

TL;DR: GEAR is a novel evaluation framework for assessing LLMs’ abductive reasoning capabilities through automated scoring of hypothesis sets based on consistency, generalizability, and diversity, without requiring human gold answers.

Details

Motivation: To address whether LLMs can discover new knowledge through abductive reasoning (generating plausible hypotheses) and create a scalable, transparent evaluation method that doesn't rely on human annotations.

Method: GEAR evaluates hypothesis sets using three metrics: consistency (explains observations), generalizability (makes meaningful predictions on unseen inputs), and diversity (covers distinct patterns). Also proposes a momentum-based curriculum that adjusts training data based on learning velocity.

Result: Evaluated 9 LLMs on 4 abduction benchmarks with 1,500 problems, generating 50,000+ hypotheses, revealing model differences missed by traditional evaluations. The curriculum approach improved all GEAR objectives and transferred gains to established benchmarks.

Conclusion: GEAR provides a principled framework for evaluating abduction and supplies label-free, scalable training signals that help LLMs produce more diverse and reliable hypotheses.

Abstract: Since the advent of large language models (LLMs), research has focused on instruction following and deductive reasoning. A central question remains: can these models discover new knowledge, and how can we evaluate this ability? We address this by studying abductive reasoning-the generation of plausible hypotheses to explain observations-and introduce GEAR (General Evaluation for Abductive Reasoning), a general-purpose, fully automated, transparent, and label-free evaluation paradigm. GEAR scores hypothesis sets by three metrics: consistency (each hypothesis explains the observations), generalizability (consistent hypotheses make meaningful predictions on unseen inputs), and diversity (the set covers distinct predictions and patterns). Built this way, GEAR is scalable (no human gold answers), reliable (deterministic scoring aligned with classical abduction), and open-ended (scores improve only when models produce new plausible hypotheses, unlike static benchmarks that saturate once accuracy is high). Using GEAR, we conduct a fine-grained study of nine LLMs on four abduction benchmarks with 1,500 problems, generating over 50,000 candidate hypotheses and revealing model differences obscured by gold-answer or purely human evaluations. We further propose a momentum-based curriculum that adjusts GEAR-derived training data by learning velocity: it starts with what the model learns quickly and shifts toward harder objectives such as generating diverse hypotheses once the model is confident on foundational objectives. Without gold-label supervision, this strategy improves all GEAR objectives and these gains transfer to established abductive reasoning benchmarks. Taken together, GEAR provides a principled framework that evaluates abduction and supplies label-free, scalable training signals that help LLMs produce more diverse and reliable hypotheses.

[111] BTC-SAM: Leveraging LLMs for Generation of Bias Test Cases for Sentiment Analysis Models

Zsolt T. Kardkovács, Lynda Djennane, Anna Field, Boualem Benatallah, Yacine Gaci, Fabio Casati, Walid Gaaloul

Main category: cs.CL

TL;DR: BTC-SAM is a novel framework that uses Large Language Models to automatically generate diverse test cases for detecting social biases in Sentiment Analysis models, reducing the need for expensive manual test case creation.

Details

Motivation: Sentiment Analysis models contain harmful social biases that need systematic testing, but creating comprehensive test cases manually is expensive and time-consuming, requiring domain experts or crowd-sourcing.

Method: Uses Large Language Models for controllable generation of test sentences, creating linguistically rich and diverse test cases with minimal specification to cover various identity groups and bias types.

Result: LLM-based generation provides high linguistic variation and diversity in test sentences, offering better test coverage compared to base prompting methods, even for previously unseen biases.

Conclusion: BTC-SAM demonstrates that LLMs can effectively generate high-quality bias test cases, making comprehensive bias testing more accessible and scalable for Sentiment Analysis models.

Abstract: Sentiment Analysis (SA) models harbor inherent social biases that can be harmful in real-world applications. These biases are identified by examining the output of SA models for sentences that only vary in the identity groups of the subjects. Constructing natural, linguistically rich, relevant, and diverse sets of sentences that provide sufficient coverage over the domain is expensive, especially when addressing a wide range of biases: it requires domain experts and/or crowd-sourcing. In this paper, we present a novel bias testing framework, BTC-SAM, which generates high-quality test cases for bias testing in SA models with minimal specification using Large Language Models (LLMs) for the controllable generation of test sentences. Our experiments show that relying on LLMs can provide high linguistic variation and diversity in the test sentences, thereby offering better test coverage compared to base prompting methods even for previously unseen biases.

[112] Pragmatic Inference for Moral Reasoning Acquisition: Generalization via Distributional Semantics

Guangliang Liu, Xi Chen, Bocheng Chen, Xitong Zhang, Kristen Johnson

Main category: cs.CL

TL;DR: LLMs struggle with generalized moral reasoning due to their reliance on distributional semantics, which differs from pragmatic-level moral reasoning. This paper proposes pragmatic inference methods based on moral foundations theory to bridge this gap and improve generalization.

Details

Motivation: Large Language Models face challenges in achieving generalized moral reasoning because they primarily operate at the distributional semantics level, while moral reasoning requires pragmatic understanding. This gap limits their ability to generalize moral reasoning across different contexts.

Method: Proposed pragmatic inference methods grounded in moral foundations theory, which leverage contextual information at each reasoning step to connect moral foundations with moral reasoning objectives and bridge the pragmatic gap.

Result: Experimental results show that the proposed approach significantly enhances LLMs’ generalization capabilities in moral reasoning tasks.

Conclusion: The method provides a foundation for future research on moral reasoning in LLMs using moral foundations theory, successfully addressing the generalization challenge by bridging the pragmatic gap between distributional semantics and moral reasoning.

Abstract: Moral reasoning has emerged as a promising research direction for Large Language Models (LLMs), yet achieving generalization remains a central challenge. From a linguistic standpoint, this difficulty arises because LLMs are adept at capturing distributional semantics, which fundamentally differs from the morals which operate at the pragmatic level. This paper investigates how LLMs can achieve generalized moral reasoning despite their reliance on distributional semantics. We propose pragmatic inference methods grounded in moral foundations theory, which leverage contextual information at each step to bridge the pragmatic gap and guide LLMs in connecting moral foundations with moral reasoning objectives. Experimental results demonstrate that our approach significantly enhances LLMs’ generalization in moral reasoning, providing a foundation for future research grounded in moral foundations theory.

[113] Dual-Scale World Models for LLM Agents Towards Hard-Exploration Problems

Minsoo Kim, Seung-won Hwang

Main category: cs.CL

TL;DR: GLoW introduces dual-scale world models for hard-exploration tasks, achieving state-of-the-art performance in text-based games with significantly fewer environment interactions than RL methods.

Details

Motivation: LLM-based agents struggle with hard-exploration tasks that require learning new knowledge through exploration, particularly in complex environments like text-based games.

Method: Uses dual-scale world models with a global trajectory frontier for high-value discoveries and local trial-and-error exploration through Multi-path Advantage Reflection mechanism that infers advantage-based progress signals.

Result: Achieves new state-of-the-art performance for LLM-based approaches on Jericho benchmark suite, with comparable performance to RL-based methods but requiring 100-800x fewer environment interactions.

Conclusion: GLoW demonstrates that dual-scale world modeling with advantage-based exploration guidance enables efficient hard-exploration in LLM-based agents, significantly reducing the sample complexity compared to traditional RL approaches.

Abstract: LLM-based agents have seen promising advances, yet they are still limited in “hard-exploration” tasks requiring learning new knowledge through exploration. We present GLoW, a novel approach leveraging dual-scale world models, maintaining a trajectory frontier of high-value discoveries at the global scale, while learning from local trial-and-error in exploration through a Multi-path Advantage Reflection mechanism which infers advantage-based progress signals to guide exploration. To evaluate our framework for hard-exploration, we tackle the Jericho benchmark suite of text-based games, where GLoW achieves a new state-of-theart performance for LLM-based approaches. Compared to state-of-the-art RLbased methods, our approach achieves comparable performance while requiring 100-800x fewer environment interactions.

[114] EduVidQA: Generating and Evaluating Long-form Answers to Student Questions based on Lecture Videos

Sourjyadip Ray, Shubham Sharma, Somak Aditya, Pawan Goyal

Main category: cs.CL

TL;DR: This paper introduces EduVidQA, a dataset for evaluating MLLMs on answering student questions from online computer science lectures, and benchmarks 6 state-of-the-art models.

Details

Motivation: To address the need for interactivity in digital education by developing automated question answering systems for online lectures using Multimodal Large Language Models.

Method: Created EduVidQA dataset with 5252 question-answer pairs from 296 CS videos, studied student preferences, and benchmarked 6 MLLMs using text-based and qualitative metrics.

Result: The task is challenging for current MLLMs, and the study provides insights into model performance nuances and the effectiveness of synthetic data for finetuning.

Conclusion: This work establishes a benchmark for educational QA and opens new research directions in NLP for education.

Abstract: As digital platforms redefine educational paradigms, ensuring interactivity remains vital for effective learning. This paper explores using Multimodal Large Language Models (MLLMs) to automatically respond to student questions from online lectures - a novel question answering task of real world significance. We introduce the EduVidQA Dataset with 5252 question-answer pairs (both synthetic and real-world) from 296 computer science videos covering diverse topics and difficulty levels. To understand the needs of the dataset and task evaluation, we empirically study the qualitative preferences of students, which we provide as an important contribution to this line of work. Our benchmarking experiments consist of 6 state-of-the-art MLLMs, through which we study the effectiveness of our synthetic data for finetuning, as well as showing the challenging nature of the task. We evaluate the models using both text-based and qualitative metrics, thus showing a nuanced perspective of the models’ performance, which is paramount to future work. This work not only sets a benchmark for this important problem, but also opens exciting avenues for future research in the field of Natural Language Processing for Education.

[115] Beyond Magic Words: Sharpness-Aware Prompt Evolving for Robust Large Language Models with TARE

Guancheng Wan, Lucheng Fu, Haoxin Liu, Yiqiao Jin, Hui Yi Leong, Eric Hanchen Jiang, Hejia Geng, Jinhe Bi, Yunpu Ma, Xiangru Tang, B. Aditya Prakash, Yizhou Sun, Wei Wang

Main category: cs.CL

TL;DR: TARE is a derivative-free framework that optimizes LLM prompts for robustness against paraphrasing by minimizing textual sharpness, outperforming accuracy-only methods while maintaining computational efficiency.

Details

Motivation: Current prompt optimization methods focus only on point-wise accuracy and fail to address brittleness - small semantic-preserving paraphrases can cause large performance swings in LLMs.

Method: TARE alternates between inner adversarial search (stressing prompts with hard paraphrases) and outer robust selection (preferring candidates with strong neighborhoods). ATARE adds anisotropic weights to shape semantic neighborhoods and adaptive radius for exploration-fidelity balance.

Result: The methods preserve accuracy under paraphrasing across diverse tasks, outperforming accuracy-only prompt search while remaining computationally practical.

Conclusion: Minimizing textual sharpness gap leads to robust prompts that maintain performance under semantic variations, addressing a key limitation in current prompt optimization approaches.

Abstract: The performance of Large Language Models (LLMs) hinges on carefully engineered prompts. However, prevailing prompt optimization methods, ranging from heuristic edits and reinforcement learning to evolutionary search, primarily target point-wise accuracy. They seldom enforce paraphrase invariance or searching stability, and therefore cannot remedy this brittleness in practice. Automated prompt search remains brittle: small, semantically preserving paraphrases often cause large performance swings. We identify this brittleness as the textual sharpness of the prompt landscape. In this work, we provide the first formal treatment of textual sharpness in the discrete, semantic space of prompts, together with an operational robustness criterion over a semantic neighborhood; the design is black-box or API-only, requiring no gradients to update the model’s parameters. Then we introduce TARE (Textual Sharpness-Aware Evolving), a derivative-free framework that alternates between an inner, sampling-based adversarial search that stresses a prompt with hard paraphrases and an outer, robust selection that prefers candidates whose neighborhoods remain strong. We further propose ATARE, which learns anisotropic weights to shape the semantic neighborhood and adapts its radius over time to balance exploration and fidelity. Diverse tasks evaluate our methods, whose design for minimizing textual sharpness gap leads to prompts that preserve accuracy under paraphrasing, outperforming accuracy-only prompt search while remaining computationally practical.

[116] Your thoughts tell who you are: Characterize the reasoning patterns of LRMs

Yida Chen, Yuning Mao, Xianjun Yang, Suyu Ge, Shengjie Bi, Lijuan Liu, Saghar Hosseini, Liang Tan, Yixin Nie, Shaoliang Nie

Main category: cs.CL

TL;DR: LOT is a method that uses language models to compare reasoning traces from different large reasoning models, creating a human-readable taxonomy that identifies systematic differences in how models think.

Details

Motivation: Current comparisons of large reasoning models focus only on macro-level statistics like accuracy, leaving the question of whether different models actually reason differently unanswered.

Method: LOT uses a generative language model to compare reasoning traces from two LRMs, articulate distinctive features in words, and model how these features predict the source model based on empirical distributions across outputs.

Result: LOT achieved 80-100% accuracy in distinguishing reasoning traces from LRMs that differ in scale, model family, or domain. It identified systematic differences in reasoning styles and showed that aligning smaller models’ reasoning with larger models improved accuracy on GPQA by 3.3-5.7%.

Conclusion: LOT provides both quantitative classification and qualitative explanations of how different large reasoning models think, revealing systematic reasoning differences that can be leveraged to improve model performance.

Abstract: Current comparisons of large reasoning models (LRMs) focus on macro-level statistics such as task accuracy or reasoning length. Whether different LRMs reason differently remains an open question. To address this gap, we introduce the LLM-proposed Open Taxonomy (LOT), a classification method that uses a generative language model to compare reasoning traces from two LRMs and articulate their distinctive features in words. LOT then models how these features predict the source LRM of a reasoning trace based on their empirical distributions across LRM outputs. Iterating this process over a dataset of reasoning traces yields a human-readable taxonomy that characterizes how models think. We apply LOT to compare the reasoning of 12 open-source LRMs on tasks in math, science, and coding. LOT identifies systematic differences in their thoughts, achieving 80-100% accuracy in distinguishing reasoning traces from LRMs that differ in scale, base model family, or objective domain. Beyond classification, LOT’s natural-language taxonomy provides qualitative explanations of how LRMs think differently. Finally, in a case study, we link the reasoning differences to performance: aligning the reasoning style of smaller Qwen3 models with that of the largest Qwen3 during test time improves their accuracy on GPQA by 3.3-5.7%.

[117] Localizing Task Recognition and Task Learning in In-Context Learning via Attention Head Analysis

Haolin Yang, Hakaze Cho, Naoya Inoue

Main category: cs.CL

TL;DR: The paper proposes a Task Subspace Logit Attribution (TSLA) framework to analyze in-context learning mechanisms in LLMs, identifying specialized attention heads for Task Recognition (TR) and Task Learning (TL) that work complementarily.

Details

Motivation: To reconcile two dominant perspectives on in-context learning: component-level analysis of attention heads and holistic decomposition into TR and TL components, providing a unified understanding of ICL mechanisms.

Method: Proposed TSLA framework to identify TR and TL specialized attention heads, using correlation analysis, ablation studies, input perturbations, and steering experiments with geometric analysis of hidden states.

Result: TR heads align hidden states with task subspace for task recognition, while TL heads rotate hidden states within subspace toward correct labels for prediction. The framework reconciles previous findings like induction heads and task vectors.

Conclusion: The TSLA framework provides a unified and interpretable account of how large language models execute in-context learning across diverse tasks and settings through complementary TR and TL mechanisms.

Abstract: We investigate the mechanistic underpinnings of in-context learning (ICL) in large language models by reconciling two dominant perspectives: the component-level analysis of attention heads and the holistic decomposition of ICL into Task Recognition (TR) and Task Learning (TL). We propose a novel framework based on Task Subspace Logit Attribution (TSLA) to identify attention heads specialized in TR and TL, and demonstrate their distinct yet complementary roles. Through correlation analysis, ablation studies, and input perturbations, we show that the identified TR and TL heads independently and effectively capture the TR and TL components of ICL. Using steering experiments with geometric analysis of hidden states, we reveal that TR heads promote task recognition by aligning hidden states with the task subspace, while TL heads rotate hidden states within the subspace toward the correct label to facilitate prediction. We further show how previous findings on ICL mechanisms, including induction heads and task vectors, can be reconciled with our attention-head-level analysis of the TR-TL decomposition. Our framework thus provides a unified and interpretable account of how large language models execute ICL across diverse tasks and settings.

[118] Task Vectors, Learned Not Extracted: Performance Gains and Mechanistic Insight

Haolin Yang, Hakaze Cho, Kaize Ding, Naoya Inoue

Main category: cs.CL

TL;DR: This paper introduces Learned Task Vectors (LTVs) as a superior alternative to extracted task vectors for in-context learning, and provides mechanistic analysis showing how task vectors influence LLM computation through attention circuits and linear propagation.

Details

Motivation: Prior methods for extracting task vectors from LLMs are cumbersome, opaque, and don't explain how task vectors actually influence model computation during in-context learning.

Method: Propose directly training Learned Task Vectors (LTVs) and conduct systematic mechanistic analysis to understand how task vectors operate through attention-head OV circuits and propagate linearly through Transformer layers.

Result: LTVs outperform extracted task vectors in accuracy and work effectively at arbitrary layers and positions. Analysis reveals task vectors primarily influence predictions through specific “key heads” in attention circuits, and despite Transformer nonlinearities, task vector propagation is largely linear.

Conclusion: LTVs provide both a practical approach for obtaining effective task vectors and a principled framework for understanding the mechanistic foundations of in-context learning in LLMs.

Abstract: Large Language Models (LLMs) can perform new tasks from in-context demonstrations, a phenomenon known as in-context learning (ICL). Recent work suggests that these demonstrations are compressed into task vectors (TVs), compact task representations that LLMs exploit for predictions. However, prior studies typically extract TVs from model outputs or hidden states using cumbersome and opaque methods, and they rarely elucidate the mechanisms by which TVs influence computation. In this work, we address both limitations. First, we propose directly training Learned Task Vectors (LTVs), which surpass extracted TVs in accuracy and exhibit superior flexibility-acting effectively at arbitrary layers, positions, and even with ICL prompts. Second, through systematic analysis, we investigate the mechanistic role of TVs, showing that at the low level they steer predictions primarily through attention-head OV circuits, with a small subset of “key heads” most decisive. At a higher level, we find that despite Transformer nonlinearities, TV propagation is largely linear: early TVs are rotated toward task-relevant subspaces to improve logits of relevant labels, while later TVs are predominantly scaled in magnitude. Taken together, LTVs not only provide a practical approach for obtaining effective TVs but also offer a principled lens into the mechanistic foundations of ICL.

[119] Retrieval-augmented GUI Agents with Generative Guidelines

Ran Xu, Kaixin Ma, Wenhao Yu, Hongming Zhang, Joyce C. Ho, Carl Yang, Dong Yu

Main category: cs.CL

TL;DR: RAG-GUI is a lightweight vision-language model that enhances GUI agents by leveraging web tutorials at inference time, achieving significant performance improvements over baseline methods.

Details

Motivation: GUI agents face limitations due to scarce training data and complex real-world tasks requiring long-tailed knowledge for rare scenarios.

Method: Uses supervised finetuning (SFT) followed by self-guided rejection sampling finetuning (RSF), functioning as a model-agnostic plug-in that retrieves web tutorials during inference.

Result: Outperforms baseline agents by 2.6% to 13.3% across two model sizes and three distinct tasks, demonstrating strong generalization capabilities.

Conclusion: RAG-GUI provides an effective plug-and-play solution for enhancing VLM-based GUI agents in real-world applications.

Abstract: GUI agents powered by vision-language models (VLMs) show promise in automating complex digital tasks. However, their effectiveness in real-world applications is often limited by scarce training data and the inherent complexity of these tasks, which frequently require long-tailed knowledge covering rare, unseen scenarios. We propose RAG-GUI , a lightweight VLM that leverages web tutorials at inference time. RAG-GUI is first warm-started via supervised finetuning (SFT) and further refined through self-guided rejection sampling finetuning (RSF). Designed to be model-agnostic, RAG-GUI functions as a generic plug-in that enhances any VLM-based agent. Evaluated across three distinct tasks, it consistently outperforms baseline agents and surpasses other inference baselines by 2.6% to 13.3% across two model sizes, demonstrating strong generalization and practical plug-and-play capabilities in real-world scenarios.

[120] Beyond Overall Accuracy: A Psychometric Deep Dive into the Topic-Specific Medical Capabilities of 80 Large Language Models

Zhimeng Luo, Lixin Wu, Adam Frisch, Daqing He

Main category: cs.CL

TL;DR: MedIRT is an IRT-based evaluation framework for LLMs in medical applications, revealing nuanced ability profiles beyond traditional accuracy metrics.

Details

Motivation: Traditional accuracy metrics are inadequate for evaluating LLMs in high-stakes medical applications as they don't capture question characteristics or provide topic-specific insights.

Method: Prospectively gathered responses from 80 diverse LLMs on a balanced 1,100-question USMLE-aligned benchmark, using unidimensional two-parameter logistic IRT models per topic to estimate latent model ability, question difficulty and discrimination.

Result: Identified distinctive ‘spiky’ ability profiles where overall rankings can be misleading; GPT-5 was top in 8 of 11 domains but outperformed by Claude-3-opus in Social Science and Communication; IRT helped identify flawed questions in benchmarks.

Conclusion: Establishes a robust, psychometrically grounded methodology essential for safe, effective, and trustworthy deployment of LLMs in healthcare, with a practical decision-support framework integrating multi-factor competency profiles.

Abstract: As Large Language Models (LLMs) are increasingly proposed for high-stakes medical applications, there has emerged a critical need for reliable and accurate evaluation methodologies. Traditional accuracy metrics fail inadequately as they neither capture question characteristics nor offer topic-specific insights. To address this gap, we introduce \textsc{MedIRT}, a rigorous evaluation framework grounded in Item Response Theory (IRT), the gold standard in high-stakes educational testing. Unlike previous research relying on archival data, we prospectively gathered fresh responses from 80 diverse LLMs on a balanced, 1,100-question USMLE-aligned benchmark. Using one unidimensional two-parameter logistic IRT model per topic, we estimate LLM’s latent model ability jointly with question difficulty and discrimination, yielding more stable and nuanced performance rankings than accuracy alone. Notably, we identify distinctive ``spiky’’ ability profiles, where overall rankings can be misleading due to highly specialized model abilities. While \texttt{GPT-5} was the top performer in a majority of domains (8 of 11), it was outperformed in Social Science and Communication by \texttt{Claude-3-opus}, demonstrating that even an overall 23rd-ranked model can hold the top spot for specific competencies. Furthermore, we demonstrate IRT’s utility in auditing benchmarks by identifying flawed questions. We synthesize these findings into a practical decision-support framework that integrates our multi-factor competency profiles with operational metrics. This work establishes a robust, psychometrically grounded methodology essential for the safe, effective, and trustworthy deployment of LLMs in healthcare.

[121] PET: Preference Evolution Tracking with LLM-Generated Explainable Distribution

Luyang Zhang, Siyuan Peng, Jialu Wang, Shichao Zhu, Beibei Li, Zhongcun Wang, Guangmou Pan, Yan Li, Song Yang

Main category: cs.CL

TL;DR: PET framework reframes user preference prediction from direct item generation to inferring dynamic probability distributions over interpretable preference clusters, improving ranking quality and addressing limitations of opaque LLM-based approaches.

Details

Motivation: Direct LLM generation for user preference prediction limits personalization, obscures holistic user profiling, and exacerbates popularity bias. There's a need for more transparent and interpretable preference learning methods.

Method: Proposed Preference Evolution Tracking (PET) framework that uses logit-probing and generative classification to infer user preferences as probability distributions over a stable lattice of preference clusters, rather than direct item generation.

Result: PET improves ranking quality by up to 40% in NDCG on public benchmarks (Yelp, MovieLens) and outperforms a SOTA production model by 7 times in NDCG score on a large-scale short-video platform dataset, particularly excelling at ranking long-tail content.

Conclusion: PET transforms user profiling from direct preference list generation to transparent distributional preference mapping, enabling more explainable, fair, and diverse personalization systems.

Abstract: Understanding how user preference evolves over time is a fundamental challenge central to modern digital ecosystems, for which Large Language Models (LLMs) are an increasingly prominent and popular approach due to their ability to comprehend the rich semantic context within behavioral data. A common practice is to use LLMs to predict a user’s next action by directly generating a ranked list of preferred items. Although effective for short-term prediction, the end-to-end generation paradigm inherently limits personalization. Its opaque decision-making process obscures holistic user profiling and exacerbates popularity bias. To address these limitations, we propose Preference Evolution Tracking (PET), a framework that reframes the task as inferring a dynamic probability distribution over a stable and interpretable lattice of preference clusters. By applying logit-probing and generative classification techniques, PET infers a user’s preference as a probability distribution, enabling transparent preference learning. On public benchmarks (Yelp, MovieLens), PET improves ranking quality by up to 40% in NDCG over direct generation baselines. On a large-scale, real-world dataset from a short-video platform, it excels at ranking long-tail contents, significantly outperforming a SOTA production model by 7 times in the NDCG score. Ultimately, PET transforms the user profile model from direct preference list generation to a transparent distributional preference mapping, paving the way for more explainable, fair, and diverse personalization systems.

[122] AceSearcher: Bootstrapping Reasoning and Search for LLMs via Reinforced Self-Play

Ran Xu, Yuchen Zhuang, Zihan Dong, Jonathan Wang, Yue Yu, Joyce C. Ho, Linjun Zhang, Haoyu Wang, Wenqi Shi, Carl Yang

Main category: cs.CL

TL;DR: AceSearcher is a cooperative self-play framework that trains a single LLM to alternate between decomposer and solver roles, improving complex reasoning through multi-hop retrieval without intermediate annotations.

Details

Motivation: Search-augmented LLMs struggle with complex reasoning due to ineffective multi-hop retrieval and limited reasoning ability, requiring a better approach for handling complex queries.

Method: Uses cooperative self-play with a single LLM alternating between decomposer (breaks down queries) and solver (integrates contexts) roles, combining supervised fine-tuning on search/reasoning/decomposition tasks with reinforcement fine-tuning optimized for final answer accuracy.

Result: Outperforms state-of-the-art baselines with 7.6% average exact match improvement, matches DeepSeek-V3 performance with <5% parameters on finance tasks, and surpasses larger models (up to 9x parameters) at smaller scales (1.5B/8B).

Conclusion: AceSearcher demonstrates exceptional efficiency and effectiveness in tackling complex reasoning tasks through its cooperative self-play framework, eliminating the need for intermediate annotations while achieving superior performance.

Abstract: Search-augmented LLMs often struggle with complex reasoning tasks due to ineffective multi-hop retrieval and limited reasoning ability. We propose AceSearcher, a cooperative self-play framework that trains a single large language model (LLM) to alternate between two roles: a decomposer that breaks down complex queries and a solver that integrates retrieved contexts for answer generation. AceSearcher couples supervised fine-tuning on a diverse mixture of search, reasoning, and decomposition tasks with reinforcement fine-tuning optimized for final answer accuracy, eliminating the need for intermediate annotations. Extensive experiments on three reasoning-intensive tasks across 10 datasets show that AceSearcher outperforms state-of-the-art baselines, achieving an average exact match improvement of 7.6%. Remarkably, on document-level finance reasoning tasks, AceSearcher-32B matches the performance of the DeepSeek-V3 model using less than 5% of its parameters. Even at smaller scales (1.5B and 8B), AceSearcher often surpasses existing search-augmented LLMs with up to 9x more parameters, highlighting its exceptional efficiency and effectiveness in tackling complex reasoning tasks. Our code will be published at https://github.com/ritaranx/AceSearcher and https://huggingface.co/AceSearcher.

[123] Can Large Language Models Express Uncertainty Like Human?

Linwei Tao, Yi-Fan Yeh, Bo Kai, Minjing Dong, Tao Huang, Tom A. Lamb, Jialin Yu, Philip H. S. Torr, Chang Xu

Main category: cs.CL

TL;DR: The paper introduces linguistic confidence (LC) as a lightweight alternative to traditional confidence estimation methods for LLMs, using hedging language to express uncertainty naturally. It provides a dataset, mapping method, systematic study, and fine-tuning framework to improve LC reliability.

Details

Motivation: Existing confidence estimation methods face practical barriers: hidden logits, computational expense of multi-sampling, and unnatural verbalized numerical uncertainty. Linguistic confidence offers a human-centered, lightweight alternative.

Method: Released a large-scale dataset of hedging expressions with human-annotated confidence scores; proposed a lightweight mapper to convert hedges into confidence scores; conducted systematic study across LLMs and QA benchmarks; introduced fine-tuning framework for LC improvement.

Result: Most LLMs underperform in expressing reliable linguistic confidence, but carefully designed prompting achieves competitive calibration and discriminability. Fine-tuning further improves LC reliability.

Conclusion: Linguistic confidence is positioned as a scalable, efficient, and human-aligned approach to LLM uncertainty estimation, calling for deeper exploration of this promising direction.

Abstract: Large language models (LLMs) are increasingly used in high-stakes settings, where overconfident responses can mislead users. Reliable confidence estimation has been shown to enhance trust and task accuracy. Yet existing methods face practical barriers: logits are often hidden, multi-sampling is computationally expensive, and verbalized numerical uncertainty (e.g., giving a 0-100 score) deviates from natural communication. We revisit linguistic confidence (LC), where models express uncertainty through hedging language (e.g., probably, might), offering a lightweight and human-centered alternative. To advance this direction, we (1) release the first diverse, large-scale dataset of hedging expressions with human-annotated confidence scores, and (2) propose a lightweight mapper that converts hedges into confidence scores at near-zero cost. Building on these resources, we (3) conduct the first systematic study of LC across modern LLMs and QA benchmarks, revealing that while most LLMs underperform in expressing reliable LC, carefully designed prompting achieves competitive calibration and discriminability. Finally, we (4) introduce a fine-tuning framework that further improves LC reliability. Taken together, our work positions linguistic confidence as a scalable, efficient, and human-aligned approach to LLM uncertainty estimation, and calls for deeper exploration of this promising yet underexplored direction.

[124] BeyondBench: Benchmark-Free Evaluation of Reasoning in Language Models

Gaurav Srivastava, Aafiya Hussain, Zhenyu Bi, Swastik Roy, Priya Pitre, Meng Lu, Morteza Ziyadi, Xuan Wang

Main category: cs.CL

TL;DR: BeyondBench is a novel evaluation framework that uses algorithmic problem generation to avoid data contamination issues in language model evaluation, covering 44 algorithmic tasks across three difficulty levels.

Details

Motivation: Traditional benchmarks risk contamination from training data, making it unclear whether models are reasoning or just recalling answers. This creates a need for fresh, uncontaminated evaluation methods.

Method: The framework generates mathematically grounded problems on the fly from a combinatorial space larger than 10^15 unique instances, with solutions verified by mathematical proofs. It covers 44 algorithmic tasks with 117 variations across Easy, Medium, and Hard suites.

Result: Evaluation of 101 language models showed consistent reasoning deficiencies, with performance degrading sharply as problem complexity increases. In the Hard Suite, top models achieved 26.91-56.38% accuracy, and performance dropped significantly without tool usage.

Conclusion: BeyondBench provides a contamination-free evaluation method that reveals fundamental reasoning limitations in current language models, especially as problem complexity increases from polynomial to exponential.

Abstract: Evaluating language models fairly is becoming harder as static benchmarks available on the internet risk contamination by training data. This makes it unclear whether models are truly reasoning or just recalling answers. In this paper, we introduce BeyondBench, an evaluation framework that avoids this problem by using algorithmic problem generation. Unlike traditional benchmarks that risk contamination from internet-scale training data, BeyondBench creates mathematically grounded problems on the fly, ensuring each test remains fresh and uncontaminated. Our framework covers 44 algorithmic tasks with a total of 117 variations, grouped into three difficulty levels: the Easy Suite (29 tasks) for basic arithmetic and statistics, the Medium Suite (5 tasks, 49 variations) for sequence patterns and reasoning, and the Hard Suite (10 tasks, 68 variations) tackling NP-complete and constraint satisfaction problems. Each task generates problems from a combinatorial space larger than 10^15 unique instances, with solutions verified deterministically by mathematical proofs. We evaluated 101 language models, including 85 open-source and 16 closed-source models, spanning sizes from 0.5B to 141B parameters and multiple quantization schemes. Our results show consistent reasoning deficiencies across model families, with performance degrading sharply as problem complexity increases from polynomial to exponential. In our Hard Suite evaluations, models such as Gemini-2.5-pro, Llama-3.3-70B, and Qwen2.5-72B achieved average accuracies of 56.38%, 26.91%, and 33.60%, respectively. Moreover, we observe that performance drops drastically without tool usage, with GPT-5, GPT-5-mini, and GPT-5-nano showing a decline of 16.81%, 28.05%, and 47.59% accuracy on the hard suite. Our leaderboard is publicly available at https://ctrl-gaurav.github.io/BeyondBench/

[125] ScenarioBench: Trace-Grounded Compliance Evaluation for Text-to-SQL and RAG

Zahra Atf, Peter R Lewis

Main category: cs.CL

TL;DR: ScenarioBench is a benchmark for evaluating Text-to-SQL and RAG systems in compliance contexts, focusing on policy-grounded decision-making with traceable justifications.

Details

Motivation: Existing Text-to-SQL and RAG benchmarks lack strict policy grounding and traceability requirements needed for compliance applications where decisions must be justified with specific policy clauses.

Method: Uses YAML scenarios with no-peek gold standards containing expected decisions, witness traces, governing clauses, and canonical SQL. Systems must justify outputs using clause IDs from the policy canon.

Result: Enables comprehensive evaluation including decision accuracy, trace quality, retrieval effectiveness, SQL correctness, policy coverage, latency, and explanation-hallucination rate.

Conclusion: ScenarioBench shifts evaluation focus toward justification quality under time constraints, providing more realistic assessment for compliance applications compared to prior benchmarks.

Abstract: ScenarioBench is a policy-grounded, trace-aware benchmark for evaluating Text-to-SQL and retrieval-augmented generation in compliance contexts. Each YAML scenario includes a no-peek gold-standard package with the expected decision, a minimal witness trace, the governing clause set, and the canonical SQL, enabling end-to-end scoring of both what a system decides and why. Systems must justify outputs using clause IDs from the same policy canon, making explanations falsifiable and audit-ready. The evaluator reports decision accuracy, trace quality (completeness, correctness, order), retrieval effectiveness, SQL correctness via result-set equivalence, policy coverage, latency, and an explanation-hallucination rate. A normalized Scenario Difficulty Index (SDI) and a budgeted variant (SDI-R) aggregate results while accounting for retrieval difficulty and time. Compared with prior Text-to-SQL or KILT/RAG benchmarks, ScenarioBench ties each decision to clause-level evidence under strict grounding and no-peek rules, shifting gains toward justification quality under explicit time budgets.

[126] MoVa: Towards Generalizable Classification of Human Morals and Values

Ziyu Chen, Junfei Sun, Chenxi Li, Tuan Dung Nguyen, Jing Yao, Xiaoyuan Yi, Xing Xie, Chenhao Tan, Lexing Xie

Main category: cs.CL

TL;DR: MoVa is a resource suite for classifying human morals and values, including 16 labeled datasets, an LLM prompting strategy that beats fine-tuned models, and a survey evaluation tool.

Details

Motivation: Researchers face difficulty navigating diverse theoretical frameworks and data for analyzing human morals and values in language.

Method: Developed MoVa suite with 16 datasets from four frameworks, created lightweight LLM prompting strategy (all@once), and built survey evaluation application.

Result: The LLM prompting strategy outperforms fine-tuned models across multiple domains and frameworks.

Conclusion: MoVa facilitates fine-grained interpretations of human and machine communication with implications for machine behavior alignment.

Abstract: Identifying human morals and values embedded in language is essential to empirical studies of communication. However, researchers often face substantial difficulty navigating the diversity of theoretical frameworks and data available for their analysis. Here, we contribute MoVa, a well-documented suite of resources for generalizable classification of human morals and values, consisting of (1) 16 labeled datasets and benchmarking results from four theoretically-grounded frameworks; (2) a lightweight LLM prompting strategy that outperforms fine-tuned models across multiple domains and frameworks; and (3) a new application that helps evaluate psychological surveys. In practice, we specifically recommend a classification strategy, all@once, that scores all related concepts simultaneously, resembling the well-known multi-label classifier chain. The data and methods in MoVa can facilitate many fine-grained interpretations of human and machine communication, with potential implications for the alignment of machine behavior.

[127] Model Fusion with Multi-LoRA Inference for Tool-Enhanced Game Dialogue Agents

Kangxu Wang, Ze Chen, Chengcheng Wei, Jiewen Zheng, Jiarong He, Max Gao

Main category: cs.CL

TL;DR: The opdainlp team won first place in Tasks 1 and 3, and second place in Task 2 of the CPDC 2025 GPU track by using Qwen3-14B with LoRA fine-tuning and model fusion, employing multiple LoRA adapters for different functions.

Details

Motivation: To build an in-game conversational AI that adheres to character personas, aligns with the game's worldview, and supports function calling, while considering effectiveness and resource/time constraints during inference.

Method: Used Qwen3-14B with LoRA fine-tuning and model fusion; synthesized data for some tasks; employed three distinct LoRA adapters for tool calling, response generation with tool call results, and response generation without tool call results; implemented MultiLoRA inference using vLLM.

Result: Achieved first place in Task 1 and Task 3, and second place in Task 2 of the GPU track.

Conclusion: The approach of using multiple LoRA adapters with Qwen3-14B and model fusion was effective for building a conversational AI that meets the competition requirements, demonstrating the viability of this method for similar tasks.

Abstract: This paper presents the opdainlp team’s solution for the GPU track of the CPDC 2025 challenge. The challenge consists of three tasks, aiming to build an in-game conversational AI that adheres to character personas, aligns with the game’s worldview, and supports function calling. Considering both effectiveness and resource/time constraints during inference, we synthesized data for some of the tasks based on the datasets provided by the competition organizers. We employed Qwen3-14B with LoRA fine-tuning and model fusion, and utilized a base model integrated with multiple LoRA adapters during inference. Specifically, in the competition, we used three distinct LoRA adapters to handle tool calling, response generation with tool call results, and response generation without tool call results, respectively. MultiLoRA inference was implemented using vLLM. Our solution achieved the first place in Task 1 and Task 3, and the second place in Task 2 of the GPU track.

[128] Prompt and Parameter Co-Optimization for Large Language Models

Xiaohe Bo, Rui Li, Zexu Sun, Quanyu Dai, Zeyu Zhang, Zihang Tian, Xu Chen, Zhenhua Dong

Main category: cs.CL

TL;DR: MetaTuner is a framework that jointly optimizes prompt optimization and fine-tuning for LLMs, enabling synergistic improvement through shared knowledge encoding and supervised regularization.

Details

Motivation: Prior work studied prompt optimization and fine-tuning in isolation, leaving their synergistic potential largely unexplored despite their complementary approaches to enhancing LLM performance.

Method: Introduces two neural networks for generating prompts and parameters respectively, with a shared bottom encoding layer for knowledge sharing. Uses supervised regularization loss to handle the discrete-continuous optimization challenge.

Result: Extensive experiments across diverse benchmarks show consistent outperformance over baseline methods.

Conclusion: Joint integration of prompt optimization and fine-tuning through MetaTuner framework effectively leverages their complementary strengths for improved LLM performance.

Abstract: Prompt optimization and fine-tuning are two major approaches to improve the performance of Large Language Models (LLMs). They enhance the capabilities of LLMs from complementary perspectives: the former through explicit natural language, and the latter through implicit parameter updates. However, prior work has typically studied them in isolation, leaving their synergistic potential largely underexplored. To bridge this gap, in this paper, we introduce MetaTuner, a novel framework that jointly integrates prompt optimization and fine-tuning for LLM training. Specifically, we introduce two neural networks to generate prompts and parameters, respectively, while allowing them to share a common bottom encoding layer to enable knowledge sharing. By the guidance of the final supervised signals, our framework is optimized to discover the optimal combinations between the prompts and parameters. Given that prompt learning involves discrete optimization while fine-tuning operates in a continuous parameter space, we design a supervised regularization loss to train our framework effectively. Extensive experiments across diverse benchmarks show that our method consistently outperforms the baselines.

[129] MRAG-Suite: A Diagnostic Evaluation Platform for Visual Retrieval-Augmented Generation

Yuelyu Ji

Main category: cs.CL

TL;DR: MRAG-Suite is a diagnostic evaluation platform for Visual RAG systems that addresses limitations in current evaluations by incorporating difficulty-based and ambiguity-aware filtering strategies, along with MM-RAGChecker for claim-level diagnosis.

Details

Motivation: Current multimodal retrieval-augmented generation evaluations fail to systematically account for query difficulty and ambiguity, limiting comprehensive assessment of Visual RAG systems.

Method: Proposed MRAG-Suite integrates multiple multimodal benchmarks (WebQA, Chart-RAG, Visual-RAG, MRAG-Bench) with difficulty-based and ambiguity-aware filtering strategies, and introduces MM-RAGChecker for claim-level diagnostic analysis.

Result: Results show substantial accuracy reductions under difficult and ambiguous queries, revealing prevalent hallucinations in Visual RAG systems. MM-RAGChecker effectively diagnoses these issues.

Conclusion: The proposed evaluation framework successfully identifies key limitations in Visual RAG systems and provides diagnostic tools to guide future improvements in multimodal retrieval-augmented generation.

Abstract: Multimodal Retrieval-Augmented Generation (Visual RAG) significantly advances question answering by integrating visual and textual evidence. Yet, current evaluations fail to systematically account for query difficulty and ambiguity. We propose MRAG-Suite, a diagnostic evaluation platform integrating diverse multimodal benchmarks (WebQA, Chart-RAG, Visual-RAG, MRAG-Bench). We introduce difficulty-based and ambiguity-aware filtering strategies, alongside MM-RAGChecker, a claim-level diagnostic tool. Our results demonstrate substantial accuracy reductions under difficult and ambiguous queries, highlighting prevalent hallucinations. MM-RAGChecker effectively diagnoses these issues, guiding future improvements in Visual RAG systems.

[130] SimuHome: A Temporal- and Environment-Aware Benchmark for Smart Home LLM Agents

Gyuhyeon Seo, Jungwoo Yang, Junseong Pyo, Nalim Kim, Jonggeun Lee, Yohan Jo

Main category: cs.CL

TL;DR: SimuHome is a time-accelerated smart home simulator built on Matter protocol that enables realistic testing of LLM agents for smart home tasks, revealing significant challenges in latent intent inference, state verification, and temporal scheduling.

Details

Motivation: Current LLM agents struggle with smart home challenges like latent user intents, temporal dependencies, device constraints, and scheduling. There's a lack of realistic simulation environments and challenging benchmarks to properly evaluate and develop smart home agents.

Method: Developed SimuHome - a time-accelerated home environment simulating smart devices, supporting API calls, and reflecting environmental changes. Built on Matter protocol for high fidelity and real-world deployability. Created benchmark of 600 episodes across 12 user query types requiring complex capabilities.

Result: Evaluation of 11 agents under ReAct framework shows models perform well on simple tasks but struggle with latent intent inference (54% success rate for top model GPT-4.1), state verification, and temporal scheduling.

Conclusion: There’s a critical need for methods that can reliably verify current state via tools before acting and coordinate time-dependent actions. SimuHome provides a realistic testbed for developing and evaluating smart home agents that can be deployed on real Matter-compliant devices.

Abstract: Large Language Model (LLM) agents excel at multi-step, tool-augmented tasks. However, smart homes introduce distinct challenges, requiring agents to handle latent user intents, temporal dependencies, device constraints, scheduling, and more. The main bottlenecks for developing smart home agents with such capabilities include the lack of a realistic simulation environment where agents can interact with devices and observe the results, as well as a challenging benchmark to evaluate them. To address this, we introduce $\textbf{SimuHome}$, a time-accelerated home environment that simulates smart devices, supports API calls, and reflects changes in environmental variables. By building the simulator on the Matter protocol (the global industry standard for smart home communication), SimuHome provides a high-fidelity environment, and agents validated in SimuHome can be deployed on real Matter-compliant devices with minimal adaptation. We provide a challenging benchmark of 600 episodes across twelve user query types that require the aforementioned capabilities. Our evaluation of 11 agents under a unified ReAct framework reveals that while models perform well on simple tasks, they struggle with latent intent inference, state verification, and especially temporal scheduling. Even the top-performing model, GPT-4.1, reaches only 54% success rate. These findings highlight a critical need for methods that can reliably verify the current state via tools before acting and coordinate time-dependent actions.

[131] Let LLMs Speak Embedding Languages: Generative Text Embeddings via Iterative Contrastive Refinement

Yu-Che Tsai, Kuan-Yu Chen, Yuan-Chi Li, Yuan-Hao Chen, Ching-Yu Tsai, Shou-De Lin

Main category: cs.CL

TL;DR: GIRCSE is a novel framework that uses autoregressive generation to iteratively refine sentence embeddings, outperforming encoder-only LLM-based methods and showing improved performance with more tokens at inference.

Details

Motivation: Existing LLM-based embeddings treat LLMs as static feature extractors and overlook their generative capabilities, missing latent concepts and implicit semantics.

Method: GIRCSE uses autoregressive generation to produce sequences of soft tokens optimized under an Iterative Contrastive Refinement (ICR) objective that encourages better representations at each refinement step.

Result: GIRCSE outperforms strong LLM-based embedding baselines on MTEB benchmark and instruction-following tasks, and exhibits emergent test-time scaling where generating more tokens improves embedding quality.

Conclusion: Generative iterative refinement establishes a new paradigm for representation learning that effectively leverages LLMs’ generative strengths.

Abstract: Existing large language model (LLM)-based embeddings typically adopt an encoder-only paradigm, treating LLMs as static feature extractors and overlooking their core generative strengths. We introduce GIRCSE (Generative Iterative Refinement for Contrastive Sentence Embeddings), a novel framework that leverages autoregressive generation to iteratively refine semantic representations. By producing sequences of soft tokens optimized under contrastive objective, GIRCSE captures latent concepts and implicit semantics that encoder-only methods often miss. To guide this process, we propose an Iterative Contrastive Refinement (ICR) objective that encourages each refinement step to yield better representations. Extensive experiments show that GIRCSE outperforms strong LLM-based embedding baselines on the MTEB benchmark and instruction-following tasks. Moreover, GIRCSE exhibits an emergent test-time scaling property: generating more tokens at inference steadily improves embedding quality. Our results establish generative iterative refinement as a new paradigm for representation learning.

[132] LOGOS: LLM-driven End-to-End Grounded Theory Development and Schema Induction for Qualitative Research

Xinyu Pi, Qisen Yang, Chuong Nguyen

Main category: cs.CL

TL;DR: LOGOS is an end-to-end framework that fully automates grounded theory workflow using LLM-driven coding, semantic clustering, and graph reasoning to transform raw text into structured hierarchical theories.

Details

Motivation: To overcome the scalability bottleneck of expert-intensive manual coding in grounded theory and democratize qualitative research by providing true automation while maintaining theoretical nuance.

Method: Integrates LLM-driven coding, semantic clustering, graph reasoning, and iterative refinement process to build reusable codebooks. Introduces 5-dimensional metric and train-test split protocol for standardized evaluation.

Result: Consistently outperforms strong baselines across five diverse corpora, achieving 88.2% alignment with expert-developed schema on complex dataset.

Conclusion: LOGOS demonstrates a powerful path to democratize and scale qualitative research without sacrificing theoretical nuance through full automation of grounded theory workflow.

Abstract: Grounded theory offers deep insights from qualitative data, but its reliance on expert-intensive manual coding presents a major scalability bottleneck. Current computational tools stop short of true automation, keeping researchers firmly in the loop. We introduce LOGOS, a novel, end-to-end framework that fully automates the grounded theory workflow, transforming raw text into a structured, hierarchical theory. LOGOS integrates LLM-driven coding, semantic clustering, graph reasoning, and a novel iterative refinement process to build highly reusable codebooks. To ensure fair comparison, we also introduce a principled 5-dimensional metric and a train-test split protocol for standardized, unbiased evaluation. Across five diverse corpora, LOGOS consistently outperforms strong baselines and achieves a remarkable $88.2%$ alignment with an expert-developed schema on a complex dataset. LOGOS demonstrates a powerful new path to democratize and scale qualitative research without sacrificing theoretical nuance.

[133] DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models

Zherui Li, Zheng Nie, Zhenhong Zhou, Yufei Guo, Yue Liu, Yitong Zhang, Yu Cheng, Qingsong Wen, Kun Wang, Jiaheng Zhang

Main category: cs.CL

TL;DR: This paper analyzes vulnerabilities in Diffusion Large Language Models (dLLMs) to jailbreak attacks, identifies key issues like greedy remasking bias and Denoising-path Dependence, and proposes DiffuGuard - a training-free defense framework that reduces attack success rates from 47.9% to 14.7% while maintaining model utility.

Details

Motivation: The rapid advancement of Diffusion Large Language Models introduces unique vulnerabilities distinct from Autoregressive LLMs, stemming from their iterative and parallel generation mechanisms. These vulnerabilities require specialized analysis and defense strategies.

Method: The paper conducts vulnerability analysis across intra-step and inter-step dimensions, then proposes DiffuGuard - a dual-stage defense framework using Stochastic Annealing Remasking to mitigate greedy selection bias and Block-level Audit and Repair for autonomous risk detection and guided correction.

Result: Experimental results on four dLLMs show DiffuGuard reduces Attack Success Rate against six diverse jailbreak methods from 47.9% to 14.7% while preserving model utility and efficiency.

Conclusion: While current decoding strategies constitute significant vulnerabilities in dLLMs, these models possess substantial intrinsic safety potential that can be unlocked through proper defense mechanisms like DiffuGuard.

Abstract: The rapid advancement of Diffusion Large Language Models (dLLMs) introduces unprecedented vulnerabilities that are fundamentally distinct from Autoregressive LLMs, stemming from their iterative and parallel generation mechanisms. In this paper, we conduct an in-depth analysis of dLLM vulnerabilities to jailbreak attacks across two distinct dimensions: intra-step and inter-step dynamics. Experimental results reveal a harmful bias inherent in the standard greedy remasking strategy and identify a critical phenomenon we term Denoising-path Dependence, where the safety of early-stage tokens decisively influences the final output. These findings also indicate that while current decoding strategies constitute a significant vulnerability, dLLMs possess a substantial intrinsic safety potential. To unlock this potential, we propose DiffuGuard, a training-free defense framework that addresses vulnerabilities through a dual-stage approach: Stochastic Annealing Remasking dynamically introduces controlled randomness to mitigate greedy selection bias, while Block-level Audit and Repair exploits internal model representations for autonomous risk detection and guided correction. Comprehensive experiments on four dLLMs demonstrate DiffuGuard’s exceptional effectiveness, reducing Attack Success Rate against six diverse jailbreak methods from 47.9% to 14.7% while preserving model utility and efficiency. Our code is available at: https://github.com/niez233/DiffuGuard.

Junying Wang, Zicheng Zhang, Ye Shen, Yalun Wu, Yingji Liang, Yijin Guo, Farong Wen, Wenzhe Li, Xuezhi Zhao, Qi Jia, Guangtao Zhai

Main category: cs.CL

TL;DR: A framework for transforming text-only QA pairs into multi-modal QA pairs using an agentic system (Q-Mirror) that iteratively refines MMQAs through generation and evaluation loops.

Details

Motivation: Manual creation of high-quality multi-modal benchmarks for scientific reasoning is costly and unscalable, creating a bottleneck for advancing large models.

Method: Developed a TQA-to-MMQA framework with quality rubric, constructed evaluation benchmarks, and created Q-Mirror agent that integrates MMQA generation and evaluation in a closed loop for iterative refinement.

Result: State-of-the-art models can generate MMQAs but with substantial gaps; top understanding models align well with human judgment; Q-Mirror improved average scores from 78.90 to 85.22 and pass rates from 72% to 95%.

Conclusion: The Q-Mirror system offers a practical path to creating large-scale scientific benchmarks by automating the transformation of text-only QA pairs into high-quality multi-modal QA pairs.

Abstract: High-quality, multi-modal benchmarks are crucial for advancing scientific reasoning in large models yet their manual creation is costly and unscalable. To address this bottleneck, we explore the potential for transforming Text-Only QA Pairs (TQAs) into high-quality Multi-Modal QA Pairs (MMQAs), which include three parts: 1) Task Definition & Evaluation Rubric: We develop a TQA-to-MMQA framework and establish a comprehensive, multi-dimensional MMQA quality rubric that provides principles for the transformation. 2) Benchmark Construction: Then we construct two extensive benchmarks to rigorously evaluate state-of-the-art generation & understanding models on the distinct tasks of MMQA generation & MMQA quality evaluation. 3) Preliminary Solution: We develop an agentic system (Q-Mirror), which operationalizes our framework by integrating MMQA generation and evaluation into a closed loop for iterative refinement. Our experiments show that while state-of-the-art models can generate MMQAs, their outputs still leave substantial gaps, underscoring the need for reliable evaluation. We further demonstrate that top-tier understanding models align closely with human judgment in MMQA quality assessment. Leveraging both insights, the Q-Mirror agent raises average scores from 78.90 to 85.22 and pass rates from 72% to 95%, offering a practical path to large-scale scientific benchmarks.

[135] Dual Mechanisms of Value Expression: Intrinsic vs. Prompted Values in LLMs

Jongwook Han, Jongwon Lim, Injin Kong, Yohan Jo

Main category: cs.CL

TL;DR: This paper analyzes how LLMs express values through intrinsic (learned) vs prompted mechanisms, finding they share some components but differ in steerability and response diversity.

Details

Motivation: To understand whether intrinsic and prompted value expressions in LLMs rely on overlapping or distinct mechanisms, which is crucial for value alignment and persona steering applications.

Method: Used two mechanistic approaches: (1) value vectors extracted from residual stream, and (2) value neurons from MLP layers that contribute to value expressions.

Result: Found that intrinsic and prompted value mechanisms partially share common components but also have unique elements, leading to different steerability (prompted > intrinsic) and response diversity (intrinsic > prompted).

Conclusion: Intrinsic mechanisms promote lexical diversity while prompted mechanisms strengthen instruction following, with implications for jailbreaking and value alignment.

Abstract: Large language models (LLMs) can express different values in two distinct ways: (1) intrinsic expression, reflecting the model’s inherent values learned during training, and (2) prompted expression, elicited by explicit prompts. Given their widespread use in value alignment and persona steering, it is paramount to clearly understand their underlying mechanisms, particularly whether they mostly overlap (as one might expect) or rely on substantially different mechanisms, but this remains largely understudied. We analyze this at the mechanistic level using two approaches: (1) value vectors, feature directions representing value mechanisms extracted from the residual stream, and (2) value neurons, MLP neurons that contribute to value expressions. We demonstrate that intrinsic and prompted value mechanisms partly share common components that are crucial for inducing value expression, but also possess unique elements that manifest in different ways. As a result, these mechanisms lead to different degrees of value steerability (prompted > intrinsic) and response diversity (intrinsic > prompted). In particular, components unique to the intrinsic mechanism seem to promote lexical diversity in responses, whereas those specific to the prompted mechanism primarily strengthen instruction following, taking effect even in distant tasks like jailbreaking.

[136] Multimodal Large Language Models Meet Multimodal Emotion Recognition and Reasoning: A Survey

Yuntao Shou, Tao Meng, Wei Ai, Keqin Li

Main category: cs.CL

TL;DR: This paper provides the first comprehensive survey of LLMs and multimodal LLMs (MLLMs) for emotion recognition and reasoning, covering architectures, datasets, benchmarks, challenges, and future directions.

Details

Motivation: The field lacks a systematic review despite notable progress in using LLMs and MLLMs for multimodal emotion recognition and reasoning, which has become a rapidly growing frontier in AI for Science.

Method: The authors conduct a comprehensive survey that consolidates recent developments in LLMs and MLLMs for emotion recognition and reasoning, including model architectures, datasets, and performance benchmarks.

Result: The paper provides an authoritative reference and practical insights for researchers, highlighting key challenges and outlining future research directions in this emerging domain.

Conclusion: This represents the first comprehensive attempt to survey the intersection of MLLMs with multimodal emotion recognition and reasoning, with summarized methods available on GitHub.

Abstract: In recent years, large language models (LLMs) have driven major advances in language understanding, marking a significant step toward artificial general intelligence (AGI). With increasing demands for higher-level semantics and cross-modal fusion, multimodal large language models (MLLMs) have emerged, integrating diverse information sources (e.g., text, vision, and audio) to enhance modeling and reasoning in complex scenarios. In AI for Science, multimodal emotion recognition and reasoning has become a rapidly growing frontier. While LLMs and MLLMs have achieved notable progress in this area, the field still lacks a systematic review that consolidates recent developments. To address this gap, this paper provides a comprehensive survey of LLMs and MLLMs for emotion recognition and reasoning, covering model architectures, datasets, and performance benchmarks. We further highlight key challenges and outline future research directions, aiming to offer researchers both an authoritative reference and practical insights for advancing this domain. To the best of our knowledge, this paper is the first attempt to comprehensively survey the intersection of MLLMs with multimodal emotion recognition and reasoning. The summary of existing methods mentioned is in our Github: \href{https://github.com/yuntaoshou/Awesome-Emotion-Reasoning}{https://github.com/yuntaoshou/Awesome-Emotion-Reasoning}.

[137] Speculative Verification: Exploiting Information Gain to Refine Speculative Decoding

Sungkyun Kim, Jaemin Kim, Dogyung Yoon, Jiho Shin, Junyeol Lee, Jiwon Seo

Main category: cs.CL

TL;DR: Speculative Verification (SV) improves speculative decoding by dynamically predicting speculation accuracy and adapting verification length to maximize throughput, achieving up to 2x speedup over standard speculative decoding.

Details

Motivation: Standard speculative decoding suffers from overhead when speculation accuracy is low, especially at large batch sizes, limiting its effectiveness for efficient LLM inference.

Method: SV introduces a companion model to estimate alignment between draft and target model distributions, dynamically predicting speculation accuracy and adapting verification length to reduce wasted computation on rejected tokens.

Result: SV consistently outperforms both speculative decoding and standard decoding across all experiments, improving SD performance by up to 2x with average 1.4x speedup in large-batch settings (batch sizes 32-80).

Conclusion: SV demonstrates robustness, scalability, and practical utility for efficient LLM inference, requiring no modifications to draft or target models and being compatible with existing SD variants.

Abstract: LLMs have low GPU efficiency and high latency due to autoregressive decoding. Speculative decoding (SD) mitigates this using a small draft model to speculatively generate multiple tokens, which are then verified in parallel by a target model. However, when speculation accuracy is low, the overhead from rejected tokens can offset the benefits, limiting SD’s effectiveness, especially at large batch sizes. To address this, we propose Speculative Verification (SV), an efficient augmentation to SD that dynamically predicts speculation accuracy and adapts the verification length to maximize throughput. SV introduces a companion model - a small auxiliary model similar in size to the draft model - to estimate the alignment between draft and target model distributions. By maximizing the information gain from quantifying this alignment, SV refines verification decisions, reducing wasted computation on rejected tokens and improving decoding efficiency. Moreover, SV requires no modifications to the draft or target models and is compatible with existing SD variants. We extensively evaluated SV on publicly available LLMs across three NLP tasks using nine combinations of draft, companion, and target models, including 13B-72B target models and three types of variations: base (no finetuning), instruction-tuned, and task fine-tuned. Across all experiments and batch sizes (4-80), SV consistently outperforms both SD and standard decoding with the target model. It improves SD performance by up to 2$\times$, with an average speedup of 1.4 $\times$ in large-batch settings (batch sizes 32-80). These results demonstrate SV’s robustness, scalability, and practical utility for efficient LLM inference.

[138] AlignX: Advancing Multilingual Large Language Models with Multilingual Representation Alignment

Mengyu Bu, Shaolei Zhang, Zhongjun He, Hua Wu, Yang Feng

Main category: cs.CL

TL;DR: AlignX is a two-stage framework that improves multilingual LLM performance through representation alignment and instruction fine-tuning, enhancing cross-lingual capabilities for non-dominant languages.

Details

Motivation: Multilingual LLMs often underperform for non-dominant languages due to imprecise alignment and suboptimal knowledge transfer from standard fine-tuning approaches.

Method: Two-stage framework: 1) Representation alignment with multilingual semantic alignment and language feature integration, 2) Multilingual instruction fine-tuning to stimulate LLM capabilities.

Result: Experimental results show enhanced multilingual general and cross-lingual generation capabilities across several pre-trained LLMs, with improved representation closeness and cross-lingual alignment.

Conclusion: AlignX effectively bridges the multilingual performance gap by improving representation alignment and stimulating multilingual capabilities through a structured two-stage approach.

Abstract: Multilingual large language models (LLMs) possess impressive multilingual understanding and generation capabilities. However, their performance and cross-lingual alignment often lag for non-dominant languages. A common solution is to fine-tune LLMs on large-scale and more balanced multilingual corpus, but such approaches often lead to imprecise alignment and suboptimal knowledge transfer, struggling with limited improvements across languages. In this paper, we propose AlignX to bridge the multilingual performance gap, which is a two-stage representation-level framework for enhancing multilingual performance of pre-trained LLMs. In the first stage, we align multilingual representations with multilingual semantic alignment and language feature integration. In the second stage, we stimulate the multilingual capability of LLMs via multilingual instruction fine-tuning. Experimental results on several pre-trained LLMs demonstrate that our approach enhances LLMs’ multilingual general and cross-lingual generation capability. Further analysis indicates that AlignX brings the multilingual representations closer and improves the cross-lingual alignment.

[139] Beyond Repetition: Text Simplification and Curriculum Learning for Data-Constrained Pretraining

Matthew Theodore Roque, Dan John Velasco

Main category: cs.CL

TL;DR: The paper studies curriculum learning in pretraining for data-constrained settings, examining text-complexity ordering and data augmentation via simplification.

Details

Motivation: Most language model pretraining research focuses on large datasets, leaving optimization in data-constrained settings underexplored, particularly regarding training data order and including alternative text versions.

Method: Built on parallel corpora with human-written paragraphs aligned with LLM-simplified variants, testing four data schedules: repeated exposure, low-to-high complexity, high-to-low, and interleaved. Analyzed representation quality via fine-tuning and zero-shot performance on multiple tasks.

Result: Adding simplified data improves fine-tuning and zero-shot performance over repeated-exposure baseline. Smaller models benefit from low-to-high complexity ordering, while larger models perform better with interleaved ordering.

Conclusion: Curriculum learning with text simplification and complexity-based ordering enhances representation quality in data-constrained pretraining, with optimal strategies varying by model size.

Abstract: Most studies on language model pretraining focus on large datasets, leaving open questions about optimization in data-constrained settings. In such settings, the effects of training data order and of including alternative versions of the same text remain underexplored. We address this by studying curriculum learning in pretraining, focusing on text-complexity ordering and data augmentation via simplification. We ask: (1) Does simplifying texts enhance representation quality more than reusing the original data? and (2) Does ordering data by text complexity yield better representations? To answer, we build on a pair of parallel corpora where human-written paragraphs are aligned with LLM-simplified variants, and test four data schedules: repeated exposure, low-to-high complexity, high-to-low, and interleaved. We analyze models’ representation quality from a sample efficiency perspective via fine-tuning, as well as its zero-shot performance on linguistic knowledge, entity tracking, world knowledge, and commonsense reasoning. Our findings show that adding simplified data improves fine-tuning and zero-shot performance over a repeated-exposure baseline: smaller models benefit from low-to-high complexity, while larger models perform better with interleaved ordering.

[140] Reinforcement Mid-Training

Yijun Tian, Shaoyu Chen, Zhichao Xu, Yawei Wang, Jinhe Bi, Peng Han, Wei Wang

Main category: cs.CL

TL;DR: The paper proposes RMT, a reinforcement mid-training framework that addresses inefficiencies in LLM training through dynamic token budgeting, curriculum-based sampling, and dual training strategy, achieving significant performance gains with reduced reasoning length.

Details

Motivation: Current LLM development has pre-training and post-training stages, but lacks an intermediate reinforcement mid-training stage that could provide strong performance improvements by addressing training inefficiencies.

Method: RMT framework with three key components: (1) dynamic token budget mechanism to limit unnecessary reasoning steps, (2) curriculum-based adaptive sampling for progressive learning from easy to hard tokens, (3) dual training strategy combining RL with next-token prediction.

Result: Achieves up to +64.91% performance improvement with only 21% of reasoning length in language modeling. Checkpoints from reinforcement mid-training also improve subsequent post-training by up to +18.76% in mathematical domain.

Conclusion: Reinforcement mid-training is a valuable intermediate stage in LLM development that significantly enhances performance and efficiency when properly implemented with the proposed RMT framework.

Abstract: The development of state-of-the-art large language models is commonly understood as a two-stage process involving pre-training and post-training. We point out the need for an additional intermediate stage called reinforcement mid-training with potential for strong performance gains. In this paper, we formally define the problem and identify three key challenges: (1) inefficient training due to excessive reasoning steps, (2) disregard of the imbalanced token entropy distribution, and (3) underutilization of token information. To address these challenges, we propose RMT, a framework for efficient, adaptive, and unified reinforcement mid-training with various innovative components. In particular, we first introduce a dynamic token budget mechanism that constrains unnecessary reasoning steps and mitigates model overthinking. Next, we design a curriculum-based adaptive sampling method that fosters a progressive learning trajectory from easy to hard tokens. Finally, we present a dual training strategy that combines reinforcement learning with next-token prediction, ensuring targeted learning on key tokens and full exploitation of all token information. Extensive experiments demonstrate the superiority of RMT over state-of-the-art methods, achieving up to +64.91% performance improvement with only 21% of the reasoning length in language modeling. We also show that checkpoints obtained after reinforcement mid-training can benefit the subsequent post-training, yielding up to +18.76% improvement in the mathematical domain.

[141] HarmMetric Eval: Benchmarking Metrics and Judges for LLM Harmfulness Assessment

Langqi Yang, Tianhang Zheng, Kedong Xiu, Yixuan Chen, Di Wang, Puning Zhao, Zhan Qin, Kui Ren

Main category: cs.CL

TL;DR: HarmMetric Eval is a benchmark for evaluating harmfulness metrics and judges in LLM jailbreak attacks, revealing that traditional metrics like METEOR and ROUGE-1 outperform LLM-based judges.

Details

Motivation: The proliferation of jailbreak attacks on LLMs requires reliable metrics to assess harmfulness, but the absence of a systematic benchmark undermines the credibility of reported jailbreak effectiveness.

Method: Created a comprehensive benchmark with a high-quality dataset of harmful prompts paired with diverse model responses, and implemented a flexible scoring mechanism compatible with various metrics and judges.

Result: Extensive experiments showed that traditional metrics (METEOR and ROUGE-1) outperform LLM-based judges in evaluating harmfulness of model responses.

Conclusion: Challenges prevailing beliefs about LLMs’ superiority in harmfulness evaluation and provides a publicly available benchmark for future research.

Abstract: The alignment of large language models (LLMs) with human values is critical for their safe deployment, yet jailbreak attacks can subvert this alignment to elicit harmful outputs from LLMs. In recent years, a proliferation of jailbreak attacks has emerged, accompanied by diverse metrics and judges to assess the harmfulness of the LLM outputs. However, the absence of a systematic benchmark to assess the quality and effectiveness of these metrics and judges undermines the credibility of the reported jailbreak effectiveness and other risks. To address this gap, we introduce HarmMetric Eval, a comprehensive benchmark designed to support both overall and fine-grained evaluation of harmfulness metrics and judges. Our benchmark includes a high-quality dataset of representative harmful prompts paired with diverse harmful and non-harmful model responses, alongside a flexible scoring mechanism compatible with various metrics and judges. With HarmMetric Eval, our extensive experiments uncover a surprising result: two conventional metrics–METEOR and ROUGE-1–outperform LLM-based judges in evaluating the harmfulness of model responses, challenging prevailing beliefs about LLMs’ superiority in this domain. Our dataset is publicly available at https://huggingface.co/datasets/qusgo/HarmMetric_Eval, and the code is available at https://anonymous.4open.science/r/HarmMetric-Eval-4CBE.

[142] LLaDA-MoE: A Sparse MoE Diffusion Language Model

Fengqi Zhu, Zebin You, Yipeng Xing, Zenan Huang, Lin Liu, Yihong Zhuang, Guoshan Lu, Kangyu Wang, Xudong Wang, Lanning Wei, Hongrui Guo, Jiaqi Hu, Wentao Ye, Tieyuan Chen, Chenchen Li, Chengfu Tang, Haibo Feng, Jun Hu, Jun Zhou, Xiaolu Zhang, Zhenzhong Lan, Junbo Zhao, Da Zheng, Chongxuan Li, Jianguo Li, Ji-Rong Wen

Main category: cs.CL

TL;DR: LLaDA-MoE is a large language diffusion model with Mixture-of-Experts architecture that achieves competitive performance with only 1.4B active parameters during inference while maintaining 7B total capacity, trained on 20T tokens.

Details

Motivation: To develop a more computationally efficient diffusion language model by integrating sparse Mixture-of-Experts architecture to reduce inference costs while maintaining competitive performance.

Method: Built LLaDA-MoE from scratch using Mixture-of-Experts architecture, trained on approximately 20T tokens, maintaining 7B total parameters but activating only 1.4B parameters during inference through sparse activation.

Result: Achieves state-of-the-art performance among diffusion language models, surpassing previous models like LLaDA, LLaDA 1.5, and Dream across multiple benchmarks. The instruct-tuned version (LLaDA-MoE-7B-A1B-Instruct) performs comparably to Qwen2.5-3B-Instruct in various tasks despite using fewer active parameters.

Conclusion: Integrating sparse MoE architecture into masked diffusion language models successfully brings out MoE’s strengths for efficient inference with few active parameters, opening new possibilities for further exploration of diffusion language models.

Abstract: We introduce LLaDA-MoE, a large language diffusion model with the Mixture-of-Experts (MoE) architecture, trained from scratch on approximately 20T tokens. LLaDA-MoE achieves competitive performance with significantly reduced computational overhead by maintaining a 7B-parameter capacity while activating only 1.4B parameters during inference. Our empirical evaluation reveals that LLaDA-MoE achieves state-of-the-art performance among diffusion language models with larger parameters, surpassing previous diffusion language models LLaDA, LLaDA 1.5, and Dream across multiple benchmarks. The instruct-tuned model LLaDA-MoE-7B-A1B-Instruct demonstrates capabilities comparable to Qwen2.5-3B-Instruct in knowledge understanding, code generation, mathematical reasoning, agent and alignment tasks, despite using fewer active parameters. Our results show that integrating a sparse MoE architecture into the training objective of masked diffusion language models still brings out MoE’s strengths under efficient inference with few active parameters, and opens ample room for further exploration of diffusion language models. LLaDA-MoE models are available at Huggingface.

[143] Agentar-Scale-SQL: Advancing Text-to-SQL through Orchestrated Test-Time Scaling

Pengfei Wang, Baolin Sun, Xuemei Dong, Yaxun Dai, Hongwei Yuan, Mengdie Chu, Yingqi Gao, Xiang Qi, Peng Zhang, Ying Yan

Main category: cs.CL

TL;DR: Agentar-Scale-SQL introduces a novel test-time scaling framework that combines internal, sequential, and parallel scaling strategies to achieve state-of-the-art performance on the BIRD benchmark.

Details

Motivation: Current Text-to-SQL methods lag behind human experts and lack orchestrated test-time scaling strategies that consider the model's internal reasoning process.

Method: Orchestrated Test-Time Scaling strategy combining: i) Internal Scaling via RL-enhanced Intrinsic Reasoning, ii) Sequential Scaling through Iterative Refinement, and iii) Parallel Scaling using Diverse Synthesis and Tournament Selection.

Result: Achieves 81.67% execution accuracy on BIRD benchmark test set, ranking first on the official leaderboard.

Conclusion: Agentar-Scale-SQL provides an effective framework toward human-level performance in Text-to-SQL tasks and is designed for easy adaptation to new databases and more powerful language models.

Abstract: State-of-the-art (SOTA) Text-to-SQL methods still lag significantly behind human experts on challenging benchmarks like BIRD. Current approaches that explore test-time scaling lack an orchestrated strategy and neglect the model’s internal reasoning process. To bridge this gap, we introduce Agentar-Scale-SQL, a novel framework leveraging scalable computation to improve performance. Agentar-Scale-SQL implements an Orchestrated Test-Time Scaling strategy that synergistically combines three distinct perspectives: i) Internal Scaling via RL-enhanced Intrinsic Reasoning, ii) Sequential Scaling through Iterative Refinement, and iii) Parallel Scaling using Diverse Synthesis and Tournament Selection. Agentar-Scale-SQL is a general-purpose framework designed for easy adaptation to new databases and more powerful language models. Extensive experiments show that Agentar-Scale-SQL achieves SOTA performance on the BIRD benchmark, reaching 81.67% execution accuracy on the test set and ranking first on the official leaderboard, demonstrating an effective path toward human-level performance.

[144] Multilingual Text-to-SQL: Benchmarking the Limits of Language Models with Collaborative Language Agents

Khanh Trinh Pham, Thu Huong Nguyen, Jun Jo, Quoc Viet Hung Nguyen, Thanh Tam Nguyen

Main category: cs.CL

TL;DR: MultiSpider 2.0 extends Spider 2.0 to 8 languages, revealing a significant multilingual gap in Text-to-SQL performance where state-of-the-art LLMs achieve only 4% accuracy compared to 60% on English-only benchmarks.

Details

Motivation: Most Text-to-SQL benchmarks are English-only, limiting progress in multilingual applications and failing to capture real-world linguistic diversity.

Method: Extended Spider 2.0 benchmark to 8 languages (English, German, French, Spanish, Portuguese, Japanese, Chinese, Vietnamese) while preserving structural difficulty but adding linguistic variability.

Result: State-of-the-art LLMs achieved only 4% execution accuracy on MultiSpider 2.0 using intrinsic reasoning, versus 60% on MultiSpider 1.0. A collaboration-driven language agents approach improved accuracy to 15%.

Conclusion: There is a substantial multilingual gap in Text-to-SQL performance, motivating the need for methods robust across languages and ready for real-world enterprise deployment.

Abstract: Text-to-SQL enables natural access to databases, yet most benchmarks are English-only, limiting multilingual progress. We introduce MultiSpider 2.0, extending Spider 2.0 to eight languages (English, German, French, Spanish, Portuguese, Japanese, Chinese, Vietnamese). It preserves Spider 2.0’s structural difficulty while adding linguistic and dialectal variability, demanding deeper reasoning for complex SQL. On this benchmark, state-of-the-art LLMs (such as DeepSeek-R1 and OpenAI o1) reach only 4% execution accuracy when relying on intrinsic reasoning, versus 60% on MultiSpider 1.0. Therefore, we provide a collaboration-driven language agents baseline that iteratively refines queries, improving accuracy to 15%. These results reveal a substantial multilingual gap and motivate methods that are robust across languages and ready for real-world enterprise deployment. Our benchmark is available at https://github.com/phkhanhtrinh23/Multilingual_Text_to_SQL.

[145] CDT: A Comprehensive Capability Framework for Large Language Models Across Cognition, Domain, and Task

Haosi Mo, Xinyu Ma, Xuebo Liu, Derek F. Wong, Yu Li, Jie Liu, Min Zhang

Main category: cs.CL

TL;DR: The paper proposes the Cognition-Domain-Task (CDT) framework to comprehensively evaluate LLM capabilities across cognitive, domain, and task dimensions, addressing limitations of existing benchmarks that focus on isolated abilities.

Details

Motivation: Existing benchmarks for LLMs often focus on isolated abilities and lack a holistic framework for comprehensive capability assessment, creating a gap in evaluation methodologies.

Method: Proposed CDT framework that measures model capabilities across three dimensions (cognition, domain, task), incorporating Cattell-Horn-Carroll cognitive theory to refine capability categorization. Applied CDT for dataset capability evaluation and data selection.

Result: Capability metrics correlated well with downstream performance and supported effective dataset analysis. Data selection experiments showed significant improvements: 44.3 and 45.4 scores with 1.6 and 2.2 point increases over baselines on general and specific benchmarks respectively.

Conclusion: The CDT framework is validated as effective and practical for comprehensive LLM capability assessment, with demonstrated improvements in dataset analysis and construction.

Abstract: Recent advances in Large Language Models (LLMs) have significantly enhanced their capabilities, highlighting the need for comprehensive evaluation frameworks that extend beyond task-specific benchmarks. However, existing benchmarks often focus on isolated abilities, lacking a holistic framework for assessing LLM capabilities. To address this gap, we propose the Cognition-Domain-Task (CDT) framework, which comprehensively measures a model’s capabilities across three dimensions. We expand the scope of model capability definitions at the cognitive level by incorporating the Cattell-Horn-Carroll cognitive theory, refining the categorization of model capabilities. We apply CDT in two directions: dataset capability evaluation and data selection. Experiments show that our capability metrics correlate well with downstream performance and can support effective dataset analysis and construction. The experiments on data selection also show significant improvements in both general and specific benchmarks, achieving scores of 44.3 and 45.4, with an increase of 1.6 and 2.2 points over the baselines, respectively. These results validate the effectiveness and practicality of CDT. Source code and models are available at https://github.com/Alessa-mo/CDT.

[146] Alternatives To Next Token Prediction In Text Generation – A Survey

Charlie Wyatt, Aditya Joshi, Flora Salim

Main category: cs.CL

TL;DR: This survey categorizes alternatives to Next Token Prediction (NTP) in LLMs into five families to address NTP’s limitations like poor planning and error accumulation.

Details

Motivation: NTP drives LLM success but causes persistent weaknesses including poor long-term planning, error accumulation, and computational inefficiency. There's growing interest in exploring alternatives.

Method: The survey categorizes NTP alternatives into five main families: Multi-Token Prediction, Plan-then-Generate, Latent Reasoning, Continuous Generation Approaches, and Non-Transformer Architectures.

Result: A comprehensive taxonomy of emerging alternatives to NTP is presented, synthesizing insights across different approaches to address token-level generation limitations.

Conclusion: The survey provides a framework to guide research into models that overcome NTP’s limitations and develop transformative NLP models through alternative generation paradigms.

Abstract: The paradigm of Next Token Prediction (NTP) has driven the unprecedented success of Large Language Models (LLMs), but is also the source of their most persistent weaknesses such as poor long-term planning, error accumulation, and computational inefficiency. Acknowledging the growing interest in exploring alternatives to NTP, the survey describes the emerging ecosystem of alternatives to NTP. We categorise these approaches into five main families: (1) Multi-Token Prediction, which targets a block of future tokens instead of a single one; (2) Plan-then-Generate, where a global, high-level plan is created upfront to guide token-level decoding; (3) Latent Reasoning, which shifts the autoregressive process itself into a continuous latent space; (4) Continuous Generation Approaches, which replace sequential generation with iterative, parallel refinement through diffusion, flow matching, or energy-based methods; and (5) Non-Transformer Architectures, which sidestep NTP through their inherent model structure. By synthesizing insights across these methods, this survey offers a taxonomy to guide research into models that address the known limitations of token-level generation to develop new transformative models for natural language processing.

[147] Bias Mitigation or Cultural Commonsense? Evaluating LLMs with a Japanese Dataset

Taisei Yamamoto, Ryoma Kumon, Danushka Bollegala, Hitomi Yanaka

Main category: cs.CL

TL;DR: The paper introduces SOBACO, a Japanese benchmark to evaluate social biases and cultural commonsense in LLMs, finding that debiasing methods significantly degrade cultural commonsense performance (up to 75% accuracy deterioration).

Details

Motivation: Existing debiasing methods may degrade LLM capabilities, but previous evaluations focused on general language understanding tasks unrelated to social biases. Cultural commonsense is closely related to social biases as both stem from social norms and values, yet the impact of bias mitigation on cultural commonsense remains uninvestigated.

Method: Proposed SOBACO (SOcial BiAs and Cultural cOmmonsense benchmark), a Japanese benchmark that evaluates social biases and cultural commonsense in LLMs using a unified format. Evaluated several LLMs on SOBACO to examine how debiasing methods affect cultural commonsense.

Result: Debiasing methods significantly degraded LLM performance on cultural commonsense tasks, with up to 75% accuracy deterioration observed across the evaluated models.

Conclusion: There is a critical trade-off between bias mitigation and cultural commonsense preservation. The findings highlight the importance of developing debiasing methods that consider this trade-off to improve both fairness and utility of LLMs.

Abstract: Large language models (LLMs) exhibit social biases, prompting the development of various debiasing methods. However, debiasing methods may degrade the capabilities of LLMs. Previous research has evaluated the impact of bias mitigation primarily through tasks measuring general language understanding, which are often unrelated to social biases. In contrast, cultural commonsense is closely related to social biases, as both are rooted in social norms and values. The impact of bias mitigation on cultural commonsense in LLMs has not been well investigated. Considering this gap, we propose SOBACO (SOcial BiAs and Cultural cOmmonsense benchmark), a Japanese benchmark designed to evaluate social biases and cultural commonsense in LLMs in a unified format. We evaluate several LLMs on SOBACO to examine how debiasing methods affect cultural commonsense in LLMs. Our results reveal that the debiasing methods degrade the performance of the LLMs on the cultural commonsense task (up to 75% accuracy deterioration). These results highlight the importance of developing debiasing methods that consider the trade-off with cultural commonsense to improve fairness and utility of LLMs.

[148] A Text-To-Text Alignment Algorithm for Better Evaluation of Modern Speech Recognition Systems

Lasse Borgholt, Jakob Havtorn, Christian Igel, Lars Maaløe, Zheng-Hua Tan

Main category: cs.CL

TL;DR: A novel alignment algorithm combining dynamic programming with beam search scoring for more accurate error analysis in speech recognition systems.

Details

Motivation: Current speech recognition evaluation metrics like word error rate obscure meaningful differences by focusing on frequent words, while errors in rare terms, named entities, and domain-specific vocabulary remain hidden.

Method: Proposed a novel alignment algorithm that couples dynamic programming with beam search scoring for precise alignment between reference and model transcripts.

Result: The approach provides more accurate alignment of individual errors compared to traditional text alignment methods, enabling reliable error analysis.

Conclusion: The algorithm addresses the need for finer-grained error analysis in speech recognition and is made available via PyPI.

Abstract: Modern neural networks have greatly improved performance across speech recognition benchmarks. However, gains are often driven by frequent words with limited semantic weight, which can obscure meaningful differences in word error rate, the primary evaluation metric. Errors in rare terms, named entities, and domain-specific vocabulary are more consequential, but remain hidden by aggregate metrics. This highlights the need for finer-grained error analysis, which depends on accurate alignment between reference and model transcripts. However, conventional alignment methods are not designed for such precision. We propose a novel alignment algorithm that couples dynamic programming with beam search scoring. Compared to traditional text alignment methods, our approach provides more accurate alignment of individual errors, enabling reliable error analysis. The algorithm is made available via PyPI.

[149] Sanitize Your Responses: Mitigating Privacy Leakage in Large Language Models

Wenjie Fu, Huandong Wang, Junyao Gao, Guoan Wan, Tao Jiang

Main category: cs.CL

TL;DR: Self-Sanitize is a novel LLM-driven framework that provides real-time mitigation of harmful content generation through self-monitoring and self-repair mechanisms, achieving superior performance with minimal overhead.

Details

Motivation: Existing mitigation strategies for harmful content in LLMs rely on post-hoc filtering, which introduces latency and computational overhead, and is incompatible with token-level streaming generation.

Method: Self-Sanitize comprises a lightweight Self-Monitor module that continuously inspects high-level intentions via representation engineering, and a Self-Repair module that performs in-place correction of harmful content without separate review dialogues.

Result: Extensive experiments on four LLMs across three privacy leakage scenarios demonstrate that Self-Sanitize achieves superior mitigation performance with minimal overhead and without degrading LLM utility.

Conclusion: Self-Sanitize offers a practical and robust solution for safer LLM deployments by enabling real-time streaming monitoring and seamless repair with negligible impact on latency and resource utilization.

Abstract: As Large Language Models (LLMs) achieve remarkable success across a wide range of applications, such as chatbots and code copilots, concerns surrounding the generation of harmful content have come increasingly into focus. Despite significant advances in aligning LLMs with safety and ethical standards, adversarial prompts can still be crafted to elicit undesirable responses. Existing mitigation strategies are predominantly based on post-hoc filtering, which introduces substantial latency or computational overhead, and is incompatible with token-level streaming generation. In this work, we introduce Self-Sanitize, a novel LLM-driven mitigation framework inspired by cognitive psychology, which emulates human self-monitor and self-repair behaviors during conversations. Self-Sanitize comprises a lightweight Self-Monitor module that continuously inspects high-level intentions within the LLM at the token level via representation engineering, and a Self-Repair module that performs in-place correction of harmful content without initiating separate review dialogues. This design allows for real-time streaming monitoring and seamless repair, with negligible impact on latency and resource utilization. Given that privacy-invasive content has often been insufficiently focused in previous studies, we perform extensive experiments on four LLMs across three privacy leakage scenarios. The results demonstrate that Self-Sanitize achieves superior mitigation performance with minimal overhead and without degrading the utility of LLMs, offering a practical and robust solution for safer LLM deployments. Our code is available at the following link: https://github.com/wjfu99/LLM_Self_Sanitize

[150] GRPO-MA: Multi-Answer Generation in GRPO for Stable and Efficient Chain-of-Thought Training

Hongcheng Wang, Yinuo Huang, Sukai Wang, Guanghui Ren, Hao Dong

Main category: cs.CL

TL;DR: GRPO-MA improves GRPO algorithm by generating multiple answers per thought to address gradient coupling, sparse rewards, and unstable advantage estimation in training Chain-of-Thought reasoning.

Details

Motivation: Address three key challenges in GRPO algorithm: gradient coupling between thoughts and answers, sparse reward signals from limited parallel sampling, and unstable advantage estimation.

Method: Propose GRPO-MA which leverages multi-answer generation from each thought process, enabling more robust and efficient optimization through variance reduction in thought advantage.

Result: Empirical analysis shows GRPO-MA reduces gradient spikes compared to GRPO. Experiments on math, code, and multimodal tasks demonstrate substantial performance and training efficiency improvements.

Conclusion: Increasing the number of answers per thought consistently enhances model performance, making GRPO-MA a more effective approach for training Chain-of-Thought reasoning in LLMs and VLMs.

Abstract: Recent progress, such as DeepSeek-R1, has shown that the GRPO algorithm, a Reinforcement Learning (RL) approach, can effectively train Chain-of-Thought (CoT) reasoning in Large Language Models (LLMs) and Vision-Language Models (VLMs). In this paper, we analyze three challenges of GRPO: gradient coupling between thoughts and answers, sparse reward signals caused by limited parallel sampling, and unstable advantage estimation. To mitigate these challenges, we propose GRPO-MA, a simple yet theoretically grounded method that leverages multi-answer generation from each thought process, enabling more robust and efficient optimization. Theoretically, we show that the variance of thought advantage decreases as the number of answers per thought increases. Empirically, our gradient analysis confirms this effect, showing that GRPO-MA reduces gradient spikes compared to GRPO. Experiments on math, code, and diverse multimodal tasks demonstrate that GRPO-MA substantially improves performance and training efficiency. Our ablation studies further reveal that increasing the number of answers per thought consistently enhances model performance.

[151] Knowledge Editing with Subspace-Aware Key-Value Mappings

Haewon Park, Sangwoo Kim, Yohan Jo

Main category: cs.CL

TL;DR: SUIT is a knowledge editing method that modifies only the critical feature subspace relevant to edits, reducing model perturbations while maintaining high edit efficacy.

Details

Motivation: Existing locate-then-edit methods for correcting factual errors in Language Models cause significant perturbations to the edited model due to lack of constraints on key and value vectors.

Method: SUIT identifies and modifies only the subspace of critical features relevant to the edit, rather than modifying entire MLP layers.

Result: Empirical results on LLaMA-3-8B, GPT-J-6B, and Qwen2.5-7B show SUIT dramatically improves knowledge preservation over strong baselines while maintaining high edit efficacy.

Conclusion: SUIT successfully identifies the critical subspace for edits, providing an effective approach for knowledge editing with minimal model perturbation.

Abstract: Knowledge editing aims to efficiently correct factual errors in Language Models (LMs). The popular locate-then-edit approach modifies an MLP layer by finding an optimal mapping between its input vector (key) and output vector (value) that leads to the expression of the edited knowledge. However, existing methods without any constraints on the key and value vectors cause significant perturbations to the edited model. To address this, we propose Subspace Knowledge Edit (SUIT), a method that identifies and modifies only the subspace of critical features relevant to the edit. Our empirical results on LLaMA-3-8B, GPT-J-6B, and Qwen2.5-7B models show that SUIT dramatically improves knowledge preservation over strong baselines while maintaining high edit efficacy. This effectiveness confirms that SUIT successfully identifies the critical subspace for the edit. Further analyses provide additional validation for our approach. The source code and data will be released to the public upon publication of the paper.

[152] Building Benchmarks from the Ground Up: Community-Centered Evaluation of LLMs in Healthcare Chatbot Settings

Hamna, Gayatri Bhat, Sourabrata Mukherjee, Faisal Lalani, Evan Hadfield, Divya Siddarth, Kalika Bali, Sunayana Sitaram

Main category: cs.CL

TL;DR: Samiksha is a community-driven evaluation pipeline for LLMs that moves beyond artificial benchmarks to reflect real community needs, particularly in healthcare contexts like India.

Details

Motivation: Current LLM evaluations lack grounding in real user contexts and cultural practices, especially in critical domains like healthcare where community needs are nuanced.

Method: Co-created with civil-society organizations and community members, the pipeline uses community feedback to determine evaluation criteria, benchmark construction, and output scoring.

Result: The approach demonstrates how multilingual LLMs handle nuanced community health queries while providing a scalable method for contextually grounded evaluation.

Conclusion: Samiksha offers a scalable pathway for more inclusive and culturally aware LLM evaluation that better reflects real-world community contexts.

Abstract: Large Language Models (LLMs) are typically evaluated through general or domain-specific benchmarks testing capabilities that often lack grounding in the lived realities of end users. Critical domains such as healthcare require evaluations that extend beyond artificial or simulated tasks to reflect the everyday needs, cultural practices, and nuanced contexts of communities. We propose Samiksha, a community-driven evaluation pipeline co-created with civil-society organizations (CSOs) and community members. Our approach enables scalable, automated benchmarking through a culturally aware, community-driven pipeline in which community feedback informs what to evaluate, how the benchmark is built, and how outputs are scored. We demonstrate this approach in the health domain in India. Our analysis highlights how current multilingual LLMs address nuanced community health queries, while also offering a scalable pathway for contextually grounded and inclusive LLM evaluation.

[153] AdaThink-Med: Medical Adaptive Thinking with Uncertainty-Guided Length Calibration

Shaohao Rui, Kaitao Chen, Weijie Ma, Xiaosong Wang

Main category: cs.CL

TL;DR: AdaThink-Med is an end-to-end framework that enhances adaptive thinking in medical LLMs by using uncertainty-guided length calibration to reduce reasoning length for simple questions while maintaining performance on complex ones.

Details

Motivation: Current medical LLMs use lengthy reasoning for all questions regardless of difficulty, leading to unnecessary computational costs. Adaptive thinking is needed where models think less for simple questions and more for complex ones.

Method: Generates multiple candidate outputs, evaluates correctness and uncertainty, estimates problem difficulty via uncertainty-guided length calibration, penalizes long reasoning for easy questions, and extends chain of thought for difficult ones.

Result: Achieves up to 6.4x length reduction on average across six medical QA benchmarks while maintaining performance with minimal degradation. Spontaneously develops “non-thinking” and “thinking” reasoning modes.

Conclusion: AdaThink-Med successfully enables adaptive thinking in medical LLMs, dynamically suppressing redundant reasoning while preserving performance, making medical AI more computationally efficient.

Abstract: Recent advances in inference time scaling with extended long chain-of thought have significantly improved the reasoning capabilities of both general and medical large language models (LLMs). However, these models tend to engage in lengthy reasoning processes regardless of the difficulty of the input question, leading to increased inference costs in real-world applications. Therefore, enabling adaptive thinking where models think less for simpler questions and think more for complex ones is critical for the effective use of medical LLMs in practice. Despite its importance, there is a lack of end-to-end approaches designed to enhance the adaptive thinking capabilities of medical LLMs while providing a comprehensive examination of the trade-off between performance and computational cost. To bridge this gap, we propose AdaThink-Med, the first end-to-end framework designed to enhance adaptive thinking ability in medical reasoning models with uncertainty-guided length calibration. AdaThink-Med first generates multiple candidate outputs for each question, evaluates the correctness and uncertainty of each candidate, and then estimates problem difficulty via an uncertainty-guided length calibration module. For outputs with low difficulty and correct answers, the framework penalizes longer reasoning paths; whereas for those with high difficulty and incorrect answers, it encourages extending the chain of thought to explore alternative solutions. On six public medical QA benchmarks, AdaThink-Med achieves up to 6.4x length reduction on average while retaining performance with only minimal degradation. Intriguingly, we observe that AdaThink-Med spontaneously develops two distinct reasoning modes, which we characterize as “non-thinking” and “thinking”, demonstrating the model’s ability to suppress redundant reasoning processes dynamically.

[154] Inducing Dyslexia in Vision Language Models

Melika Honarmand, Ayati Sharma, Badr AlKhamissi, Johannes Mehrer, Martin Schrimpf

Main category: cs.CL

TL;DR: Using vision-language models to simulate dyslexia by identifying and perturbing artificial word processing units, showing selective reading impairments while preserving general visual and language abilities.

Details

Motivation: Traditional methods for studying dyslexia are limited in testing causal hypotheses about reading impairments, so researchers turned to computational models to better understand underlying mechanisms.

Method: Identified visual-word-form-selective units in large-scale vision-language models and performed targeted ablation of these units, comparing with ablation of random units.

Result: Targeted ablation of word-form-selective units caused selective reading impairments matching dyslexic humans’ phonological deficits, while general visual and language comprehension remained intact.

Conclusion: The modeling approach successfully replicated key dyslexia characteristics and established a computational framework for investigating reading disorders.

Abstract: Dyslexia, a neurodevelopmental disorder characterized by persistent reading difficulties, is often linked to reduced activity of the visual word form area in the ventral occipito-temporal cortex. Traditional approaches to studying dyslexia, such as behavioral and neuroimaging methods, have provided valuable insights but remain limited in their ability to test causal hypotheses about the underlying mechanisms of reading impairments. In this study, we use large-scale vision-language models (VLMs) to simulate dyslexia by functionally identifying and perturbing artificial analogues of word processing. Using stimuli from cognitive neuroscience, we identify visual-word-form-selective units within VLMs and demonstrate that targeted ablation of these units, unlike ablation of random units, leads to selective impairments in reading tasks while general visual and language comprehension abilities remain intact. In particular, the resulting model matches dyslexic humans’ phonological deficits without a significant change in orthographic processing. Taken together, our modeling results replicate key characteristics of dyslexia and establish a computational framework for investigating reading disorders.

[155] Hype or not? Formalizing Automatic Promotional Language Detection in Biomedical Research

Bojan Batalo, Erica K. Shimomoto, Neil Millar

Main category: cs.CL

TL;DR: Automatic detection of hype in scientific language using machine learning models trained on annotated NIH grant applications.

Details

Motivation: Promotional language in science is increasing and can undermine objective evaluation, impede research development, and erode trust in science.

Method: Developed formalized guidelines for identifying hype language, annotated NIH grant applications, and evaluated traditional text classifiers and language models against human baseline.

Result: Formalized annotation guidelines helped humans reliably annotate hype adjectives, and machine learning models trained on the annotated dataset showed promising results for hype detection.

Conclusion: Hype detection is linguistically complex and may require domain knowledge and temporal awareness, representing the first NLP approach to this problem.

Abstract: In science, promotional language (‘hype’) is increasing and can undermine objective evaluation of evidence, impede research development, and erode trust in science. In this paper, we introduce the task of automatic detection of hype, which we define as hyperbolic or subjective language that authors use to glamorize, promote, embellish, or exaggerate aspects of their research. We propose formalized guidelines for identifying hype language and apply them to annotate a portion of the National Institutes of Health (NIH) grant application corpus. We then evaluate traditional text classifiers and language models on this task, comparing their performance with a human baseline. Our experiments show that formalizing annotation guidelines can help humans reliably annotate candidate hype adjectives and that using our annotated dataset to train machine learning models yields promising results. Our findings highlight the linguistic complexity of the task, and the potential need for domain knowledge and temporal awareness of the facts. While some linguistic works address hype detection, to the best of our knowledge, we are the first to approach it as a natural language processing task.

[156] InfLLM-V2: Dense-Sparse Switchable Attention for Seamless Short-to-Long Adaptation

Weilin Zhao, Zihan Zhou, Zhou Su, Chaojun Xiao, Yuxuan Li, Yanghao Li, Yudi Zhang, Weilun Zhao, Zhen Li, Yuxiang Huang, Ao Sun, Xu Han, Zhiyuan Liu

Main category: cs.CL

TL;DR: InfLLM-V2 introduces a dense-sparse switchable attention framework that enables efficient long-sequence processing in LLMs by reusing dense attention parameters and smoothly transitioning between dense and sparse attention based on sequence length.

Details

Motivation: Self-attention in Transformers faces computational bottlenecks with long sequences, and existing sparse attention methods disrupt the standard pretrain-on-short, finetune-on-long workflow with excessive parameters and slow convergence.

Method: Parameter-free architecture modification that reuses dense attention parameters, using dense attention for short sequences and automatically switching to sparse attention for long sequences with an efficient implementation to reduce computational overhead.

Result: InfLLM-V2 achieves 4x speedup over dense attention while retaining 98.1% performance on long-context understanding and 99.7% on chain-of-thought reasoning. The framework was used to train and open-source MiniCPM4.1 model.

Conclusion: InfLLM-V2 provides an effective solution for long-sequence processing that maintains performance while significantly improving efficiency, enabling practical acceleration of large language models for long-context tasks.

Abstract: Long-sequence processing is a critical capability for modern large language models. However, the self-attention mechanism in the standard Transformer architecture faces severe computational and memory bottlenecks when processing long sequences. While trainable sparse attention methods offer a promising solution, existing approaches such as NSA introduce excessive extra parameters and disrupt the conventional \textit{pretrain-on-short, finetune-on-long} workflow, resulting in slow convergence and difficulty in acceleration. To overcome these limitations, we introduce dense-sparse switchable attention framework, termed as InfLLM-V2. InfLLM-V2 is a trainable sparse attention that seamlessly adapts models from short to long sequences. Specifically, InfLLM-V2 reuses dense attention parameters through parameter-free architecture modification, maintaining consistency between short and long sequence processing. Additionally, InfLLM-V2 ensures computational efficiency across all sequence lengths, by using dense attention for short inputs and smoothly transitioning to sparse attention for long sequences. To achieve practical acceleration, we further introduce an efficient implementation of InfLLM-V2 that significantly reduces the computational overhead. Our experiments on long-context understanding and chain-of-thought reasoning demonstrate that InfLLM-V2 is 4$\times$ faster than dense attention while retaining 98.1% and 99.7% of the performance, respectively. Based on the InfLLM-V2 framework, we have trained and open-sourced MiniCPM4.1 (https://huggingface.co/openbmb/MiniCPM4.1-8B), a hybrid reasoning model, providing a reproducible implementation for the research community.

[157] Understanding the Dilemma of Unlearning for Large Language Models

Qingjie Zhang, Haoting Qian, Zhicong Huang, Cheng Hong, Minlie Huang, Ke Xu, Chao Zhang, Han Qiu

Main category: cs.CL

TL;DR: Unlearning in LLMs faces a dilemma: methods either insufficiently remove knowledge (recoverable via keyword emphasis) or cause catastrophic forgetting, with knowledge changes mainly disrupting keyword focus rather than true erasure.

Details

Motivation: To address the lack of interpretability in unlearning mechanisms and understand why unlearning effectiveness is contested - whether knowledge is truly removed or just suppressed.

Method: Proposed unPact framework using prompt attribution and contribution tracking to quantify each prompt token’s influence on outputs, enabling pre- and post-unlearning comparisons across six methods, three LLMs, and three benchmarks.

Result: Found that unlearning works by disrupting keyword focus rather than true knowledge erasure; knowledge remains recoverable via keyword emphasis without weight modification; catastrophic forgetting stems from indiscriminate token penalization.

Conclusion: Existing unlearning methods face a fundamental dilemma between insufficient knowledge removal (recoverable) and overly destructive approaches (catastrophic forgetting), highlighting the gap to reliable unlearning.

Abstract: Unlearning seeks to remove specific knowledge from large language models (LLMs), but its effectiveness remains contested. On one side, “forgotten” knowledge can often be recovered through interventions such as light fine-tuning; on the other side, unlearning may induce catastrophic forgetting that degrades general capabilities. Despite active exploration of unlearning methods, interpretability analyses of the mechanism are scarce due to the difficulty of tracing knowledge in LLMs’ complex architectures. We address this gap by proposing unPact, an interpretable framework for unlearning via prompt attribution and contribution tracking. Typically, it quantifies each prompt token’s influence on outputs, enabling pre- and post-unlearning comparisons to reveal what changes. Across six mainstream unlearning methods, three LLMs, and three benchmarks, we find that: (1) Unlearning appears to be effective by disrupting focus on keywords in prompt; (2) Much of the knowledge is not truly erased and can be recovered by simply emphasizing these keywords in prompts, without modifying the model’s weights; (3) Catastrophic forgetting arises from indiscriminate penalization of all tokens. Taken together, our results suggest an unlearning dilemma: existing methods tend either to be insufficient - knowledge remains recoverable by keyword emphasis, or overly destructive - general performance collapses due to catastrophic forgetting, still leaving a gap to reliable unlearning.

[158] Reference-Free Rating of LLM Responses via Latent Information

Leander Girrbach, Chi-Ping Su, Tankred Saanum, Richard Socher, Eric Schulz, Zeynep Akata

Main category: cs.CL

TL;DR: The paper shows that single-response LLM-as-a-judge ratings without references are unreliable due to instability and poor calibration. It proposes Latent Judges using internal model signals to provide more deterministic and discriminative scores.

Details

Motivation: To address the unreliability of single-response LLM-as-a-judge ratings, which suffer from instability under sampling, poor calibration, score compression near the top of the scale, and frequent ties.

Method: Proposes Latent Judges that derive scalar ratings from internal model signals: (i) probability-weighted scores over integer ratings, (ii) verifier-style probabilities of “yes”, and (iii) linear probes trained on model activations at the rating position.

Result: Across various benchmarks, latent methods match or surpass standard prompting, with consistent gains on pairwise accuracy and listwise ranking. Probability-weighted scores achieve strongest single-rating correlations, while probes work well when output logits are miscalibrated.

Conclusion: Latent information provides deterministic and more discriminative signals for reference-free evaluation, improving selection and training approaches like Best-of-N, multi-teacher distillation, and routing.

Abstract: How reliable are single-response LLM-as-a-judge ratings without references, and can we obtain fine-grained, deterministic scores in this setting? We study the common practice of asking a judge model to assign Likert-scale scores to free-text responses and show two systematic issues: scores are unstable under sampling and poorly calibrated, leading to compression near the top of the scale and frequent ties. We then propose and evaluate Latent Judges, which derive scalar ratings from internal model signals: (i) probability-weighted scores over integer ratings, (ii) verifier-style probabilities of “yes”, and (iii) linear probes trained on model activations at the rating position. Across a broad suite of pairwise and single-rating benchmarks, latent methods match or surpass standard prompting, with consistent gains on pairwise accuracy and listwise ranking relevant to Best-of-N selection. Probability-weighted scores achieve the strongest single-rating correlations, while probes recover useful signals when output logits are miscalibrated. These results indicate that latent information provides deterministic and more discriminative signals for reference-free evaluation, and can improve selection and training approaches like Best-of-$N$, multi-teacher distillation, and routing.

[159] MemGen: Weaving Generative Latent Memory for Self-Evolving Agents

Guibin Zhang, Muxin Fu, Shuicheng Yan

Main category: cs.CL

TL;DR: MemGen is a dynamic generative memory framework that enables LLM-powered agents to interweave memory and reasoning through latent token sequences, outperforming existing memory systems and spontaneously developing human-like memory faculties.

Details

Motivation: Existing memory paradigms for LLM agents are constrained - parametric memory forcibly adjusts model parameters while retrieval-based memory externalizes experience, neither capturing the fluid integration of reasoning and memory seen in human cognition.

Method: MemGen consists of a memory trigger that monitors reasoning state to decide explicit memory invocation, and a memory weaver that constructs latent token sequences as machine-native memory to enrich reasoning, creating an interwoven cycle of memory and cognition.

Result: MemGen surpasses leading external memory systems (ExpeL, AWM) by up to 38.22%, exceeds GRPO by up to 13.44%, and exhibits strong cross-domain generalization. It spontaneously evolves distinct human-like memory faculties including planning, procedural, and working memory.

Conclusion: MemGen enables more naturalistic machine cognition by allowing agents to recall and augment latent memory throughout reasoning, suggesting an emergent trajectory toward human-like cognitive faculties without explicit supervision.

Abstract: Agent memory shapes how Large Language Model (LLM)-powered agents, akin to the human brain, progressively refine themselves through environment interactions. Existing paradigms remain constrained: parametric memory forcibly adjusts model parameters, and retrieval-based memory externalizes experience into structured databases, yet neither captures the fluid interweaving of reasoning and memory that underlies human cognition. To address this gap, we propose MemGen, a dynamic generative memory framework that equips agents with a human-esque cognitive faculty. It consists of a \textit{memory trigger}, which monitors the agent’s reasoning state to decide explicit memory invocation, and a \textit{memory weaver}, which takes the agent’s current state as stimulus to construct a latent token sequence as machine-native memory to enrich its reasoning. In this way, MemGen enables agents to recall and augment latent memory throughout reasoning, producing a tightly interwoven cycle of memory and cognition. Extensive experiments across eight benchmarks show that MemGen surpasses leading external memory systems such as ExpeL and AWM by up to $38.22%$, exceeds GRPO by up to $13.44%$, and exhibits strong cross-domain generalization ability. More importantly, we find that without explicit supervision, MemGen spontaneously evolves distinct human-like memory faculties, including planning memory, procedural memory, and working memory, suggesting an emergent trajectory toward more naturalistic forms of machine cognition.

[160] Socratic-Zero : Bootstrapping Reasoning via Data-Free Agent Co-evolution

Shaobo Wang, Zhengbo Jiao, Zifan Zhang, Yilang Peng, Xu Ze, Boyu Yang, Wei Wang, Hu Wei, Linfeng Zhang

Main category: cs.CL

TL;DR: Socratic-Zero is an autonomous framework that generates high-quality training data from minimal seed examples through co-evolution of three agents, achieving superior performance on mathematical reasoning benchmarks without requiring pre-existing tasks or labels.

Details

Motivation: Existing data synthesis methods struggle with inconsistent data quality and inability to dynamically adapt to evolving model capabilities, creating suboptimal training signals. There's a need for scalable, high-quality data generation without heavy human annotation.

Method: Uses three co-evolving agents: Solver (refines reasoning from preference feedback), Teacher (adaptively crafts challenging questions based on Solver’s weaknesses), and Generator (distills Teacher’s strategy for scalable curriculum generation). Forms a closed-loop system requiring only 100 seed questions.

Result: Socratic-Solver-8B achieves +20.2 percentage point gain over prior methods across 7 mathematical reasoning benchmarks. Synthetic data from Socratic-Generator-32B enables student LLMs to outperform SOTA commercial LLMs including Qwen3-235B, DeepSeek-V3.1-671B, GPT-5, Gemini-2.5-Pro, Grok-4, and Claude-4.1-Opus.

Conclusion: Socratic-Zero demonstrates that fully autonomous data generation through agent co-evolution can produce high-quality training curricula that significantly outperform existing methods and even surpass commercial SOTA models, enabling scalable reasoning capabilities without human annotation.

Abstract: Recent breakthroughs in large language models (LLMs) on reasoning tasks rely heavily on massive, high-quality datasets-typically human-annotated and thus difficult to scale. While data synthesis or distillation offers a promising alternative, existing methods struggle with inconsistent data quality and an inability to dynamically adapt to the evolving capabilities of the model, leading to suboptimal training signals. To address these limitations, we introduce Socratic-Zero, a fully autonomous framework that generates high-quality training data from minimal seed examples through the co-evolution of three agents: the Teacher, the Solver, and the Generator. The Solver continuously refines its reasoning by learning from preference feedback on both successful and failed trajectories; the Teacher adaptively crafts increasingly challenging questions based on the Solver’s weaknesses; and the Generator distills the Teacher’s question-design strategy to enable scalable, high-fidelity curriculum generation. This closed-loop system produces a self-improving curriculum-requiring no pre-existing tasks or labels. Remarkably, starting from only 100 seed questions, our Socratic-Solver-8B achieves an average gain of +20.2 percentage points over prior data synthesis methods across seven mathematical reasoning benchmarks (AMC23, AIME24-25, Olympiad, MATH-500, Minerva, and GSM8K), with consistent gains on both Qwen3 and GLM4 series models. Even more surprisingly, synthetic data from Socratic-Generator-32B enables student LLMs to achieve superior performance compared to other state-of-the-art (SOTA) commercial LLMs on these benchmarks, including Qwen3-235B-A22B, DeepSeek-V3.1-671B, GPT-5, Gemini-2.5-Pro, Grok-4, and Claude-4.1-Opus.

[161] ProxyAttn: Guided Sparse Attention via Representative Heads

Yixuan Wang, Huang He, Siqi Bao, Hua Wu, Haifeng Wang, Qingfu Zhu, Wanxiang Che

Main category: cs.CL

TL;DR: ProxyAttn is a training-free sparse attention algorithm that compresses attention head dimensions to achieve more precise block importance estimation, enabling up to 10.3x attention acceleration and 2.4x pre-filling acceleration in LLMs without significant performance loss.

Details

Motivation: The quadratic complexity of attention mechanisms limits LLM efficiency on long-text tasks. Existing methods use coarse-grained block importance estimation that leads to performance degradation at high sparsity rates.

Method: Uses pooled representative heads to approximate scores for all heads based on observed similarity among attention heads. Proposes block-aware dynamic budget estimation to account for varying sparsity among heads. Combines proxy head scores with multi-head dynamic budgets for fine-grained block importance evaluation.

Result: Achieves up to 10.3x attention acceleration and 2.4x pre-filling acceleration without significant performance loss. Experiments confirm underlying similarity among attention heads and show substantial gains in performance and efficiency compared to existing methods.

Conclusion: ProxyAttn enables efficient fine-grained block importance estimation at low computational cost, providing significant acceleration for long-text tasks in LLMs while maintaining performance.

Abstract: The quadratic complexity of attention mechanisms limits the efficiency of Large Language Models (LLMs) on long-text tasks. Recently, methods that dynamically estimate block importance have enabled efficient block sparse attention, leading to significant acceleration in long-text pre-filling of LLMs. However, their coarse-grained estimation inevitably leads to performance degradation at high sparsity rates. In this work, we propose ProxyAttn, a training-free sparse attention algorithm that achieves more precise block estimation by compressing the dimension of attention heads. Based on our observation of the similarity among multiple attention heads, we use the scores of pooled representative heads to approximate the scores for all heads. To account for the varying sparsity among heads, we also propose a block-aware dynamic budget estimation method. By combining the scores from representative proxy heads with multi-head dynamic budgets, we achieve a more fine-grained block importance evaluation at low computational cost. Experiments on a variety of mainstream models and extensive benchmarks confirm the underlying similarity among attention heads. Leveraging a fine-grained estimation, the proposed method achieves substantial gains in performance and efficiency compared to existing methods. More precisely, ProxyAttn can achieve up to 10.3x attention acceleration and 2.4x prefilling acceleration without significant performance loss. Our code is available at https://github.com/wyxstriker/ProxyAttn.

[162] LatentEvolve: Self-Evolving Test-Time Scaling in Latent Space

Guibin Zhang, Fanci Meng, Guancheng Wan, Zherui Li, Kun Wang, Zhenfei Yin, Lei Bai, Shuicheng Yan

Main category: cs.CL

TL;DR: LatentEvolve is a self-evolving test-time scaling framework that enables LLMs to progressively learn how to scale computation more effectively, inspired by human cognitive systems.

Details

Motivation: Existing test-time scaling methods are independent and don't allow LLMs to progressively learn better scaling strategies. The goal is to evolve LLMs to learn "how to scale test-time computation" more effectively.

Method: LatentEvolve uses a dual evolutionary system: daytime scaling (fast recall of historical latent representations) and nighttime scaling (integration of past latent optimizations), mimicking human brain’s hippocampus-neocortex system for unsupervised learning.

Result: LatentEvolve outperforms state-of-the-art TTS methods like LatentSeek and TTRL by up to 13.33% across eight benchmarks and five model backbones, showing exceptional cross-domain and cross-backbone generalization.

Conclusion: The framework successfully enables LLMs to self-evolve their test-time scaling capabilities through unsupervised learning, achieving significant performance improvements while maintaining generalization across different domains and model architectures.

Abstract: Test-time Scaling (TTS) has been demonstrated to significantly enhance the reasoning capabilities of Large Language Models (LLMs) during the inference phase without altering model parameters. However, existing TTS methods are largely independent, implying that LLMs have not yet evolved to progressively learn how to scale more effectively. With the objective of evolving LLMs to learn ``how to scale test-time computation,’’ we propose LatentEvolve, a self-evolving latent TTS framework inspired by the complementary learning system (CLS) theory. Analogous to the human brain’s dual system of a fast-recall hippocampus and a slow-consolidating neocortex, LatentEvolve comprises two evolutionary components: \textit{daytime scaling}, which rapidly retrieves historical latent representations to better guide current LLM reasoning; and \textit{nighttime scaling}, which integrates past latent optimizations in a manner akin to the human brain’s consolidation of experiences during sleep. The alternation of daytime and nighttime processes facilitates a fast and slow evolution of LLM TTS, mirroring human cognitive dynamics in a fully unsupervised manner. Extensive experiments across eight benchmarks and five model backbones demonstrate that our LatentEvolve surpasses state-of-the-art TTS methods such as LatentSeek and TTRL by up to $13.33%$ and exhibits exceptional cross-domain and cross-backbone generalization.

[163] SeaPO: Strategic Error Amplification for Robust Preference Optimization of Large Language Models

Jun Rao, Yunjie Liao, Xuebo Liu, Zepeng Lin, Lian Lian, Dong Jin, Shengjun Cheng, Jun Yu, Min Zhang

Main category: cs.CL

TL;DR: SeaPO is a Strategic Error Amplification method that introduces specific error patterns into LLM preference optimization to ensure negative samples are more erroneous than positive samples, enhancing model performance through preference-based training.

Details

Motivation: Existing alignment methods struggle when positive and negative samples become similar in quality during training due to model capacity limitations, complicating preference optimization.

Method: Leverages three common LLM error types to introduce specific error patterns into preference optimization, ensuring negative samples are more erroneous than positive samples and using preference-based training to mitigate these errors.

Result: Evaluations across five capability dimensions and model scales (1.5B to 14B) show significant performance improvements, particularly in truthfulness (5-10 percentage points). Task performance varies with error types - common errors improve related tasks, while mixed errors provide broader enhancement.

Conclusion: Strategic error amplification through SeaPO effectively addresses the similarity problem in preference optimization, leading to substantial performance improvements across multiple model scales and capability dimensions.

Abstract: Existing alignment methods for preference optimization of large language models (LLMs) aim to enhance model performance by utilizing pairs of positive and negative samples. However, due to the limited capacity of models in scoring or generating responses, the quality of positive and negative samples may become similar during training, which complicates optimization for preference learning. To address this issue, we introduce SeaPO, a Strategic Error Amplification method that leverages three error types commonly occurring in LLMs to introduce specific error patterns into the model Preference Optimization. This strategy ensures that negative samples are more erroneous than positive samples and preference-based training is employed to mitigate the occurrence of these errors, thereby enhancing model performance. Evaluations across five capability dimensions and different model scales (1.5B to 14B) demonstrate that the generated data significantly improved overall model performance, particularly in terms of truthfulness, with improvements of 5-10 percentage points observed. Further analysis reveals that task performance varies depending on the error types introduced. Injecting the most common error types improves performance in related tasks, while a mix of error types leads to a broader performance enhancement: most tasks show stable improvements, while a few tasks exhibit significant gains.

[164] Evaluating Spatiotemporal Consistency in Automatically Generated Sewing Instructions

Luisa Geiger, Mareike Hartmann, Michael Sullivan, Alexander Koller

Main category: cs.CL

TL;DR: Proposes a tree-based evaluation metric for LLM-generated assembly instructions that better captures spatiotemporal aspects than traditional text similarity metrics like BLEU and BERT.

Details

Motivation: Traditional metrics like BLEU and BERT similarity scores fail to adequately evaluate the spatiotemporal soundness of step-by-step assembly instructions, particularly in domains like sewing where construction sequence matters.

Method: Developed a novel automatic tree-based evaluation metric specifically designed for LLM-generated step-by-step assembly instructions, focusing on spatiotemporal aspects of construction.

Result: The proposed metric shows better correlation with manually-annotated error counts and human quality ratings for sewing instructions, and demonstrates superior robustness against artificially-constructed counterfactual examples designed to confound text similarity metrics.

Conclusion: The tree-based metric is superior to traditional text similarity metrics for evaluating the spatiotemporal soundness of assembly instructions, particularly in domains requiring precise construction sequences.

Abstract: In this paper, we propose a novel, automatic tree-based evaluation metric for LLM-generated step-by-step assembly instructions, that more accurately reflects spatiotemporal aspects of construction than traditional metrics such as BLEU and BERT similarity scores. We apply our proposed metric to the domain of sewing instructions, and show that our metric better correlates with manually-annotated error counts as well as human quality ratings, demonstrating our metric’s superiority for evaluating the spatiotemporal soundness of sewing instructions. Further experiments show that our metric is more robust than traditional approaches against artificially-constructed counterfactual examples that are specifically constructed to confound metrics that rely on textual similarity.

[165] KnowGuard: Knowledge-Driven Abstention for Multi-Round Clinical Reasoning

Xilin Dang, Kexin Chen, Xiaorui Su, Ayush Noori, Iñaki Arango, Lucas Vittor, Xinyi Long, Yuyang Du, Marinka Zitnik, Pheng Ann Heng

Main category: cs.CL

TL;DR: KnowGuard is a novel abstention framework for LLMs in medical diagnosis that uses systematic knowledge graph exploration to identify when information is insufficient, preventing overconfident misdiagnoses.

Details

Motivation: Current LLMs struggle with abstention in medical scenarios, providing overconfident responses despite incomplete information, which can lead to harmful misdiagnoses. Existing methods rely solely on model self-assessments without systematic external evidence verification.

Method: A two-stage “investigate-before-abstain” paradigm: 1) Evidence discovery through systematic knowledge graph exploration (graph expansion and direct retrieval), and 2) Evidence evaluation that ranks evidence using multiple factors to adapt exploration based on patient context and conversation history.

Result: KnowGuard outperforms state-of-the-art abstention approaches, improving diagnostic accuracy by 3.93% while reducing unnecessary interaction by 7.27 turns on average in open-ended multi-round clinical benchmarks.

Conclusion: The systematic knowledge graph exploration approach enables LLMs to better recognize insufficient medical evidence and make appropriate abstentions, improving both diagnostic accuracy and efficiency in clinical decision-making.

Abstract: In clinical practice, physicians refrain from making decisions when patient information is insufficient. This behavior, known as abstention, is a critical safety mechanism preventing potentially harmful misdiagnoses. Recent investigations have reported the application of large language models (LLMs) in medical scenarios. However, existing LLMs struggle with the abstentions, frequently providing overconfident responses despite incomplete information. This limitation stems from conventional abstention methods relying solely on model self-assessments, which lack systematic strategies to identify knowledge boundaries with external medical evidences. To address this, we propose \textbf{KnowGuard}, a novel \textit{investigate-before-abstain} paradigm that integrates systematic knowledge graph exploration for clinical decision-making. Our approach consists of two key stages operating on a shared contextualized evidence pool: 1) an evidence discovery stage that systematically explores the medical knowledge space through graph expansion and direct retrieval, and 2) an evidence evaluation stage that ranks evidence using multiple factors to adapt exploration based on patient context and conversation history. This two-stage approach enables systematic knowledge graph exploration, allowing models to trace structured reasoning paths and recognize insufficient medical evidence. We evaluate our abstention approach using open-ended multi-round clinical benchmarks that mimic realistic diagnostic scenarios, assessing abstention quality through accuracy-efficiency trade-offs beyond existing closed-form evaluations. Experimental evidences clearly demonstrate that KnowGuard outperforms state-of-the-art abstention approaches, improving diagnostic accuracy by 3.93% while reducing unnecessary interaction by 7.27 turns on average.

[166] DiaCDM: Cognitive Diagnosis in Teacher-Student Dialogues using the Initiation-Response-Evaluation Framework

Rui Jia, Yuang Wei, Ruijia Li, Yuang-Hao Jiang, Xinyu Xie, Yaomin Shen, Min Zhang, Bo Jiang

Main category: cs.CL

TL;DR: DiaCDM is a novel cognitive diagnosis model that adapts the IRE framework and graph-based encoding to assess students’ knowledge mastery from teacher-student dialogues, overcoming challenges of dynamic unstructured data.

Details

Motivation: Traditional cognitive diagnosis models are ineffective for real-world teacher-student dialogues due to lack of suitable frameworks for dynamic unstructured data and difficulty extracting diagnostic semantics from lengthy conversations.

Method: Adapted the initiation-response-evaluation (IRE) framework from educational theory and developed a unique graph-based encoding method that integrates teacher questions with relevant knowledge components to capture key information.

Result: Experiments on three real-world dialogue datasets show DiaCDM significantly improves diagnostic accuracy and enhances result interpretability compared to traditional methods.

Conclusion: DiaCDM provides teachers with a powerful tool for assessing students’ cognitive states in dialogue settings, representing the first exploration of cognitive diagnosis in dialogue contexts.

Abstract: While cognitive diagnosis (CD) effectively assesses students’ knowledge mastery from structured test data, applying it to real-world teacher-student dialogues presents two fundamental challenges. Traditional CD models lack a suitable framework for handling dynamic, unstructured dialogues, and it’s difficult to accurately extract diagnostic semantics from lengthy dialogues. To overcome these hurdles, we propose DiaCDM, an innovative model. We’ve adapted the initiation-response-evaluation (IRE) framework from educational theory to design a diagnostic framework tailored for dialogue. We also developed a unique graph-based encoding method that integrates teacher questions with relevant knowledge components to capture key information more precisely. To our knowledge, this is the first exploration of cognitive diagnosis in a dialogue setting. Experiments on three real-world dialogue datasets confirm that DiaCDM not only significantly improves diagnostic accuracy but also enhances the results’ interpretability, providing teachers with a powerful tool for assessing students’ cognitive states. The code is available at https://github.com/Mind-Lab-ECNU/DiaCDM/tree/main.

Xinye Zhao, Spyridon Mastorakis

Main category: cs.CL

TL;DR: SemShareKV is a KV cache sharing framework that accelerates LLM inference by reusing key-value caches from semantically similar prompts using fuzzy token matching with LSH and RoPE, achieving up to 6.25x speedup with minimal quality loss.

Details

Motivation: The memory footprint of KV caches during LLM inference has become a bottleneck, and existing compression methods are limited for semantically similar but lexically different prompts common in tasks like multi-document summarization and conversational agents.

Method: Uses fuzzy token matching with locality-sensitive hashing (LSH) on token embeddings and incorporates Rotary Position Embedding (RoPE) to preserve positional information, selectively reusing relevant key-value pairs from reference prompts’ caches.

Result: Achieves up to 6.25x speedup and 42% lower GPU memory usage with 5k tokens input on summarization datasets, with negligible quality degradation.

Conclusion: Semantic-aware cache sharing shows significant potential for efficient LLM inference by reducing redundant computation while maintaining output quality.

Abstract: As large language models (LLMs) continue to scale, the memory footprint of key-value (KV) caches during inference has become a significant bottleneck. Existing approaches primarily focus on compressing KV caches within a single prompt or reusing shared prefixes or frequently ocurred text segments across prompts. However, such strategies are limited in scenarios where prompts are semantically similar but lexically different, which frequently occurs in tasks such as multi-document summarization and conversational agents. We propose \textit{SemShareKV}, a KV cache sharing and compression framework that accelerates LLM inference by reusing KVCache in semantically similar prompts. Instead of relying on exact token matches, SemShareKV applies fuzzy token matching using locality-sensitive hashing (LSH) on token embeddings and incorporates Rotary Position Embedding (RoPE) to better preserve positional information. By selectively reusing relevant key-value pairs from a reference prompt’s cache, SemShareKV reduces redundant computation while maintaining output quality. Experiments on diverse summarization datasets show up to 6.25$\times$ speedup and 42% lower GPU memory usage with 5k tokens input, with negligible quality degradation. These results highlight the potential of semantic-aware cache sharing for efficient LLM inference.

[168] Hierarchical Error Correction for Large Language Models: A Systematic Framework for Domain-Specific AI Quality Enhancement

Zhilong Zhao, Yindi Liu

Main category: cs.CL

TL;DR: The paper proposes a Hierarchical Error Correction (HEC) framework that systematically addresses domain-specific AI limitations through error analysis and targeted interventions, achieving average improvements of 11.2 percentage points across specialized domains.

Details

Motivation: Large Language Models face significant performance challenges in specialized domains, with state-of-the-art models achieving only 45.9% accuracy on medical coding tasks, highlighting the need for domain-specific enhancement strategies.

Method: Developed a three-stage Hierarchical Error Correction framework based on analysis of error patterns across four specialized domains, addressing Knowledge-layer (58.4%), Reasoning-layer (39.6%), and Complexity-layer (2.0%) errors according to hierarchical importance.

Result: Experimental validation across medical transcription (4,921 cases), legal document classification (1,000 cases), political bias detection (645 cases), and legal reasoning (1,000 cases) shows consistent improvements with average 11.2 percentage point gains (p < 0.001) across five LLM architectures.

Conclusion: Systematic error analysis can guide effective AI enhancement strategies in specialized domains, particularly for moderate-baseline tasks, while framework limitations exist in high-baseline tasks (>75% accuracy) where hierarchical intervention may interfere with reasoning processes.

Abstract: Large Language Models face significant performance challenges in specialized domains, with state-of-the-art models achieving only 45.9% accuracy on medical coding tasks. This study proposes a Hierarchical Error Correction (HEC) framework that addresses domain-specific AI limitations through systematic error analysis and targeted intervention strategies. We analyze error patterns across four specialized domains and find that AI errors follow consistent hierarchical structures: Knowledge-layer errors (58.4%), Reasoning-layer errors (39.6%), and Complexity-layer errors (2.0%). Based on these patterns, we develop a three-stage correction framework that addresses errors according to their hierarchical importance and demonstrates that framework effectiveness correlates inversely with baseline task performance. Experimental validation across medical transcription (4,921 cases), legal document classification (1,000 cases), political bias detection (645 cases), and legal reasoning (1,000 cases) shows consistent improvements. Cross-model validation across five LLM architectures demonstrates average improvements of 11.2 percentage points (p < 0.001). However, analysis reveals framework limitations in high-baseline tasks (>75% accuracy), where hierarchical intervention may interfere with effective reasoning processes. The results suggest that systematic error analysis can guide effective AI enhancement strategies in specialized domains, particularly for moderate-baseline tasks, while highlighting the importance of understanding framework boundaries for optimal deployment.

[169] Between Help and Harm: An Evaluation of Mental Health Crisis Handling by LLMs

Adrian Arnaiz-Rodriguez, Miguel Baidal, Erik Derner, Jenn Layton Annable, Mark Ball, Mark Ince, Elvira Perez Vallejos, Nuria Oliver

Main category: cs.CL

TL;DR: This paper evaluates LLMs’ ability to handle mental health crises, finding they’re generally reliable for explicit disclosures but have significant risks including inappropriate responses, poor handling of indirect signals, and context misalignment.

Details

Motivation: LLMs are increasingly used in high-stakes mental health contexts, but their safety in detecting and responding to acute mental health crises remains poorly understood due to lack of unified taxonomies, benchmarks, and clinical evaluations.

Method: Introduced a unified taxonomy of six clinically-informed mental health crisis categories, curated a diverse evaluation dataset, established expert-designed assessment protocols, and systematically benchmarked three state-of-the-art LLMs for crisis classification and response generation.

Result: LLMs are highly consistent and generally reliable for explicit crisis disclosures, but show significant risks: non-negligible proportion of inappropriate/harmful responses, higher failure rates in open-weight models, systemic weaknesses in handling indirect risk signals, formulaic replies, and frequent context misalignment.

Conclusion: Urgent need for enhanced safeguards, improved crisis detection, and context-aware interventions in LLM deployments. The taxonomy, datasets, and evaluation framework provide groundwork for responsible AI-driven mental health support innovation.

Abstract: The widespread use of chatbots powered by large language models (LLMs) such as ChatGPT and Llama has fundamentally reshaped how people seek information and advice across domains. Increasingly, these chatbots are being used in high-stakes contexts, including emotional support and mental health concerns. While LLMs can offer scalable support, their ability to safely detect and respond to acute mental health crises remains poorly understood. Progress is hampered by the absence of unified crisis taxonomies, robust annotated benchmarks, and empirical evaluations grounded in clinical best practices. In this work, we address these gaps by introducing a unified taxonomy of six clinically-informed mental health crisis categories, curating a diverse evaluation dataset, and establishing an expert-designed protocol for assessing response appropriateness. We systematically benchmark three state-of-the-art LLMs for their ability to classify crisis types and generate safe, appropriate responses. The results reveal that while LLMs are highly consistent and generally reliable in addressing explicit crisis disclosures, significant risks remain. A non-negligible proportion of responses are rated as inappropriate or harmful, with responses generated by an open-weight model exhibiting higher failure rates than those generated by the commercial ones. We also identify systemic weaknesses in handling indirect or ambiguous risk signals, a reliance on formulaic and inauthentic default replies, and frequent misalignment with user context. These findings underscore the urgent need for enhanced safeguards, improved crisis detection, and context-aware interventions in LLM deployments. Our taxonomy, datasets, and evaluation framework lay the groundwork for ongoing research and responsible innovation in AI-driven mental health support, helping to minimize harm and better protect vulnerable users.

[170] Metaphor identification using large language models: A comparison of RAG, prompt engineering, and fine-tuning

Matteo Fuoli, Weihang Huang, Jeannette Littlemore, Sarah Turner, Ellen Wilding

Main category: cs.CL

TL;DR: LLMs can automate metaphor identification in texts with high accuracy, achieving median F1 score of 0.79 through fine-tuning, and can serve as testbeds for refining metaphor theory.

Details

Motivation: Manual annotation constrains large-scale metaphor analysis due to its context-sensitive nature, creating need for automated approaches using modern language models.

Method: Compared three approaches: retrieval-augmented generation (RAG) with codebook, prompt engineering (zero-shot, few-shot, chain-of-thought), and fine-tuning on hand-coded texts.

Result: State-of-the-art closed-source LLMs achieved high accuracy, with fine-tuning yielding median F1 score of 0.79. Discrepancies between human and LLM outputs were systematic, reflecting known theoretical grey areas.

Conclusion: LLMs can partly automate metaphor identification and serve as testbeds for developing and refining metaphor identification protocols and underlying theory.

Abstract: Metaphor is a pervasive feature of discourse and a powerful lens for examining cognition, emotion, and ideology. Large-scale analysis, however, has been constrained by the need for manual annotation due to the context-sensitive nature of metaphor. This study investigates the potential of large language models (LLMs) to automate metaphor identification in full texts. We compare three methods: (i) retrieval-augmented generation (RAG), where the model is provided with a codebook and instructed to annotate texts based on its rules and examples; (ii) prompt engineering, where we design task-specific verbal instructions; and (iii) fine-tuning, where the model is trained on hand-coded texts to optimize performance. Within prompt engineering, we test zero-shot, few-shot, and chain-of-thought strategies. Our results show that state-of-the-art closed-source LLMs can achieve high accuracy, with fine-tuning yielding a median F1 score of 0.79. A comparison of human and LLM outputs reveals that most discrepancies are systematic, reflecting well-known grey areas and conceptual challenges in metaphor theory. We propose that LLMs can be used to at least partly automate metaphor identification and can serve as a testbed for developing and refining metaphor identification protocols and the theory that underpins them.

[171] Expanding Computation Spaces of LLMs at Inference Time

Yoonna Jang, Kisu Yang, Isabelle Augenstein

Main category: cs.CL

TL;DR: Language models can use artificially inserted filler tokens at inference to expand computational space, improving performance especially for smaller models by up to 12.372 percentage points.

Details

Motivation: To investigate whether language models can leverage artificially inserted filler tokens solely at inference as additional computation space, building on prior work that trained such tokens.

Method: Identify effective token types, numbers, and insertion locations; examine when models learn to use expanded computation space during training; analyze attention dynamics in these spaces via attention maps.

Result: Appropriate token types and counts vary, but placing filler tokens directly before the final ‘Answer:’ token is most effective. Smaller models benefit most (up to 12.372% improvement), and attention maps show expanded spaces continue original attention patterns and focus on questions/answers.

Conclusion: Filler tokens act as additional computational capacity rather than redundant input, with meaningful computation occurring in these expanded spaces for problem-solving.

Abstract: Chain-of-thought (CoT) rationale enables language models to use additional task-related text for problem-solving, benefiting not only from detailed reasoning steps but also from the expanded computational space of longer inputs. Prior work has trained filler or special tokens to serve as additional computation spaces. In this study, we investigate whether language models can leverage artificially inserted sequences of filler tokens solely at inference. We first identify effective token types, numbers, and insertion locations, then examine at what stage of training models begin to exploit the expanded computation space, and finally analyze dynamics within these spaces via attention maps. Experiments on models ranging from 1.7B to 32B across open-domain QA and math tasks show that appropriate token types and counts vary, but placing filler tokens directly before the final ‘Answer:’ token is most effective. Smaller models benefit most, up to 12.372 percentage points in SmolLM2-1.7B-Instruct, indicating that these spaces act as additional computational capacity rather than redundant input. Attention maps reveal that expanded spaces often continue the original attention mechanism and sometimes focus on questions or answer options, suggesting meaningful computation for problem-solving.

[172] BOE-XSUM: Extreme Summarization in Clear Language of Spanish Legal Decrees and Notifications

Andrés Fernández García, Javier de la Rosa, Julio Gonzalo, Roser Morante, Enrique Amigó, Alejandro Benito-Santos, Jorge Carrillo-de-Albornoz, Víctor Fresno, Adrian Ghajari, Guillermo Marco, Laura Plaza, Eva Sánchez Salido

Main category: cs.CL

TL;DR: BOE-XSUM is a new Spanish legal document summarization dataset with 3,648 entries from Spain’s Official Gazette, showing that fine-tuned medium-sized LLMs significantly outperform zero-shot general models.

Details

Motivation: Address the lack of concise Spanish document summaries, especially in the legal domain, due to information overload and limited available resources.

Method: Created BOE-XSUM dataset with short summaries, original texts, and document types. Fine-tuned medium-sized LLMs on this dataset and compared them to zero-shot general-purpose models.

Result: Fine-tuned models significantly outperformed zero-shot models. BERTIN GPT-J 6B achieved 41.6% accuracy vs 33.5% for DeepSeek-R1, representing a 24% performance gain.

Conclusion: Specialized fine-tuning on domain-specific datasets like BOE-XSUM substantially improves Spanish legal document summarization compared to using general models in zero-shot settings.

Abstract: The ability to summarize long documents succinctly is increasingly important in daily life due to information overload, yet there is a notable lack of such summaries for Spanish documents in general, and in the legal domain in particular. In this work, we present BOE-XSUM, a curated dataset comprising 3,648 concise, plain-language summaries of documents sourced from Spain’s ``Bolet'{\i}n Oficial del Estado’’ (BOE), the State Official Gazette. Each entry in the dataset includes a short summary, the original text, and its document type label. We evaluate the performance of medium-sized large language models (LLMs) fine-tuned on BOE-XSUM, comparing them to general-purpose generative models in a zero-shot setting. Results show that fine-tuned models significantly outperform their non-specialized counterparts. Notably, the best-performing model – BERTIN GPT-J 6B (32-bit precision) – achieves a 24% performance gain over the top zero-shot model, DeepSeek-R1 (accuracies of 41.6% vs.\ 33.5%).

[173] How Well Do LLMs Imitate Human Writing Style?

Rebira Jemama, Rajesh Kumar

Main category: cs.CL

TL;DR: A training-free framework for authorship verification and style imitation analysis that combines TF-IDF character n-grams with transformer embeddings, achieving high accuracy while significantly reducing computational requirements.

Details

Motivation: To understand whether large language models can effectively replicate specific human author styles and develop a fast, training-free method for authorship verification and style imitation analysis.

Method: Integrates TF-IDF character n-grams with transformer embeddings and classifies text pairs through empirical distance distributions, eliminating supervised training or threshold tuning. Evaluates LLMs across different prompting strategies.

Result: Achieves 97.5% accuracy on academic essays and 94.5% in cross-domain evaluation, with 91.8% reduction in training time and 59% memory usage reduction. Few-shot prompting yields 23.5x higher style-matching than zero-shot, and completion prompting reaches 99.9% agreement with original author style.

Conclusion: Prompting strategy has more influence on style fidelity than model size. High-fidelity imitation doesn’t imply human-like unpredictability - stylistic fidelity and statistical detectability are separable, providing basis for future authorship modeling and detection work.

Abstract: Large language models (LLMs) can generate fluent text, but their ability to replicate the distinctive style of a specific human author remains unclear. We present a fast, training-free framework for authorship verification and style imitation analysis. The method integrates TF-IDF character n-grams with transformer embeddings and classifies text pairs through empirical distance distributions, eliminating the need for supervised training or threshold tuning. It achieves 97.5% accuracy on academic essays and 94.5% in cross-domain evaluation, while reducing training time by 91.8% and memory usage by 59% relative to parameter-based baselines. Using this framework, we evaluate five LLMs from three separate families (Llama, Qwen, Mixtral) across four prompting strategies - zero-shot, one-shot, few-shot, and text completion. Results show that the prompting strategy has a more substantial influence on style fidelity than model size: few-shot prompting yields up to 23.5x higher style-matching accuracy than zero-shot, and completion prompting reaches 99.9% agreement with the original author’s style. Crucially, high-fidelity imitation does not imply human-like unpredictability - human essays average a perplexity of 29.5, whereas matched LLM outputs average only 15.2. These findings demonstrate that stylistic fidelity and statistical detectability are separable, establishing a reproducible basis for future work in authorship modeling, detection, and identity-conditioned generation.

[174] MobileLLM-R1: Exploring the Limits of Sub-Billion Language Model Reasoners with Open Training Recipes

Changsheng Zhao, Ernie Chang, Zechun Liu, Chia-Jung Chang, Wei Wen, Chen Lai, Rick Cao, Yuandong Tian, Raghuraman Krishnamoorthi, Yangyang Shi, Vikas Chandra

Main category: cs.CL

TL;DR: Strong reasoning capabilities can emerge in sub-billion-parameter models with only ~2T tokens of high-quality curated data, challenging the assumption that massive datasets (>10T tokens) are necessary for reasoning emergence.

Details

Motivation: To challenge the prevailing assumption that reasoning capabilities in LLMs require training on massive datasets (>10T tokens), and demonstrate that careful data curation can enable reasoning emergence with far less data.

Method: Curated and resampled open-source datasets using designed metrics to identify beneficial data, then pre-trained models on ~2T tokens of high-quality data followed by established post-training procedures.

Result: MobileLLM-R1-950M achieves AIME score of 15.5, substantially outperforming comparable models (OLMo-2-1.48B: 0.6, SmolLM-2-1.7B: 0.3) and matching/surpassing Qwen3-0.6B despite using only 11.7% of Qwen3’s training tokens.

Conclusion: High-quality data curation is more crucial than massive data scaling for reasoning emergence, enabling efficient development of capable sub-billion-parameter reasoning models with significantly reduced computational requirements.

Abstract: The paradigm shift in large language models (LLMs) from instinctive responses to chain-of-thought (CoT) reasoning has fueled two prevailing assumptions: (1) reasoning capabilities only emerge in sufficiently large models, and (2) such capabilities require training on massive datasets. While the first assumption has already been challenged by recent sub-billion-parameter reasoning models such as Qwen3-0.6B and DeepSeek distilled variants, the second remains largely unquestioned. In this work, we revisit the necessity of scaling to extremely large corpora (>10T tokens) for reasoning emergence. By carefully curating and resampling open-source datasets that we identify as beneficial under our designed metrics, we demonstrate that strong reasoning abilities can emerge with far less data. Specifically, we show that only ~2T tokens of high-quality data are sufficient, and pre-training with 4.2T tokens on the dataset resampled from these ~2T tokens, followed by a established post-training procedure, enables the development of MobileLLM-R1, a series of sub-billion-parameter reasoning models that substantially outperform prior models trained on fully open-sourced data. For example, MobileLLM-R1-950M achieves an AIME score of 15.5, compared to just 0.6 for OLMo-2-1.48B and 0.3 for SmolLM-2-1.7B. Remarkably, despite being trained on only 11.7% of the tokens compared to Qwen3’s proprietary 36T-token corpus for pretraining, MobileLLM-R1-950M matches or surpasses Qwen3-0.6B across multiple reasoning benchmarks. To facilitate further research in this direction, we have released the complete training recipe, data sources, data mixing ratio, and model checkpoints, together with the key insights obtained throughout this study.

[175] The Dialogue That Heals: A Comprehensive Evaluation of Doctor Agents’ Inquiry Capability

Linlu Gong, Ante Wang, Yunghwei Lai, Weizhi Ma, Yang Liu

Main category: cs.CL

TL;DR: MAQuE is a comprehensive benchmark for evaluating medical AI doctors’ questioning abilities, featuring 3,000 realistic patient agents and a multi-faceted evaluation framework that reveals significant challenges in current LLMs’ diagnostic inquiry capabilities.

Details

Motivation: Current AI doctors focus mainly on diagnostic skills but overlook other essential physician qualities like empathy, patience, and clear communication. There's a need for comprehensive evaluation of medical questioning abilities beyond just diagnostic accuracy.

Method: Created MAQuE benchmark with 3,000 realistically simulated patient agents exhibiting diverse linguistic patterns, cognitive limitations, emotional responses, and passive disclosure tendencies. Introduced multi-faceted evaluation framework covering task success, inquiry proficiency, dialogue competence, inquiry efficiency, and patient experience.

Result: Experiments show substantial challenges across all evaluation aspects. State-of-the-art models have significant room for improvement in inquiry capabilities, are highly sensitive to realistic patient behavior variations, and show trade-offs between different evaluation perspectives.

Conclusion: Current medical AI models struggle to balance performance and practicality in real-world clinical settings, highlighting the need for more comprehensive evaluation and improvement in medical questioning capabilities beyond just diagnostic accuracy.

Abstract: An effective physician should possess a combination of empathy, expertise, patience, and clear communication when treating a patient. Recent advances have successfully endowed AI doctors with expert diagnostic skills, particularly the ability to actively seek information through inquiry. However, other essential qualities of a good doctor remain overlooked. To bridge this gap, we present MAQuE(Medical Agent Questioning Evaluation), the largest-ever benchmark for the automatic and comprehensive evaluation of medical multi-turn questioning. It features 3,000 realistically simulated patient agents that exhibit diverse linguistic patterns, cognitive limitations, emotional responses, and tendencies for passive disclosure. We also introduce a multi-faceted evaluation framework, covering task success, inquiry proficiency, dialogue competence, inquiry efficiency, and patient experience. Experiments on different LLMs reveal substantial challenges across the evaluation aspects. Even state-of-the-art models show significant room for improvement in their inquiry capabilities. These models are highly sensitive to variations in realistic patient behavior, which considerably impacts diagnostic accuracy. Furthermore, our fine-grained metrics expose trade-offs between different evaluation perspectives, highlighting the challenge of balancing performance and practicality in real-world clinical settings.

[176] SemanticShield: LLM-Powered Audits Expose Shilling Attacks in Recommender Systems

Kaihong Li, Huichi Zhou, Bin Ma, Fangjun Huang

Main category: cs.CL

TL;DR: A two-stage framework called SemanticShield that uses LLMs to detect shilling attacks in recommender systems by analyzing item-side semantic features alongside user behaviors.

Details

Motivation: Existing recommender system defenses focus mainly on user behaviors while ignoring item-side features like titles and descriptions that can reveal malicious intent in shilling attacks.

Method: Two-stage detection: first stage pre-screens suspicious users using behavioral criteria, second stage uses LLM-based auditing to evaluate semantic consistency. Enhanced through reinforcement fine-tuning with specialized reward functions.

Result: Experiments on six attack strategies show SemanticShield effectively detects shilling attacks, and evaluation on unseen attack methods demonstrates strong generalization capability.

Conclusion: SemanticShield provides an effective defense against shilling attacks by leveraging item-side semantics through LLMs, offering improved detection and generalization over traditional methods.

Abstract: Recommender systems (RS) are widely used in e-commerce for personalized suggestions, yet their openness makes them susceptible to shilling attacks, where adversaries inject fake behaviors to manipulate recommendations. Most existing defenses emphasize user-side behaviors while overlooking item-side features such as titles and descriptions that can expose malicious intent. To address this gap, we propose a two-stage detection framework that integrates item-side semantics via large language models (LLMs). The first stage pre-screens suspicious users using low-cost behavioral criteria, and the second stage employs LLM-based auditing to evaluate semantic consistency. Furthermore, we enhance the auditing model through reinforcement fine-tuning on a lightweight LLM with carefully designed reward functions, yielding a specialized detector called SemanticShield. Experiments on six representative attack strategies demonstrate the effectiveness of SemanticShield against shilling attacks, and further evaluation on previously unseen attack methods shows its strong generalization capability. Code is available at https://github.com/FrankenstLee/SemanticShield.

[177] Generalized Correctness Models: Learning Calibrated and Model-Agnostic Correctness Predictors from Historical Patterns

Hanqi Xiao, Vaidehi Patil, Hyunji Lee, Elias Stengel-Eskin, Mohit Bansal

Main category: cs.CL

TL;DR: LLM confidence estimation is better achieved through systematic encoding of correctness history across models rather than relying on self-introspection, with answer phrasing being a strong predictor for correctness.

Details

Motivation: Accurate confidence estimation is critical for deploying LLMs in high-stakes applications, but current approaches based on self-knowledge perform poorly. The authors hypothesize that exposure to historical correctness data is key.

Method: Proposed Generalized Correctness Models (GCMs) trained on correctness data from multiple LLMs, using historical predictions and answer phrasing patterns. Also explored in-context examples and post-hoc calibration methods.

Result: GCMs trained on Qwen3-8B performed well across 5 model families and datasets (MMLU, TriviaQA), showing that correctness prediction is a generalizable skill learned from historical data rather than model-specific introspection.

Conclusion: Reliable LLM confidence estimation is a model-agnostic skill learned through systematic encoding of correctness history, not dependent on self-introspection.

Abstract: Generating accurate and calibrated confidence estimates is critical for deploying LLMs in high-stakes or user-facing applications, and remains an open challenge. Prior research has often framed confidence as a problem of eliciting a model’s “self-knowledge”, i.e., the ability of an LLM to judge whether its own answers are correct; this approach implicitly assumes that there is some privileged information about the answer’s correctness that is accessible to the model itself. However, our experiments reveal that an LLM attempting to predict the correctness of its own outputs generally performs no better than an unrelated LLM. Moreover, we hypothesize that a key factor in building a “Correctness Model” (CM) is exposure to a target model’s historical predictions. We propose multiple methods to inject this historical correctness information, creating a Generalized Correctness Model (GCM). We first show that GCMs can be trained on the correctness data from many LLMs and learn patterns for correctness prediction applicable across datasets and models. We then use CMs as a lens for studying the source of correctness prediction ability and its generalization, systematically controlling their training data and finding that answer phrasing is a strong predictor for correctness. We further explore alternative methods of injecting history without training an LLM, finding that including history as in-context examples can help improve correctness prediction, and post-hoc calibration can provide complementary reductions in calibration error. We evaluate GCMs based on Qwen3-8B across 5 model families and the MMLU and TriviaQA datasets, as well as on a downstream selective prediction task, finding that reliable LLM confidence estimation is a generalizable and model-agnostic skill learned by systematically encoding correctness history rather than a model-specific skill reliant on self-introspection.

[178] Circuit Distillation

Somin Wadhwa, Silvio Amir, Byron C. Wallace

Main category: cs.CL

TL;DR: Circuit distillation aligns internal representations between teacher and student models to transfer computational mechanisms, outperforming standard behavioral mimicry distillation.

Details

Motivation: Standard model distillation treats teacher's internal computations as a black box, focusing only on output mimicry. This work aims to distill the underlying computational mechanisms implemented by the teacher model.

Method: Proposes circuit distillation with an objective to align internal representations between functionally correspondent circuit components in teacher and student models. Introduces a method to match these components and a loss function reflecting representation similarities.

Result: Circuit distillation outperforms standard distillation on entity tracking and theory of mind tasks using Llama3 models. Successfully transfers algorithmic capabilities by adjusting only a small, targeted subset of student parameters.

Conclusion: Establishes feasibility of transferring mechanisms, enabling efficient distillation of targeted teacher capabilities through interpretable and controllable internal student mechanisms.

Abstract: Model distillation typically focuses on behavioral mimicry, where a student model is trained to replicate a teacher’s output while treating its internal computations as a black box. In this work we propose an alternative approach: Distilling the underlying computational mechanisms implemented by a teacher model. Specifically, we propose circuit distillation, which introduces an objective to align internal representations between analogous circuit components in teacher and student models. We propose a method to match ``functionally correspondent’’ circuit components and introduce a loss reflecting similarities between the representations that these induce. We evaluate circuit distillation on entity tracking and theory of mind (ToM) tasks using models from the Llama3 family. Our results demonstrate that circuit distillation outperforms standard distillation, successfully transferring algorithmic capabilities by adjusting only a small, targeted subset of student model parameters. This work establishes the feasibility of transferring mechanisms, which may in turn allow for efficient distillation of targeted teacher capabilities via interpretable and controllable internal student mechanisms.

[179] Ultra-Fast Language Generation via Discrete Diffusion Divergence Instruct

Haoyang Zheng, Xinyang Liu, Cindy Xiangrui Kong, Nan Jiang, Zheyuan Hu, Weijian Luo, Wei Deng, Guang Lin

Main category: cs.CL

TL;DR: DiDi-Instruct is a training-based method that accelerates language generation by initializing from pre-trained discrete diffusion language models, achieving 64x speedup while outperforming GPT-2 and standard diffusion models.

Details

Motivation: Fast generation of language texts is crucial in the AI era, and current methods need acceleration while maintaining quality.

Method: Uses pre-trained discrete diffusion language models as initialization, with techniques like grouped reward normalization, intermediate-state matching, and reward-guided ancestral sampler (RGAS) for improved training stability and inference performance.

Result: Achieves 64x acceleration over GPT-2 baseline, with perplexities ranging from 62.2 (8 NFEs) to 18.4 (128 NFEs) on OpenWebText, with only 1% entropy loss and 20x less training time.

Conclusion: DiDi-Instruct is an efficient distillation method that enables fast language generation with minimal performance degradation, validated through extensive studies and protein sequence generation.

Abstract: Fast generation of language texts is the holy grail that people pursue in the AI era. In this work, we introduced Discrete Diffusion Divergence Instruct (DiDi-Instruct), a training-based method that leads to fast language generation models by initializing from a pre-trained (masked) discrete diffusion language model (dLLM). The resulting DiDi-Instruct model outperforms the dLLM counterparts and the GPT-2 baseline with 64x acceleration. In the theoretical part of the paper, we build the foundation of DiDi-Instruct in a framework of integral KL-divergence minimization, with practical training algorithms. We also introduce techniques like grouped reward normalization, intermediate-state matching, and the reward-guided ancestral sampler (RGAS) that significantly improve the training stability, the model coverage, and the inference performances. On OpenWebText, DiDi-Instruct outperforms all accelerated language generation models as well as the GPT-2 baseline and the standard dLLMs, achieving sample perplexities ranging from 62.2 (8 NFEs) to 18.4 (128 NFEs). These performance gains are accomplished with a negligible entropy loss of about 1% and 20x less additional training wall-clock time. We further validate the robustness and effectiveness of DiDi-Instruct through extensive ablation studies, model scaling, and the generation of discrete protein sequences. In conclusion, DiDi-Instruct is an efficient yet effective distillation method, enabling language generation in the blink of an eye. We will release both code and models at github.com/haoyangzheng-ai/didi-instruct.

[180] GateMABSA: Aspect-Image Gated Fusion for Multimodal Aspect-based Sentiment Analysis

Adamu Lawan, Haruna Yunusa

Main category: cs.CL

TL;DR: GateMABSA is a novel gated multimodal architecture for Aspect-based Sentiment Analysis that uses specialized mLSTMs to handle syntactic structure, semantic relevance, and selective multimodal fusion, outperforming existing baselines on Twitter datasets.

Details

Motivation: Existing multimodal ABSA models struggle with filtering noisy visual signals and effectively aligning aspects with opinion-bearing content across text and image modalities.

Method: GateMABSA integrates three specialized mLSTMs: Syn-mLSTM for syntactic structure, Sem-mLSTM for aspect-semantic relevance, and Fuse-mLSTM for selective multimodal fusion.

Result: Extensive experiments on two benchmark Twitter datasets demonstrate that GateMABSA outperforms several baselines.

Conclusion: The proposed gated multimodal architecture effectively addresses challenges in multimodal ABSA by incorporating syntactic, semantic, and fusion-aware components.

Abstract: Aspect-based Sentiment Analysis (ABSA) has recently advanced into the multimodal domain, where user-generated content often combines text and images. However, existing multimodal ABSA (MABSA) models struggle to filter noisy visual signals, and effectively align aspects with opinion-bearing content across modalities. To address these challenges, we propose GateMABSA, a novel gated multimodal architecture that integrates syntactic, semantic, and fusion-aware mLSTM. Specifically, GateMABSA introduces three specialized mLSTMs: Syn-mLSTM to incorporate syntactic structure, Sem-mLSTM to emphasize aspect–semantic relevance, and Fuse-mLSTM to perform selective multimodal fusion. Extensive experiments on two benchmark Twitter datasets demonstrate that GateMABSA outperforms several baselines.

[181] Hyperdimensional Probe: Decoding LLM Representations via Vector Symbolic Architectures

Marco Bronzini, Carlo Nicolini, Bruno Lepri, Jacopo Staiano, Andrea Passerini

Main category: cs.CL

TL;DR: Hyperdimensional Probe is a novel method that combines symbolic representations and neural probing to decode interpretable concepts from LLM vector spaces using Vector Symbolic Architectures, overcoming limitations of existing interpretability methods.

Details

Motivation: Current LLM interpretability methods like direct logit attribution and sparse autoencoders provide limited insight due to vocabulary constraints and unclear feature names, creating a need for better decoding approaches.

Method: Projects the model’s residual stream into interpretable concepts via Vector Symbolic Architectures, combining strengths of sparse autoencoders and conventional probes while overcoming their limitations.

Result: The probe reliably extracts meaningful concepts across varied LLMs, embedding sizes, and input domains, and helps identify LLM failures in both controlled tasks and question-answering settings.

Conclusion: This work advances information decoding in LLM vector space, enabling extraction of more informative, interpretable, and structured features from neural representations.

Abstract: Despite their capabilities, Large Language Models (LLMs) remain opaque with limited understanding of their internal representations. Current interpretability methods, such as direct logit attribution (DLA) and sparse autoencoders (SAEs), provide restricted insight due to limitations such as the model’s output vocabulary or unclear feature names. This work introduces Hyperdimensional Probe, a novel paradigm for decoding information from the LLM vector space. It combines ideas from symbolic representations and neural probing to project the model’s residual stream into interpretable concepts via Vector Symbolic Architectures (VSAs). This probe combines the strengths of SAEs and conventional probes while overcoming their key limitations. We validate our decoding paradigm with controlled input-completion tasks, probing the model’s final state before next-token prediction on inputs spanning syntactic pattern recognition, key-value associations, and abstract inference. We further assess it in a question-answering setting, examining the state of the model both before and after text generation. Our experiments show that our probe reliably extracts meaningful concepts across varied LLMs, embedding sizes, and input domains, also helping identify LLM failures. Our work advances information decoding in LLM vector space, enabling extracting more informative, interpretable, and structured features from neural representations.

[182] Confidence-Guided Error Correction for Disordered Speech Recognition

Abner Hernandez, Tomás Arias Vergara, Andreas Maier, Paula Andrea Pérez-Toro

Main category: cs.CL

TL;DR: Using LLMs as post-processing modules for ASR error correction in disordered speech, with confidence-informed prompting that embeds word-level uncertainty estimates to improve robustness and reduce overcorrection.

Details

Motivation: To improve automatic speech recognition for disordered speech by leveraging LLMs' error correction capabilities while addressing the challenge of overcorrection and improving generalization across different speakers and datasets.

Method: Proposed confidence-informed prompting where word-level uncertainty estimates are embedded directly into LLM training, fine-tuned a LLaMA 3.1 model, and compared against transcript-only fine-tuning and post hoc confidence-based filtering.

Result: Achieved 10% relative WER reduction on Speech Accessibility Project spontaneous speech and 47% reduction on TORGO dataset compared to naive LLM correction.

Conclusion: Confidence-aware fine-tuning is effective for impaired speech ASR error correction, demonstrating significant improvements in word error rate reduction through uncertainty-informed LLM training.

Abstract: We investigate the use of large language models (LLMs) as post-processing modules for automatic speech recognition (ASR), focusing on their ability to perform error correction for disordered speech. In particular, we propose confidence-informed prompting, where word-level uncertainty estimates are embedded directly into LLM training to improve robustness and generalization across speakers and datasets. This approach directs the model to uncertain ASR regions and reduces overcorrection. We fine-tune a LLaMA 3.1 model and compare our approach to both transcript-only fine-tuning and post hoc confidence-based filtering. Evaluations show that our method achieves a 10% relative WER reduction compared to naive LLM correction on the Speech Accessibility Project spontaneous speech and a 47% reduction on TORGO, demonstrating the effectiveness of confidence-aware fine-tuning for impaired speech.

[183] An empirical study on the limitation of Transformers in program trace generation

Simeng Sun

Main category: cs.CL

TL;DR: Transformers trained on program trace generation achieve high in-distribution accuracy but show systematic generalization failures across program length and trace steps, with some architectural modifications improving generalization.

Details

Motivation: To study Transformers' reasoning capabilities through program trace generation, where models produce step-by-step execution traces for synthetic programs, externalizing reasoning through long traces with trivial individual steps.

Method: Training small Transformers with diverse modifications including alternative position encodings, softmax replacements, hybrid models, and short convolutions on the program trace generation task.

Result: Models achieve strong in-distribution accuracy but exhibit systematic failures when generalizing to various factors like program length and trace steps, though some architectural designs significantly improve generalization.

Conclusion: While Transformers can perform well on program trace generation in-distribution, they struggle with systematic generalization, suggesting limitations in their reasoning capabilities that can be partially mitigated through specific architectural modifications.

Abstract: We study Transformers on the task \emph{program trace generation} (PTG), where models produce step-by-step execution traces for synthetic programs. Unlike existing algorithmic problems, PTG externalizes reasoning through long traces where each step is trivial. We train small Transformers with diverse modifications, including alternative position encodings, softmax replacements, hybrid model, and short convolutions. While these models achieve strong in-distribution accuracy, they exhibit systematic failures when generalizing to various factors (e.g., program length, trace steps), though some designs significantly improve generalization.

[184] Scaling Generalist Data-Analytic Agents

Shuofei Qiao, Yanqiu Zhao, Zhisong Qiu, Xiaobin Wang, Jintian Zhang, Zhao Bin, Ningyu Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Huajun Chen

Main category: cs.CL

TL;DR: DataMind is a scalable data synthesis and agent training framework that builds generalist data-analytic agents, addressing challenges in data resources, training strategy, and multi-turn rollout. It achieves state-of-the-art performance on data analysis benchmarks.

Details

Motivation: Current data-analytic agents heavily rely on proprietary models, while open-source models struggle with diverse data formats and complex multi-step reasoning required for real-world analytics.

Method: Uses fine-grained task taxonomy with recursive easy-to-hard composition, knowledge-augmented trajectory sampling with filtering, dynamic training objective combining SFT and RL losses, and memory-frugal multi-turn rollout framework.

Result: DataMind-14B achieves 71.16% average score on multiple benchmarks, outperforming proprietary models like DeepSeek-V3.1 and GPT-5. DataMind-7B scores 68.10%, best among open-source models.

Conclusion: DataMind provides an effective framework for building high-performing open-source data-analytic agents and releases curated datasets and models for community research.

Abstract: Data-analytic agents are emerging as a key catalyst for automated scientific discovery and for the vision of Innovating AI. Current approaches, however, rely heavily on prompt engineering over proprietary models, while open-source models struggle to face diverse-format, large-scale data files and long-horizon, multi-step reasoning that real-world analytics demands. This paper introduces DataMind, a scalable data synthesis and agent training recipe designed to build generalist data-analytic agents. DataMind tackles three key challenges in building open-source data-analytic agents, including insufficient data resources, improper training strategy, and unstable code-based multi-turn rollout. Concretely, DataMind applies 1) a fine-grained task taxonomy and a recursive easy-to-hard task composition mechanism to increase the diversity and difficulty of synthesized queries; 2) a knowledge-augmented trajectory sampling strategy followed by model-based and rule-based filtering; 3) a dynamically adjustable training objective combining both SFT and RL losses; 4) a memory-frugal and stable code-based multi-turn rollout framework. Built on DataMind, we curate DataMind-12K, a high-quality trajectory set spanning diverse domains, task categories, and data file formats for data-analytic tasks. Trained on DataMind-12K, our DataMind-14B achieves state-of-the-art with an average score of 71.16% on multiple data analysis benchmarks, outperforming the strongest proprietary baselines DeepSeek-V3.1 and GPT-5. Our DataMind-7B also performs best among all open-source models with a score of 68.10%. We also incorporate some empirical insights gained from our exploratory trials into the analysis experiments, aiming to provide actionable insights about agentic training for the community. We will release DataMind-12K and DataMind-7B,14B for the community’s future research.

[185] jina-reranker-v3: Last but Not Late Interaction for Document Reranking

Feng Wang, Yuqing Li, Han Xiao

Main category: cs.CL

TL;DR: jina-reranker-v3 is a 0.6B parameter multilingual document reranker using novel ’last but not late interaction’ that achieves SOTA BEIR performance with 61.94 nDCG@10 while being 10x smaller than generative listwise rerankers.

Details

Motivation: To create a more efficient reranker that enables rich cross-document interactions while maintaining compact architecture, addressing limitations of late interaction models like ColBERT.

Method: Introduces ’last but not late interaction’ - conducts causal self-attention between query and documents within same context window, then extracts contextual embeddings from last token of each document.

Result: Achieves state-of-the-art BEIR performance with 61.94 nDCG@10 while being ten times smaller than generative listwise rerankers.

Conclusion: The compact architecture with novel interaction method provides superior performance and efficiency compared to existing approaches.

Abstract: jina-reranker-v3 is a 0.6B parameter multilingual document reranker that introduces a novel last but not late interaction. Unlike late interaction models such as ColBERT that perform separate encoding followed by multi-vector matching, our approach conducts causal self-attention between query and documents within the same context window, enabling rich cross-document interactions before extracting contextual embeddings from the last token of each document. This compact architecture achieves state-of-the-art BEIR performance with 61.94 nDCG@10 while being ten times smaller than generative listwise rerankers.

[186] Towards Trustworthy Lexical Simplification: Exploring Safety and Efficiency with Small LLMs

Akio Hayakawa, Stefan Bott, Horacio Saggion

Main category: cs.CL

TL;DR: Proposes efficient lexical simplification framework using small LLMs for privacy-sensitive environments, addressing safety concerns through probability-based filtering of harmful simplifications.

Details

Motivation: Address challenges of large language models in privacy-sensitive lexical simplification applications, particularly for vulnerable user groups who need safe and correct outputs.

Method: Uses small LLMs deployable locally, explores knowledge distillation with synthesized data and in-context learning, and proposes filtering strategy based on model output probability to detect harmful simplifications.

Result: Knowledge distillation boosts automatic metrics but increases harmful simplifications. Model output probability effectively detects harmful simplifications. Filtering strategy suppresses harmful outputs while preserving beneficial ones.

Conclusion: Establishes benchmark for efficient and safe lexical simplification with small LLMs, highlighting trade-offs between performance, efficiency, and safety, and demonstrates promising approach for real-world deployment.

Abstract: Despite their strong performance, large language models (LLMs) face challenges in real-world application of lexical simplification (LS), particularly in privacy-sensitive and resource-constrained environments. Moreover, since vulnerable user groups (e.g., people with disabilities) are one of the key target groups of this technology, it is crucial to ensure the safety and correctness of the output of LS systems. To address these issues, we propose an efficient framework for LS systems that utilizes small LLMs deployable in local environments. Within this framework, we explore knowledge distillation with synthesized data and in-context learning as baselines. Our experiments in five languages evaluate model outputs both automatically and manually. Our manual analysis reveals that while knowledge distillation boosts automatic metric scores, it also introduces a safety trade-off by increasing harmful simplifications. Importantly, we find that the model’s output probability is a useful signal for detecting harmful simplifications. Leveraging this, we propose a filtering strategy that suppresses harmful simplifications while largely preserving beneficial ones. This work establishes a benchmark for efficient and safe LS with small LLMs. It highlights the key trade-offs between performance, efficiency, and safety, and demonstrates a promising approach for safe real-world deployment.

[187] Towards Personalized Deep Research: Benchmarks and Evaluations

Yuan Liang, Jiaxian Li, Yuqing Wang, Piaohong Wang, Motong Tian, Pai Liu, Shuofei Qiao, Runnan Fang, He Zhu, Ge Zhang, Minghao Liu, Yuchen Eleanor Jiang, Ningyu Zhang, Wangchunshu Zhou

Main category: cs.CL

TL;DR: Introduces Personalized Deep Research Bench, the first benchmark for evaluating personalization in Deep Research Agents (DRAs), featuring 50 research tasks across 10 domains paired with 25 user profiles.

Details

Motivation: Existing evaluations rely on close-ended benchmarks and neglect personalized scenarios, creating a gap in open-ended deep research evaluation.

Method: Created a benchmark with 250 user-task queries combining 50 research tasks and 25 authentic user profiles. Proposed PQR Evaluation Framework measuring Personalization Alignment, Content Quality, and Factual Reliability.

Result: Experiments on various systems revealed current capabilities and limitations in handling personalized deep research.

Conclusion: Establishes a rigorous foundation for developing and evaluating the next generation of truly personalized AI research assistants.

Abstract: Deep Research Agents (DRAs) can autonomously conduct complex investigations and generate comprehensive reports, demonstrating strong real-world potential. However, existing evaluations mostly rely on close-ended benchmarks, while open-ended deep research benchmarks remain scarce and typically neglect personalized scenarios. To bridge this gap, we introduce Personalized Deep Research Bench, the first benchmark for evaluating personalization in DRAs. It pairs 50 diverse research tasks across 10 domains with 25 authentic user profiles that combine structured persona attributes with dynamic real-world contexts, yielding 250 realistic user-task queries. To assess system performance, we propose the PQR Evaluation Framework, which jointly measures (P) Personalization Alignment, (Q) Content Quality, and (R) Factual Reliability. Our experiments on a range of systems highlight current capabilities and limitations in handling personalized deep research. This work establishes a rigorous foundation for developing and evaluating the next generation of truly personalized AI research assistants.

[188] Knowledge Extraction on Semi-Structured Content: Does It Remain Relevant for Question Answering in the Era of LLMs?

Kai Sun, Yin Huang, Srishti Mehra, Mohammad Kachuee, Xilun Chen, Renjie Tao, Zhaojiang Lin, Andrea Jessee, Nirav Shah, Alex Betty, Yue Liu, Anuj Kumar, Wen-tau Yih, Xin Luna Dong

Main category: cs.CL

TL;DR: This paper investigates whether knowledge triple extraction remains useful for web-based QA systems in the era of LLMs, finding that while challenging, extraction can still benefit LLMs through augmentation and multi-task learning.

Details

Motivation: To determine if knowledge extraction still has value for QA systems given the advances in LLMs, questioning whether extraction is still needed when LLMs can directly answer questions.

Method: Extended an existing benchmark with knowledge extraction annotations and evaluated commercial and open-source LLMs of varying sizes, testing augmentation with extracted triples and multi-task learning approaches.

Result: Web-scale knowledge extraction remains challenging for LLMs, but LLMs can still benefit from knowledge extraction through augmentation with extracted triples and multi-task learning, even when achieving high QA accuracy.

Conclusion: Knowledge triple extraction maintains value in web-based QA systems and provides strategies for maximizing LLM effectiveness across different model sizes and resource constraints.

Abstract: The advent of Large Language Models (LLMs) has significantly advanced web-based Question Answering (QA) systems over semi-structured content, raising questions about the continued utility of knowledge extraction for question answering. This paper investigates the value of triple extraction in this new paradigm by extending an existing benchmark with knowledge extraction annotations and evaluating commercial and open-source LLMs of varying sizes. Our results show that web-scale knowledge extraction remains a challenging task for LLMs. Despite achieving high QA accuracy, LLMs can still benefit from knowledge extraction, through augmentation with extracted triples and multi-task learning. These findings provide insights into the evolving role of knowledge triple extraction in web-based QA and highlight strategies for maximizing LLM effectiveness across different model sizes and resource settings.

[189] Investigating Language and Retrieval Bias in Multilingual Previously Fact-Checked Claim Detection

Ivan Vykopal, Antonia Karamolegkou, Jaroslav Kopčan, Qiwei Peng, Tomáš Javůrek, Michal Gregor, Marián Šimko

Main category: cs.CL

TL;DR: The paper studies language and retrieval bias in multilingual LLMs for fact-checking, revealing performance disparities across languages and skewed retrieval favoring popular claims.

Details

Motivation: Multilingual LLMs show language bias, performing better on high-resource languages like English than low-resource ones, and retrieval systems exhibit bias by favoring certain information.

Method: Evaluated six open-source multilingual LLMs across 20 languages using multilingual prompting with AMC-16K dataset, and analyzed retrieval bias using multilingual embedding models and claim frequency.

Result: Found persistent language bias in LLM performance and retrieval bias where certain claims are disproportionately retrieved, inflating performance for popular claims while under-representing less common ones.

Conclusion: Highlights ongoing bias issues in multilingual fact-checking and provides recommendations for improving equity in cross-lingual fact verification systems.

Abstract: Multilingual Large Language Models (LLMs) offer powerful capabilities for cross-lingual fact-checking. However, these models often exhibit language bias, performing disproportionately better on high-resource languages such as English than on low-resource counterparts. We also present and inspect a novel concept

retrieval bias, when information retrieval systems tend to favor certain information over others, leaving the retrieval process skewed. In this paper, we study language and retrieval bias in the context of Previously Fact-Checked Claim Detection (PFCD). We evaluate six open-source multilingual LLMs across 20 languages using a fully multilingual prompting strategy, leveraging the AMC-16K dataset. By translating task prompts into each language, we uncover disparities in monolingual and cross-lingual performance and identify key trends based on model family, size, and prompting strategy. Our findings highlight persistent bias in LLM behavior and offer recommendations for improving equity in multilingual fact-checking. To investigate retrieval bias, we employed multilingual embedding models and look into the frequency of retrieved claims. Our analysis reveals that certain claims are retrieved disproportionately across different posts, leading to inflated retrieval performance for popular claims while under-representing less common ones.

[190] Paired by the Teacher: Turning Unpaired Data into High-Fidelity Pairs for Low-Resource Text Generation

Yen-Ju Lu, Thomas Thebaud, Laureano Moro-Velazquez, Najim Dehak, Jesus Villalba

Main category: cs.CL

TL;DR: PbT is a teacher-student pipeline that synthesizes accurate input-output pairs without human labels or parallel data by using teacher LLM compression and student reconstruction, enabling high-quality synthetic data generation for low-resource NLG scenarios.

Details

Motivation: Address the challenge in low-resource NLG where practitioners often have only raw inputs or outputs but not both, forcing small models to learn from few examples or rely on costly synthetic data from large LLMs.

Method: Two-stage pipeline: teacher LLM compresses unpaired examples into intermediate representations (IRs), then student model reconstructs inputs from IRs to pair with outputs, creating synthetic training data.

Result: 8B student trained on PbT data outperforms models trained on 70B teacher-generated corpora, comes within 1.2 ROUGE-L of human-annotated pairs, closes 82% of oracle gap at one-third annotation cost of direct synthesis.

Conclusion: PbT effectively generates in-domain sources that avoid mismatch issues, producing concise and faithful summaries aligned with target style, demonstrating significant advantage over direct synthesis approaches.

Abstract: We present Paired by the Teacher (PbT), a two-stage teacher-student pipeline that synthesizes accurate input-output pairs without human labels or parallel data. In many low-resource natural language generation (NLG) scenarios, practitioners may have only raw outputs, like highlights, recaps, or questions, or only raw inputs, such as articles, dialogues, or paragraphs, but seldom both. This mismatch forces small models to learn from very few examples or rely on costly, broad-scope synthetic examples produced by large LLMs. PbT addresses this by asking a teacher LLM to compress each unpaired example into a concise intermediate representation (IR), and training a student to reconstruct inputs from IRs. This enables outputs to be paired with student-generated inputs, yielding high-quality synthetic data. We evaluate PbT on five benchmarks-document summarization (XSum, CNNDM), dialogue summarization (SAMSum, DialogSum), and question generation (SQuAD)-as well as an unpaired setting on SwitchBoard (paired with DialogSum summaries). An 8B student trained only on PbT data outperforms models trained on 70 B teacher-generated corpora and other unsupervised baselines, coming within 1.2 ROUGE-L of human-annotated pairs and closing 82% of the oracle gap at one-third the annotation cost of direct synthesis. Human evaluation on SwitchBoard further confirms that only PbT produces concise, faithful summaries aligned with the target style, highlighting its advantage of generating in-domain sources that avoid the mismatch, limiting direct synthesis.

[191] Pretraining Large Language Models with NVFP4

NVIDIA, Felix Abecassis, Anjulie Agrusa, Dong Ahn, Jonah Alben, Stefania Alborghetti, Michael Andersch, Sivakumar Arayandi, Alexis Bjorlin, Aaron Blakeman, Evan Briones, Ian Buck, Bryan Catanzaro, Jinhang Choi, Mike Chrzanowski, Eric Chung, Victor Cui, Steve Dai, Bita Darvish Rouhani, Carlo del Mundo, Deena Donia, Burc Eryilmaz, Henry Estela, Abhinav Goel, Oleg Goncharov, Yugi Guvvala, Robert Hesse, Russell Hewett, Herbert Hum, Ujval Kapasi, Brucek Khailany, Mikail Khona, Nick Knight, Alex Kondratenko, Ronny Krashinsky, Ben Lanir, Simon Layton, Michael Lightstone, Daniel Lo, Paulius Micikevicius, Asit Mishra, Tim Moon, Deepak Narayanan, Chao Ni, Abhijit Paithankar, Satish Pasumarthi, Ankit Patel, Mostofa Patwary, Ashwin Poojary, Gargi Prasad, Sweta Priyadarshi, Yigong Qin, Xiaowei Ren, Oleg Rybakov, Charbel Sakr, Sanjeev Satheesh, Stas Sergienko, Pasha Shamis, Kirthi Shankar, Nishant Sharma, Mohammad Shoeybi, Michael Siu, Misha Smelyanskiy, Darko Stosic, Dusan Stosic, Bor-Yiing Su, Frank Sun, Nima Tajbakhsh, Shelby Thomas, Przemek Tredak, Evgeny Tsykunov, Gandhi Vaithilingam, Aditya Vavre, Rangharajan Venkatesan, Roger Waleffe, Qiyu Wan, Hexin Wang, Mengdi Wang, Lizzie Wei, Hao Wu, Evan Wu, Keith Wyss, Ning Xu, Jinze Xue, Charlene Yang, Yujia Zhai, Ruoxi Zhang, Jingyang Zhu, Zhongbo Zhu

Main category: cs.CL

TL;DR: This paper introduces a novel approach for stable 4-bit floating point (FP4) training of large language models using NVFP4 format, achieving comparable performance to FP8 baseline while significantly improving computational efficiency.

Details

Motivation: Training frontier LLMs requires massive computational resources (tens to hundreds of yottaflops). While FP8 training is widely adopted, transitioning to narrower precision like FP4 could unlock additional improvements in computational speed and resource utilization, but poses challenges to training stability and convergence.

Method: The approach integrates Random Hadamard transforms to bound block-level outliers, uses two-dimensional quantization scheme for consistent representations across forward and backward passes, employs stochastic rounding for unbiased gradient estimation, and incorporates selective high-precision layers.

Result: The method was validated by training a 12-billion-parameter model on 10 trillion tokens - the longest publicly documented training run in 4-bit precision. The model achieved training loss and downstream task accuracies comparable to an FP8 baseline.

Conclusion: NVFP4, when combined with the proposed training approach, represents a major step forward in narrow-precision LLM training algorithms, enabling more efficient training of next-generation LLMs.

Abstract: Large Language Models (LLMs) today are powerful problem solvers across many domains, and they continue to get stronger as they scale in model size, training set size, and training set quality, as shown by extensive research and experimentation across the industry. Training a frontier model today requires on the order of tens to hundreds of yottaflops, which is a massive investment of time, compute, and energy. Improving pretraining efficiency is therefore essential to enable the next generation of even more capable LLMs. While 8-bit floating point (FP8) training is now widely adopted, transitioning to even narrower precision, such as 4-bit floating point (FP4), could unlock additional improvements in computational speed and resource utilization. However, quantization at this level poses challenges to training stability, convergence, and implementation, notably for large-scale models trained on long token horizons. In this study, we introduce a novel approach for stable and accurate training of large language models (LLMs) using the NVFP4 format. Our method integrates Random Hadamard transforms (RHT) to bound block-level outliers, employs a two-dimensional quantization scheme for consistent representations across both the forward and backward passes, utilizes stochastic rounding for unbiased gradient estimation, and incorporates selective high-precision layers. We validate our approach by training a 12-billion-parameter model on 10 trillion tokens – the longest publicly documented training run in 4-bit precision to date. Our results show that the model trained with our NVFP4-based pretraining technique achieves training loss and downstream task accuracies comparable to an FP8 baseline. These findings highlight that NVFP4, when combined with our training approach, represents a major step forward in narrow-precision LLM training algorithms.

[192] EasySteer: A Unified Framework for High-Performance and Extensible LLM Steering

Haolei Xu, Xinyu Mei, Yuchen Yan, Rui Zhou, Wenqi Zhang, Weiming Lu, Yueting Zhuang, Yongliang Shen

Main category: cs.CL

TL;DR: EasySteer is a unified framework for LLM steering that achieves 5.5-11.4× speedup over existing methods through vLLM integration, offering modular architecture and pre-computed steering vectors for various applications.

Details

Motivation: Existing LLM steering frameworks suffer from computational inefficiency, limited extensibility, and restricted functionality, hindering both research progress and practical deployment.

Method: Built on vLLM with modular architecture featuring pluggable interfaces for analysis-based and learning-based methods, fine-grained parameter control, pre-computed steering vectors for eight domains, and interactive demonstration system.

Result: Achieves 5.5-11.4× speedup over existing frameworks and demonstrates effectiveness in overthinking mitigation, hallucination reduction, and other key applications.

Conclusion: EasySteer transforms steering from research technique to production-ready capability, establishing critical infrastructure for deployable, controllable language models.

Abstract: Large language model (LLM) steering has emerged as a promising paradigm for controlling model behavior at inference time through targeted manipulation of hidden states, offering a lightweight alternative to expensive retraining. However, existing steering frameworks suffer from critical limitations: computational inefficiency, limited extensibility, and restricted functionality that hinder both research progress and practical deployment. We present EasySteer, a unified framework for high-performance, extensible LLM steering built on vLLM. Our system features modular architecture with pluggable interfaces for both analysis-based and learning-based methods, fine-grained parameter control, pre-computed steering vectors for eight application domains, and an interactive demonstration system. Through deep integration with vLLM’s optimized inference engine, EasySteer achieves 5.5-11.4$\times$ speedup over existing frameworks. Extensive experiments demonstrate its effectiveness in overthinking mitigation, hallucination reduction, and other key applications. EasySteer transforms steering from research technique to production-ready capability, establishing critical infrastructure for deployable, controllable language models.

[193] NAIPv2: Debiased Pairwise Learning for Efficient Paper Quality Estimation

Penghai Zhao, Jinyu Tian, Qinghua Xing, Xin Zhang, Zheng Li, Jianjun Qian, Ming-Ming Cheng, Xiang Li

Main category: cs.CL

TL;DR: NAIPv2 is a debiased and efficient framework for paper quality estimation that uses pairwise learning and probabilistic integration of reviewer scores to achieve state-of-the-art performance while maintaining linear-time inference efficiency.

Details

Motivation: Existing LLM-based estimation methods have high inference costs, while direct score regression suffers from scale inconsistencies in reviewer ratings.

Method: Uses pairwise learning within domain-year groups to reduce rating inconsistencies, introduces Review Tendency Signal (RTS) for probabilistic integration of reviewer scores and confidences, and trains on a large dataset of 24,276 ICLR submissions.

Result: Achieves 78.2% AUC and 0.432 Spearman correlation, demonstrates strong generalization on NeurIPS submissions with predicted scores increasing consistently across decision categories from Rejected to Oral.

Conclusion: NAIPv2 establishes a debiased and scalable framework for automated paper quality estimation, representing progress toward future scientific intelligence systems.

Abstract: The ability to estimate the quality of scientific papers is central to how both humans and AI systems will advance scientific knowledge in the future. However, existing LLM-based estimation methods suffer from high inference cost, whereas the faster direct score regression approach is limited by scale inconsistencies. We present NAIPv2, a debiased and efficient framework for paper quality estimation. NAIPv2 employs pairwise learning within domain-year groups to reduce inconsistencies in reviewer ratings and introduces the Review Tendency Signal (RTS) as a probabilistic integration of reviewer scores and confidences. To support training and evaluation, we further construct NAIDv2, a large-scale dataset of 24,276 ICLR submissions enriched with metadata and detailed structured content. Trained on pairwise comparisons but enabling efficient pointwise prediction at deployment, NAIPv2 achieves state-of-the-art performance (78.2% AUC, 0.432 Spearman), while maintaining scalable, linear-time efficiency at inference. Notably, on unseen NeurIPS submissions, it further demonstrates strong generalization, with predicted scores increasing consistently across decision categories from Rejected to Oral. These findings establish NAIPv2 as a debiased and scalable framework for automated paper quality estimation, marking a step toward future scientific intelligence systems. Code and dataset are released at https://sway.cloud.microsoft/Pr42npP80MfPhvj8.

[194] Incentive-Aligned Multi-Source LLM Summaries

Yanchen Jiang, Zhe Feng, Aranyak Mehta

Main category: cs.CL

TL;DR: TTS is an incentive-aligned framework for truthful text summarization that improves factual robustness without ground-truth labels by decomposing claims, eliciting source stances, scoring sources with peer-prediction, and filtering unreliable sources.

Details

Motivation: Current LLM-based search and answer systems have weak incentives for source accuracy and are vulnerable to adversarial content when synthesizing multiple texts into responses.

Method: Four-step framework: (1) decompose draft synthesis into atomic claims, (2) elicit each source’s stance on every claim, (3) score sources using adapted multi-task peer-prediction mechanism that rewards informative agreement, (4) filter unreliable sources before re-summarizing.

Result: Experiments show TTS improves factual accuracy and robustness while preserving fluency, aligning exposure with informative corroboration and disincentivizing manipulation.

Conclusion: TTS establishes formal guarantees that align source incentives with informative honesty, making truthful reporting the utility-maximizing strategy for improved factual robustness in text summarization.

Abstract: Large language models (LLMs) are increasingly used in modern search and answer systems to synthesize multiple, sometimes conflicting, texts into a single response, yet current pipelines offer weak incentives for sources to be accurate and are vulnerable to adversarial content. We introduce Truthful Text Summarization (TTS), an incentive-aligned framework that improves factual robustness without ground-truth labels. TTS (i) decomposes a draft synthesis into atomic claims, (ii) elicits each source’s stance on every claim, (iii) scores sources with an adapted multi-task peer-prediction mechanism that rewards informative agreement, and (iv) filters unreliable sources before re-summarizing. We establish formal guarantees that align a source’s incentives with informative honesty, making truthful reporting the utility-maximizing strategy. Experiments show that TTS improves factual accuracy and robustness while preserving fluency, aligning exposure with informative corroboration and disincentivizing manipulation.

[195] Learning to Parallel: Accelerating Diffusion Large Language Models via Adaptive Parallel Decoding

Wenrui Bao, Zhiben Chen, Dan Xu, Yuzhang Shang

Main category: cs.CL

TL;DR: Learn2PD is a framework that trains a lightweight filter model to enable adaptive parallel decoding in diffusion-based LLMs, achieving significant speedup without performance loss.

Details

Motivation: Current parallel decoding strategies in diffusion-based LLMs use fixed heuristics that don't adapt to input-specific characteristics, resulting in suboptimal speed-quality trade-offs across diverse NLP tasks.

Method: Proposes Learn2PD framework with a lightweight filter model trained to predict when token predictions match final output, and introduces End-of-Text Prediction (EoTP) to detect decoding completion. The filter is learned post-training with minimal computation.

Result: Achieves up to 22.58× speedup without performance drop on LLaDA benchmark, and up to 57.51× when combined with KV-Cache.

Conclusion: The proposed adaptive parallel decoding approach significantly improves inference throughput while maintaining output quality, offering a more flexible alternative to fixed heuristic methods.

Abstract: Autoregressive decoding in large language models (LLMs) requires $\mathcal{O}(n)$ sequential steps for $n$ tokens, fundamentally limiting inference throughput. Recent diffusion-based LLMs (dLLMs) enable parallel token generation through iterative denoising. However, current parallel decoding strategies rely on fixed, input-agnostic heuristics (e.g., confidence thresholds), which fail to adapt to input-specific characteristics, resulting in suboptimal speed-quality trade-offs across diverse NLP tasks. In this work, we explore a more flexible and dynamic approach to parallel decoding. We propose Learning to Parallel Decode (Learn2PD), a framework that trains a lightweight and adaptive filter model to predict, for each token position, whether the current prediction matches the final output. This learned filter approximates an oracle parallel decoding strategy that unmasks tokens only when correctly predicted. Importantly, the filter model is learned in a post-training manner, requiring only a small amount of computation to optimize it (minute-level GPU time). Additionally, we introduce End-of-Text Prediction (EoTP) to detect decoding completion at the end of sequence, avoiding redundant decoding of padding tokens. Experiments on the LLaDA benchmark demonstrate that our method achieves up to 22.58$\times$ speedup without any performance drop, and up to 57.51$\times$ when combined with KV-Cache.

[196] InfoAgent: Advancing Autonomous Information-Seeking Agents

Gongrui Zhang, Jialiang Zhu, Ruiqi Yang, Kai Qiu, Miaosen Zhang, Zhirong Wu, Qi Dai, Bei Liu, Chong Luo, Zhengyuan Yang, Linjie Li, Lijuan Wang, Weizhu Chen, Yuan Zhang, Xin Li, Zhaoyi Liu, Xin Geng, Baining Guo

Main category: cs.CL

TL;DR: InfoAgent is a deep research agent that uses a data synthesis pipeline and self-hosted search tools to handle challenging queries, outperforming prior open-source agents on multiple benchmarks.

Details

Motivation: To build more capable LLM agents that can expand their capabilities through tool interaction, with a focus on transparency and avoiding heavy reliance on commercial search tools.

Method: Uses entity trees and sub-tree sampling with entity fuzzification to create challenging queries, develops self-hosted search infrastructure, and employs two-stage training: cold-start supervised finetuning followed by reinforcement learning.

Result: Achieves 15.3% accuracy on BrowseComp, 29.2% on BrowseComp-ZH, and 40.4% on Xbench-DS, outperforming WebSailor-72B and DeepDive-32B.

Conclusion: The proposed data synthesis pipeline and training methods enable InfoAgent to effectively handle challenging research queries while maintaining transparency through self-hosted tools.

Abstract: Building Large Language Model agents that expand their capabilities by interacting with external tools represents a new frontier in AI research and applications. In this paper, we introduce InfoAgent, a deep research agent powered by an innovative data synthesis pipeline and orchestrated web search tools. To construct challenging, hard-to-find queries,we build entity trees and apply sub-tree sampling with entity fuzzification to systematically increase question difficulty. Unlike prior work that relies heavily on commercial search tools, we develop a dedicated self-hosted search infrastructure, enhancing transparency of agent environments and facilitating further advancement of agent capacity. We evaluate the effectiveness of our data pipeline by measuring the average number of tool calls required to correctly answer a question, and also show that our agent yields better performance when equipped with our tools. Our \mbox{InfoAgent} is post-trained from Qwen3-14B using a two-stage recipe: cold-start supervised finetuning to instill long-horizon search behaviors, followed by reinforcement learning which significantly improves reasoning-driven tool use. With our methods, InfoAgent achieves 15.3% accuracy on BrowseComp, 29.2% on BrowseComp-ZH, and 40.4% on Xbench-DS, outperforming prior open-source deep research agents such as WebSailor-72B and DeepDive-32B.

[197] WordAlchemy: A transformer-based Reverse Dictionary

Kanhaiya Madaswar, Harshal Patil, Pranav Sadavarte, Sunil B. Mane

Main category: cs.CL

TL;DR: A novel cross-lingual reverse dictionary system for Indian languages using transformer-based deep learning with mT5 model and Translation Language Modeling.

Details

Motivation: Currently no reverse dictionary providers exist for Indian languages, which would be useful for language learners, anomia patients, and tip-of-the-tongue problems.

Method: Transformer-based deep learning approach using mT5 model with Translation Language Modeling (TLM) instead of conventional Masked Language Modeling (MLM).

Result: Development of an open-source cross-lingual reverse dictionary system with support for Indian languages.

Conclusion: The proposed system addresses the gap in reverse dictionary support for Indian languages using advanced transformer architecture with TLM technique.

Abstract: A reverse dictionary takes a target word’s description as input and returns the words that fit the description. Reverse Dictionaries are useful for new language learners, anomia patients, and for solving common tip-of-the-tongue problems (lethologica). Currently, there does not exist any Reverse Dictionary provider with support for any Indian Language. We present a novel open-source cross-lingual reverse dictionary system with support for Indian languages. In this paper, we propose a transformer-based deep learning approach to tackle the limitations faced by the existing systems using the mT5 model. This architecture uses the Translation Language Modeling (TLM) technique, rather than the conventional BERT’s Masked Language Modeling (MLM) technique.

[198] Continual Dialogue State Tracking via Example-Guided Question Answering

Hyundong Cho, Andrea Madotto, Zhaojiang Lin, Khyathi Raghavi Chandu, Satwik Kottur, Jing Xu, Jonathan May, Chinnadhurai Sankar

Main category: cs.CL

TL;DR: Reformulating dialogue state tracking as granular example-guided QA tasks to improve continual learning performance without complex regularization methods.

Details

Motivation: Dialogue systems suffer from catastrophic forgetting when updated with new services, and DST is a simple NLU task that can benefit from reformulation to minimize task shift between services.

Method: Reformulate DST as bundle of granular example-guided QA tasks, use in-context examples retrieved by trained retriever, and combine with dialogue-level memory replay.

Result: 60M parameter model achieves significant performance boost, attains state-of-the-art on DST continual learning metrics without complex regularization or parameter expansion.

Conclusion: The approach successfully alleviates service-specific memorization and teaches models to contextualize questions and examples for better continual learning in dialogue systems.

Abstract: Dialogue systems are frequently updated to accommodate new services, but naively updating them by continually training with data for new services in diminishing performance on previously learnt services. Motivated by the insight that dialogue state tracking (DST), a crucial component of dialogue systems that estimates the user’s goal as a conversation proceeds, is a simple natural language understanding task, we propose reformulating it as a bundle of granular example-guided question answering tasks to minimize the task shift between services and thus benefit continual learning. Our approach alleviates service-specific memorization and teaches a model to contextualize the given question and example to extract the necessary information from the conversation. We find that a model with just 60M parameters can achieve a significant boost by learning to learn from in-context examples retrieved by a retriever trained to identify turns with similar dialogue state changes. Combining our method with dialogue-level memory replay, our approach attains state of the art performance on DST continual learning metrics without relying on any complex regularization or parameter expansion methods.

[199] CGELBank Annotation Manual v1.2

Brett Reynolds, Nathan Schneider, Aryaman Arora

Main category: cs.CL

TL;DR: CGELBank is a treebank and tools based on the Cambridge Grammar of the English Language, with documentation on its annotation scheme.

Details

Motivation: To create a treebank and associated tools based on the Cambridge Grammar of the English Language syntactic formalism.

Method: Development of a treebank and tools hosted on GitHub, with documentation of the annotation scheme.

Result: Creation of CGELBank treebank and tools available at https://github.com/nert-nlp/cgel.

Conclusion: The paper documents the specific annotation scheme used in CGELBank based on CGEL syntactic formalism.

Abstract: CGELBank is a treebank and associated tools based on a syntactic formalism for English derived from the Cambridge Grammar of the English Language (CGEL; Huddleston and Pullum, 2002). It is hosted on GitHub at https://github.com/nert-nlp/cgel. This document lays out the particularities of the CGELBank annotation scheme.

[200] Machines Do See Color: A Guideline to Classify Different Forms of Racist Discourse in Large Corpora

Diana Davila Gordillo, Joan C. Timoneda, Sebastian Vallejo Vera

Main category: cs.CL

TL;DR: The paper presents a generalizable method for identifying and classifying racist discourse in large text corpora using cross-lingual machine learning models, specifically XLM-RoBERTa.

Details

Motivation: Current approaches to identifying racist language are limited to small qualitative studies or large-scale methods that only detect overt racism, lacking comprehensive classification of different racist discourse forms in large corpora.

Method: A three-step approach: 1) Conceptualize racism and its manifestations, 2) Contextualize racist manifestations to specific time and place to identify discursive forms, 3) Apply XLM-RoBERTa for supervised text classification with contextual understanding.

Result: XLM-R and their pretrained model XLM-R-Racismo outperform other state-of-the-art approaches in classifying racism in large corpora, demonstrated on a corpus of tweets about Ecuador’s indigenous community from 2018-2021.

Conclusion: The proposed method provides an effective, generalizable framework for identifying and classifying various forms of racist discourse in large text corpora across different languages and contexts.

Abstract: Current methods to identify and classify racist language in text rely on small-n qualitative approaches or large-n approaches focusing exclusively on overt forms of racist discourse. This article provides a step-by-step generalizable guideline to identify and classify different forms of racist discourse in large corpora. In our approach, we start by conceptualizing racism and its different manifestations. We then contextualize these racist manifestations to the time and place of interest, which allows researchers to identify their discursive form. Finally, we apply XLM-RoBERTa (XLM-R), a cross-lingual model for supervised text classification with a cutting-edge contextual understanding of text. We show that XLM-R and XLM-R-Racismo, our pretrained model, outperform other state-of-the-art approaches in classifying racism in large corpora. We illustrate our approach using a corpus of tweets relating to the Ecuadorian ind'igena community between 2018 and 2021.

Haohao Zhu, Xiaokun Zhang, Junyu Lu, Youlin Wu, Zewen Bai, Changrong Min, Liang Yang, Bo Xu, Dongyu Zhang, Hongfei Lin

Main category: cs.CL

TL;DR: The paper proposes DEN, a Dual Enhanced Network that jointly models both long-term stable personality traits and short-term dynamic states for comprehensive textual personality detection.

Details

Motivation: Existing personality detection studies focus only on either long-term or short-term personality representations, neglecting the integration of both aspects which are both vital for comprehensive personality understanding.

Method: DEN uses three modules: Long-term Personality Encoding for stable traits via psychological entity patterns, Short-term Personality Encoding for dynamic states via contextual post analysis, and Bi-directional Interaction to integrate both aspects.

Result: Experimental results on two personality detection datasets demonstrate the effectiveness of the DEN model in capturing comprehensive personality representations.

Conclusion: The study underscores the importance of considering both stable and dynamic aspects of personality in textual personality detection for a more comprehensive understanding.

Abstract: Textual personality detection aims to identify personality characteristics by analyzing user-generated content on social media platforms. Extensive psychological literature highlights that personality encompasses both long-term stable traits and short-term dynamic states. However, existing studies often concentrate only on either long-term or short-term personality representations, neglecting the integration of both aspects. This limitation hinders a comprehensive understanding of individuals’ personalities, as both stable traits and dynamic states are vital. To bridge this gap, we propose a Dual Enhanced Network (DEN) to jointly model users’ long-term and short-term personality traits. In DEN, the Long-term Personality Encoding module models stable long-term personality traits by analyzing consistent patterns in the usage of psychological entities. The Short-term Personality Encoding module captures dynamic short-term personality states by modeling the contextual information of individual posts in real-time. The Bi-directional Interaction module integrates both aspects of personality, creating a cohesive and comprehensive representation of the user’s personality. Experimental results on two personality detection datasets demonstrate the effectiveness of the DEN model and underscore the importance of considering both stable and dynamic aspects of personality in textual personality detection.

[202] Multi-Head RAG: Solving Multi-Aspect Problems with LLMs

Maciej Besta, Ales Kubicek, Robert Gerstenberger, Marcin Chrapek, Roman Niggli, Patrik Okanovic, Yi Zhu, Patrick Iff, Michal Podstawski, Lucas Weitzendorf, Mingyuan Chi, Joanna Gajda, Piotr Nyczyk, Jürgen Müller, Hubert Niewiadomski, Torsten Hoefler

Main category: cs.CL

TL;DR: MRAG introduces a novel RAG approach that uses Transformer multi-head attention activations as keys to retrieve multiple documents with different aspects, improving retrieval accuracy for complex queries.

Details

Motivation: Existing RAG solutions struggle with queries requiring multiple documents with substantially different contents, as their embeddings may be distant in embedding space, making retrieval challenging.

Method: Leverages activations from Transformer’s multi-head attention layer as keys instead of decoder layer, exploiting that different attention heads capture different data aspects to create embeddings representing various facets.

Result: Shows design advantages over 18 RAG baselines, empirical improvements of up to 20% in retrieval success ratios, and benefits for downstream LLM generation.

Conclusion: MRAG effectively addresses multi-aspect document retrieval challenges and can be seamlessly integrated with existing RAG frameworks and benchmarks.

Abstract: Retrieval Augmented Generation (RAG) enhances the abilities of Large Language Models (LLMs) by enabling the retrieval of documents into the LLM context to provide more accurate and relevant responses. Existing RAG solutions do not focus on queries that may require fetching multiple documents with substantially different contents. Such queries occur frequently, but are challenging because the embeddings of these documents may be distant in the embedding space, making it hard to retrieve them all. This paper introduces Multi-Head RAG (MRAG), a novel scheme designed to address this gap with a simple yet powerful idea: leveraging activations of Transformer’s multi-head attention layer, instead of the decoder layer, as keys for fetching multi-aspect documents. The driving observation is that different attention heads learn to capture different data aspects. Harnessing the corresponding activations results in embeddings that represent various facets of data items and queries, improving the retrieval accuracy for complex queries. We provide an evaluation methodology and metrics, multi-aspect datasets, and real-world use cases to demonstrate MRAG’s effectiveness. We show MRAG’s design advantages over 18 RAG baselines, empirical improvements of up to 20% in retrieval success ratios, and benefits for downstream LLM generation. MRAG can be seamlessly integrated with existing RAG frameworks and benchmarks.

[203] SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature

David Wadden, Kejian Shi, Jacob Morrison, Alan Li, Aakanksha Naik, Shruti Singh, Nitzan Barzilay, Kyle Lo, Tom Hope, Luca Soldaini, Shannon Zejiang Shen, Doug Downey, Hannaneh Hajishirzi, Arman Cohan

Main category: cs.CL

TL;DR: SciRIFF is a 137K-instruction dataset for scientific literature understanding across 54 tasks, featuring expert-written instructions with complex contexts and structured outputs. Fine-tuning LLMs with SciRIFF improves performance by 70.6% on held-out scientific tasks.

Details

Motivation: To address the need for high-quality instruction-following datasets specifically for scientific literature understanding, helping researchers navigate rapidly growing scientific literature through improved LLM capabilities.

Method: Created SciRIFF dataset with 137K expert-written instruction-following instances covering 54 tasks across 5 scientific capabilities. Fine-tuned LLMs using a mix of general-domain and SciRIFF instructions and evaluated on 9 held-out tasks.

Result: LLMs fine-tuned on SciRIFF achieved 70.6% average improvement over baselines trained only on general-domain instructions on the SciRIFF-Eval held-out tasks.

Conclusion: SciRIFF enables effective development and evaluation of LLMs for scientific literature understanding, demonstrating significant performance improvements for scientific information extraction and synthesis tasks.

Abstract: We present SciRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K instruction-following instances for training and evaluation, covering 54 tasks. These tasks span five core scientific literature understanding capabilities: information extraction, summarization, question answering, claim verification, and classification. SciRIFF is unique in being entirely expert-written, high-quality instruction-following dataset for extracting and synthesizing information from research literature across diverse scientific fields. It features complex instructions with long input contexts, detailed task descriptions, and structured outputs. To demonstrate its utility, we finetune a series of large language models (LLMs) using a mix of general-domain and SciRIFF instructions. On nine out-of-distribution held-out tasks (referred to as SciRIFF-Eval), LLMs finetuned on SciRIFF achieve 70.6% average improvement over baselines trained only on general-domain instructions. SciRIFF facilitates the development and evaluation of LLMs to help researchers navigate the rapidly growing body of scientific literature.

[204] Sheaf Discovery with Joint Computation Graph Pruning and Flexible Granularity

Lei Yu, Jingcheng Niu, Zining Zhu, Xi Chen, Gerald Penn

Main category: cs.CL

TL;DR: DiscoGP is a framework that extracts modular units (sheaves) from neural language models by pruning both edges and weights, preserving task performance with high sparsity.

Details

Motivation: To extend functional circuits in interpretability research by considering both computation graph edges and weight parameters for better modularity and functional fidelity.

Method: Uses a gradient-based pruning algorithm that operates on both edges and weights to reduce the model to a sparse skeleton while preserving core capabilities.

Result: Extracted sheaves preserve 93%-100% of model performance across linguistic and reasoning tasks while comprising only 1%-7% of original weights and connections, with superior modularity compared to previous circuits.

Conclusion: DiscoGP effectively identifies highly sparse, functional modules in LMs and provides novel insights into LLM inner workings when extended to neuron-level analysis.

Abstract: In this paper, we introduce DiscoGP, a novel framework for extracting self-contained modular units, or sheaves, within neural language models (LMs). Sheaves extend the concept of functional circuits, a unit widely explored in interpretability research, by considering not only subsets of edges in an LM’s computation graph but also the model’s weight parameters. Our framework identifies sheaves through a gradient-based pruning algorithm that operates on both of these in such a way that reduces the original LM to a sparse skeleton that preserves certain core capabilities. Experimental results demonstrate that, across a range of linguistic and reasoning tasks, DiscoGP extracts sheaves that preserve 93%-100% of a model’s performance on the identified task while comprising only 1%-7% of the original weights and connections. Furthermore, our analysis reveals that, compared to previously identified LM circuits, the sheaves discovered by DiscoGP exhibit superior modularity and functional fidelity. Extending our method to the neuron level also unveils novel insights into the inner workings of LLMs

[205] CiteFusion: An Ensemble Framework for Citation Intent Classification Harnessing Dual-Model Binary Couples and SHAP Analyses

Lorenzo Paolini, Sahar Vahdati, Angelo Di Iorio, Robert Wardenga, Ivan Heibi, Silvio Peroni

Main category: cs.CL

TL;DR: CiteFusion is an ensemble framework for citation intent classification that combines SciBERT and XLNet models with a meta-classifier, achieving state-of-the-art performance on SciCite and ACL-ARC datasets while providing interpretability through SHAP analysis.

Details

Motivation: Understanding scholarly citation motivations is crucial for evaluating research impact and promoting transparent scholarly communication, requiring accurate citation intent classification.

Method: Uses one-vs-all decomposition of multi-class task into binary subtasks, pairs SciBERT and XLNet models independently tuned for each citation intent, aggregates outputs via feedforward neural network meta-classifier, employs SHAP for interpretability, and incorporates section titles as structural context.

Result: Achieves state-of-the-art Macro-F1 scores of 89.60% on SciCite and 76.24% on ACL-ARC, demonstrates robust performance in imbalanced and data-scarce scenarios, and shows positive impact of section titles on classification accuracy.

Conclusion: CiteFusion provides an effective ensemble approach for citation intent classification with strong performance, interpretability features, and practical applications through a released web-based tool.

Abstract: Understanding the motivations underlying scholarly citations is essential to evaluate research impact and promote transparent scholarly communication. This study introduces CiteFusion, an ensemble framework designed to address the multi-class Citation Intent Classification task on two benchmark datasets: SciCite and ACL-ARC. The framework employs a one-vs-all decomposition of the multi-class task into class-specific binary subtasks, leveraging complementary pairs of SciBERT and XLNet models, independently tuned, for each citation intent. The outputs of these base models are aggregated through a feedforward neural network meta-classifier to reconstruct the original classification task. To enhance interpretability, SHAP (SHapley Additive exPlanations) is employed to analyze token-level contributions, and interactions among base models, providing transparency into the classification dynamics of CiteFusion, and insights about the kind of misclassifications of the ensemble. In addition, this work investigates the semantic role of structural context by incorporating section titles, as framing devices, into input sentences, assessing their positive impact on classification accuracy. CiteFusion ultimately demonstrates robust performance in imbalanced and data-scarce scenarios: experimental results show that CiteFusion achieves state-of-the-art performance, with Macro-F1 scores of 89.60% on SciCite, and 76.24% on ACL-ARC. Furthermore, to ensure interoperability and reusability, citation intents from both datasets schemas are mapped to Citation Typing Ontology (CiTO) object properties, highlighting some overlaps. Finally, we describe and release a web-based application that classifies citation intents leveraging the CiteFusion models developed on SciCite.

[206] LLM-3D Print: Large Language Models To Monitor and Control 3D Printing

Yayati Jadhav, Peter Pak, Amir Barati Farimani

Main category: cs.CL

TL;DR: A framework using pre-trained LLMs to monitor and control FDM 3D printing processes, detecting defects through image analysis and autonomously executing corrective actions without human intervention.

Details

Motivation: Traditional material extrusion techniques are prone to errors requiring expert intervention, while existing automated methods lack generalizability across different printer setups and require extensive labeled datasets.

Method: Leveraging pre-trained LLMs to analyze images captured after each print layer/segment, identify failure modes, query printer parameters, and generate/execute corrective action plans.

Result: LLM-based agents accurately identified common 3D printing errors (inconsistent extrusion, stringing, warping, layer adhesion), determined causal parameters, and autonomously corrected them without human intervention.

Conclusion: The proposed framework effectively addresses the limitations of existing automated error detection methods and demonstrates the viability of LLMs for autonomous process monitoring and control in additive manufacturing.

Abstract: Industry 4.0 has revolutionized manufacturing by driving digitalization and shifting the paradigm toward additive manufacturing (AM). Fused Deposition Modeling (FDM), a key AM technology, enables the creation of highly customized, cost-effective products with minimal material waste through layer-by-layer extrusion, posing a significant challenge to traditional subtractive methods. However, the susceptibility of material extrusion techniques to errors often requires expert intervention to detect and mitigate defects that can severely compromise product quality. While automated error detection and machine learning models exist, their generalizability across diverse 3D printer setups, firmware, and sensors is limited, and deep learning methods require extensive labeled datasets, hindering scalability and adaptability. To address these challenges, we present a process monitoring and control framework that leverages pre-trained Large Language Models (LLMs) alongside 3D printers to detect and address printing defects. The LLM evaluates print quality by analyzing images captured after each layer or print segment, identifying failure modes and querying the printer for relevant parameters. It then generates and executes a corrective action plan. We validated the effectiveness of the proposed framework in identifying defects by comparing it against a control group of engineers with diverse AM expertise. Our evaluation demonstrated that LLM-based agents not only accurately identify common 3D printing errors, such as inconsistent extrusion, stringing, warping, and layer adhesion, but also effectively determine the parameters causing these failures and autonomously correct them without any need for human intervention.

[207] Parse Trees Guided LLM Prompt Compression

Wenhao Mao, Chengbin Hou, Tianyu Zhang, Xinyu Lin, Ke Tang, Hairong Lv

Main category: cs.CL

TL;DR: PartPrompt is a novel selective prompt compression method that uses linguistic parse trees and global hierarchical structure to compress LLM prompts while maintaining performance.

Details

Motivation: Existing prompt compression methods have limitations - generative methods suffer from hallucination, while selective methods overlook linguistic rules and global structure, leading to suboptimal compression.

Method: PartPrompt parses sentences into parse trees, calculates local information entropy for nodes, organizes them into a global hierarchical tree, applies root-ward and leaf-ward propagation to adjust node values, and uses recursive pruning based on adjusted values.

Result: PartPrompt achieves state-of-the-art performance across various datasets, metrics, compression ratios, and target LLMs, with superior coherence and effectiveness in extreme long prompt scenarios.

Conclusion: The proposed PartPrompt method effectively addresses limitations of existing compression approaches by incorporating linguistic rules and global structure, demonstrating superior performance in prompt compression tasks.

Abstract: Offering rich contexts to Large Language Models (LLMs) has shown to boost the performance in various tasks, but the resulting longer prompt would increase the computational cost and might exceed the input limit of LLMs. Recently, some prompt compression methods have been suggested to shorten the length of prompts by using language models to generate shorter prompts or by developing computational models to select important parts of original prompt. The generative compression methods would suffer from issues like hallucination, while the selective compression methods have not involved linguistic rules and overlook the global structure of prompt. To this end, we propose a novel selective compression method called PartPrompt. It first obtains a parse tree for each sentence based on linguistic rules, and calculates local information entropy for each node in a parse tree. These local parse trees are then organized into a global tree according to the hierarchical structure such as the dependency of sentences, paragraphs, and sections. After that, the root-ward propagation and leaf-ward propagation are proposed to adjust node values over the global tree. Finally, a recursive algorithm is developed to prune the global tree based on the adjusted node values. The experiments show that PartPrompt receives the state-of-the-art performance across various datasets, metrics, compression ratios, and target LLMs for inference. The in-depth ablation studies confirm the effectiveness of designs in PartPrompt, and other additional experiments also demonstrate its superiority in terms of the coherence of compressed prompts and in the extreme long prompt scenario.

[208] Cross-lingual Human-Preference Alignment for Neural Machine Translation with Direct Quality Optimization

Kaden Uhlig, Joern Wuebker, Raphael Reinauer, John DeNero

Main category: cs.CL

TL;DR: Task-alignment algorithms like RLHF and DPO can address task-data mismatch in neural machine translation, improving multilingual models even when applied to only a subset of languages.

Details

Motivation: To address the existing task-data mismatch in neural machine translation by applying task-alignment techniques that repurpose foundational models for specific tasks.

Method: Introduce Direct Quality Optimization (DQO), a variant of DPO that uses a pre-trained translation quality estimation model as a proxy for human preferences.

Result: Task-alignment improves performance across all languages of a multilingual model, even when only applied to a subset of languages, verified by both automatic metrics and human evaluation.

Conclusion: Task-alignment techniques effectively address task-data mismatch in NMT and provide consistent improvements across multilingual models.

Abstract: Reinforcement Learning from Human Feedback (RLHF) and derivative techniques like Direct Preference Optimization (DPO) are task-alignment algorithms used to repurpose general, foundational models for specific tasks. We show that applying task-alignment to neural machine translation (NMT) addresses an existing task–data mismatch in NMT, leading to improvements across all languages of a multilingual model, even when task-alignment is only applied to a subset of those languages. We do so by introducing Direct Quality Optimization (DQO), a variant of DPO leveraging a pre-trained translation quality estimation model as a proxy for human preferences, and verify the improvements with both automatic metrics and human evaluation.

[209] LLMs Are In-Context Bandit Reinforcement Learners

Giovanni Monea, Antoine Bosselut, Kianté Brantley, Yoav Artzi

Main category: cs.CL

TL;DR: LLMs can perform in-context reinforcement learning (ICRL) by learning online from external rewards rather than supervised data, showing effective learning across various classification tasks and model sizes from 500M to 70B parameters.

Details

Motivation: To investigate whether LLMs can learn through reinforcement learning in-context, moving beyond traditional supervised in-context learning to handle external rewards and online learning scenarios.

Method: Used contextual bandit version of in-context reinforcement learning, experimenting with challenging classification tasks across LLMs of varying sizes (500M to 70B parameters), addressing process instability and testing with both semantic and abstract labels.

Result: LLMs effectively demonstrate in-context reinforcement learning capabilities, showing scaling trends with model size, but also reveal fundamental limitations in their implicit reasoning about errors.

Conclusion: LLMs possess significant ICRL capabilities that scale with model size, but their error reasoning remains a fundamental limitation that needs addressing.

Abstract: Large Language Models (LLMs) excel at in-context learning (ICL), a supervised learning technique that relies on adding annotated examples to the model context. We investigate a contextual bandit version of in-context reinforcement learning (ICRL), where models learn in-context, online, from external reward, instead of supervised data. We show that LLMs effectively demonstrate such learning, and provide a detailed study of the phenomena, experimenting with challenging classification tasks and models of sizes from 500M to 70B parameters. This includes identifying and addressing the instability of the process, demonstrating learning with both semantic and abstract labels, and showing scaling trends. Our findings highlight ICRL capabilities in LLMs, while also underscoring fundamental limitations in their implicit reasoning about errors.

[210] AERA Chat: An Interactive Platform for Automated Explainable Student Answer Assessment

Jiazheng Li, Artem Bobrov, Runcong Zhao, Cesare Aloisi, Yulan He

Main category: cs.CL

TL;DR: AERA Chat is an interactive visualization platform that uses multiple LLMs to score student answers and generate explanatory rationales, with tools for educators to annotate tasks and researchers to evaluate rationale quality.

Details

Motivation: Current automated student answer scoring systems lack reliable explainability due to scarce annotated data and costly manual verification, leading to reliance on noisy LLM-generated rationales.

Method: Developed AERA Chat platform that leverages multiple LLMs concurrently for scoring and rationale generation, with visualization features highlighting critical answer components and justification elements.

Result: The platform effectively facilitates robust rationale evaluation and comparative analysis across multiple rationale-generation methods on several datasets.

Conclusion: AERA Chat addresses the explainability gap in automated student assessment by providing interactive visualization and evaluation tools that enhance trust and usability for educators and researchers.

Abstract: Explainability in automated student answer scoring systems is critical for building trust and enhancing usability among educators. Yet, generating high-quality assessment rationales remains challenging due to the scarcity of annotated data and the prohibitive cost of manual verification, prompting heavy reliance on rationales produced by large language models (LLMs), which are often noisy and unreliable. To address these limitations, we present AERA Chat, an interactive visualization platform designed for automated explainable student answer assessment. AERA Chat leverages multiple LLMs to concurrently score student answers and generate explanatory rationales, offering innovative visualization features that highlight critical answer components and rationale justifications. The platform also incorporates intuitive annotation and evaluation tools, supporting educators in marking tasks and researchers in evaluating rationale quality from different models. We demonstrate the effectiveness of our platform through evaluations of multiple rationale-generation methods on several datasets, showcasing its capability for facilitating robust rationale evaluation and comparative analysis.

[211] When Speculation Spills Secrets: Side Channels via Speculative Decoding In LLMs

Jiankun Wei, Abdulrahman Abdulrazzag, Tianchen Zhang, Adel Muursepp, Gururaj Saileshwar

Main category: cs.CL

TL;DR: This paper reveals a side-channel attack on speculative decoding in LLMs where input-dependent patterns of correct/incorrect speculations can be monitored through token counts or packet sizes, allowing query fingerprinting and data leakage.

Details

Motivation: To expose security vulnerabilities in speculative decoding techniques used by deployed LLMs, which are designed to improve throughput and latency but create observable patterns that can be exploited.

Method: The researchers demonstrate attacks by monitoring per-iteration token counts and packet sizes to infer speculation patterns, testing across four speculative-decoding schemes (REST, LADE, BiLD, EAGLE) and evaluating in research prototypes and vLLM framework.

Result: Attackers can fingerprint user queries with >90% accuracy across schemes (REST 100%, LADE up to 92%, BiLD up to 95%, EAGLE up to 77.6%) and leak confidential datastore contents at rates exceeding 25 tokens/sec.

Conclusion: The paper proposes mitigations including packet padding and iteration-wise token aggregation to defend against these side-channel attacks in speculative decoding implementations.

Abstract: Deployed large language models (LLMs) often rely on speculative decoding, a technique that generates and verifies multiple candidate tokens in parallel, to improve throughput and latency. In this work, we reveal a new side-channel whereby input-dependent patterns of correct and incorrect speculations can be inferred by monitoring per-iteration token counts or packet sizes.We demonstrate that an adversary observing these patterns can fingerprint user queries with >90% accuracy across four speculative-decoding schemes, REST (100%), LADE (up to 92%), BiLD (up to 95%), and EAGLE (up to 77.6%) and leak confidential datastore contents used for prediction at rates exceeding 25 tokens/sec. We evaluate the side-channel attacks in both research prototypes as well as the production-grade vLLM serving framework. To defend against these, we propose and evaluate a suite of mitigations, including packet padding and iteration-wise token aggregation.

[212] UniHR: Hierarchical Representation Learning for Unified Knowledge Graph Link Prediction

Zhiqiang Liu, Yin Hua, Mingyang Chen, Yichi Zhang, Zhuo Chen, Lei Liang, Huajun Chen, Wen Zhang

Main category: cs.CL

TL;DR: UniHR is a unified hierarchical framework that handles multiple complex fact types in knowledge graphs (hyper-relational, temporal, nested) through hierarchical data representation and structure learning modules.

Details

Motivation: Real-world knowledge graphs contain diverse complex facts beyond standard triples, but existing methods focus on specific types and struggle with hierarchical modeling across different fact types.

Method: Proposes UniHR with two modules: HiDR unifies different KG types into triple-based representations, and HiSL performs intra-fact and inter-fact message passing to enhance semantic and structural information.

Result: Extensive experiments on 9 datasets across 5 KG types demonstrate UniHR’s effectiveness and the strong potential of unified representations for complex scenarios.

Conclusion: UniHR successfully addresses limitations of specialized approaches and shows unified hierarchical representation learning is effective for handling diverse complex facts in real-world knowledge graphs.

Abstract: Real-world knowledge graphs (KGs) contain not only standard triple-based facts, but also more complex, heterogeneous types of facts, such as hyper-relational facts with auxiliary key-value pairs, temporal facts with additional timestamps, and nested facts that imply relationships between facts. These richer forms of representation have attracted significant attention due to their enhanced expressiveness and capacity to model complex semantics in real-world scenarios. However, most existing studies suffer from two main limitations: (1) they typically focus on modeling only specific types of facts, thus making it difficult to generalize to real-world scenarios with multiple fact types; and (2) they struggle to achieve generalizable hierarchical (inter-fact and intra-fact) modeling due to the complexity of these representations. To overcome these limitations, we propose UniHR, a Unified Hierarchical Representation learning framework, which consists of a learning-optimized Hierarchical Data Representation (HiDR) module and a unified Hierarchical Structure Learning (HiSL) module. The HiDR module unifies hyper-relational KGs, temporal KGs, and nested factual KGs into triple-based representations. Then HiSL incorporates intra-fact and inter-fact message passing, focusing on enhancing both semantic information within individual facts and enriching the structural information between facts. To go beyond the unified method itself, we further explore the potential of unified representation in complex real-world scenarios, including joint modeling of multi-task, compositional and hybrid facts. Extensive experiments on 9 datasets across 5 types of KGs demonstrate the effectiveness of UniHR and highlight the strong potential of unified representations.

[213] Adapting Chat Language Models Using Only Target Unlabeled Language Data

Atsuki Yamaguchi, Terufumi Morishita, Aline Villavicencio, Nikolaos Aletras

Main category: cs.CL

TL;DR: ElChat is a new language adaptation method for chat LLMs that directly adapts chat models on target unlabeled data without needing a base model, by injecting information from the source chat model to preserve chat abilities.

Details

Motivation: Vocabulary expansion on target unlabeled data causes chat models to forget their chat abilities, and obtaining target chat data is costly or unavailable for low-resource languages.

Method: ElChat directly adapts chat models on target unlabeled data by injecting information from the source chat model to elicit chat abilities, eliminating the need for a base model.

Result: ElChat achieves more robust target language and safety performance while maintaining superior English, chat, and instruction-following abilities compared to the chat vector method.

Conclusion: ElChat provides an effective direct adaptation approach for chat LLMs that preserves chat capabilities while enabling language adaptation without requiring target chat data.

Abstract: Vocabulary expansion (VE) is the de-facto approach to language adaptation of large language models (LLMs) by adding new tokens and continuing pre-training on target data. While this is effective for base models trained on unlabeled data, it poses challenges for chat models trained to follow instructions through labeled conversation data. Directly adapting the latter with VE on target unlabeled data may result in forgetting chat abilities. While ideal, target chat data is often unavailable or costly to create for low-resource languages, and machine-translated alternatives are not always effective. To address this issue, previous work proposed using a base and chat model from the same family. This method first adapts the base LLM with VE on target unlabeled data and then converts it to a chat model by adding a chat vector (CV) derived from the weight difference between the source base and chat models. We propose ElChat, a new language adaptation method for chat LLMs that adapts a chat model directly on target unlabeled data, without a base model. It elicits chat abilities by injecting information from the source chat model. ElChat offers more robust and competitive target language and safety performance while achieving superior English, chat, and instruction-following abilities compared to CV.

[214] LongProc: Benchmarking Long-Context Language Models on Long Procedural Generation

Xi Ye, Fangcong Yin, Yinghui He, Joie Zhang, Howard Yen, Tianyu Gao, Greg Durrett, Danqi Chen

Main category: cs.CL

TL;DR: LongProc is a new benchmark for evaluating long-context language models that tests both information integration from dispersed sources and long-form generation (up to 8K tokens) across six procedural tasks, revealing significant limitations in current models despite their claimed large context windows.

Details

Motivation: Existing benchmarks focus mainly on long-context recall with short responses, but real-world applications require both information integration from dispersed sources and long-form generation, which current benchmarks don't adequately test.

Method: Created LongProc benchmark with six diverse procedural generation tasks that require following detailed instructions, synthesizing dispersed information, and generating structured long-form outputs. Evaluated 23 LCLMs at three difficulty levels (500, 2K, and 8K output tokens) using rule-based evaluation.

Result: Open-weight models typically fail on 2K-token tasks, and closed-source models like GPT-4o show significant degradation on 8K-token tasks. Reasoning models perform better overall due to long CoT training. Models struggle with long-range coherence in long-form generations.

Conclusion: Current LCLMs have critical limitations in handling long-form generation tasks despite their claimed large context windows, indicating substantial room for improvement in long-context capabilities.

Abstract: Existing benchmarks for evaluating long-context language models (LCLMs) primarily focus on long-context recall, requiring models to produce short responses based on a few critical snippets while processing thousands of irrelevant tokens. We introduce LongProc (Long Procedural Generation), a new benchmark that requires both the integration of highly dispersed information and long-form generation. LongProc consists of six diverse procedural generation tasks, such as extracting structured information from HTML pages into a TSV format and executing complex search procedures to create travel plans. These tasks challenge LCLMs by testing their ability to follow detailed procedural instructions, synthesize and reason over dispersed information, and generate structured, long-form outputs (up to 8K tokens). Furthermore, as these tasks adhere to deterministic procedures and yield structured outputs, they enable reliable rule-based evaluation. We evaluated 23 LCLMs, including instruction-tuned models and recent reasoning models, on LongProc at three difficulty levels, with the maximum number of output tokens set at 500, 2K, and 8K. Notably, while all tested models claim a context window size above 32K tokens, open-weight models typically falter on 2K-token tasks, and closed-source models like GPT-4o show significant degradation on 8K-token tasks. Reasoning models achieve stronger overall performance in long-form generation, benefiting from long CoT training. Further analysis reveals that LCLMs struggle to maintain long-range coherence in long-form generations. These findings highlight critical limitations in current LCLMs and suggest substantial room for improvement. Data and code available at: https://princeton-pli.github.io/LongProc.

[215] A Partition Cover Approach to Tokenization

Jia Peng Lim, Shawn Tan, Davin Choo, Hady W. Lauw

Main category: cs.CL

TL;DR: GreedTok is a new tokenization algorithm that formulates tokenization as an optimization problem, shows it’s NP-hard, and provides a polynomial-time greedy solution that outperforms BPE and Unigram on compression.

Details

Motivation: Current tokenization methods like Byte-Pair Encoding (BPE) treat tokenization as compression but lack formal optimization formulation. The authors aim to establish tokenization as a proper optimization problem with theoretical guarantees.

Method: Formulate tokenization as an optimization objective, prove NP-hardness via reduction from vertex cover, and propose GreedTok - a polynomial-time greedy algorithm that relaxes to the weighted maximum coverage problem with (1-1/e)-approximation guarantee.

Result: GreedTok outperforms BPE and Unigram on compression metrics, achieves comparable covering scores to GreedWMC, and when used for pre-training 1B parameter transformers, achieves lower bits per byte than BPE even when controlling for dataset size or training tokens.

Conclusion: Tokenization can be effectively formulated as an optimization problem, and GreedTok provides a theoretically grounded alternative to BPE with superior compression performance in practical language model training.

Abstract: Tokenization is the process of encoding strings into tokens of a fixed vocabulary size, and is widely utilized in Natural Language Processing applications. The leading tokenization algorithm today is Byte-Pair Encoding (BPE), which formulates the tokenization problem as a compression problem and tackles it by performing sequences of merges. In this work, we formulate tokenization as an optimization objective, show that it is NP-hard via a simple reduction from vertex cover, and propose a polynomial-time greedy algorithm GreedTok. Our formulation naturally relaxes to the well-studied weighted maximum coverage problem which has a simple $(1 - 1/e)$-approximation algorithm GreedWMC. Through empirical evaluations on real-world corpora, we show that GreedTok outperforms BPE and Unigram on compression and achieves a covering score comparable to GreedWMC. Finally, our extensive pre-training for two transformer-based language models with 1 billion parameters, comparing the choices of BPE and GreedTok as the tokenizer, shows that GreedTok achieves a lower bit per byte even when we control for either the total dataset proportion or total training tokens.

[216] A Survey of Graph Retrieval-Augmented Generation for Customized Large Language Models

Qinggang Zhang, Shengyuan Chen, Yuanchen Bei, Zheng Yuan, Huachi Zhou, Zijin Hong, Hao Chen, Yilin Xiao, Chuang Zhou, Junnan Dong, Yi Chang, Xiao Huang

Main category: cs.CL

TL;DR: This survey introduces GraphRAG, a graph-based approach to Retrieval-Augmented Generation that addresses limitations of traditional RAG systems by using graph-structured knowledge representation and efficient retrieval techniques for domain-specific LLM applications.

Details

Motivation: Traditional RAG systems based on flat text retrieval face challenges in complex query understanding, knowledge integration across distributed sources, and efficiency bottlenecks at scale, limiting their effectiveness in specialized domains requiring deep expertise.

Method: GraphRAG addresses these limitations through three key innovations: graph-structured knowledge representation capturing entity relationships and domain hierarchies, efficient graph-based retrieval techniques enabling context-preserving knowledge retrieval with multi-hop reasoning, and structure-aware knowledge integration algorithms.

Result: The survey systematically analyzes GraphRAG’s technical foundations and examines current implementations across various professional domains, identifying key technical challenges and promising research directions.

Conclusion: GraphRAG represents a new paradigm that revolutionizes domain-specific LLM applications by overcoming traditional RAG limitations through structured knowledge representation and advanced retrieval techniques, with collected resources available for the research community.

Abstract: Large language models (LLMs) have demonstrated remarkable capabilities in a wide range of tasks, yet their application to specialized domains remains challenging due to the need for deep expertise. Retrieval-Augmented generation (RAG) has emerged as a promising solution to customize LLMs for professional fields by seamlessly integrating external knowledge bases, enabling real-time access to domain-specific expertise during inference. Despite its potential, traditional RAG systems, based on flat text retrieval, face three critical challenges: (i) complex query understanding in professional contexts, (ii) difficulties in knowledge integration across distributed sources, and (iii) system efficiency bottlenecks at scale. This survey presents a systematic analysis of Graph-based Retrieval-Augmented Generation (GraphRAG), a new paradigm that revolutionizes domain-specific LLM applications. GraphRAG addresses traditional RAG limitations through three key innovations: (i) graph-structured knowledge representation that explicitly captures entity relationships and domain hierarchies, (ii) efficient graph-based retrieval techniques that enable context-preserving knowledge retrieval with multihop reasoning ability, and (iii) structure-aware knowledge integration algorithms that leverage retrieved knowledge for accurate and logical coherent generation of LLMs. In this survey, we systematically analyze the technical foundations of GraphRAG and examine current implementations across various professional domains, identifying key technical challenges and promising research directions. All the related resources of GraphRAG, including research papers, open-source data, and projects, are collected for the community in https://github.com/DEEP-PolyU/Awesome-GraphRAG.

[217] ESGSenticNet: A Neurosymbolic Knowledge Base for Corporate Sustainability Analysis

Keane Ong, Rui Mao, Deeksha Varshney, Frank Xing, Ranjan Satapathy, Johan Sulaeman, Erik Cambria, Gianmarco Mengaldo

Main category: cs.CL

TL;DR: ESGSenticNet is a knowledge base for sustainability analysis that addresses challenges in processing corporate sustainability data through a neurosymbolic framework, outperforming baselines in capturing ESG-related information.

Details

Motivation: Corporate sustainability evaluation is hindered by data complexity and ineffective NLP tools, with challenges including immateriality, complexity, and subjectivity in sustainability disclosures.

Method: A neurosymbolic framework integrating specialized concept parsing, GPT-4o inference, semi-supervised label propagation, and hierarchical taxonomy to create a structured knowledge base of 44k knowledge triplets.

Result: ESGSenticNet outperforms state-of-the-art baselines by 26% on ESG relatedness and 31% on ESG action orientation, capturing more unique ESG topic terms without requiring training.

Conclusion: ESGSenticNet effectively extracts relevant sustainability information from disclosures as a simple lexical method, making it accessible for non-technical stakeholders.

Abstract: Evaluating corporate sustainability performance is essential to drive sustainable business practices, amid the need for a more sustainable economy. However, this is hindered by the complexity and volume of corporate sustainability data (i.e. sustainability disclosures), not least by the effectiveness of the NLP tools used to analyse them. To this end, we identify three primary challenges - immateriality, complexity, and subjectivity, that exacerbate the difficulty of extracting insights from sustainability disclosures. To address these issues, we introduce ESGSenticNet, a publicly available knowledge base for sustainability analysis. ESGSenticNet is constructed from a neurosymbolic framework that integrates specialised concept parsing, GPT-4o inference, and semi-supervised label propagation, together with a hierarchical taxonomy. This approach culminates in a structured knowledge base of 44k knowledge triplets - (‘halve carbon emission’, supports, ’emissions control’), for effective sustainability analysis. Experiments indicate that ESGSenticNet, when deployed as a lexical method, more effectively captures relevant and actionable sustainability information from sustainability disclosures compared to state of the art baselines. Besides capturing a high number of unique ESG topic terms, ESGSenticNet outperforms baselines on the ESG relatedness and ESG action orientation of these terms by 26% and 31% respectively. These metrics describe the extent to which topic terms are related to ESG, and depict an action toward ESG. Moreover, when deployed as a lexical method, ESGSenticNet does not require any training, possessing a key advantage in its simplicity for non-technical stakeholders.

[218] Beyond checkmate: exploring the creative chokepoints in AI text

Nafis Irtiza Tripto, Saranya Venkatraman, Mahjabin Nahar, Dongwon Lee

Main category: cs.CL

TL;DR: This paper analyzes differences between human and AI-generated text across different segments (introduction, body, conclusion), finding that the body segment is most informative for detection despite AI text closely resembling human writing there.

Details

Motivation: To investigate nuanced distinctions between human and AI texts across different text segments, informing LLMs' viability as creative assistants and enabling more effective detection strategies.

Method: Using a chess game analogy (opening, middle, end games), the study analyzes segment-specific patterns in human and AI texts to reveal where the most striking differences lie.

Result: AI texts closely resemble human writing in the body segment due to its length, but deeper analysis shows higher divergence in features dependent on continuous language flow, making the body segment most informative for detection. Human texts exhibit greater stylistic variation across segments.

Conclusion: The findings provide fresh insights into human-AI text differences and pave the way for more effective and interpretable detection strategies, with the body segment being particularly revealing despite superficial similarities.

Abstract: The rapid advancement of Large Language Models (LLMs) has revolutionized text generation but also raised concerns about potential misuse, making detecting LLM-generated text (AI text) increasingly essential. While prior work has focused on identifying AI text and effectively checkmating it, our study investigates a less-explored territory: portraying the nuanced distinctions between human and AI texts across text segments (introduction, body, and conclusion). Whether LLMs excel or falter in incorporating linguistic ingenuity across text segments, the results will critically inform their viability and boundaries as effective creative assistants to humans. Through an analogy with the structure of chess games, comprising opening, middle, and end games, we analyze segment-specific patterns to reveal where the most striking differences lie. Although AI texts closely resemble human writing in the body segment due to its length, deeper analysis shows a higher divergence in features dependent on the continuous flow of language, making it the most informative segment for detection. Additionally, human texts exhibit greater stylistic variation across segments, offering a new lens for distinguishing them from AI. Overall, our findings provide fresh insights into human-AI text differences and pave the way for more effective and interpretable detection strategies. Codes available at https://github.com/tripto03/chess_inspired_human_ai_text_distinction.

[219] Which Words Matter Most in Zero-Shot Prompts?

Nikta Gohari Sadr, Sangmitra Madhusudan, Hassan Sajjad, Ali Emami

Main category: cs.CL

TL;DR: ZIP score is introduced to quantify individual word importance in instructional prompts through controlled perturbations, revealing task-specific word hierarchies, model differences, noun dominance, and inverse correlation with model performance.

Details

Motivation: To understand which specific words drive the effectiveness of zero-shot instructional prompts like "Let's think step-by-step" in Large Language Models.

Method: ZIP score (Zero-shot Importance of Perturbation) uses controlled perturbations including synonym replacement, co-hyponym substitution, and strategic removal to quantify word importance across four models, seven prompts, and multiple task domains.

Result: Found four key patterns: task-specific word hierarchies, proprietary models align better with human intuitions, nouns dominate importance rankings, and word importance inversely correlates with model performance. ZIP achieved 90% accuracy on validation benchmark versus LIME’s 60%.

Conclusion: The study advances prompt science by providing practical insights for prompt engineering and theoretical understanding of word-level effects in LLMs, establishing the first ground-truth benchmark for prompt interpretability.

Abstract: While zero-shot instructional prompts like “Let’s think step-by-step” have revolutionized Large Language Model performance, a fundamental question remains unanswered: which specific words drive their remarkable effectiveness? We introduce the ZIP score (Zero-shot Importance of Perturbation), the first systematic method to quantify individual word importance in instructional prompts through controlled perturbations including synonym replacement, co-hyponym substitution, and strategic removal. Our analysis across four flagship models, seven widely-adopted prompts, and multiple task domains reveals four key findings: (1) Task-specific word hierarchies exist where mathematical problems prioritize “step-by-step” while reasoning tasks favor “think”; (2) Proprietary models show superior alignment with human intuitions compared to open-source alternatives; (3) Nouns dominate importance rankings, consistently representing the majority of significant words; and (4) Word importance inversely correlates with model performance, indicating prompts have greatest impact where models struggle most. Beyond revealing these patterns, we establish the first ground-truth benchmark for prompt interpretability through 20 validation prompts with predetermined key words, where ZIP achieves 90% accuracy versus LIME’s 60%. Our findings advance prompt science, the study of how language shapes model behavior, providing both practical insights for prompt engineering and theoretical understanding of word-level effects in LLMs.

[220] UltraIF: Advancing Instruction Following from the Wild

Kaikai An, Li Sheng, Ganqu Cui, Shuzheng Si, Ning Ding, Yu Cheng, Baobao Chang

Main category: cs.CL

TL;DR: UltraIF is a method that aligns base LLMs to follow complex instructions by decomposing prompts, training a composer, and using evaluation questions for filtering, achieving performance comparable to instruct models.

Details

Motivation: To bridge the performance gap between open-source LLMs and leading proprietary models in following complex instructions, using only open-source data.

Method: Decomposes user prompts into simpler queries and constraints, trains UltraComposer to compose constraint-associated prompts with evaluation questions, and synthesizes complicated instructions while filtering responses.

Result: Successfully aligned LLaMA-3.1-8B-Base to match its instruct version on 5 benchmarks without benchmark information, using only 8B models. Also improved LLaMA-3.1-8B-Instruct through self-alignment.

Conclusion: UltraIF provides an effective approach for instruction-following alignment using open-source data, with potential for broader applications in model improvement.

Abstract: Instruction-following made modern large language models (LLMs) helpful assistants. However, the key to taming LLMs on complex instructions remains mysterious, for that there are huge gaps between models trained by open-source community and those trained by leading companies. To bridge the gap, we propose a simple and scalable approach UltraIF for building LLMs that can follow complex instructions with open-source data. UltraIF first decomposes real-world user prompts into simpler queries, constraints, and corresponding evaluation questions for the constraints. Then, we train an UltraComposer to compose constraint-associated prompts with evaluation questions. This prompt composer allows us to synthesize complicated instructions as well as filter responses with evaluation questions. In our experiment, for the first time, we successfully align LLaMA-3.1-8B-Base to catch up with its instruct version on 5 instruction-following benchmarks without any benchmark information, using only 8B model as response generator and evaluator. The aligned model also achieved competitive scores on other benchmarks. Moreover, we also show that UltraIF could further improve LLaMA-3.1-8B-Instruct through self-alignment, motivating broader use cases for the method. Our code is available at https://github.com/kkk-an/UltraIF.

[221] Group-Adaptive Threshold Optimization for Robust AI-Generated Text Detection

Minseok Jung, Cynthia Fuertes Panizo, Liam Dugan, Yi R., Fung, Pin-Yu Chen, Paul Pu Liang

Main category: cs.CL

TL;DR: FairOPT is a group-specific threshold optimization algorithm for AI-text detectors that addresses bias in fixed global thresholds by learning optimal thresholds for different subgroups based on attributes like text length and writing style.

Details

Motivation: Fixed global thresholds in AI-text detectors cause distributional bias across subgroups, leading to disproportionate misclassifications (e.g., more false positives on short human-written text and neurotic writing styles).

Method: Partition data into subgroups based on attributes (text length, writing style), then implement FairOPT to learn optimal decision thresholds for each group to reduce classification discrepancy.

Result: FairOPT significantly reduced discrepancy across 9 detectors and 3 datasets, decreasing overall discrepancy by 27.4% across 5 metrics while only sacrificing 0.005% accuracy.

Conclusion: The framework enables more robust AI-generated content detection through post-processing optimization of group-specific thresholds, mitigating bias while maintaining accuracy.

Abstract: The advancement of large language models (LLMs) has made it difficult to differentiate human-written text from AI-generated text. Several AI-text detectors have been developed in response, which typically utilize a fixed global threshold (e.g., $\theta = 0.5$) to classify machine-generated text. However, one universal threshold could fail to account for distributional variations by subgroups. For example, when using a fixed threshold, detectors make more false positive errors on shorter human-written text, and more positive classifications of neurotic writing styles among long texts. These discrepancies can lead to misclassifications that disproportionately affect certain groups. We address this critical limitation by introducing FairOPT, an algorithm for group-specific threshold optimization for probabilistic AI-text detectors. We partitioned data into subgroups based on attributes (e.g., text length and writing style) and implemented FairOPT to learn decision thresholds for each group to reduce discrepancy. FairOPT showed notable discrepancy mitigation across nine detectors and three heterogeneous datasets, and the remarkable mitigation of the minimax problem by decreasing overall discrepancy 27.4% across five metrics while minimally sacrificing accuracy by 0.005%. Our framework paves the way for more robust classification in AI-generated content detection via post-processing. We release our data, code, and project information at URL.

[222] Confidence Improves Self-Consistency in LLMs

Amir Taubenfeld, Tom Sheffer, Eran Ofek, Amir Feder, Ariel Goldstein, Zorik Gekhman, Gal Yona

Main category: cs.CL

TL;DR: CISC improves self-consistency decoding by using confidence scores to weight reasoning paths, reducing required sample size by over 40% while maintaining or improving performance.

Details

Motivation: Self-consistency decoding is computationally expensive due to the need for sampling many lengthy reasoning paths to find the most frequent correct answer.

Method: Confidence-Informed Self-Consistency (CISC) performs weighted majority voting based on model-generated confidence scores, prioritizing high-confidence paths to reduce sample requirements.

Result: CISC outperforms self-consistency in nearly all configurations across nine models and four datasets, reducing required reasoning paths by over 40% on average.

Conclusion: LLMs can effectively judge their own outputs’ correctness, and within-question confidence evaluation is crucial for distinguishing correct answers, with the most calibrated confidence method proving least effective for CISC.

Abstract: Self-consistency decoding enhances LLMs’ performance on reasoning tasks by sampling diverse reasoning paths and selecting the most frequent answer. However, it is computationally expensive, as sampling many of these (lengthy) paths is required to increase the chances that the correct answer emerges as the most frequent one. To address this, we introduce Confidence-Informed Self-Consistency (CISC). CISC performs a weighted majority vote based on confidence scores obtained directly from the model. By prioritizing high-confidence paths, it can identify the correct answer with a significantly smaller sample size. When tested on nine models and four datasets, CISC outperforms self-consistency in nearly all configurations, reducing the required number of reasoning paths by over 40% on average. In addition, we introduce the notion of within-question confidence evaluation, after showing that standard evaluation methods are poor predictors of success in distinguishing correct and incorrect answers to the same question. In fact, the most calibrated confidence method proved to be the least effective for CISC. Lastly, beyond these practical implications, our results and analyses show that LLMs can effectively judge the correctness of their own outputs, contributing to the ongoing debate on this topic.

[223] PAFT: Prompt-Agnostic Fine-Tuning

Chenxing Wei, Yao Shu, Mingwen Ou, Ying Tiffany He, Fei Richard Yu

Main category: cs.CL

TL;DR: PAFT is a fine-tuning method that improves LLM robustness to prompt variations by using dynamic prompt generation during training, achieving better generalization and faster inference.

Details

Motivation: Standard fine-tuning causes LLMs to overfit to specific prompt wording, making them sensitive to minor phrasing changes that drastically reduce performance.

Method: PAFT generates diverse synthetic prompts and continuously samples from them during training, forcing models to learn fundamental task principles rather than surface-level patterns.

Result: PAFT achieves 7% higher generalization accuracy on unseen prompts, superior performance on QA, math reasoning, and tool use benchmarks, and 3.2x faster inference speeds due to reduced prompt sensitivity.

Conclusion: PAFT effectively enhances LLM robustness and cross-domain generalization while improving overall performance and inference efficiency.

Abstract: Fine-tuning large language models (LLMs) often causes overfitting to specific prompt wording, where minor phrasing variations drastically reduce performance. To address this, we propose Prompt-Agnostic Fine-Tuning (PAFT), a method that enhances robustness through dynamic prompt variation during training. PAFT first generates diverse synthetic prompts, then continuously samples from this set to construct training instances, forcing models to learn fundamental task principles rather than surface-level patterns. Across systematic evaluations using both supervised fine-tuning (SFT) and reinforcement learning fine-tuning (RLFT), PAFT demonstrates substantially improved prompt robustness, achieving 7% higher generalization accuracy on unseen prompts than standard methods. In addition to enhanced robustness, PAFT consistently yields superior overall performance on established benchmarks for question answering, mathematical reasoning, and tool use. Notably, models trained with PAFT attain 3.2 faster inference speeds due to reduced prompt sensitivity. Ablation studies further validate effectiveness of PAFT, while theoretical analysis reveals that PAFT can effectively enhance the cross-domain generalization ability of LLM.

[224] B-cos LM: Efficiently Transforming Pre-trained Language Models for Improved Explainability

Yifan Wang, Sukrut Rao, Ji-Ung Lee, Mayank Jobanputra, Vera Demberg

Main category: cs.CL

TL;DR: B-cos LMs transform pre-trained language models into explainable models through B-cos conversion and fine-tuning, producing more faithful and interpretable explanations than post-hoc methods while maintaining comparable task performance.

Details

Motivation: Post-hoc explanation methods for black-box models often lack faithfulness and human interpretability, while existing B-cos networks have been limited to computer vision applications.

Method: Directly transform pre-trained language models into B-cos LMs by combining B-cos conversion and task fine-tuning, improving efficiency compared to previous methods.

Result: B-cos LMs produce more faithful and human interpretable explanations than post-hoc methods while maintaining task performance comparable to conventional fine-tuning.

Conclusion: B-cos LMs successfully extend explainable architectures to NLP tasks, providing better explanations while preserving performance, with potential applications to decoder-only models for generation tasks.

Abstract: Post-hoc explanation methods for black-box models often struggle with faithfulness and human interpretability due to the lack of explainability in current neural architectures. Meanwhile, B-cos networks have been introduced to improve model explainability by proposing an architecture that removes bias terms and promotes input-weight alignment. Although B-cos networks have shown success in building explainable systems, their application has so far been limited to computer vision models and their associated training pipelines. In this work, we introduce B-cos LMs, i.e., B-cos language models (LMs) empowered for natural language processing (NLP) tasks. Our approach directly transforms pre-trained language models into B-cos LMs by combining B-cos conversion and task fine-tuning, improving efficiency compared to previous methods. Our automatic and human evaluation results demonstrate that B-cos LMs produce more faithful and human interpretable explanations than post-hoc methods, while maintaining task performance comparable to conventional fine-tuning. Our in-depth analysis explores how B-cos LMs differ from conventionally fine-tuned models in their learning processes and explanation patterns. Finally, we present a first exploration of transforming decoder-only models to B-cos LMs for generation tasks.

[225] PropXplain: Can LLMs Enable Explainable Propaganda Detection?

Maram Hasanain, Md Arid Hasan, Mohamed Bayan Kmainasi, Elisa Sartori, Ali Ezzat Shahroor, Giovanni Da San Martino, Firoj Alam

Main category: cs.CL

TL;DR: Proposed a multilingual explanation-enhanced dataset for propaganda detection and an LLM that generates both labels and explanations.

Details

Motivation: Most propaganda detection research focuses only on detection without providing explanations for predictions, due to lack of datasets with explanations.

Method: Created first multilingual (Arabic/English) explanation-enhanced dataset and developed an explanation-enhanced LLM for both detection and explanation generation.

Result: The model performs comparably to detection-only models while also generating explanations for predictions.

Conclusion: Successfully addressed the explanation gap in propaganda detection and will release dataset/resources publicly to advance research.

Abstract: There has been significant research on propagandistic content detection across different modalities and languages. However, most studies have primarily focused on detection, with little attention given to explanations justifying the predicted label. This is largely due to the lack of resources that provide explanations alongside annotated labels. To address this issue, we propose a multilingual (i.e., Arabic and English) explanation-enhanced dataset, the first of its kind. Additionally, we introduce an explanation-enhanced LLM for both label detection and rationale-based explanation generation. Our findings indicate that the model performs comparably while also generating explanations. We will make the dataset and experimental resources publicly available for the research community (https://github.com/firojalam/PropXplain).

[226] MemeIntel: Explainable Detection of Propagandistic and Hateful Memes

Mohamed Bayan Kmainasi, Abul Hasnat, Md Arid Hasan, Ali Ezzat Shahroor, Firoj Alam

Main category: cs.CL

TL;DR: MemeXplain is a new dataset and method for detecting propagandistic and hateful memes with explanation generation, using multi-stage optimization to improve both classification accuracy and rationale quality.

Details

Motivation: Current methods for detecting harmful multimodal content on social media focus on classification but neglect explanation generation, which degrades performance when both tasks are trained together.

Method: Created MemeXplain dataset for Arabic propagandistic memes and English hateful memes, and proposed a multi-stage optimization approach to train Vision-Language Models for joint label detection and explanation generation.

Result: The approach significantly improved both tasks, achieving ~1.4% absolute accuracy improvement on ArMeme and ~2.2% on Hateful Memes compared to state-of-the-art methods.

Conclusion: Multi-stage optimization effectively addresses the challenge of joint training for detection and explanation, and the MemeXplain dataset provides a valuable resource for future multimodal content analysis research.

Abstract: The proliferation of multimodal content on social media presents significant challenges in understanding and moderating complex, context-dependent issues such as misinformation, hate speech, and propaganda. While efforts have been made to develop resources and propose new methods for automatic detection, limited attention has been given to jointly modeling label detection and the generation of explanation-based rationales, which often leads to degraded classification performance when trained simultaneously. To address this challenge, we introduce MemeXplain, an explanation-enhanced dataset for propagandistic memes in Arabic and hateful memes in English, making it the first large-scale resource for these tasks. To solve these tasks, we propose a multi-stage optimization approach and train Vision-Language Models (VLMs). Our results show that this strategy significantly improves both label detection and explanation generation quality over the base model, outperforming the current state-of-the-art with an absolute improvement of ~1.4% (Acc) on ArMeme and ~2.2% (Acc) on Hateful Memes. For reproducibility and future research, we aim to make the MemeXplain dataset and scripts publicly available (https://github.com/MohamedBayan/MemeIntel).

[227] Two Heads Are Better Than One: Dual-Model Verbal Reflection at Inference-Time

Jiazheng Li, Yuxiang Zhou, Junru Lu, Gladys Tyen, Lin Gui, Cesare Aloisi, Yulan He

Main category: cs.CL

TL;DR: DARS is a dual-model reflective scoring framework that uses contrastive reflection synthesis to generate precise verbal feedback for automated student answer scoring, outperforming existing baselines.

Details

Motivation: Preference optimization methods in LLMs lack transparency in reasoning outcomes, which is critical for explainable assessment in Automated Student Answer Scoring (ASAS). Existing methods produce superficial critiques and struggle with detecting subtle reasoning errors.

Method: A contrastive reflection synthesis pipeline generates precise verbal feedback by identifying discrepancies in structure reasoning graph paths. DARS framework uses synthetic reflection data with a dedicated Critic model trained for effective reflection.

Result: DARS achieves strong performance and consistently outperforms existing ASAS baselines across all evaluation metrics. Extensive experiments provide insights into reflection data value, framework design, and scaling behavior.

Conclusion: The DARS framework successfully addresses transparency limitations in ASAS through contrastive reflection synthesis and dual-model architecture, demonstrating superior performance and providing valuable insights for explainable assessment systems.

Abstract: Although preference optimization methods have improved reasoning performance in Large Language Models (LLMs), they often lack transparency regarding why one reasoning outcome is preferred over another. This limitation is especially critical in Automated Student Answer Scoring (ASAS), where explainability is essential to justify assessment outcomes. Verbal reinforcement learning offers the potential to generate explicit reflection, but it tends to produce superficial critiques that can harm assessment performance. Existing LLMs also struggle to reliably detect subtle reasoning errors in ASAS tasks. Moreover, manually identifying intermediate reasoning errors is expensive and difficult to scale. To address these challenges, we introduce a contrastive reflection synthesis pipeline that generates precise verbal feedback by identifying discrepancies in structure reasoning graph paths. Leveraging these synthetic reflection data, we propose DARS, a Dual-model Reflective Scoring framework featuring a dedicated Critic model trained for effective reflection. DARS achieves strong performance and consistently outperforms existing ASAS baselines across all evaluation metrics. Extensive experiments further provide novel insights into the value of reflection data, framework design, and the scaling behavior of DARS. We release the DARS code at https://github.com/lijiazheng99/DARS.

[228] How to Protect Yourself from 5G Radiation? Investigating LLM Responses to Implicit Misinformation

Ruohao Guo, Wei Xu, Alan Ritter

Main category: cs.CL

TL;DR: EchoMist is the first benchmark for evaluating LLMs on implicit misinformation where false premises are embedded in queries, revealing that current models perform poorly and often fail to detect these subtle false assumptions.

Details

Motivation: Current LLM safety evaluations focus on explicit false statements, but real-world misinformation often manifests subtly as unchallenged premises, creating a critical safety gap that needs assessment.

Method: Created EchoMist benchmark with implicit misinformation scenarios from diverse sources including human-AI conversations and social media, then tested 15 state-of-the-art LLMs and evaluated two mitigation methods: Self-Alert and RAG.

Result: Current LLMs perform alarmingly poorly on implicit misinformation detection, often failing to identify false premises and generating counterfactual explanations. The mitigation methods showed EchoMist remains a persistent challenge.

Conclusion: Implicit misinformation poses a critical safety risk for LLMs that current models cannot adequately handle, underscoring the urgent need for better safeguards against this subtle form of misinformation.

Abstract: As Large Language Models (LLMs) are widely deployed in diverse scenarios, the extent to which they could tacitly spread misinformation emerges as a critical safety concern. Current research primarily evaluates LLMs on explicit false statements, overlooking how misinformation often manifests subtly as unchallenged premises in real-world interactions. We curated EchoMist, the first comprehensive benchmark for implicit misinformation, where false assumptions are embedded in the query to LLMs. EchoMist targets circulated, harmful, and ever-evolving implicit misinformation from diverse sources, including realistic human-AI conversations and social media interactions. Through extensive empirical studies on 15 state-of-the-art LLMs, we find that current models perform alarmingly poorly on this task, often failing to detect false premises and generating counterfactual explanations. We also investigate two mitigation methods, i.e., Self-Alert and RAG, to enhance LLMs’ capability to counter implicit misinformation. Our findings indicate that EchoMist remains a persistent challenge and underscore the critical need to safeguard against the risk of implicit misinformation.

[229] Adaptive Group Policy Optimization: Towards Stable Training and Token-Efficient Reasoning

Chen Li, Nazhou Liu, Kai Yang

Main category: cs.CL

TL;DR: Proposes Adaptive Group Policy Optimization (AGPO) to address deficiencies in GRPO like zero-variance in advantage estimation, using adaptive loss function for more stable training and better efficiency.

Details

Motivation: GRPO has become core for training Reasoning LLMs but suffers from issues like zero-variance in advantage estimation that affect RL stability and inference efficiency.

Method: AGPO uses an adaptive loss function to mitigate training fluctuation and token inefficiency in reasoning steps.

Result: Experiments show AGPO achieves more stable training and superior performance with significantly fewer tokens in reasoning steps.

Conclusion: AGPO effectively addresses GRPO’s deficiencies, providing more stable training and better efficiency for Reasoning LLMs.

Abstract: Since DeepSeek-R1 popularized, Group Relative Policy Optimization (GRPO) has become the core part of training Reasoning LLMs. However, we find some deficiency that influences RL stability and inference efficiency, like zero-variance in advantage estimation. Thus, we propose Adaptive Group Policy Optimization (AGPO) which uses a simple but effective method, an adaptive loss function, to mitigate training fluctuation and token inefficiency. The experiments demonstrate our method achieves more stable training and superior performance with significantly fewer tokens in reasoning steps.

[230] SUV: Scalable Large Language Model Copyright Compliance with Regularized Selective Unlearning

Tianyang Xu, Xiaoze Liu, Feijie Wu, Xiaoqian Wang, Jing Gao

Main category: cs.CL

TL;DR: SUV is a selective unlearning framework that prevents LLMs from memorizing copyrighted content while preserving utility, using DPO with gradient projection and Fisher regularization.

Details

Motivation: To address legal concerns about LLMs unintentionally generating copyrighted content and prevent copyright infringement lawsuits.

Method: Constructs dataset of copyright infringement cases, uses Direct Preference Optimization to replace verbatim content with alternatives, and integrates gradient projection and Fisher information regularization to preserve performance.

Result: Significantly reduces verbatim memorization of copyrighted books with negligible impact on unrelated task performance, validated on 500 famous books and public benchmarks.

Conclusion: SUV offers a scalable and effective solution for mitigating copyright risks in real-world LLM applications while maintaining model utility.

Abstract: Large Language Models (LLMs) have transformed natural language processing by learning from massive datasets, yet this rapid progress has also drawn legal scrutiny, as the ability to unintentionally generate copyrighted content has already prompted several prominent lawsuits. In this work, we introduce SUV (Selective Unlearning for Verbatim data), a selective unlearning framework designed to prevent LLM from memorizing copyrighted content while preserving its overall utility. In detail, the proposed method constructs a dataset that captures instances of copyrighted infringement cases by the targeted LLM. With the dataset, we unlearn the content from the LLM by means of Direct Preference Optimization (DPO), which replaces the verbatim copyrighted content with plausible and coherent alternatives. Since DPO may hinder the LLM’s performance in other unrelated tasks, we integrate gradient projection and Fisher information regularization to mitigate the degradation. We validate our approach using a large-scale dataset of 500 famous books (predominantly copyrighted works) and demonstrate that SUV significantly reduces verbatim memorization with negligible impact on the performance on unrelated tasks. Extensive experiments on both our dataset and public benchmarks confirm the scalability and efficacy of our approach, offering a promising solution for mitigating copyright risks in real-world LLM applications.

[231] XL-Suite: Cross-Lingual Synthetic Training and Evaluation Data for Open-Ended Generation

Vivek Iyer, Pinzhen Chen, Ricardo Rei, Alexandra Birch

Main category: cs.CL

TL;DR: XL-Instruct is a novel technique for generating synthetic data that significantly improves cross-lingual generation capabilities of LLMs, with fine-tuning on just 8K instructions boosting win rates against GPT-4o-Mini from 7.4% to 21.5%.

Details

Motivation: Cross-lingual open-ended generation (responding in a different language than the query) is an important yet understudied problem that needs better evaluation methods and training approaches.

Method: Proposes XL-Instruct for generating high-quality synthetic data and introduces XL-AlpacaEval benchmark for evaluating cross-lingual generation. Fine-tunes LLMs with 8K XL-Instruct generated instructions.

Result: Fine-tuning with XL-Instruct significantly improves model performance, increasing win rate against GPT-4o-Mini from 7.4% to 21.5%. Also shows strong zero-shot improvements to question answering in the same language.

Conclusion: XL-Instruct shows promising role in post-training of multilingual LLMs. XL-Suite (training and evaluation data) is publicly released to facilitate cross-lingual open-ended generation research.

Abstract: Cross-lingual open-ended generation - responding in a language different from that of the query - is an important yet understudied problem. This work proposes XL-Instruct, a novel technique for generating high-quality synthetic data, and introduces XL-AlpacaEval, a new benchmark for evaluating cross-lingual generation capabilities of large language models (LLMs). Our experiments show that fine-tuning with just 8K instructions generated using XL-Instruct significantly improves model performance, increasing the win rate against GPT-4o-Mini from 7.4% to 21.5% and improving on several fine-grained quality metrics. Moreover, base LLMs fine-tuned on XL-Instruct exhibit strong zero-shot improvements to question answering in the same language, as shown on our machine-translated m-AlpacaEval. These consistent gains highlight the promising role of XL-Instruct in the post-training of multilingual LLMs. Finally, we publicly release XL-Suite, a collection of training and evaluation data to facilitate research in cross-lingual open-ended generation.

[232] SentenceKV: Efficient LLM Inference via Sentence-Level Semantic KV Caching

Yuxuan Zhu, Ali Falahati, David H. Yang, Mohammad Mohammadi Amiri

Main category: cs.CL

TL;DR: SentenceKV is a sentence-level semantic KV caching approach that groups tokens by semantic similarity, compresses them into semantic vectors on GPU, and offloads individual KV pairs to CPU, enabling efficient inference with reduced memory overhead while maintaining accuracy.

Details

Motivation: Traditional token-level KV caching methods ignore semantic relationships between tokens, while existing semantic-preserving approaches suffer from high memory usage and slow time-to-first-token. There's a need for efficient KV cache management that preserves semantic coherence.

Method: During prefilling, tokens are grouped by sentence-level semantic similarity and compressed into semantic vectors stored on GPU, while individual KV pairs are offloaded to CPU. During decoding, tokens are generated by selectively retrieving semantically relevant sentence-level KV entries using semantic similarity between prefilling vectors and decoding queries.

Result: Extensive evaluations on PG-19, LongBench, and Needle-In-A-Haystack benchmarks show SentenceKV significantly outperforms state-of-the-art methods in both efficiency and memory usage without compromising model accuracy.

Conclusion: SentenceKV provides an effective solution for efficient long-context inference by leveraging sentence-level semantic grouping and selective retrieval, achieving substantial memory reduction while maintaining stable inference latency and model accuracy.

Abstract: Large language models face significant computational and memory challenges when processing long contexts. During inference, efficient management of the key-value (KV) cache, which stores intermediate activations for autoregressive generation, is critical to reducing memory overhead and improving computational efficiency. Traditional token-level efficient KV caching methods overlook semantic information, treating tokens independently without considering their semantic relationships. Meanwhile, existing semantic-preserving KV cache management approaches often suffer from substantial memory usage and high time-to-first-token. To address these limitations, we propose SentenceKV, a novel sentence-level semantic KV caching approach designed to enhance inference efficiency while preserving semantic coherence. During prefilling, SentenceKV groups tokens based on sentence-level semantic similarity, compressing sentence representations into concise semantic vectors stored directly on the GPU, while individual KV pairs are offloaded to CPU. During decoding, SentenceKV generates tokens by selectively retrieving semantically relevant sentence-level KV entries, leveraging the semantic similarity between the prefilling-stage semantic vectors and decoding-stage queries. This ensures efficient and contextually accurate predictions, minimizing the loading of redundant or irrelevant data into GPU memory and significantly reducing memory overhead while maintaining stable inference latency, even for extremely long contexts. Extensive evaluations on benchmarks including PG-19, LongBench, and Needle-In-A-Haystack demonstrate that SentenceKV significantly outperforms state-of-the-art methods in both efficiency and memory usage, without compromising model accuracy.

[233] AnesSuite: A Comprehensive Benchmark and Dataset Suite for Anesthesiology Reasoning in LLMs

Xiang Feng, Wentao Jiang, Zengmao Wang, Yong Luo, Pingbo Xu, Baosheng Yu, Hua Jin, Bo Du, Jing Zhang

Main category: cs.CL

TL;DR: AnesSuite is the first comprehensive dataset suite for anesthesiology reasoning in LLMs, featuring AnesBench evaluation benchmark and training datasets. Morpheus baseline models show significant performance improvements with limited training.

Details

Motivation: LLMs' reasoning capabilities in specialized medical domains like anesthesiology remain underexplored, creating a gap that needs to be addressed for reliable medical applications.

Method: Created AnesSuite with AnesBench evaluation benchmark (three reasoning levels) and training datasets for CPT, SFT, and RLVR. Developed Morpheus baseline models using SFT and GRPO training.

Result: Morpheus demonstrates substantial performance improvements rivaling larger-scale models, despite limited training. Comprehensive analysis reveals key factors influencing anesthesiology reasoning performance.

Conclusion: AnesSuite and Morpheus provide foundational infrastructure for advancing anesthesiology reasoning in LLMs, with open-source availability to support further research and development in this specialized medical domain.

Abstract: The application of large language models (LLMs) in the medical field has garnered significant attention, yet their reasoning capabilities in more specialized domains like anesthesiology remain underexplored. To bridge this gap, we introduce AnesSuite, the first comprehensive dataset suite specifically designed for anesthesiology reasoning in LLMs. The suite features AnesBench, an evaluation benchmark tailored to assess anesthesiology-related reasoning across three levels: factual retrieval (System 1), hybrid reasoning (System 1.x), and complex decision-making (System 2). Alongside this benchmark, the suite includes three training datasets that provide an infrastructure for continued pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning with verifiable rewards (RLVR). Leveraging this suite, we develop Morpheus, the first baseline model collection for anesthesiology reasoning. Despite undergoing limited training with SFT and group relative policy optimization (GRPO), Morpheus demonstrates substantial performance improvements, rivaling the performance of larger-scale models. Furthermore, through comprehensive evaluations and experiments, we analyze the key factors influencing anesthesiology reasoning performance, including model characteristics, training strategies and training data. Both AnesSuite and Morpheus will be open-sourced at https://github.com/MiliLab/AnesSuite.

[234] A Practical Synthesis of Detecting AI-Generated Textual, Visual, and Audio Content

Lele Cao

Main category: cs.CL

TL;DR: Survey of AI-generated content detection methods covering text, visual, and audio modalities, discussing motivations, techniques, and challenges in preserving content authenticity.

Details

Motivation: Address critical concerns about misinformation, copyright infringement, security threats, and erosion of public trust caused by advances in AI-generated content.

Method: Comprehensive survey of detection techniques including observation-based strategies, linguistic/statistical analysis, model-based pipelines, watermarking/fingerprinting, and ensemble approaches with human-in-the-loop verification.

Result: Provides state-of-the-art research overview and case studies across academic, journalistic, legal, and industrial contexts to inform robust solutions and policymaking.

Conclusion: Identifies open challenges including adversarial transformations, domain generalization, and ethical concerns, offering holistic guidance for researchers, practitioners, and regulators to preserve content authenticity.

Abstract: Advances in AI-generated content have led to wide adoption of large language models, diffusion-based visual generators, and synthetic audio tools. However, these developments raise critical concerns about misinformation, copyright infringement, security threats, and the erosion of public trust. In this paper, we explore an extensive range of methods designed to detect and mitigate AI-generated textual, visual, and audio content. We begin by discussing motivations and potential impacts associated with AI-based content generation, including real-world risks and ethical dilemmas. We then outline detection techniques spanning observation-based strategies, linguistic and statistical analysis, model-based pipelines, watermarking and fingerprinting, as well as emergent ensemble approaches. We also present new perspectives on robustness, adaptation to rapidly improving generative architectures, and the critical role of human-in-the-loop verification. By surveying state-of-the-art research and highlighting case studies in academic, journalistic, legal, and industrial contexts, this paper aims to inform robust solutions and policymaking. We conclude by discussing open challenges, including adversarial transformations, domain generalization, and ethical concerns, thereby offering a holistic guide for researchers, practitioners, and regulators to preserve content authenticity in the face of increasingly sophisticated AI-generated media.

[235] Are You Getting What You Pay For? Auditing Model Substitution in LLM APIs

Will Cai, Tianneng Shi, Xuandong Zhao, Dawn Song

Main category: cs.CL

TL;DR: Commercial LLM APIs have a trust problem where providers may substitute cheaper models while charging for premium ones. Current detection methods are unreliable, but hardware-level security using Trusted Execution Environments (TEEs) offers provable cryptographic guarantees with modest performance overhead.

Details

Motivation: Users pay for specific LLM models but have no guarantee that providers actually deliver them faithfully. Providers may covertly substitute cheaper alternatives (quantized versions, smaller models) to reduce costs while maintaining advertised pricing.

Method: The paper formalizes the model substitution problem and systematically evaluates detection methods under adversarial conditions. It examines software-only methods (statistical tests on text outputs and log probabilities) and proposes hardware-level security using Trusted Execution Environments (TEEs) as a robust solution.

Result: Software-only detection methods are fundamentally unreliable: statistical tests on text outputs are query-intensive and fail against subtle substitutions, while methods using log probabilities are defeated by inherent inference nondeterminism in production environments. TEEs can provide provable cryptographic guarantees of model integrity with only modest performance overhead.

Conclusion: Hardware-level security using Trusted Execution Environments offers a practical and robust solution to the LLM API trust problem, providing provable cryptographic guarantees that ensure users get what they pay for, with only modest performance costs.

Abstract: Commercial Large Language Model (LLM) APIs create a fundamental trust problem: users pay for specific models but have no guarantee that providers deliver them faithfully. Providers may covertly substitute cheaper alternatives (e.g., quantized versions, smaller models) to reduce costs while maintaining advertised pricing. We formalize this model substitution problem and systematically evaluate detection methods under realistic adversarial conditions. Our empirical analysis reveals that software-only methods are fundamentally unreliable: statistical tests on text outputs are query-intensive and fail against subtle substitutions, while methods using log probabilities are defeated by inherent inference nondeterminism in production environments. We argue that this verification gap can be more effectively closed with hardware-level security. We propose and evaluate the use of Trusted Execution Environments (TEEs) as one practical and robust solution. Our findings demonstrate that TEEs can provide provable cryptographic guarantees of model integrity with only a modest performance overhead, offering a clear and actionable path to ensure users get what they pay for. Code is available at https://github.com/sunblaze-ucb/llm-api-audit

[236] DataPuzzle: Breaking Free from the Hallucinated Promise of LLMs in Data Analysis

Zhengxuan Zhang, Zhuowen Liang, Yin Wu, Teng Lin, Yuyu Luo, Nan Tang

Main category: cs.CL

TL;DR: The paper proposes a shift from black-box LLM analysis to structured, multi-agent workflows for transparent and verifiable data analysis.

Details

Motivation: Current LLM applications treat models as opaque oracles, producing brittle and unverifiable results that conceal reasoning processes and can be misleading.

Method: Proposes DataPuzzle - a multi-agent framework that decomposes complex questions, structures information into interpretable forms (tables, graphs), and coordinates specialized agent roles for extraction, translation, and linkage tasks.

Result: A conceptual framework blueprint that transforms LLMs from monolithic answer generators into collaborative components within transparent workflows.

Conclusion: Structure is essential for building trustworthy, auditable analytic systems with LLMs - transforming opaque answers into traceable processes and brittle fluency into accountable insight.

Abstract: Large language models (LLMs) are increasingly applied to multi-modal data analysis – not necessarily because they offer the most precise answers, but because they provide fluent, flexible interfaces for interpreting complex inputs. Yet this fluency often conceals a deeper structural failure: the prevailing ``Prompt-to-Answer’’ paradigm treats LLMs as black-box analysts, collapsing evidence, reasoning, and conclusions into a single, opaque response. The result is brittle, unverifiable, and frequently misleading. We argue for a fundamental shift: from generation to structured extraction, from monolithic prompts to modular, agent-based workflows. LLMs should not serve as oracles, but as collaborators – specialized in tasks like extraction, translation, and linkage – embedded within transparent workflows that enable step-by-step reasoning and verification. We propose DataPuzzle, a conceptual multi-agent framework that decomposes complex questions, structures information into interpretable forms (e.g. tables, graphs), and coordinates agent roles to support transparent and verifiable analysis. This framework serves as an aspirational blueprint for restoring visibility and control in LLM-driven analytics – transforming opaque answers into traceable processes, and brittle fluency into accountable insight. This is not a marginal refinement; it is a call to reimagine how we build trustworthy, auditable analytic systems in the era of large language models. Structure is not a constraint – it is the path to clarity.

[237] Efficient Reasoning Models: A Survey

Sicheng Feng, Gongfan Fang, Xinyin Ma, Xinchao Wang

Main category: cs.CL

TL;DR: This survey provides a comprehensive overview of efficient reasoning methods for Chain-of-Thought models, categorizing approaches into three directions: shorter reasoning chains, smaller models, and faster decoding strategies.

Details

Motivation: Reasoning models generate lengthy Chain-of-Thoughts which cause substantial computational overhead, creating an urgent need for effective acceleration methods.

Method: Categorizes existing works into three key directions: (1) shorter - compressing lengthy CoTs into concise reasoning chains, (2) smaller - developing compact language models through knowledge distillation, model compression, and reinforcement learning, and (3) faster - designing efficient decoding strategies.

Result: Provides a systematic framework for understanding and implementing efficient reasoning methods, with a curated collection of papers available in a GitHub repository.

Conclusion: The survey organizes the landscape of efficient reasoning research into three clear directions to address computational overhead in reasoning models while maintaining performance.

Abstract: Reasoning models have demonstrated remarkable progress in solving complex and logic-intensive tasks by generating extended Chain-of-Thoughts (CoTs) prior to arriving at a final answer. Yet, the emergence of this “slow-thinking” paradigm, with numerous tokens generated in sequence, inevitably introduces substantial computational overhead. To this end, it highlights an urgent need for effective acceleration. This survey aims to provide a comprehensive overview of recent advances in efficient reasoning. It categorizes existing works into three key directions: (1) shorter - compressing lengthy CoTs into concise yet effective reasoning chains; (2) smaller - developing compact language models with strong reasoning capabilities through techniques such as knowledge distillation, other model compression techniques, and reinforcement learning; and (3) faster - designing efficient decoding strategies to accelerate inference of reasoning models. A curated collection of papers discussed in this survey is available in our GitHub repository: https://github.com/fscdc/Awesome-Efficient-Reasoning-Models.

[238] IPBench: Benchmarking the Knowledge of Large Language Models in Intellectual Property

Qiyao Wang, Guhong Chen, Hongbo Wang, Huaren Liu, Minghui Zhu, Zhifei Qin, Linwei Li, Yilin Yue, Shiqiang Wang, Jiayan Li, Yihang Wu, Ziqiang Liu, Longze Chen, Run Luo, Liyang Fan, Jiaming Li, Lei Zhang, Kan Xu, Chengming Li, Hamid Alinejad-Rokny, Shiwen Ni, Yuan Lin, Min Yang

Main category: cs.CL

TL;DR: IPBench is the first comprehensive IP task taxonomy and large-scale bilingual benchmark with 8 IP mechanisms and 20 tasks to evaluate LLMs in real-world IP scenarios, showing current models have significant room for improvement.

Details

Motivation: Existing IP datasets and benchmarks are narrow in scope (focusing mainly on patents) and lack alignment with real-world scenarios, creating a gap in evaluating LLMs for comprehensive IP tasks.

Method: Created IPBench benchmark with 8 IP mechanisms and 20 distinct tasks, then benchmarked 17 LLMs including general-purpose and domain-specific models under zero-shot, few-shot, and chain-of-thought settings.

Result: Top-performing model DeepSeek-V3 achieved only 75.8% accuracy, showing significant room for improvement. Open-source IP and law-oriented models lag behind closed-source general-purpose models.

Conclusion: IPBench addresses the gap in comprehensive IP evaluation and will be expanded with additional tasks to better reflect real-world complexities and support model advancements in the IP domain.

Abstract: Intellectual Property (IP) is a highly specialized domain that integrates technical and legal knowledge, making it inherently complex and knowledge-intensive. Recent advancements in LLMs have demonstrated their potential to handle IP-related tasks, enabling more efficient analysis, understanding, and generation of IP-related content. However, existing datasets and benchmarks focus narrowly on patents or cover limited aspects of the IP field, lacking alignment with real-world scenarios. To bridge this gap, we introduce IPBench, the first comprehensive IP task taxonomy and a large-scale bilingual benchmark encompassing 8 IP mechanisms and 20 distinct tasks, designed to evaluate LLMs in real-world IP scenarios. We benchmark 17 main LLMs, ranging from general purpose to domain-specific, including chat-oriented and reasoning-focused models, under zero-shot, few-shot, and chain-of-thought settings. Our results show that even the top-performing model, DeepSeek-V3, achieves only 75.8% accuracy, indicating significant room for improvement. Notably, open-source IP and law-oriented models lag behind closed-source general-purpose models. To foster future research, we publicly release IPBench, and will expand it with additional tasks to better reflect real-world complexities and support model advancements in the IP domain. We provide the data and code in the supplementary URLs.

[239] Dynamic Early Exit in Reasoning Models

Chenxu Yang, Qingyi Si, Yongjie Duan, Zheliang Zhu, Chenyu Zhu, Qiaowei Li, Minghui Chen, Zheng Lin, Weiping Wang

Main category: cs.CL

TL;DR: A method for LLMs to self-truncate chain-of-thought sequences by early exit during generation, reducing reasoning length by 19.1-80.1% while improving accuracy by 0.3-5.0%.

Details

Motivation: Overthinking in long chain-of-thought generation slows down problem solving and risks accuracy loss due to redundant reasoning steps.

Method: Monitors model behavior at reasoning transition points and dynamically terminates generation when the model shows high confidence in a trial answer, requiring no additional training.

Result: Consistently effective across 11 reasoning LLMs on 10 benchmarks, reducing CoT length by 19.1-80.1% while improving accuracy by 0.3-5.0%.

Conclusion: The proposed self-truncation method is simple, effective, and seamlessly integrates with existing reasoning LLMs to improve efficiency and accuracy.

Abstract: Recent advances in large reasoning language models (LRLMs) rely on test-time scaling, which extends long chain-of-thought (CoT) generation to solve complex tasks. However, overthinking in long CoT not only slows down the efficiency of problem solving, but also risks accuracy loss due to the extremely detailed or redundant reasoning steps. We propose a simple yet effective method that allows LLMs to self-truncate CoT sequences by early exit during generation. Instead of relying on fixed heuristics, the proposed method monitors model behavior at potential reasoning transition points and dynamically terminates the next reasoning chain’s generation when the model exhibits high confidence in a trial answer. Our method requires no additional training and can be seamlessly integrated into existing o1-like reasoning LLMs. Experiments on 10 reasoning benchmarks (e.g., GSM8K, MATH-500, AMC, GPQA, AIME and LiveCodeBench) show that the proposed method is consistently effective on 11 cutting-edge reasoning LLMs of varying series and sizes, reducing the length of CoT sequences by an average of 19.1% to 80.1% while improving accuracy by 0.3% to 5.0%.

[240] TRACE Back from the Future: A Probabilistic Reasoning Approach to Controllable Language Generation

Gwen Yidou Weng, Benjie Wang, Guy Van den Broeck

Main category: cs.CL

TL;DR: TRACE is a novel framework for controllable text generation that uses tractable probabilistic reasoning to efficiently compute expected attribute probabilities, enabling flexible control over global text properties without expensive retraining.

Details

Motivation: Current methods for controlling language model outputs either require expensive post-training for each new attribute or use slow/unreliable sampling approaches, especially for rare attributes. There's a need for flexible, efficient control over global text properties.

Method: TRACE distills a Hidden Markov Model (HMM) from the language model and pairs it with a small classifier to estimate attribute probabilities. This enables exact computation of Expected Attribute Probability (EAP) over predicted futures, which is used to reweight the LM’s next-token probabilities.

Result: Achieves state-of-the-art detoxification with only 20% decoding overhead, enables 76 low-resource personalized LMs within seconds, and seamlessly extends to composite attributes.

Conclusion: TRACE provides an efficient and flexible framework for controllable text generation that adapts to new attributes through lightweight control mechanisms, overcoming limitations of existing approaches.

Abstract: As large language models (LMs) advance, there is an increasing need to control their outputs to align with human values (e.g., detoxification) or desired attributes (e.g., personalization, topic). However, autoregressive models focus on next-token predictions and struggle with global properties that require looking ahead. Existing solutions either post-train LMs for each new attribute–expensive and inflexible–or approximate the Expected Attribute Probability (EAP) of future sequences by sampling or training, which is slow and unreliable for rare attributes. We introduce TRACE (Tractable Probabilistic Reasoning for Adaptable Controllable gEneration), a novel framework that efficiently computes EAP and adapts to new attributes through tractable probabilistic reasoning and lightweight control. TRACE distills a Hidden Markov Model (HMM) from an LM and pairs it with a small classifier to estimate attribute probabilities, enabling exact EAP computation over the HMM’s predicted futures. This EAP is then used to reweigh the LM’s next-token probabilities for globally compliant continuations. Empirically, TRACE achieves state-of-the-art detoxification results with only 20% decoding overhead, yields 76 low-resource personalized LMs within seconds, and seamlessly extends to composite attributes. Our code is available at: https://github.com/yidouweng/trace.

[241] Cooking Up Creativity: Enhancing LLM Creativity through Structured Recombination

Moran Mizrahi, Chen Shani, Gabriel Stanovsky, Dan Jurafsky, Dafna Shahaf

Main category: cs.CL

TL;DR: The paper introduces DishCOVER, a novel approach that enhances LLM creativity by translating between natural language and structured representations, then performing cognitively inspired manipulations on these representations to generate creative recipes.

Details

Motivation: LLMs struggle to produce truly creative and diverse ideas despite excelling at many tasks. The authors aim to go beyond superficial token-level variations and enable more abstract exploration of idea landscapes.

Method: The approach uses LLMs to translate between natural language and structured representations, then performs cognitively inspired manipulations on these structured representations to recombine existing ideas in creative ways.

Result: Experiments and domain-expert evaluations show that DishCOVER generates outputs that are mostly coherent and feasible, and significantly surpass GPT-4o in terms of novelty and diversity for creative recipe generation.

Conclusion: The work demonstrates the effectiveness of structured creativity approaches in AI and hopes to inspire further research in this direction.

Abstract: Large Language Models (LLMs) excel at many tasks, yet they struggle to produce truly creative, diverse ideas. In this paper, we introduce a novel approach that enhances LLM creativity. We apply LLMs for translating between natural language and structured representations, and perform the core creative leap via cognitively inspired manipulations on these representations. Our notion of creativity goes beyond superficial token-level variations; rather, we recombine structured representations of existing ideas, enabling our system to effectively explore a more abstract landscape of ideas. We demonstrate our approach in the culinary domain with DishCOVER, a model that generates creative recipes. Experiments and domain-expert evaluations reveal that our outputs, which are mostly coherent and feasible, significantly surpass GPT-4o in terms of novelty and diversity, thus outperforming it in creative generation. We hope our work inspires further research into structured creativity in AI.

[242] $\textit{New News}$: System-2 Fine-tuning for Robust Integration of New Knowledge

Core Francisco Park, Zechen Zhang, Hidenori Tanaka

Main category: cs.CL

TL;DR: The paper introduces New News dataset to study the gap between fine-tuning and in-context learning for knowledge integration, proposes System-2 Fine-tuning with Self-QA protocol to bridge this gap, and discovers contextual shadowing effect.

Details

Motivation: To address the challenge of adequately integrating new information into model weights via fine-tuning, as current methods show substantial performance gap compared to in-context learning.

Method: Created New News dataset with hypothetical news across multiple domains; proposed System-2 Fine-tuning using self-play data generation protocols (paraphrases, implications, Self-QA) to distill knowledge into model weights.

Result: Self-QA protocol of Sys2-FT significantly improves in-weight learning while preserving general capabilities; discovered contextual shadowing effect where training with news context followed by rephrases/QAs degrades learning; found preliminary evidence of scaling law for Sys2-FT.

Conclusion: System-2 Fine-tuning with Self-QA protocol effectively bridges the FT-ICL gap for knowledge integration, though careful attention is needed to avoid contextual shadowing effects during training.

Abstract: Humans and intelligent animals can internalize new information and accurately internalize their implications to perform downstream tasks. While large language models (LLMs) can achieve this through in-context learning (ICL) when the information (news) is explicitly given as context, adequately integrating the information into model weights via fine-tuning remains challenging. In this paper, we introduce New News, a dataset composed of hypothetical yet plausible news spanning multiple domains (mathematics, coding, discoveries, leaderboards, events), accompanied by downstream evaluation questions whose correct answers critically depend on understanding and internalizing the news. First, we demonstrate a substantial gap between naive fine-tuning and in-context learning (FT-ICL gap) on our dataset. To address this gap, we explore a suite of self-play data generation protocols – paraphrases, implications, and Self-QA – designed to distill the knowledge processed by the model with context into the weights of the model, which we term System-2 Fine-tuning (Sys2-FT). We systematically evaluate ICL and Sys2-FT performance across data domains and model scales with the Qwen 2.5 family of models. Our results demonstrate that the Self-QA protocol of Sys2-FT significantly improves models’ in-weight learning of the news while preserving general capabilities. Furthermore, we discover the contextual shadowing effect, where training with the news in context followed by its rephrases or QAs catastrophically degrades learning of the news. Finally, we show preliminary evidence of an emerging scaling law of Sys2-FT.

[243] References Indeed Matter? Reference-Free Preference Optimization for Conversational Query Reformulation

Doyoung Kim, Youngjun Lee, Joeun Kim, Jihwan Bang, Hwanjun Song, Susik Yoon, Jae-Gil Lee

Main category: cs.CL

TL;DR: DualReform is a reference-free preference optimization framework for conversational query reformulation that generates pseudo reference passages from conversational datasets without needing actual reference passages.

Details

Motivation: Existing CQR approaches rely on reference passages for optimization, which are impractical to acquire in real-world scenarios where only queries and responses are available.

Method: Uses two innovations: (1) response-based inference where responses serve as proxies to infer pseudo reference passages, and (2) response refinement via the dual-role of CQR where a CQR model refines responses based on shared objectives between response refinement and CQR.

Result: Achieves 96.9-99.1% of the retrieval accuracy attainable only with reference passages and surpasses state-of-the-art method by up to 31.6%.

Conclusion: DualReform provides an effective reference-free approach for conversational query reformulation that performs nearly as well as methods requiring reference passages while being more practical for real-world deployment.

Abstract: Conversational query reformulation (CQR) has become indispensable for improving retrieval in dialogue-based applications. However, existing approaches typically rely on reference passages for optimization, which are impractical to acquire in real-world scenarios. To address this limitation, we introduce a novel reference-free preference optimization framework DualReform that generates pseudo reference passages from commonly-encountered conversational datasets containing only queries and responses. DualReform attains this goal through two key innovations: (1) response-based inference, where responses serve as proxies to infer pseudo reference passages, and (2) response refinement via the dual-role of CQR, where a CQR model refines responses based on the shared objectives between response refinement and CQR. Despite not relying on reference passages, DualReform achieves 96.9–99.1% of the retrieval accuracy attainable only with reference passages and surpasses the state-of-the-art method by up to 31.6%.

[244] OnPrem.LLM: A Privacy-Conscious Document Intelligence Toolkit

Arun S. Maiya

Main category: cs.CL

TL;DR: OnPrem.LLM is a Python toolkit for using large language models with sensitive data in offline/restricted environments, supporting multiple LLM backends and providing privacy-preserving pipelines for document processing, RAG, and other NLP tasks.

Details

Motivation: To enable the application of LLMs to sensitive, non-public data while maintaining privacy and data control, particularly in offline or restricted environments where cloud-based solutions are not suitable.

Method: Provides a Python-based toolkit with prebuilt pipelines for document processing, RAG, information extraction, summarization, and classification. Supports multiple LLM backends (llama.cpp, Ollama, vLLM, Hugging Face) with quantized models, GPU acceleration, and backend switching. Includes a no-code web interface for non-technical users.

Result: A comprehensive toolkit that enables local execution of LLMs while maintaining data privacy, with support for hybrid deployments when cloud integration is permitted.

Conclusion: OnPrem.LLM successfully addresses the need for privacy-preserving LLM applications in restricted environments, offering flexibility through multiple backend support and accessibility through both programming interfaces and no-code web interface.

Abstract: We present OnPrem$.$LLM, a Python-based toolkit for applying large language models (LLMs) to sensitive, non-public data in offline or restricted environments. The system is designed for privacy-preserving use cases and provides prebuilt pipelines for document processing and storage, retrieval-augmented generation (RAG), information extraction, summarization, classification, and prompt/output processing with minimal configuration. OnPrem$.$LLM supports multiple LLM backends – including llama$.$cpp, Ollama, vLLM, and Hugging Face Transformers – with quantized model support, GPU acceleration, and seamless backend switching. Although designed for fully local execution, OnPrem$.$LLM also supports integration with a wide range of cloud LLM providers when permitted, enabling hybrid deployments that balance performance with data control. A no-code web interface extends accessibility to non-technical users.

[245] VeriFact: Enhancing Long-Form Factuality Evaluation with Refined Fact Extraction and Reference Facts

Xin Liu, Lechen Zhang, Sheza Munir, Yiyang Gu, Lu Wang

Main category: cs.CL

TL;DR: VeriFact is a framework that improves factuality evaluation of LLM responses by better extracting and verifying facts, while FactRBench is a new benchmark that assesses both precision and recall in long-form answers.

Details

Motivation: Current factuality evaluation methods for LLMs struggle with complex inter-sentence dependencies and often miss key relational facts, leading to incomplete verification.

Method: VeriFact enhances fact extraction by identifying and resolving incomplete/missing facts, and FactRBench provides reference fact sets from advanced LLMs and human answers for recall assessment.

Result: VeriFact significantly improves fact completeness and preserves complex relational facts, leading to more accurate factuality evaluation. Larger models show better precision and recall, but high precision doesn’t always correlate with high recall.

Conclusion: Comprehensive factuality assessment requires evaluating both precision and recall, as they don’t always correlate, and VeriFact provides more accurate evaluation by addressing fact extraction limitations.

Abstract: Large language models (LLMs) excel at generating long-form responses, but evaluating their factuality remains challenging due to complex inter-sentence dependencies within the generated facts. Prior solutions predominantly follow a decompose-decontextualize-verify pipeline but often fail to capture essential context and miss key relational facts. In this paper, we introduce VeriFact, a factuality evaluation framework designed to enhance fact extraction by identifying and resolving incomplete and missing facts to support more accurate verification results. Moreover, we introduce FactRBench , a benchmark that evaluates both precision and recall in long-form model responses, whereas prior work primarily focuses on precision. FactRBench provides reference fact sets from advanced LLMs and human-written answers, enabling recall assessment. Empirical evaluations show that VeriFact significantly enhances fact completeness and preserves complex facts with critical relational information, resulting in more accurate factuality evaluation. Benchmarking various open- and close-weight LLMs on FactRBench indicate that larger models within same model family improve precision and recall, but high precision does not always correlate with high recall, underscoring the importance of comprehensive factuality assessment.

[246] Learning When to Think: Shaping Adaptive Reasoning in R1-Style Models via Multi-Stage RL

Songjun Tu, Jiahao Lin, Qichao Zhang, Xiangyu Tian, Linjing Li, Xiangyuan Lan, Dongbin Zhao

Main category: cs.CL

TL;DR: AutoThink enables large reasoning models to dynamically decide when to use explicit step-by-step reasoning versus direct answers based on problem complexity, achieving better accuracy-efficiency trade-offs.

Details

Motivation: Large reasoning models often generate unnecessary detailed reasoning for simple problems, causing computational overhead and latency. The goal is to equip them with adaptive thinking capabilities to avoid over-thinking.

Method: Proposes AutoThink, a multi-stage reinforcement learning framework that learns when to invoke explicit reasoning. It builds on R1-style distilled models and uses reward shaping to optimize reasoning policies.

Result: Experiments on five mathematical benchmarks show AutoThink achieves 6.4% relative accuracy improvement while reducing token usage by 52% on DeepSeek-R1-Distill-Qwen-1.5B, outperforming recent prompting and RL-based pruning methods.

Conclusion: AutoThink establishes a scalable and adaptive reasoning paradigm that can be seamlessly integrated into R1-style models, providing favorable accuracy-efficiency trade-offs by invoking reasoning only when necessary.

Abstract: Large reasoning models (LRMs) are proficient at generating explicit, step-by-step reasoning sequences before producing final answers. However, such detailed reasoning can introduce substantial computational overhead and latency, particularly for simple problems. To address this over-thinking problem, we explore how to equip LRMs with adaptive thinking capabilities: enabling them to dynamically decide whether or not to engage in explicit reasoning based on problem complexity. Building on R1-style distilled models, we observe that inserting a simple ellipsis ("…") into the prompt can stochastically trigger either a thinking or no-thinking mode, revealing a latent controllability in the reasoning behavior. Leveraging this property, we propose AutoThink, a multi-stage reinforcement learning (RL) framework that progressively optimizes reasoning policies via stage-wise reward shaping. AutoThink learns to invoke explicit reasoning only when necessary, while defaulting to succinct responses for simpler tasks. Experiments on five mainstream mathematical benchmarks demonstrate that AutoThink achieves favorable accuracy-efficiency trade-offs compared to recent prompting and RL-based pruning methods. It can be seamlessly integrated into any R1-style model, including both distilled and further fine-tuned variants. Notably, AutoThink improves relative accuracy by 6.4 percent while reducing token usage by 52 percent on DeepSeek-R1-Distill-Qwen-1.5B, establishing a scalable and adaptive reasoning paradigm for LRMs. Project Page: https://github.com/ScienceOne-AI/AutoThink.

[247] The Counting Power of Transformers

Marco Sälzer, Chris Köcher, Alexander Kozachinskiy, Georg Zetzsche, Anthony Widjaja Lin

Main category: cs.CL

TL;DR: Transformers can express highly nonlinear counting properties beyond just linear inequalities, specifically capturing all semialgebraic counting properties expressible as boolean combinations of multivariate polynomials.

Details

Motivation: To formally investigate the counting power of transformers beyond existing results that only demonstrate expressivity for (semi-)linear counting properties.

Method: Developed a formal framework for analyzing transformers’ counting capabilities and proved that transformers can capture all semialgebraic counting properties through theoretical analysis.

Result: Transformers can express counting properties that are boolean combinations of arbitrary multivariate polynomials of any degree, generalizing beyond linear counting properties captured by C-RASP softmax transformers.

Conclusion: Transformers have much stronger counting capabilities than previously known, capable of expressing highly nonlinear counting properties, which also leads to new undecidability results for simple transformer models without positional encodings or masking.

Abstract: Counting properties (e.g. determining whether certain tokens occur more than other tokens in a given input text) have played a significant role in the study of expressiveness of transformers. In this paper, we provide a formal framework for investigating the counting power of transformers. We argue that all existing results demonstrate transformers’ expressivity only for (semi-)linear counting properties, i.e., which are expressible as a boolean combination of linear inequalities. Our main result is that transformers can express counting properties that are highly nonlinear. More precisely, we prove that transformers can capture all semialgebraic counting properties, i.e., expressible as a boolean combination of arbitrary multivariate polynomials (of any degree). Among others, these generalize the counting properties that can be captured by C-RASP softmax transformers, which capture only linear counting properties. To complement this result, we exhibit a natural subclass of (softmax) transformers that completely characterizes semialgebraic counting properties. Through connections with the Hilbert’s tenth problem, this expressivity of transformers also yields a new undecidability result for analyzing an extremely simple transformer model – surprisingly with neither positional encodings (i.e. NoPE-transformers) nor masking. We also experimentally validate trainability of such counting properties.

[248] Critique-Guided Distillation for Efficient and Robust Language Model Reasoning

Berkcan Kapusuzoglu, Supriyo Chakraborty, Chia-Hsuan Lee, Sambit Sahu

Main category: cs.CL

TL;DR: Critique-Guided Distillation (CGD) is a multi-stage training framework that enhances supervised fine-tuning by incorporating teacher-generated critiques and refined responses, improving reasoning capabilities while maintaining general instruction-following and factual accuracy.

Details

Motivation: To address the imitation problem in supervised fine-tuning where models reproduce correct responses without internalizing the underlying reasoning, and to provide a more efficient alternative to RL-based methods.

Method: Multi-stage training framework that augments SFT with teacher-generated explanatory critiques and refined responses. Students learn to map the triplet of prompt, initial response, and teacher critique into refined teacher response.

Result: Substantial gains on reasoning benchmarks (+15.0% on AMC23, +12.2% on MATH-500), approaches/exceeds SimpleRL-Zero performance with 60x less compute, maintains baseline performance on IFEval, MUSR, TruthfulQA, and BBH.

Conclusion: CGD establishes as a robust and generalizable alternative to conventional SFT and RL-based methods, offering efficient advancement of reasoning and safety in large language models.

Abstract: Supervised fine-tuning (SFT) with expert demonstrations often suffers from the imitation problem, where models reproduce correct responses without internalizing the underlying reasoning. We propose Critique-Guided Distillation (CGD), a multi-stage training framework that augments SFT with teacher-generated explanatory critiques and refined responses. Instead of directly imitating teacher outputs, a student learns to map the triplet of prompt, its own initial response, and teacher critique into the refined teacher response, thereby capturing both what to output and why. Our analyses show that CGD consistently reduces refinement uncertainty, improves alignment between critiques and responses, and enhances sample efficiency. On reasoning benchmarks, CGD achieves substantial gains across LLaMA and Qwen families, including +15.0% on AMC23 and +12.2% on MATH-500, while avoiding the format drift issues observed in prior critique-based fine-tuning. Importantly, on LLaMA-3.1-8B CGD approaches or exceeds the performance of SimpleRL-Zero, which is a DeepSeek-R1 replication, while requiring 60x less compute. Beyond reasoning, CGD maintains or improves general instruction-following and factual accuracy, matching baseline performance on IFEval, MUSR, TruthfulQA, and BBH. In contrast, prior critique-based methods degrade these capabilities (e.g., -21% on IFEval). Taken together, these results establish CGD} as a robust and generalizable alternative to both conventional SFT and RL-based methods, offering a more efficient path toward advancing the reasoning and safety of large language models.

[249] AdaBoN: Adaptive Best-of-N Alignment

Vinod Raman, Hilal Asi, Satyen Kale

Main category: cs.CL

TL;DR: Proposes a prompt-adaptive strategy for Best-of-N alignment that allocates inference compute efficiently by estimating reward distributions and adaptively allocating budget.

Details

Motivation: Address computational expense of uniform Best-of-N sampling across prompts by accounting for differences in alignment difficulty and latency concerns.

Method: Two-stage algorithm: exploratory phase estimates reward distribution for each prompt, then adaptive allocation of remaining budget using these estimates.

Result: Outperforms uniform allocation with same inference budget, remains competitive against uniform with 20% larger budgets, improves performance with larger batch sizes.

Conclusion: Simple, practical method for efficient test-time alignment that works with any LM-RM combination and provides better performance with same compute budget.

Abstract: Recent advances in test-time alignment methods, such as Best-of-N sampling, offer a simple and effective way to steer language models (LMs) toward preferred behaviors using reward models (RM). However, these approaches can be computationally expensive, especially when applied uniformly across prompts without accounting for differences in alignment difficulty. In this work, we propose a prompt-adaptive strategy for Best-of-N alignment that allocates inference-time compute more efficiently. Motivated by latency concerns, we develop a two-stage algorithm: an initial exploratory phase estimates the reward distribution for each prompt using a small exploration budget, and a second stage adaptively allocates the remaining budget using these estimates. Our method is simple, practical, and compatible with any LM-RM combination. Empirical results on prompts from the AlpacaEval, HH-RLHF, and PKU-SafeRLHF datasets for 12 LM/RM pairs and 50 different batches of prompts show that our adaptive strategy outperforms the uniform allocation with the same inference budget. Moreover, we show that our adaptive strategy remains competitive against uniform allocations with 20 percent larger inference budgets and improves in performance as the batch size grows.

[250] MobileIPL: Enhancing Mobile Agents Thinking Process via Iterative Preference Learning

Kun Huang, Weikai Xu, Yuxuan Liu, Quandong Wang, Pengzhi Gao, Wei Liu, Jian Luan, Bin Wang, Bo An

Main category: cs.CL

TL;DR: The paper proposes Iterative Preference Learning (IPL) to address data scarcity in Chain of Action-Planning Thoughts (CoaT) for VLM-based mobile agents, using rule-based rewards and Thinking-level DPO pairs, achieving SOTA performance on mobile GUI benchmarks.

Details

Motivation: Address the scarcity of diverse CoaT trajectories that limits expressiveness and generalization of VLM-based mobile agents, while avoiding expensive process-level annotations.

Method: IPL constructs CoaT-tree through iterative sampling, scores leaf nodes with rule-based reward, backpropagates feedback for T-DPO pairs, and uses three-stage instruction evolution with GPT-4o for diverse Q&A pairs from mobile UI screenshots.

Result: MobileIPL outperforms strong baselines including OS-ATLAS and UI-TARS, achieving state-of-the-art performance across three standard Mobile GUI-Agents benchmarks with strong generalization to out-of-domain scenarios.

Conclusion: The proposed IPL framework effectively addresses CoaT trajectory scarcity through iterative preference learning and instruction evolution, demonstrating superior performance and generalization in mobile GUI agent tasks.

Abstract: The Chain of Action-Planning Thoughts (CoaT) paradigm has been shown to improve the reasoning performance of VLM-based mobile agents in GUI tasks. However, the scarcity of diverse CoaT trajectories limits the expressiveness and generalization ability of such agents. While self-training is commonly employed to address data scarcity, existing approaches either overlook the correctness of intermediate reasoning steps or depend on expensive process-level annotations to construct process reward models (PRM). To address the above problems, we propose an Iterative Preference Learning (IPL) that constructs a CoaT-tree through interative sampling, scores leaf nodes using rule-based reward, and backpropagates feedback to derive Thinking-level Direct Preference Optimization (T-DPO) pairs. To prevent overfitting during warm-up supervised fine-tuning, we further introduce a three-stage instruction evolution, which leverages GPT-4o to generate diverse Q&A pairs based on real mobile UI screenshots, enhancing both generality and layout understanding. Experiments on three standard Mobile GUI-agent benchmarks demonstrate that our agent MobileIPL outperforms strong baselines, including continual pretraining models such as OS-ATLAS and UI-TARS. It achieves state-of-the-art performance across three standard Mobile GUI-Agents benchmarks and shows strong generalization to out-of-domain scenarios.

[251] Automatically Advancing LLM Expertise in Technology Judgment

Siyang Wu, Honglin Bao, Nadav Kunievsky, James A. Evans

Main category: cs.CL

TL;DR: LLMs often fail at distinguishing semantically similar patents due to unused knowledge rather than knowledge gaps, revealing that models know more than they can use effectively.

Details

Motivation: To determine whether LLMs truly apply their knowledge when faced with challenging new tasks, specifically distinguishing objectively different but semantically similar patents.

Method: Introduced a benchmark of 1.3M computer science patent pairs and a framework decomposing errors into missing vs unused knowledge using clarifying questions in three settings: raw performance, self-answered, and externally supplied answers.

Result: LLMs often possess relevant knowledge but fail to deploy it; smaller models generate simpler transferable questions while larger models create complex but less generalizable ones.

Conclusion: LLM evaluation should shift from static fact recall to dynamic knowledge application, as models’ key limitation is unused knowledge rather than knowledge gaps.

Abstract: Large language models (LLMs) are rapidly becoming core tools for science, engineering, and innovation. Their promise lies not just in remembering facts, but in putting knowledge to work. Despite their impressive ability to answer increasingly difficult questions, it remains unclear whether LLMs truly use their knowledge when confronted with new and challenging tasks. We address this question with a patent classification task that requires deep conceptual understanding: distinguishing objectively different but semantically similar patents. To evaluate this approach, we introduce a challenging new benchmark of 1.3 million post-2015 computer science patent pairs, characterized by dense technical jargon and strategically complex writing. We find that LLMs often fail our benchmark and struggle to distinguish among semantically similar patents. To probe this failure, we introduce a novel framework that decomposes model errors into two sources: missing and unused knowledge. Our approach asks models to generate clarifying questions to improve their understanding, and then compares three settings: raw performance, self-answered questions, and externally supplied answers. This decomposition reveals that LLMs often possess the relevant knowledge internally but fail to deploy it, while a smaller share of errors arises from genuine knowledge gaps. We then ask whether the ability of models to construct a task-specific database of questions and answers differs across models. We find that smaller models generate simpler, broadly transferable questions, while larger models propose more complex but less generalizable ones. This suggests new strategies for combining strengths across models. Our findings highlight a critical limitation of current LLMs and their evaluation: models often know more than they can use. LLM evaluation should shift from recall of static facts to application of dynamic knowledge.

[252] Is Active Persona Inference Necessary for Aligning Small Models to Personal Preferences?

Zilu Tang, Afra Feyza Akyürek, Ekin Akyürek, Derry Wijaya

Main category: cs.CL

TL;DR: Active preference inference improves personalized language model alignment by creating better prefixes than passive modeling, leading to better generalization, contextual faithfulness, and reduced bias.

Details

Motivation: Address underspecification in personalized LM alignment by testing whether actively inferring preference descriptions is more effective than passive modeling with prior examples.

Method: Created synthetic personalized alignment dataset based on famous people with known preferences, then finetuned 1-8B models to test active preference inference versus passive approaches.

Result: Higher-quality active prefixes lead to better generalization, more contextually faithful models, and less systematic biases across protected attributes.

Conclusion: Active alignment provides a more controllable and efficient path for personalized alignment compared to passive preference modeling.

Abstract: A prominent issue in aligning language models (LMs) to personalized preferences is underspecification – the lack of information from users about their preferences. A popular trend of injecting such specification is adding a prefix (e.g. prior relevant conversations) to the current user’s conversation to steer preference distribution. Most methods passively model personal preferences with prior example preferences pairs. We ask whether models benefit from actively inferring preference descriptions, and address this question by creating a synthetic personalized alignment dataset based on famous people with known public preferences. We then test how effective finetuned 1-8B size models are at inferring and aligning to personal preferences. Results show that higher-quality active prefixes lead to better generalization, more contextually faithful models, and less systematic biases across different protected attributes. All our results suggest active alignment can lead to a more controllable and efficient path for personalized alignment.

[253] Exploring Large Language Models for Translating Romanian Computational Problems into English

Adrian Marius Dumitran, Adrian-Catalin Badea, Stefan-Gabriel Muscalu, Angela-Liliana Dumitran, Stefan-Cosmin Dascalescu, Radu-Sebastian Amarie

Main category: cs.CL

TL;DR: LLMs can maintain or enhance performance in translating less common languages like Romanian to English for mathematical/CS tasks when given structured prompts, showing potential for reliable automatic translation of IOI-style problems with human oversight.

Details

Motivation: Address the performance gap where LLMs underperform on mathematical/CS tasks when translated from Romanian to English, and explore reliable automatic translation for applications like programming competitions and educational materials.

Method: Evaluated multiple LLMs (OpenRoLLM, Llama 3.1 8B, Llama 3.2 3B, GPT-4o) using various translation methods, assessed accuracy and stability through repeated runs, performed syntactic/semantic analyses, and compared against human translators with expert evaluation.

Result: LLMs with appropriate supervision can maintain or enhance translation performance for less common languages, and augmented OJI Romanian dataset with accurate English translations for future LLM training/evaluation.

Conclusion: With human oversight, LLMs can serve as a viable solution for multilingual problem-solving in real-world scenarios, showing comparable quality to human translators as evaluated by certified experts.

Abstract: Recent studies have suggested that large language models (LLMs) underperform on mathematical and computer science tasks when these problems are translated from Romanian into English, compared to their original Romanian format. Accurate translation is critical for applications ranging from automatic translations in programming competitions to the creation of high-quality educational materials, as well as minimizing errors or fraud in human translations. This study shows that robust large language models (LLMs) can maintain or even enhance their performance in translating less common languages when given well-structured prompts. Our findings suggest that LLMs, with appropriate supervision, can be reliably used for the automatic translation of IOI (International Olympiad in Informatics)-style tasks. We evaluate several translation methods across multiple LLMs, including OpenRoLLM, Llama 3.1 8B, Llama 3.2 3B and GPT-4o, assessing their translation accuracy and performance stability through repeated runs. Additionally, we augment the OJI (Romanian County-Level Informatics Olympiad) Romanian dataset with accurate English translations, enhancing its utility for future LLM training and evaluation. Through detailed syntactic and semantic analyses, we confirm that with human oversight, LLMs can serve as a viable solution for multilingual problem-solving. We also compare the translation quality of LLMs against human translators, as evaluated by a certified expert, underscoring the potential of LLMs in realworld scenarios.

[254] Game-RL: Synthesizing Multimodal Verifiable Game Data to Boost VLMs’ General Reasoning

Jingqi Tong, Jixin Tang, Hangcheng Li, Yurong Mou, Ming Zhang, Jun Zhao, Yanbo Wen, Fan Song, Jiahao Zhan, Yuyang Lu, Chaoran Tao, Zhiyuan Guo, Jizhou Yu, Tianhao Cheng, Zhiheng Xi, Changhao Jiang, Zhangyue Yin, Yining Zheng, Weifeng Ge, Guanhua Chen, Tao Gui, Xipeng Qiu, Qi Zhang, Xuanjing Huang

Main category: cs.CL

TL;DR: Game-RL uses video games to train vision-language models through reinforcement learning, showing improved general reasoning across multiple benchmarks.

Details

Motivation: Current vision-language RL focuses on narrow domains, limiting exploration. Video games provide rich visual elements and verifiable rewards for broader training.

Method: Proposes Game-RL with Code2Logic approach to synthesize game reasoning tasks from game code, creating GameQA dataset with 30 games and 158 tasks of varying difficulty.

Result: RL training on GameQA enables multiple VLMs to achieve performance improvements across 7 diverse vision-language benchmarks.

Conclusion: Video games serve as valuable resources to boost general reasoning abilities in vision-language models.

Abstract: Vision-language reinforcement learning (RL) has primarily focused on narrow domains (e.g. geometry or chart reasoning). This leaves broader training scenarios and resources underexplored, limiting the exploration and learning of Vision Language Models (VLMs) through RL. We find video games inherently provide rich visual elements and mechanics that are easy to verify. To fully use the multimodal and verifiable reward in video games, we propose Game-RL, constructing diverse game tasks for RL training to boost VLMs general reasoning ability. To obtain training data, we propose Code2Logic, a novel approach that adapts game code to synthesize game reasoning task data, thus obtaining the GameQA dataset of 30 games and 158 tasks with controllable difficulty gradation. Unexpectedly, RL training solely on GameQA enables multiple VLMs to achieve performance improvements across 7 diverse vision-language benchmarks, demonstrating the value of Game-RL for enhancing VLMs’ general reasoning. Furthermore, this suggests that video games may serve as valuable scenarios and resources to boost general reasoning abilities. Our code, dataset and models are available at the GitHub repository.

[255] Mechanistic Fine-tuning for In-context Learning

Hakaze Cho, Peng Luo, Mariko Kato, Rin Kaenbyou, Naoya Inoue

Main category: cs.CL

TL;DR: ABFT is a novel fine-tuning method that optimizes attention scores directly rather than final outputs, achieving superior ICL performance with minimal data and computational costs.

Details

Motivation: To bridge the gap between ICL and pre-training while reducing the massive computational costs of end-to-end fine-tuning on ICL-style datasets.

Method: Attention Behavior Fine-Tuning (ABFT) builds training objectives on attention scores to force them to focus on correct label tokens and mitigate attention from wrong label tokens, leveraging findings on ICL’s inner mechanisms.

Result: ABFT outperforms previous methods in performance, robustness, unbiasedness, and efficiency across 9 LMs and 8 datasets, using only ~0.01% data cost compared to prior approaches.

Conclusion: The work demonstrates controlling specific LM modules to improve behavior, reveals implicit bias of ICL-style data for induction heads, and opens future applications of mechanistic interpretability.

Abstract: In-context Learning (ICL) utilizes structured demonstration-query inputs to induce few-shot learning on Language Models (LMs), which are not originally pre-trained on ICL-style data. To bridge the gap between ICL and pre-training, some approaches fine-tune LMs on large ICL-style datasets by an end-to-end paradigm with massive computational costs. To reduce such costs, in this paper, we propose Attention Behavior Fine-Tuning (ABFT), utilizing the previous findings on the inner mechanism of ICL, building training objectives on the attention scores instead of the final outputs, to force the attention scores to focus on the correct label tokens presented in the context and mitigate attention scores from the wrong label tokens. Our experiments on 9 modern LMs and 8 datasets empirically find that ABFT outperforms in performance, robustness, unbiasedness, and efficiency, with only around 0.01% data cost compared to the previous methods. Moreover, our subsequent analysis finds that the end-to-end training objective contains the ABFT objective, suggesting the implicit bias of ICL-style data to the emergence of induction heads. Our work demonstrates the possibility of controlling specific module sequences within LMs to improve their behavior, opening up the future application of mechanistic interpretability.

[256] Internal Chain-of-Thought: Empirical Evidence for Layer-wise Subtask Scheduling in LLMs

Zhipeng Yang, Junzhuo Li, Siyu Xia, Xuming Hu

Main category: cs.CL

TL;DR: LLMs exhibit internal chain-of-thought processing where they sequentially decompose and execute composite tasks layer-by-layer, with distinct subtasks learned at different network depths and executed sequentially across layers.

Details

Motivation: To enhance LLM transparency by investigating whether they internally plan and execute subtasks in a sequential manner across network layers, similar to external chain-of-thought reasoning.

Method: Used layer-from context-masking and novel cross-task patching on 15 two-step composite tasks to confirm subtask learning at different depths, and applied LogitLens to decode hidden states to examine sequential execution patterns. Also replicated analysis on real-world TRACE benchmark.

Result: Confirmed that distinct subtasks are learned at different network depths and executed sequentially across layers, revealing consistent layerwise execution patterns in both synthetic and real-world benchmarks.

Conclusion: LLMs have capacity to internally plan and execute subtasks sequentially, enhancing model transparency and opening avenues for fine-grained, instruction-level activation steering.

Abstract: We show that large language models (LLMs) exhibit an $\textit{internal chain-of-thought}$: they sequentially decompose and execute composite tasks layer-by-layer. Two claims ground our study: (i) distinct subtasks are learned at different network depths, and (ii) these subtasks are executed sequentially across layers. On a benchmark of 15 two-step composite tasks, we employ layer-from context-masking and propose a novel cross-task patching method, confirming (i). To examine claim (ii), we apply LogitLens to decode hidden states, revealing a consistent layerwise execution pattern. We further replicate our analysis on the real-world $\text{TRACE}$ benchmark, observing the same stepwise dynamics. Together, our results enhance LLMs transparency by showing their capacity to internally plan and execute subtasks (or instructions), opening avenues for fine-grained, instruction-level activation steering.

[257] A Culturally-Rich Romanian NLP Dataset from “Who Wants to Be a Millionaire?” Videos

Alexandru-Gabriel Ganea, Antonia-Adelina Popovici, Adrian-Marius Dumitran

Main category: cs.CL

TL;DR: LLMs perform better on international questions (80-95% accuracy) than Romanian-specific cultural questions (50-75%), highlighting cultural bias in multilingual NLP systems.

Details

Motivation: To investigate performance disparities of LLMs across different languages and cultural contexts, particularly for culturally-specific content.

Method: Created a multilingual dataset from Romanian game show using OCR, text extraction, and manual verification; benchmarked LLMs including Romanian-adapted models; conducted translation and cross-lingual experiments.

Result: Significant performance gap: 80-95% accuracy on international questions vs 50-75% on Romanian-specific cultural questions; cultural context strongly impacts LLM performance.

Conclusion: Cultural context and data source significantly affect LLM performance; insights provided for building robust, culturally-aware multilingual NLP systems, especially in education.

Abstract: Large Language Models (LLMs) demonstrate varying performance across languages and cultural contexts. This study introduces a novel, culturally-rich, multilingual dataset derived from video recordings of the Romanian game show “Who Wants to Be a Millionaire?” (Vrei s\u{a} fii Milionar?). We employed an innovative process combining optical character recognition (OCR), automated text extraction, and manual verification to collect question-answer pairs, enriching them with metadata including question domain (e.g., biology, history), cultural relevance (Romanian-specific vs. international), and difficulty. Benchmarking state-of-the-art LLMs, including Romanian-adapted models, on this dataset revealed significant performance disparities: models consistently achieve higher accuracy (80-95%) on international questions compared to Romanian-specific cultural questions (50-75%). We further investigate these differences through experiments involving machine translation of Romanian questions into English and cross-lingual tests using a comparable dataset in French. Our findings underscore the impact of cultural context and data source on LLM performance and offer practical insights for building robust, culturally-aware multilingual NLP systems, especially in educational domains. The dataset is publicly available at Hugging Face.

[258] Language Models Optimized to Fool Detectors Still Have a Distinct Style (And How to Change It)

Rafael Rivera Soto, Barry Chen, Nicholas Andrews

Main category: cs.CL

TL;DR: The paper challenges claims that machine-text detection is inherently unreliable by identifying a robust stylistic feature space that remains effective even against optimized language models designed to evade detection.

Details

Motivation: To counter recent claims that machine-generated text cannot be reliably detected, and to examine whether language models can be effectively optimized to degrade detection performance across various detectors.

Method: Identified a robust stylistic feature space for detection, tested against language models optimized to evade detection, and developed a new paraphrasing approach that simultaneously closes the stylistic gap between human and machine writing while avoiding traditional detection features.

Result: Stylistic detectors remain surprisingly robust even when explicitly optimized against. When only single samples are available, the attack is universally effective across all detectors, but as sample size increases, human and machine distributions become distinguishable.

Conclusion: The findings support previous recommendations to avoid reliance on machine-text detection, as detection reliability depends heavily on the number of samples available and can be circumvented by sophisticated attacks.

Abstract: Despite considerable progress in the development of machine-text detectors, it has been suggested that the problem is inherently hard, and therefore, that stakeholders should proceed under the assumption that machine-generated text cannot be reliably detected as such. We examine a recent such claim by Nicks et al. (2024) regarding the ease with which language models can be optimized to degrade the performance of machine-text detectors, including detectors not specifically optimized against. We identify a feature space – the stylistic feature space – that is robust to such optimization, and show that it may be used to reliably detect samples from language models optimized to prevent detection. Furthermore, we show that even when models are explicitly optimized against stylistic detectors, detection performance remains surprisingly unaffected. We then seek to understand if stylistic detectors are inherently more robust. To study this question, we explore a new paraphrasing approach that simultaneously aims to close the gap between human writing and machine writing in stylistic feature space while avoiding detection using traditional features. We show that when only a single sample is available for detection, this attack is universally effective across all detectors considered, including those that use writing style. However, as the number of samples available for detection grows, the human and machine distributions become distinguishable. Overall, our findings underscore previous recommendations to avoid reliance on machine-text detection.

[259] VOCABTRIM: Vocabulary Pruning for Efficient Speculative Decoding in LLMs

Raghavv Goel, Sudhanshu Agrawal, Mukul Gagrani, Junyoung Park, Yifan Zao, He Zhang, Tian Liu, Yiping Yang, Xin Yuan, Jiuyan Lu, Chris Lott, Mingu Lee

Main category: cs.CL

TL;DR: VocabTrim is a training-free technique that improves speculative decoding performance by trimming the drafter model’s vocabulary to only include frequently sampled tokens, reducing drafting latency in memory-bound environments.

Details

Motivation: Current speculative decoding methods have unnecessary inference overhead in drafting, especially for LLMs with large vocabularies. This overhead is particularly problematic in memory-bound environments like edge devices.

Method: VocabTrim reconstructs the drafter’s language modeling head to contain only a limited set of tokens selected from the most frequently sampled tokens in the target model’s vocabulary.

Result: The method boosts memory-bound speed-up by 16% for Llama-3.2-3B-Instruct on Spec-Bench, significantly reducing drafting latency despite slightly lower acceptance rates.

Conclusion: VocabTrim effectively improves speculative decoding performance in memory-bound scenarios by optimizing vocabulary size, making it suitable for edge device deployment.

Abstract: In this paper, we introduce a simple training-free technique to improve the performance of drafter-based speculative decoding (SpD) methods that incorporates language modeling head (LM head) during drafting process. A drafter-based speculative decoding leverages one or more smaller language models, a.k.a. drafters or draft models, to sample a draft sequence or tree consisting of multiple tokens, followed by verification by a base LLM, a target model, accepting a subset as its valid generation. As it is usually considered that the speculative decoding requires one-to-one mapping between vocabularies of the target model and the draft model, it has been natural to share the vocabulary between them, or even share the LM head as in EAGLE or Medusa. We first identify that this draft token sampling scheme inherently contains an unnecessary inference overhead in drafting, especially for some target LLMs with very large vocabularies. Then, we propose a simple technique, VocabTrim, to mitigate the drafting overhead to improve the generation speed in memory-bound environment. VocabTrim reconstructs the drafter LM head to contain only a limited set of tokens, selected by the most frequently sampled from the vocabulary of the target model. While limiting the vocabulary in drafting slightly degrades the acceptance rate, it significantly reduces the drafting latency in memory-bound process which is often the case on edge devices, resulting in higher memory-bound speed up (MBSU). We show that our method can boost the memory-bound speed-up for Llama-3 models on Spec-Bench, specifically by 16% for Llama-3.2-3B-Instruct.

[260] ReflAct: World-Grounded Decision Making in LLM Agents via Goal-State Reflection

Jeonghye Kim, Sojeong Rhee, Minbeom Kim, Dohyung Kim, Sangmook Lee, Youngchul Sung, Kyomin Jung

Main category: cs.CL

TL;DR: ReflAct improves upon ReAct by adding continuous reflection on agent state and goal alignment, achieving 27.7% higher success rates and 93.3% success in ALFWorld.

Details

Motivation: ReAct produces ungrounded reasoning steps and misalignment between agent state and goals, causing compounding errors and hallucinations.

Method: ReflAct shifts reasoning from planning actions to continuously reflecting on agent’s state relative to its goal, explicitly grounding decisions in states and enforcing ongoing goal alignment.

Result: ReflAct surpasses ReAct by 27.7% on average, achieving 93.3% success rate in ALFWorld, and outperforms ReAct even with enhancement modules like Reflexion and WKM.

Conclusion: Strengthening the core reasoning backbone through continuous reflection and goal alignment is key to reliable agent performance.

Abstract: Recent advances in LLM agents have largely built on reasoning backbones like ReAct, which interleave thought and action in complex environments. However, ReAct often produces ungrounded or incoherent reasoning steps, leading to misalignment between the agent’s actual state and goal. Our analysis finds that this stems from ReAct’s inability to maintain consistent internal beliefs and goal alignment, causing compounding errors and hallucinations. To address this, we introduce ReflAct, a novel backbone that shifts reasoning from merely planning next actions to continuously reflecting on the agent’s state relative to its goal. By explicitly grounding decisions in states and enforcing ongoing goal alignment, ReflAct dramatically improves strategic reliability. This design delivers substantial empirical gains: ReflAct surpasses ReAct by 27.7% on average, achieving a 93.3% success rate in ALFWorld. Notably, ReflAct even outperforms ReAct with added enhancement modules (e.g., Reflexion, WKM), showing that strengthening the core reasoning backbone is key to reliable agent performance.

[261] Multilingual Prompting for Improving LLM Generation Diversity

Qihan Wang, Shidong Pan, Tal Linzen, Emily Black

Main category: cs.CL

TL;DR: Multilingual prompting enhances diversity in LLM responses by generating culturally-varied prompts across multiple languages, outperforming existing diversity techniques.

Details

Motivation: LLMs lack cultural representation and diversity in their generations, leading to biased or limited responses across different cultures and languages.

Method: Propose multilingual prompting: create variations of base prompts with cultural and linguistic cues from multiple cultures, generate responses, then combine results to activate broader cultural knowledge.

Result: Multilingual prompting consistently outperforms high-temperature sampling, step-by-step recall, and persona prompting across GPT-4o, GPT-4o-mini, LLaMA 70B, and LLaMA 8B models. Benefits vary by language resource level and model size, with aligned language-culture cues reducing hallucinations.

Conclusion: Multilingual prompting effectively increases cultural diversity in LLM responses by leveraging language-specific knowledge, with performance advantages over existing diversity enhancement methods.

Abstract: Large Language Models (LLMs) are known to lack cultural representation and overall diversity in their generations, from expressing opinions to answering factual questions. To mitigate this problem, we propose multilingual prompting: a prompting method which generates several variations of a base prompt with added cultural and linguistic cues from several cultures, generates responses, and then combines the results. Building on evidence that LLMs have language-specific knowledge, multilingual prompting seeks to increase diversity by activating a broader range of cultural knowledge embedded in model training data. Through experiments across multiple models (GPT-4o, GPT-4o-mini, LLaMA 70B, and LLaMA 8B), we show that multilingual prompting consistently outperforms existing diversity-enhancing techniques such as high-temperature sampling, step-by-step recall, and persona prompting. Further analyses show that the benefits of multilingual prompting vary between high and low resource languages and across model sizes, and that aligning the prompting language with cultural cues reduces hallucination about culturally-specific information.

[262] GRILE: A Benchmark for Grammar Reasoning and Explanation in Romanian LLMs

Adrian-Marius Dumitran, Alexandra-Mihaela Danila, Angela-Liliana Dumitran

Main category: cs.CL

TL;DR: GRILE is the first open benchmark for Romanian language testing, containing 1,151 multiple-choice questions from high-stakes exams. It evaluates LLMs on answer selection and explanation generation, revealing significant gaps in performance and explanation quality for Romanian.

Details

Motivation: To assess the pedagogical value of LLMs for low-resource languages like Romanian, where their capabilities remain unclear despite NLP advancements.

Method: Created GRILE benchmark with 1,151 Romanian exam questions, tested 7 multilingual and Romanian-specific LLMs on answer selection and explanation generation, with expert review of explanation quality.

Result: Gemini 2.5 Pro achieved 83% accuracy, but most open-weight models stayed below 65%. 48% of explanations contained factual or pedagogical flaws. Error analysis revealed weaknesses in morphology and DOOM3 orthographic norms.

Conclusion: The study exposes challenges for educational NLP in low-resource settings and establishes GRILE as a test-bed for controllable explanation generation and evaluation.

Abstract: LLMs (Large language models) have revolutionized NLP (Natural Language Processing), yet their pedagogical value for low-resource languages remains unclear. We present GRILE (Grammar Romanian Inference and Language Explanations) , the first open benchmark of 1,151 multiple-choice questions harvested from Romanian high-stakes exams (National Evaluation, Baccalaureate, university admissions). GRILE enables us to probe two complementary abilities of seven state-of-the-art multilingual and Romanian-specific LLMs: (i) selecting the correct answer, and (ii) producing linguistically accurate explanations. While Gemini 2.5 Pro reaches 83% accuracy, most open-weight models stay below 65%, and 48% of their explanations contain factual or pedagogical flaws according to expert review. A detailed error analysis pinpoints systematic weaknesses in morphology and in applying the latest DOOM3 orthographic norms. All data, code and a public web demo are released to catalyze future research. Our findings expose open challenges for trustworthy educational NLP in low-resource settings and establish GRILE as a new test-bed for controllable explanation generation and evaluation.

[263] Generalizable Process Reward Models via Formally Verified Training Data

Ryo Kamoi, Yusen Zhang, Nan Zhang, Sarkar Snigdha Sarathi Das, Rui Zhang

Main category: cs.CL

TL;DR: FoVer automatically synthesizes PRM training data using formal verification tools to label step-level errors, enabling cross-task generalization across diverse reasoning domains without human annotation.

Details

Motivation: Current PRMs require costly human annotation for step-level error labeling and are limited to math reasoning domains, creating barriers to broader application.

Method: Uses formal verification tools (Z3, Isabelle) to automatically annotate step-level error labels on LLM responses for formal logic and theorem proving tasks, creating PRM training data without human involvement.

Result: PRMs trained with FoVer significantly outperform original LLM-based PRMs and achieve competitive/superior results compared to state-of-the-art PRMs on ProcessBench and 12 reasoning benchmarks including MATH, AIME, ANLI, MMLU, and BBH.

Conclusion: FoVer enables automatic synthesis of accurate PRM training data and demonstrates effective cross-task generalization, making PRMs applicable to diverse reasoning tasks beyond math domains.

Abstract: Process Reward Models (PRMs), which provide step-level feedback on reasoning traces generated by Large Language Models (LLMs), are receiving increasing attention. However, two key research gaps remain: creating PRM training data requires costly human annotation to label accurate step-level errors, and existing PRMs are limited to math reasoning domains. In response to these gaps, this paper aims to enable automatic synthesis of accurate PRM training data and the generalization of PRMs to diverse reasoning tasks beyond math reasoning. We propose FoVer, an approach to synthesize PRM training data with accurate step-level error labels automatically annotated by formal verification tools, such as Z3 and Isabelle. To show the practical effectiveness of FoVer, we synthesize a training dataset by annotating step-level error labels on LLM responses to formal logic and theorem proving tasks, without relying on human annotation. While FoVer creates training data with symbolic tasks compatible with formal verification, our experiments show that PRMs trained on our dataset exhibit cross-task generalization, enabling a single PRM to effectively perform verification across diverse reasoning tasks. Specifically, LLM-based PRMs trained with FoVer significantly outperform PRMs based on the original LLMs and achieve competitive or superior results compared to state-of-the-art PRMs, as measured by step-level verification on ProcessBench and Best-of-K performance across 12 reasoning benchmarks, including MATH, AIME, ANLI, MMLU, and BBH. The dataset and code are in the supplementary material and will be made public. The datasets, models, and code are provided at https://github.com/psunlpgroup/FoVer.

[264] Leveraging Online Data to Enhance Medical Knowledge in a Small Persian Language Model

Mehrdad Ghassabi, Pedram Rostami, Hamidreza Baradaran Kashani, Amirhossein Poursina, Zahra Kazemi, Milad Tavakoli

Main category: cs.CL

TL;DR: This study enhances medical knowledge in small Persian language models by creating the first curated medical dataset from online sources and fine-tuning a baseline model, achieving improved medical question answering accuracy and passing the Iranian Basic Medical Science Entrance Exam.

Details

Motivation: Small language models struggle with specialized domains in low-resource languages like Persian, and no curated medical dataset existed for Persian despite numerous medical websites being available online.

Method: Created the first curated Persian medical dataset by crawling medical magazines and collecting real doctor-patient Q&A pairs, then fine-tuned a baseline language model using this data.

Result: The fine-tuned model achieved improved accuracy in medical question answering, successfully passed the Iranian Basic Medical Science Entrance Exam (September 2023), and improved Persian-translated MMLU accuracy by an average of 2.67%.

Conclusion: This work demonstrates the potential of leveraging open-access online data to enrich small language models in medical fields, providing a novel solution for Persian medical AI applications in resource-constrained environments.

Abstract: The rapid advancement of language models has demonstrated the potential of artificial intelligence in the healthcare industry. However, small language models struggle with specialized domains in low-resource languages like Persian. While numerous medical-domain websites exist in Persian, no curated dataset or corpus has been available making ours the first of its kind. This study explores the enhancement of medical knowledge in a small language model by leveraging accessible online data, including a crawled corpus from medical magazines and a dataset of real doctor-patient Q&A pairs. We fine-tuned a baseline model using our curated data to improve its medical knowledge. Benchmark evaluations demonstrate that the fine-tuned model achieves improved accuracy in medical question answering and provides better responses compared to its baseline. Notably, the trained model successfully passed the Iranian Basic Medical Science Entrance Exam, taken in September 2023, and improved Persian-translated MMLU accuracy by an average of 2.67%. This work highlights the potential of leveraging open-access online data to enrich small language models in medical fields, providing a novel solution for Persian medical AI applications suitable for resource-constrained environments.

[265] Align-GRAG: Reasoning-Guided Dual Alignment for Graph Retrieval-Augmented Generation

Derong Xu, Pengyue Jia, Xiaopeng Li, Yingyi Zhang, Maolin Wang, Qidong Liu, Xiangyu Zhao, Yichao Wang, Huifeng Guo, Ruiming Tang, Enhong Chen, Tong Xu

Main category: cs.CL

TL;DR: Align-GRAG is a reasoning-guided dual alignment framework that addresses challenges in graph-based RAG systems by optimizing graph encoders with LLM reasoning chains to prune irrelevant knowledge and align graph-language representations.

Details

Motivation: Graph-based RAG systems face challenges with irrelevant node retrieval in dense graphs and representation gaps between graph structures and language, limiting their ability to fully leverage graph relationships for enhanced understanding.

Method: Proposes Align-GRAG with dual alignment: formulates subgraphs by retrieving nodes/edges, uses an Aligner to optimize graph encoder with LLM-summarized reasoning chain via KL divergence and contrastive loss for node pruning and semantic space unification, then integrates aligned graph data with LLM for generation.

Result: Experiments on GraphQA benchmark across three tasks (common sense reasoning, scene graph understanding, knowledge graph reasoning) validate the method’s effectiveness.

Conclusion: Align-GRAG successfully addresses graph RAG limitations by enabling efficient knowledge pruning and establishing unified semantic representations between graphs and language, improving answer coherence and accuracy.

Abstract: Large language models (LLMs) have demonstrated remarkable capabilities, but still struggle with issues like hallucinations and outdated information. Retrieval-augmented generation (RAG) addresses these issues by grounding LLM outputs in external knowledge with an Information Retrieval (IR) system. Building on this foundation, graph-based RAG systems go a step further by retrieving subgraphs, which preserve the relationships between knowledge entities and provide more comprehensive context. However, graph RAG faces two challenges: (1) Retrieving relevant information introduces irrelevant nodes (especially in dense graph databases, where retrieval usually extends to adjacent nodes), and leads to overly lengthy inputs that hinder efficiency; (2) The representation gap between graph and language during generation with LLMs limits the ability to fully leverage graph structures for enhanced understanding. To address these limitations, we propose Align-GRAG, a novel reasoning-guided dual alignment framework in post-retrieval phrase. It first formulates a subgraph by retrieving nodes and edges. Then an Aligner is proposed to jointly optimize a graph encoder with an LLM-summarized reasoning chain. It achieves dual alignment of graph node and representation by leveraging KL divergence loss and contrastive loss, facilitating efficient pruning of irrelevant knowledge and establishing a unified semantic space. The Generator integrates the aligned graph data with LLM to produce coherent and accurate answers. Experiments on the GraphQA benchmark across three tasks (including common sense reasoning, scene graph understanding, and knowledge graph reasoning) validate the effectiveness of our method. The codes are available in this repository\footnote{https://anonymous.4open.science/r/Align-GRAG-F3D8/}.

[266] ToDi: Token-wise Distillation via Fine-Grained Divergence Control

Seongryong Jung, Suwan Yoon, DongGeon Kim, Hwanhee Lee

Main category: cs.CL

TL;DR: ToDi is a novel knowledge distillation method that adaptively combines Forward KL and Reverse KL divergences per token using a sigmoid-based weighting function, outperforming uniform distillation approaches.

Details

Motivation: Conventional knowledge distillation methods apply uniform divergence loss across vocabulary, neglecting token-level prediction discrepancies between teacher and student models, leading to suboptimal knowledge transfer.

Method: Proposes Token-wise Distillation (ToDi) that uses gradient analysis to reveal complementary roles of FKL (boosts underestimated tokens) and RKL (suppresses overestimated tokens), then combines them adaptively per token using sigmoid-based weighting based on teacher-student probability log-ratio.

Result: ToDi consistently outperforms recent distillation baselines across instruction-following benchmarks, with extensive ablation studies and efficiency analysis validating its effectiveness and practicality.

Conclusion: Token-wise adaptive combination of FKL and RKL enables precise distribution alignment and superior knowledge distillation performance compared to uniform or less granular strategies.

Abstract: Large language models (LLMs) offer impressive performance but are impractical for resource-constrained deployment due to high latency and energy consumption. Knowledge distillation (KD) addresses this by transferring knowledge from a large teacher to a smaller student model. However, conventional KD, notably approaches like Forward KL (FKL) and Reverse KL (RKL), apply uniform divergence loss across the entire vocabulary, neglecting token-level prediction discrepancies. By investigating these representative divergences via gradient analysis, we reveal that FKL boosts underestimated tokens, while RKL suppresses overestimated ones, showing their complementary roles. Based on this observation, we propose Token-wise Distillation (ToDi), a novel method that adaptively combines FKL and RKL per token using a sigmoid-based weighting function derived from the teacher-student probability log-ratio. ToDi dynamically emphasizes the appropriate divergence for each token, enabling precise distribution alignment. We demonstrate that ToDi consistently outperforms recent distillation baselines using uniform or less granular strategies across instruction-following benchmarks. Extensive ablation studies and efficiency analysis further validate ToDi’s effectiveness and practicality.

[267] Nested Named Entity Recognition as Single-Pass Sequence Labeling

Alberto Muñoz-Ortiz, David Vilares, Caio Corro, Carlos Gómez-Rodríguez

Main category: cs.CL

TL;DR: NER as sequence labeling using constituency linearizations with pretrained encoders.

Details

Motivation: Simplify nested named entity recognition by reducing structured prediction complexity to token classification.

Method: Combine constituency linearizations with pretrained encoders for sequence labeling.

Result: Achieves competitive performance compared to less efficient systems.

Conclusion: Method captures nested entities efficiently and can be trained using standard sequence labeling libraries.

Abstract: We cast nested named entity recognition (NNER) as a sequence labeling task by leveraging prior work that linearizes constituency structures, effectively reducing the complexity of this structured prediction problem to straightforward token classification. By combining these constituency linearizations with pretrained encoders, our method captures nested entities while performing exactly n tagging actions. Our approach achieves competitive performance compared to less efficient systems, and it can be trained using any off-the-shelf sequence labeling library.

[268] A Survey on Stereotype Detection in Natural Language Processing

Alessandra Teresa Cignarella, Anastasia Giachanou, Els Lefever

Main category: cs.CL

TL;DR: A survey of stereotype detection research in NLP, analyzing definitions from multiple disciplines and reviewing over 6,000 papers to identify trends, challenges, and future directions.

Details

Motivation: Stereotypes influence social perceptions and can escalate into discrimination and violence, making stereotype detection an important emerging field with significant societal implications.

Method: Conducted a semi-automatic literature review using Semantic Scholar, retrieving and filtering over 6,000 papers from 2000-2025, and analyzed definitions from psychology, sociology, and philosophy.

Result: Identified key trends, methodologies, challenges and future directions in stereotype detection research, emphasizing its potential as an early-monitoring tool to prevent bias escalation and hate speech.

Conclusion: Highlights the need for a broader, multilingual, and intersectional approach in NLP studies on stereotype detection.

Abstract: Stereotypes influence social perceptions and can escalate into discrimination and violence. While NLP research has extensively addressed gender bias and hate speech, stereotype detection remains an emerging field with significant societal implications. In this work is presented a survey of existing research, analyzing definitions from psychology, sociology, and philosophy. A semi-automatic literature review was performed by using Semantic Scholar. We retrieved and filtered over 6,000 papers (in the year range 2000-2025), identifying key trends, methodologies, challenges and future directions. The findings emphasize stereotype detection as a potential early-monitoring tool to prevent bias escalation and the rise of hate speech. Conclusions highlight the need for a broader, multilingual, and intersectional approach in NLP studies.

[269] BRIT: Bidirectional Retrieval over Unified Image-Text Graph

Ainulla Khan, Yamada Moyuru, Srinidhi Akella

Main category: cs.CL

TL;DR: BRIT is a novel multi-modal RAG framework that unifies text-image connections into a graph structure to handle complex cross-modal questions on multi-modal documents.

Details

Motivation: Current RAG advancements focus mainly on text-based queries, leaving multi-modal documents with both texts and images underexplored, especially when fine-tuning is not feasible.

Method: BRIT creates a multi-modal graph that unifies various text-image connections and retrieves query-specific sub-graphs by traversing both image-to-text and text-to-image paths to find relevant content for complex cross-modal questions.

Result: Comprehensive experiments demonstrate BRIT’s superiority in handling cross-modal questions on multi-modal documents, using the introduced MM-RAG test set for evaluation.

Conclusion: BRIT effectively addresses the gap in multi-modal RAG by leveraging graph-based retrieval of text-image connections, enabling better handling of complex cross-modal multi-hop questions.

Abstract: Retrieval-Augmented Generation (RAG) has emerged as a promising technique to enhance the quality and relevance of responses generated by large language models. While recent advancements have mainly focused on improving RAG for text-based queries, RAG on multi-modal documents containing both texts and images has not been fully explored. Especially when fine-tuning does not work. This paper proposes BRIT, a novel multi-modal RAG framework that effectively unifies various text-image connections in the document into a multi-modal graph and retrieves the texts and images as a query-specific sub-graph. By traversing both image-to-text and text-to-image paths in the graph, BRIT retrieve not only directly query-relevant images and texts but also further relevant contents to answering complex cross-modal multi-hop questions. To evaluate the effectiveness of BRIT, we introduce MM-RAG test set specifically designed for multi-modal question answering tasks that require to understand the text-image relations. Our comprehensive experiments demonstrate the superiority of BRIT, highlighting its ability to handle cross-modal questions on the multi-modal documents.

Xuanming Zhang, Yuxuan Chen, Min-Hsuan Yeh, Yixuan Li

Main category: cs.CL

TL;DR: MetaMind is a multi-agent framework that improves LLMs’ social reasoning by decomposing it into three stages: Theory-of-Mind hypothesis generation, moral refinement using cultural norms, and response generation with intent validation.

Details

Motivation: LLMs struggle with ambiguity and contextual nuance in human communication, particularly in inferring unspoken intentions, emotions, and beliefs (Theory of Mind), which is crucial for human social interactions.

Method: A multi-agent framework with three collaborative stages: Theory-of-Mind Agent generates mental state hypotheses, Moral Agent refines them using cultural norms and ethics, and Response Agent generates appropriate responses while validating intent alignment.

Result: Achieves state-of-the-art performance with 35.7% improvement in real-world social scenarios and 6.2% gain in ToM reasoning. Enables LLMs to match human-level performance on key ToM tasks for the first time.

Conclusion: MetaMind advances AI systems toward human-like social intelligence, demonstrating ability to balance contextual plausibility, social appropriateness, and user adaptation, with applications in empathetic dialogue and culturally sensitive interactions.

Abstract: Human social interactions depend on the ability to infer others’ unspoken intentions, emotions, and beliefs-a cognitive skill grounded in the psychological concept of Theory of Mind (ToM). While large language models (LLMs) excel in semantic understanding tasks, they struggle with the ambiguity and contextual nuance inherent in human communication. To bridge this gap, we introduce MetaMind, a multi-agent framework inspired by psychological theories of metacognition, designed to emulate human-like social reasoning. MetaMind decomposes social understanding into three collaborative stages: (1) a Theory-of-Mind Agent generates hypotheses about user mental states (e.g., intent, emotion), (2) a Moral Agent refines these hypotheses using cultural norms and ethical constraints, and (3) a Response Agent generates contextually appropriate responses while validating alignment with inferred intent. Our framework achieves state-of-the-art performance across three challenging benchmarks, with 35.7% improvement in real-world social scenarios and 6.2% gain in ToM reasoning. Notably, it enables LLMs to match human-level performance on key ToM tasks for the first time. Ablation studies confirm the necessity of all components, which showcase the framework’s ability to balance contextual plausibility, social appropriateness, and user adaptation. This work advances AI systems toward human-like social intelligence, with applications in empathetic dialogue and culturally sensitive interactions. Code is available at https://github.com/XMZhangAI/MetaMind.

[271] A Necessary Step toward Faithfulness: Measuring and Improving Consistency in Free-Text Explanations

Lingjun Zhao, Hal Daumé III

Main category: cs.CL

TL;DR: Presents a measure for Prediction-EXplanation (PEX) consistency based on weight of evidence, showing over 62% of LLM-generated explanations lack consistency, and that direct preference optimization can significantly improve both consistency and faithfulness.

Details

Motivation: Faithful free-text explanations are crucial for transparency in high-stakes AI decision-making, but are challenging for language models to generate and humans to assess.

Method: Extends weight of evidence concept to create PEX consistency measure, applies direct preference optimization to improve explanation consistency across three model families.

Result: More than 62% of LLM-generated explanations lack PEX consistency. Direct preference optimization improves consistency by 43.1% to 292.3% and explanation faithfulness by up to 9.7%.

Conclusion: PEX consistency is an important aspect of explanation faithfulness that can be effectively improved through direct preference optimization, enhancing the reliability of AI explanations in critical applications.

Abstract: Faithful free-text explanations are important to ensure transparency in high-stakes AI decision-making contexts, but they are challenging to generate by language models and assess by humans. In this paper, we present a measure for Prediction-EXplanation (PEX) consistency, by extending the concept of weight of evidence. This measure quantifies how much a free-text explanation supports or opposes a prediction, serving as an important aspect of explanation faithfulness. Our analysis reveals that more than 62% explanations generated by large language models lack this consistency. We show that applying direct preference optimization improves the consistency of generated explanations across three model families, with improvement ranging from 43.1% to 292.3%. Furthermore, we demonstrate that optimizing this consistency measure can improve explanation faithfulness by up to 9.7%.

[272] From Single to Multi-Granularity: Toward Long-Term Memory Association and Selection of Conversational Agents

Derong Xu, Yi Wen, Pengyue Jia, Yingyi Zhang, wenlin zhang, Yichao Wang, Huifeng Guo, Ruiming Tang, Xiangyu Zhao, Enhong Chen, Tong Xu

Main category: cs.CL

TL;DR: MemGAS is a framework that enhances long-term dialogue memory for LLMs through multi-granularity memory association, adaptive selection, and retrieval to overcome limitations of single-granularity approaches.

Details

Motivation: LLMs struggle with long-term dialogue memory due to limited context windows, and existing retrieval-augmented memory systems using single-granularity segmentation fail to capture deep memory connections, leading to partial retrieval or noise.

Method: Uses multi-granularity memory units with Gaussian Mixture Models for clustering and association, entropy-based router for adaptive granularity selection, and LLM-based filtering for memory refinement.

Result: Outperforms state-of-the-art methods on four long-term memory benchmarks for both question answering and retrieval tasks across different query types and top-K settings.

Conclusion: MemGAS effectively addresses long-term dialogue memory challenges through multi-granularity association and adaptive selection, demonstrating superior performance over existing approaches.

Abstract: Large Language Models (LLMs) have recently been widely adopted in conversational agents. However, the increasingly long interactions between users and agents accumulate extensive dialogue records, making it difficult for LLMs with limited context windows to maintain a coherent long-term dialogue memory and deliver personalized responses. While retrieval-augmented memory systems have emerged to address this issue, existing methods often depend on single-granularity memory segmentation and retrieval. This approach falls short in capturing deep memory connections, leading to partial retrieval of useful information or substantial noise, resulting in suboptimal performance. To tackle these limits, we propose MemGAS, a framework that enhances memory consolidation by constructing multi-granularity association, adaptive selection, and retrieval. MemGAS is based on multi-granularity memory units and employs Gaussian Mixture Models to cluster and associate new memories with historical ones. An entropy-based router adaptively selects optimal granularity by evaluating query relevance distributions and balancing information completeness and noise. Retrieved memories are further refined via LLM-based filtering. Experiments on four long-term memory benchmarks demonstrate that MemGAS outperforms state-of-the-art methods on both question answer and retrieval tasks, achieving superior performance across different query types and top-K settings. \footnote{https://github.com/quqxui/MemGAS}

[273] TrojanStego: Your Language Model Can Secretly Be A Steganographic Privacy Leaking Agent

Dominik Meier, Jan Philip Wahle, Paul Röttger, Terry Ruas, Bela Gipp

Main category: cs.CL

TL;DR: TrojanStego is a novel LLM threat model where adversaries fine-tune models to embed sensitive information into natural outputs via linguistic steganography, enabling covert data exfiltration without controlling inference inputs.

Details

Motivation: As LLMs are integrated into sensitive workflows, concerns grow about their potential to leak confidential information through novel attack vectors.

Method: Proposed a practical encoding scheme based on vocabulary partitioning that LLMs can learn via fine-tuning, enabling linguistic steganography to embed secrets into natural-looking outputs.

Result: Compromised models reliably transmit 32-bit secrets with 87% accuracy on held-out prompts, reaching over 97% accuracy with majority voting across three generations, while maintaining high utility and evading human detection.

Conclusion: TrojanStego represents a new class of passive, covert, practical, and dangerous LLM data exfiltration attacks that highlight significant security risks in LLM deployments.

Abstract: As large language models (LLMs) become integrated into sensitive workflows, concerns grow over their potential to leak confidential information. We propose TrojanStego, a novel threat model in which an adversary fine-tunes an LLM to embed sensitive context information into natural-looking outputs via linguistic steganography, without requiring explicit control over inference inputs. We introduce a taxonomy outlining risk factors for compromised LLMs, and use it to evaluate the risk profile of the threat. To implement TrojanStego, we propose a practical encoding scheme based on vocabulary partitioning learnable by LLMs via fine-tuning. Experimental results show that compromised models reliably transmit 32-bit secrets with 87% accuracy on held-out prompts, reaching over 97% accuracy using majority voting across three generations. Further, they maintain high utility, can evade human detection, and preserve coherence. These results highlight a new class of LLM data exfiltration attacks that are passive, covert, practical, and dangerous.

[274] Long Context Scaling: Divide and Conquer via Multi-Agent Question-driven Collaboration

Sibo Xiao, Zixin Lin, Wenyang Gao, Hui Chen, Yue Zhang

Main category: cs.CL

TL;DR: XpandA is a multi-agent framework that uses dynamic partitioning and question-driven workflows to process long contexts efficiently, achieving 20% performance improvements and 1.5x speedup over existing methods.

Details

Motivation: Existing agent-based methods for long-context processing suffer from prohibitive latency, information loss from excessive agent invocations, and disruption of textual dependencies by immoderate partitioning.

Method: XpandA uses: 1) dynamic partitioning to adaptively modulate context window filling; 2) question-guided protocol to update information in shared memory; 3) selective partition replay based on state-tracking to handle inverted-order structures.

Result: Comprehensive evaluation on benchmarks from 1k to 1M tokens shows XpandA enables ultra-long sequence processing with 20% performance improvements and 1.5x inference speedup over full-context, RAG, and previous agent-based methods.

Conclusion: XpandA significantly enhances LLMs’ long-context capabilities by overcoming limitations of existing approaches through its dynamic partitioning and question-driven workflow design.

Abstract: Processing long contexts has become a critical capability for modern large language models (LLMs). Existing works leverage agent-based divide-and-conquer methods for processing long contexts. But these methods face crucial limitations, including prohibitive accumulated latency and amplified information loss from excessive agent invocations, and the disruption of inherent textual dependencies by immoderate partitioning. In this paper, we propose a novel multi-agent framework XpandA (Expand-Agent) coupled with question-driven workflow and dynamic partitioning for robust long-context processing. XpandA overcomes these limitations through: 1) dynamic partitioning of long texts, which adaptively modulates the filling rate of context windows for input sequences of vastly varying lengths; 2) question-guided protocol to update flat information ensembles within centralized shared memory, constructing consistent inter-agent knowledge across partitions; and 3) selectively replaying specific partitions based on the state-tracking of question-information couples to promote the resolution of inverted-order structures across partitions (e.g., flashbacks). We perform a comprehensive evaluation of XpandA on multiple long-context benchmarks with length varying from 1k to 1M, demonstrating XpandA’s feasibility for processing ultra-long sequences and its significant effectiveness in enhancing the long-context capabilities of various LLMs by achieving 20% improvements and 1.5x inference speedup over baselines of full-context, RAG and previous agent-based methods.

[275] SpecExtend: A Drop-in Enhancement for Speculative Decoding of Long Sequences

Jungyoub Cha, Hyunjong Kim, Sungzoon Cho

Main category: cs.CL

TL;DR: SpecExtend improves speculative decoding for long sequences by integrating efficient attention mechanisms and a novel KV cache eviction strategy called Cross-model Retrieval, achieving up to 3.86x speedup without additional training.

Details

Motivation: Speculative decoding performance degrades significantly as input length grows, even at moderate lengths, which has remained largely underexplored despite being a widely used acceleration technique for LLMs.

Method: Integrates FlashAttention and Hybrid Tree Attention to accelerate prefill and verification steps. Proposes Cross-model Retrieval, a novel KV cache eviction strategy that uses the target model’s attention scores to dynamically select relevant context for the draft model.

Result: Accelerates speculative decoding by up to 2.84x on 16K-token long summarization and up to 3.86x on long reasoning tasks, while preserving short-input performance of state-of-the-art frameworks.

Conclusion: SpecExtend is an effective drop-in enhancement that significantly improves speculative decoding performance on long sequences without requiring additional training.

Abstract: Speculative decoding is a widely used technique for accelerating inference in large language models (LLMs), but its performance degrades as input length grows, with significant drops even at moderate lengths. Yet, this early degradation has remained largely underexplored. We introduce SpecExtend, a drop-in enhancement that improves speculative decoding on long sequences without additional training. SpecExtend integrates efficient attention mechanisms such as FlashAttention and Hybrid Tree Attention to accelerate prefill and verification steps. To improve both draft accuracy and speed on long inputs without retraining, we propose Cross-model Retrieval, a novel KV cache eviction strategy that leverages the target model’s attention scores to dynamically select relevant context for the smaller draft model. Extensive evaluations show that SpecExtend accelerates speculative decoding by up to 2.84x on 16K-token long summarization and up to 3.86x on long reasoning, while preserving the short-input performance of state-of-the-art frameworks. Our code is available at https://github.com/jycha98/SpecExtend .

[276] Evaluating and Steering Modality Preferences in Multimodal Large Language Model

Yu Zhang, Jinlong Ma, Yongshuai Hou, Xuefeng Bai, Kehai Chen, Yang Xiang, Jun Yu, Min Zhang

Main category: cs.CL

TL;DR: The paper studies modality preference in MLLMs using a controlled benchmark, finds all tested models exhibit bias, and proposes a representation engineering method to steer preference without fine-tuning.

Details

Motivation: To investigate whether multimodal large language models exhibit modality preference when processing conflicting multimodal evidence, which is currently understudied.

Method: Built MC² benchmark for controlled evidence conflict scenarios, evaluated 18 MLLMs, and proposed representation engineering method to probe and steer modality preference without additional training.

Result: All 18 tested MLLMs demonstrated clear modality bias, and preference direction can be captured in latent representations. The proposed method effectively amplifies preference toward desired directions and improves downstream tasks.

Conclusion: MLLMs exhibit systematic modality preference that can be systematically studied and controlled through representation engineering, enabling applications like hallucination mitigation and improved multimodal translation.

Abstract: Multimodal large language models (MLLMs) have achieved remarkable performance on complex tasks with multimodal context. However, it is still understudied whether they exhibit modality preference when processing multimodal contexts. To study this question, we first build a \textbf{MC\textsuperscript{2}} benchmark under controlled evidence conflict scenarios to systematically evaluate modality preference, which is the tendency to favor one modality over another when making decisions based on multimodal conflicting evidence. Our extensive evaluation reveals that all 18 tested MLLMs generally demonstrate clear modality bias, and modality preference can be influenced by external interventions. An in-depth analysis reveals that the preference direction can be captured within the latent representations of MLLMs. Built on this, we propose a probing and steering method based on representation engineering to explicitly control modality preference without additional fine-tuning or carefully crafted prompts. Our method effectively amplifies modality preference toward a desired direction and applies to downstream tasks such as hallucination mitigation and multimodal machine translation, yielding promising improvements.

[277] Pearl: A Multimodal Culturally-Aware Arabic Instruction Dataset

Fakhraddin Alwajih, Samar M. Magdy, Abdellah El Mekki, Omer Nacar, Youssef Nafea, Safaa Taher Abdelfadil, Abdulfattah Mohammed Yahya, Hamzah Luqman, Nada Almarwani, Samah Aloufi, Baraah Qawasmen, Houdaifa Atou, Serry Sibaee, Hamzah A. Alsayadi, Walid Al-Dhabyani, Maged S. Al-shaibani, Aya El Aatar, Nour Qandos, Rahaf Alhamouri, Samar Ahmad, Mohammed Anwar Al-Ghrawi, Aminetou Yacoub, Ruwa AbuHweidi, Vatimetou Mohamed Lemin, Reem Abdel-Salam, Ahlam Bashiti, Aisha Alansari, Ahmed Ashraf, Nora Alturayeif, Alcides Alcoba Inciarte, Adel Ammar, Abdelrahim A. Elmadany, Mohamedou Cheikh Tourad, Ismail Berrada, Mustafa Jarrar, Shady Shehata, Muhammad Abdul-Mageed

Main category: cs.CL

TL;DR: PEARL is a large-scale Arabic multimodal dataset and benchmark designed to address cultural biases in LVLMs, featuring over 309K examples across ten culturally significant domains with human annotations from across the Arab world.

Details

Motivation: Mainstream large vision-language models inherently encode cultural biases, highlighting the need for diverse multimodal datasets to improve cultural understanding in AI systems.

Method: Constructed through advanced agentic workflows and extensive human-in-the-loop annotations by 37 annotators from across the Arab world, covering ten culturally significant domains and all Arab countries.

Result: Comprehensive evaluations show that reasoning-centric instruction alignment substantially improves models’ cultural grounding compared to conventional scaling methods.

Conclusion: PEARL establishes a foundational resource for advancing culturally-informed multimodal modeling research and all datasets/benchmarks are publicly available.

Abstract: Mainstream large vision-language models (LVLMs) inherently encode cultural biases, highlighting the need for diverse multimodal datasets. To address this gap, we introduce PEARL, a large-scale Arabic multimodal dataset and benchmark explicitly designed for cultural understanding. Constructed through advanced agentic workflows and extensive human-in-the-loop annotations by 37 annotators from across the Arab world, PEARL comprises over 309K multimodal examples spanning ten culturally significant domains covering all Arab countries. We further provide two robust evaluation benchmarks (PEARL and PEARL-LITE) along with a specialized subset (PEARL-X) explicitly developed to assess nuanced cultural variations. Comprehensive evaluations on state-of-the-art open and proprietary LVLMs demonstrate that reasoning-centric instruction alignment substantially improves models’ cultural grounding compared to conventional scaling methods. PEARL establishes a foundational resource for advancing culturally-informed multimodal modeling research. All datasets and benchmarks are publicly available.

[278] LiTEx: A Linguistic Taxonomy of Explanations for Understanding Within-Label Variation in Natural Language Inference

Pingjun Hong, Beiduo Chen, Siyao Peng, Marie-Catherine de Marneffe, Barbara Plank

Main category: cs.CL

TL;DR: LITEX introduces a linguistically-informed taxonomy to categorize free-text explanations in NLI, addressing within-label variation where annotators agree on labels but provide divergent reasoning.

Details

Motivation: To address the overlooked challenge of within-label variation in NLI, where annotators agree on the same label but provide different reasoning, and to better understand the rationales behind NLI labels.

Method: Developed LITEX taxonomy, annotated a subset of e-SNLI dataset, validated taxonomy reliability, and analyzed alignment with NLI labels, highlights, and explanations. Also assessed taxonomy’s usefulness in explanation generation.

Result: Conditioning explanation generation on LITEX yields explanations linguistically closer to human explanations than those generated using only labels or highlights. The taxonomy captures within-label variation effectively.

Conclusion: LITEX taxonomy not only captures within-label variation but also bridges the gap between human and model explanations more effectively than existing strategies through taxonomy-guided generation for reasoning.

Abstract: There is increasing evidence of Human Label Variation (HLV) in Natural Language Inference (NLI), where annotators assign different labels to the same premise-hypothesis pair. However, within-label variation–cases where annotators agree on the same label but provide divergent reasoning–poses an additional and mostly overlooked challenge. Several NLI datasets contain highlighted words in the NLI item as explanations, but the same spans on the NLI item can be highlighted for different reasons, as evidenced by free-text explanations, which offer a window into annotators’ reasoning. To systematically understand this problem and gain insight into the rationales behind NLI labels, we introduce LITEX, a linguistically-informed taxonomy for categorizing free-text explanations in English. Using this taxonomy, we annotate a subset of the e-SNLI dataset, validate the taxonomy’s reliability, and analyze how it aligns with NLI labels, highlights, and explanations. We further assess the taxonomy’s usefulness in explanation generation, demonstrating that conditioning generation on LITEX yields explanations that are linguistically closer to human explanations than those generated using only labels or highlights. Our approach thus not only captures within-label variation but also shows how taxonomy-guided generation for reasoning can bridge the gap between human and model explanations more effectively than existing strategies.

[279] Semi-structured LLM Reasoners Can Be Rigorously Audited

Jixuan Leng, Cassandra A. Cohen, Zhixian Zhang, Chenyan Xiong, William W. Cohen

Main category: cs.CL

TL;DR: SSRMs generate semi-structured reasoning traces that can be automatically audited to detect reasoning errors, maintaining strong performance while improving faithfulness.

Details

Motivation: Address the problem of faithfulness in LLM reasoning where errors and omissions are difficult to detect and may obscure biases.

Method: Train Semi-Structured Reasoning Models to produce reasoning traces in non-executable Pythonic syntax that names steps and marks inputs/outputs, enabling automated audits using hand-crafted DSL, LLM-generated audits, and learned typicality audits.

Result: All three audit methods effectively flag probable reasoning errors, and SSRMs demonstrate strong performance and generalizability across twelve benchmarks and two model families without compromising accuracy.

Conclusion: SSRMs provide an effective approach to improve reasoning faithfulness through structured, auditable reasoning traces while maintaining competitive performance.

Abstract: Although Large Language Models (LLMs) have become capable reasoners, the problem of faithfulness persists: their reasoning can contain errors and omissions that are difficult to detect and that may obscure biases in model outputs. To address this issue, we introduce Semi-Structured Reasoning Models (SSRMs), which are trained to produce semi-structured representations of reasoning. SSRMs generate reasoning traces in a non-executable Pythonic syntax that names each reasoning step and marks its inputs and outputs. This structure allows SSRM traces to be automatically audited to identify reasoning flaws. We evaluate three types of audits: hand-crafted structured reasoning audits, written in a domain-specific language (DSL) implemented in Python; LLM-generated structured reasoning audits; and learned typicality audits, which apply probabilistic models over reasoning traces. We show that all of these methods can be used to effectively flag probable reasoning errors. Importantly, the auditability of SSRMs does not appear to compromise overall accuracy: in evaluation on twelve benchmarks and two model families, SSRMs demonstrate strong performance and generalizability relative to other models of comparable size.

[280] Improving Reliability and Explainability of Medical Question Answering through Atomic Fact Checking in Retrieval-Augmented LLMs

Juraj Vladika, Annika Domres, Mai Nguyen, Rebecca Moser, Jana Nano, Felix Busch, Lisa C. Adams, Keno K. Bressem, Denise Bernhardt, Stephanie E. Combs, Kai J. Borm, Florian Matthes, Jan C. Peeken

Main category: cs.CL

TL;DR: A novel atomic fact-checking framework that decomposes LLM medical responses into verifiable atomic facts, improving factual accuracy and explainability by 40% and detecting 50% of hallucinations.

Details

Motivation: LLMs show medical knowledge but suffer from hallucinations and inaccurate citations, hindering clinical adoption. Current methods like RAG partially help but still have persistent hallucinations and low explainability.

Method: Decomposes LLM-generated medical responses into discrete verifiable units (atomic facts), each independently verified against authoritative medical guideline knowledge bases, enabling targeted error correction and source tracing.

Result: Achieved up to 40% overall answer improvement and 50% hallucination detection rate. Provides granular explanations by tracing each atomic fact to relevant source chunks, significantly improving factual accuracy and explainability.

Conclusion: This framework represents a crucial step towards trustworthy clinical LLM applications, addressing key prerequisites for clinical use and fostering confidence in AI-assisted healthcare through improved reliability and transparency.

Abstract: Large language models (LLMs) exhibit extensive medical knowledge but are prone to hallucinations and inaccurate citations, which pose a challenge to their clinical adoption and regulatory compliance. Current methods, such as Retrieval Augmented Generation, partially address these issues by grounding answers in source documents, but hallucinations and low fact-level explainability persist. In this work, we introduce a novel atomic fact-checking framework designed to enhance the reliability and explainability of LLMs used in medical long-form question answering. This method decomposes LLM-generated responses into discrete, verifiable units called atomic facts, each of which is independently verified against an authoritative knowledge base of medical guidelines. This approach enables targeted correction of errors and direct tracing to source literature, thereby improving the factual accuracy and explainability of medical Q&A. Extensive evaluation using multi-reader assessments by medical experts and an automated open Q&A benchmark demonstrated significant improvements in factual accuracy and explainability. Our framework achieved up to a 40% overall answer improvement and a 50% hallucination detection rate. The ability to trace each atomic fact back to the most relevant chunks from the database provides a granular, transparent explanation of the generated responses, addressing a major gap in current medical AI applications. This work represents a crucial step towards more trustworthy and reliable clinical applications of LLMs, addressing key prerequisites for clinical application and fostering greater confidence in AI-assisted healthcare.

[281] Hanfu-Bench: A Multimodal Benchmark on Cross-Temporal Cultural Understanding and Transcreation

Li Zhou, Lutong Yu, Dongchu Xie, Shaohuan Cheng, Wenyan Li, Haizhou Li

Main category: cs.CL

TL;DR: Hanfu-Bench is a multimodal dataset for temporal cultural understanding, focusing on Hanfu (traditional Chinese garment) across dynasties. It evaluates VLMs on cultural visual understanding and cultural image transcreation tasks.

Details

Motivation: Existing cultural understanding studies with VLMs emphasize geographic diversity but overlook temporal dimensions. This work addresses the gap in temporal cultural evolution analysis.

Method: Created Hanfu-Bench dataset with expert-curated multimodal data. Includes two tasks: cultural visual understanding (multiple-choice VQA on temporal-cultural features) and cultural image transcreation (transforming traditional attire to modern designs).

Result: Closed VLMs perform comparably to non-experts on visual understanding but fall 10% short of human experts. Open VLMs lag further. For transcreation, best model achieves only 42% success rate.

Conclusion: The benchmark reveals significant challenges in temporal cultural understanding and creative adaptation, providing essential testbed for this new research direction.

Abstract: Culture is a rich and dynamic domain that evolves across both geography and time. However, existing studies on cultural understanding with vision-language models (VLMs) primarily emphasize geographic diversity, often overlooking the critical temporal dimensions. To bridge this gap, we introduce Hanfu-Bench, a novel, expert-curated multimodal dataset. Hanfu, a traditional garment spanning ancient Chinese dynasties, serves as a representative cultural heritage that reflects the profound temporal aspects of Chinese culture while remaining highly popular in Chinese contemporary society. Hanfu-Bench comprises two core tasks: cultural visual understanding and cultural image transcreation. The former task examines temporal-cultural feature recognition based on single- or multi-image inputs through multiple-choice visual question answering, while the latter focuses on transforming traditional attire into modern designs through cultural element inheritance and modern context adaptation. Our evaluation shows that closed VLMs perform comparably to non-experts on visual cutural understanding but fall short by 10% to human experts, while open VLMs lags further behind non-experts. For the transcreation task, multi-faceted human evaluation indicates that the best-performing model achieves a success rate of only 42%. Our benchmark provides an essential testbed, revealing significant challenges in this new direction of temporal cultural understanding and creative adaptation.

[282] Answer Convergence as a Signal for Early Stopping in Reasoning

Xin Liu, Lu Wang

Main category: cs.CL

TL;DR: Chain-of-thought prompting creates verbose outputs with redundant reasoning steps. This paper shows models converge to answers after 60% of reasoning steps and proposes three inference-time strategies to reduce token usage by 40%+ with minimal accuracy loss.

Details

Motivation: CoT prompting increases inference costs due to verbose and redundant outputs. The authors hypothesize many reasoning steps are unnecessary for correct answers and aim to identify the minimum required reasoning.

Method: Three inference-time strategies: (1) early stopping via answer consistency, (2) boosting probability of generating end-of-reasoning signals, and (3) supervised method learning when to stop based on internal activations.

Result: Experiments across five benchmarks and five LLMs show significant token reduction (over 40% on NaturalQuestions) with little or no accuracy drop. Answer Consistency even improves accuracy while reducing tokens.

Conclusion: The work demonstrates the importance of cost-effective reasoning methods at inference time, offering practical benefits for real-world applications by reducing computational costs while maintaining performance.

Abstract: Chain-of-thought (CoT) prompting enhances reasoning in large language models (LLMs) but often leads to verbose and redundant outputs, thus increasing inference cost. We hypothesize that many reasoning steps are unnecessary for producing correct answers. To investigate this, we start with a systematic study to examine what is the minimum reasoning required for a model to reach a stable decision. We find that on math reasoning tasks like math, models typically converge to their final answers after 60% of the reasoning steps, suggesting substantial redundancy in the remaining content. Based on these insights, we propose three inference-time strategies to improve efficiency: (1) early stopping via answer consistency, (2) boosting the probability of generating end-of-reasoning signals, and (3) a supervised method that learns when to stop based on internal activations. Experiments across five benchmarks and five open-weights LLMs show that our methods significantly reduce token usage with little or no accuracy drop. In particular, on NaturalQuestions, Answer Consistency reduces tokens by over 40% while further improving accuracy. Our work underscores the importance of cost-effective reasoning methods that operate at inference time, offering practical benefits for real-world applications.

[283] Improving LLM Reasoning through Interpretable Role-Playing Steering

Anyi Wang, Dong Shu, Yifan Wang, Yunpu Ma, Mengnan Du

Main category: cs.CL

TL;DR: SRPS is a novel framework that identifies and manipulates internal model features for role-playing behavior, enabling fine-grained control and interpretability in LLM reasoning enhancement.

Details

Motivation: Existing role-playing methods rely on prompt engineering which lacks stability and interpretability. The paper aims to develop a more stable and interpretable approach to enhance LLM reasoning through role-playing.

Method: Extracts latent representations from role-play prompts, selects relevant features based on activation patterns, and constructs a steering vector that can be injected into the model’s residual stream with controllable intensity.

Result: Significant performance gains: Llama3.1-8B on CSQA improved from 31.86% to 39.80%, Gemma2-9B on SVAMP increased from 37.50% to 45.10% in zero-shot CoT setting.

Conclusion: SRPS enhances reasoning ability in LLMs with better interpretability and stability compared to traditional prompt-based role-playing, demonstrating the potential of feature manipulation for model control.

Abstract: Role-playing has emerged as an effective technique for enhancing the reasoning capabilities of large language models (LLMs). However, existing methods primarily rely on prompt engineering, which often lacks stability and interpretability. In this paper, we introduce Sparse Autoencoder Role-Playing Steering (SRPS), a novel framework that identifies and manipulates internal model features associated with role-playing behavior. Our approach extracts latent representations from role-play prompts, selects the most relevant features based on activation patterns, and constructs a steering vector that can be injected into the model’s residual stream with controllable intensity. Our method enables fine-grained control over role-specific behavior and offers insights into how role information influences internal model activations. Extensive experiments across various reasoning benchmarks and model sizes demonstrate consistent performance gains. Notably, in the zero-shot chain-of-thought (CoT) setting, the accuracy of Llama3.1-8B on CSQA improves from 31.86% to 39.80%, while Gemma2-9B on SVAMP increases from 37.50% to 45.10%. These results highlight the potential of SRPS to enhance reasoning ability in LLMs, providing better interpretability and stability compared to traditional prompt-based role-playing.

[284] What Do Indonesians Really Need from Language Technology? A Nationwide Survey

Muhammad Dehan Al Kautsar, Lucky Susanto, Derry Wijaya, Fajri Koto

Main category: cs.CL

TL;DR: A nationwide survey in Indonesia reveals that machine translation and information retrieval are the most critical needs for local language communities, with strong enthusiasm for language technology but concerns about privacy, bias, and data transparency.

Details

Motivation: There is emerging NLP development for Indonesia's 700+ local languages, but progress is costly and it's unclear what these language communities actually need from language technology.

Method: Conducted a nationwide survey to assess the actual needs of native speakers in Indonesia.

Result: Findings show addressing language barriers through machine translation and information retrieval is the most critical priority. There is strong enthusiasm for language technology advancements but concerns about privacy, bias, and use of public data for AI training.

Conclusion: Greater transparency and clear communication are needed to support broader AI adoption in Indonesian local language communities.

Abstract: There is an emerging effort to develop NLP for Indonesias 700+ local languages, but progress remains costly due to the need for direct engagement with native speakers. However, it is unclear what these language communities truly need from language technology. To address this, we conduct a nationwide survey to assess the actual needs of native speakers in Indonesia. Our findings indicate that addressing language barriers, particularly through machine translation and information retrieval, is the most critical priority. Although there is strong enthusiasm for advancements in language technology, concerns around privacy, bias, and the use of public data for AI training highlight the need for greater transparency and clear communication to support broader AI adoption.

[285] AdversariaL attacK sAfety aLIgnment(ALKALI): Safeguarding LLMs through GRACE: Geometric Representation-Aware Contrastive Enhancement- Introducing Adversarial Vulnerability Quality Index (AVQI)

Danush Khanna, Gurucharan Marthi Krishna Kumar, Basab Ghosh, Yaswanth Narsupalli, Vinija Jain, Vasu Sharma, Aman Chadha, Amitava Das

Main category: cs.CL

TL;DR: ALKALI benchmark exposes latent camouflage vulnerability in LLMs where adversarial prompts embed close to safe representations. GRACE framework reduces ASR by 39% through geometric regularization, and AVQI metric quantifies latent alignment failures.

Details

Motivation: Adversarial threats against LLMs are escalating faster than current defenses can adapt, with existing methods like DPO being blind to latent geometry vulnerabilities.

Method: Introduces ALKALI benchmark (9,000 prompts across 15 attack families), GRACE framework with geometric contrastive enhancement, and AVQI metric for quantifying latent alignment failures via cluster separation analysis.

Result: Evaluation of 21 LLMs shows alarmingly high attack success rates. GRACE achieves up to 39% ASR reduction by enforcing latent separation and adversarial cohesion constraints.

Conclusion: Latent camouflage is a critical structural blind spot in LLM alignment that requires geometric-aware defenses like GRACE, with AVQI providing principled measurement of internal safety encoding.

Abstract: Adversarial threats against LLMs are escalating faster than current defenses can adapt. We expose a critical geometric blind spot in alignment: adversarial prompts exploit latent camouflage, embedding perilously close to the safe representation manifold while encoding unsafe intent thereby evading surface level defenses like Direct Preference Optimization (DPO), which remain blind to the latent geometry. We introduce ALKALI, the first rigorously curated adversarial benchmark and the most comprehensive to date spanning 9,000 prompts across three macro categories, six subtypes, and fifteen attack families. Evaluation of 21 leading LLMs reveals alarmingly high Attack Success Rates (ASRs) across both open and closed source models, exposing an underlying vulnerability we term latent camouflage, a structural blind spot where adversarial completions mimic the latent geometry of safe ones. To mitigate this vulnerability, we introduce GRACE - Geometric Representation Aware Contrastive Enhancement, an alignment framework coupling preference learning with latent space regularization. GRACE enforces two constraints: latent separation between safe and adversarial completions, and adversarial cohesion among unsafe and jailbreak behaviors. These operate over layerwise pooled embeddings guided by a learned attention profile, reshaping internal geometry without modifying the base model, and achieve up to 39% ASR reduction. Moreover, we introduce AVQI, a geometry aware metric that quantifies latent alignment failure via cluster separation and compactness. AVQI reveals when unsafe completions mimic the geometry of safe ones, offering a principled lens into how models internally encode safety. We make the code publicly available at https://anonymous.4open.science/r/alkali-B416/README.md.

[286] Query-Focused Retrieval Heads Improve Long-Context Reasoning and Re-ranking

Wuwei Zhang, Fangcong Yin, Howard Yen, Danqi Chen, Xi Ye

Main category: cs.CL

TL;DR: QRHead improves retrieval in long-context language models by identifying query-focused attention heads, and QRRetriever uses their attention scores for efficient long-context reasoning and retrieval.

Details

Motivation: To enhance retrieval from long context in language models by identifying better attention heads than existing retrieval heads, which are measured by copy-paste behavior.

Method: Identify QRHead by aggregating attention scores with respect to input query using real-world task examples. Use QRRetriever with accumulated attention mass as retrieval scores for selecting relevant context parts.

Result: Achieves over 10% performance gains on LongMemEval and CLIPPER multi-hop reasoning tasks, outperforms full context and strong dense retrievers. Strong zero-shot performance on BEIR benchmark, beating other LLM-based re-rankers like RankGPT.

Conclusion: Query-context attention scoring and task selection are crucial for identifying effective QRHeads. The work provides a general-purpose retriever and interpretability insights into long-context LM capabilities.

Abstract: Recent work has identified retrieval heads, a subset of attention heads responsible for retrieving salient information in long-context language models (LMs), as measured by their copy-paste behavior in Needlein-a-Haystack tasks. In this paper, we introduce QRHead (Query-Focused Retrieval Head), an improved set of attention heads that enhance retrieval from long context. We identify QRHead by aggregating attention scores with respect to the input query, using a handful of examples from real-world tasks (e.g., long-context QA). We further introduce QRRetriever, an efficient and effective retriever that uses the accumulated attention mass of QRHead as retrieval scores. We use QRRetriever for long-context reasoning by selecting the most relevant parts with the highest retrieval scores. On multi-hop reasoning tasks LongMemEval and CLIPPER, this yields over 10% performance gains over full context and outperforms strong dense retrievers. We also evaluate QRRetriever as a re-ranker on the BEIR benchmark and find that it achieves strong zero-shot performance, outperforming other LLM-based re-rankers such as RankGPT. Further analysis shows that both the query-context attention scoring and task selection are crucial for identifying QRHead with strong downstream utility. Overall, our work contributes a general-purpose retriever and offers interpretability insights into the long-context capabilities of LMs.

[287] Q2E: Query-to-Event Decomposition for Zero-Shot Multilingual Text-to-Video Retrieval

Shubhashis Roy Dipta, Francis Ferraro

Main category: cs.CL

TL;DR: Q2E is a Query-to-Event decomposition method for zero-shot multilingual text-to-video retrieval that extracts latent parametric knowledge from LLMs and VLMs to improve video retrieval for complex real-world events.

Details

Motivation: To improve identification and retrieval of videos related to complex real-world events by automatically extracting latent parametric knowledge about those events from LLMs and VLMs.

Method: Query-to-Event decomposition using knowledge embedded in LLMs and VLMs, adaptable across datasets/domains/models, with entropy-based fusion scoring for zero-shot fusion of multimodal knowledge (visual and speech-based inputs).

Result: Q2E outperforms several state-of-the-art baselines on two diverse datasets across multiple retrieval metrics, and integrating audio information significantly improves text-to-video retrieval.

Conclusion: The approach successfully enhances understanding of simplified human queries through decomposition and demonstrates the value of integrating multimodal knowledge (including audio) for improved video retrieval performance.

Abstract: Recent approaches have shown impressive proficiency in extracting and leveraging parametric knowledge from Large-Language Models (LLMs) and Vision-Language Models (VLMs). In this work, we consider how we can improve the identification and retrieval of videos related to complex real-world events by automatically extracting latent parametric knowledge about those events. We present Q2E: a Query-to-Event decomposition method for zero-shot multilingual text-to-video retrieval, adaptable across datasets, domains, LLMs, or VLMs. Our approach demonstrates that we can enhance the understanding of otherwise overly simplified human queries by decomposing the query using the knowledge embedded in LLMs and VLMs. We additionally show how to apply our approach to both visual and speech-based inputs. To combine this varied multimodal knowledge, we adopt entropy-based fusion scoring for zero-shot fusion. Through evaluations on two diverse datasets and multiple retrieval metrics, we demonstrate that Q2E outperforms several state-of-the-art baselines. Our evaluation also shows that integrating audio information can significantly improve text-to-video retrieval. We have released code and data for future research.

[288] Curriculum-Guided Layer Scaling for Language Model Pretraining

Karanpartap Singh, Neil Band, Ehsan Adeli

Main category: cs.CL

TL;DR: Curriculum-Guided Layer Scaling (CGLS) is a framework that synchronizes increasing data difficulty with model growth through progressive layer stacking during pretraining, improving learning efficiency and generalization.

Details

Motivation: Inspired by human cognitive development where knowledge builds gradually as brains mature, the authors aim to improve compute efficiency in pretraining large language models by aligning model growth with data difficulty progression.

Method: CGLS progressively adds layers during training while simultaneously increasing data difficulty using curricula. At 100M scale: synthetic short stories → general web data. At 1.2B scale: general text → technical/specialized content using DistilBERT-based classifier for stratification.

Result: CGLS outperforms baseline methods on PIQA and ARC benchmarks at 100M scale. At 1.2B scale, progressive depth increase with sample difficulty leads to better generalization and zero-shot performance on various downstream benchmarks.

Conclusion: CGLS effectively unlocks the potential of progressive stacking, offering a simple yet effective strategy for improving generalization on knowledge-intensive and reasoning tasks during pretraining.

Abstract: As the cost of pretraining large language models grows, there is continued interest in strategies to improve learning efficiency during this core training stage. Motivated by cognitive development, where humans gradually build knowledge as their brains mature, we propose Curriculum-Guided Layer Scaling (CGLS), a framework for compute-efficient pretraining that synchronizes increasing data difficulty with model growth through progressive layer stacking (i.e. gradually adding layers during training). At the 100M parameter scale, using a curriculum transitioning from synthetic short stories to general web data, CGLS outperforms baseline methods on the question-answering benchmarks PIQA and ARC. Pretraining at the 1.2B scale, we stratify the DataComp-LM corpus with a DistilBERT-based classifier and progress from general text to highly technical or specialized content. Our results show that progressively increasing model depth alongside sample difficulty leads to better generalization and zero-shot performance on various downstream benchmarks. Altogether, our findings demonstrate that CGLS unlocks the potential of progressive stacking, offering a simple yet effective strategy for improving generalization on knowledge-intensive and reasoning tasks.

[289] Infini-gram mini: Exact n-gram Search at the Internet Scale with FM-Index

Hao Xu, Jiacheng Liu, Yejin Choi, Noah A. Smith, Hannaneh Hajishirzi

Main category: cs.CL

TL;DR: infini-gram mini is an efficient system that makes petabyte-level text corpora searchable using FM-index, with 44% corpus size, 18× faster indexing, and 3.2× memory reduction. It reveals significant benchmark contamination in language model training data.

Details

Motivation: Understanding massive Internet text data used for training language models is crucial, but existing search engines have high storage overhead that hinders application on Internet-scale data.

Method: Based on FM-index data structure that simultaneously indexes and compresses text, creating indexes with size only 44% of the corpus. The system improves indexing speed (18×) and reduces memory use during indexing (3.2×) and querying.

Result: Indexed 83TB of Internet text in 99 days with single CPU node. Found several core LM evaluation benchmarks heavily contaminated in Internet crawls (up to 74.2% in GSM8K), potentially overestimating language model capabilities.

Conclusion: infini-gram mini enables efficient large-scale text analysis and reveals critical benchmark contamination issues. The system provides web interface and API for general search queries, with a contamination bulletin for sharing benchmark contamination rates.

Abstract: Language models are trained mainly on massive text data from the Internet, and it becomes increasingly important to understand this data source. Exact-match search engines enable searching in large text corpora - counting string appearances and retrieving the enclosing documents - yet the high storage overhead hinders their application on Internet-scale data. We present infini-gram mini, an efficient and scalable system that can make petabyte-level text corpora searchable. Based on the FM-index data structure (Ferragina and Manzini, 2000), which simultaneously indexes and compresses text, our system creates indexes with size only 44% of the corpus. Infini-gram mini greatly improves upon the best existing implementation of FM-index in terms of indexing speed (18$\times$) and memory use during both indexing (3.2$\times$ reduction) and querying (down to a negligible amount). We index 83TB of Internet text in 99 days with a single CPU node with 128 vCPUs (or 19 hours if using 137 such nodes). We show one important use case of infini-gram mini in a large-scale analysis of benchmark contamination. We find several core LM evaluation benchmarks to be heavily contaminated in Internet crawls (up to 74.2% in GSM8K), which could lead to overestimating the capabilities of language models if trained on such data. We host a benchmark contamination bulletin to share the contamination rate of many core and community-contributed benchmarks. We also release a web interface and an API endpoint to serve general search queries on infini-gram mini indexes.

[290] BOW: Reinforcement Learning for Bottlenecked Next Word Prediction

Ming Shen, Zhikun Xu, Jacob Dineen, Xiao Ye, Ben Zhou

Main category: cs.CL

TL;DR: BOW (BOttlenecked next-Word prediction) is a reinforcement learning formulation that inserts a reasoning bottleneck before next-word prediction, forcing models to generate explicit reasoning trajectories and improving general reasoning capabilities.

Details

Motivation: Standard next-word prediction (NWP) yields surface fluency but limited explicit reasoning. The goal is to strengthen models' general reasoning capability by shifting the supervision signal to elicit explicit reasoning before token emission.

Method: BOW inserts an intermediate reasoning bottleneck where the policy model must first generate a next-word reasoning trajectory. A frozen scorer then assigns a distributional reward based on the probability of the gold next token conditioned on the trajectory, with optional L1-style regularization to prevent shortcuts.

Result: Across ten benchmarks, BOW adaptation on Qwen2.5-7B-Instruct and Llama3.1-8B-Instruct improved zero-shot reasoning by nearly 5% on average compared to strong baselines, achieving top results in 7 of 10 intrinsic NWP evaluations.

Conclusion: BOW is a viable alternative to vanilla NWP that induces explicit next-word reasoning and strengthens general reasoning ability, outperforming both RL variants with hard binary rewards and supervised finetuning approaches.

Abstract: Large language models (LLMs) are typically pretrained with next-word prediction (NWP), which yields strong surface fluency but places limited pressure on models to form explicit reasoning before emitting tokens. We study whether shifting the supervision signal can better elicit explicit reasoning and, more broadly, strengthen models’ general reasoning capability. We present BOttlenecked next-Word prediction (BOW), a RL formulation of NWP that inserts an intermediate reasoning bottleneck. Instead of predicting the next word directly from context, the policy model must first generate a next-word reasoning trajectory. A frozen scorer then assigns this trajectory a soft, distributional reward equal to the probability of the gold next token conditioned solely on the trajectory to guide the RL optimization. We also propose an optional L1-style regularizer on the reward to discourage “name-the-answer” shortcuts. Across ten benchmarks, a brief BOW adaptation phase on Qwen2.5-7B-Instruct and Llama3.1-8B-Instruct improves zero-shot reasoning and outperforms strong continual-pretraining baselines, including an RL variant with a hard, binary reward and a supervised finetuning approach with augmented data, by nearly 5% on average, while achieving the top result in 7 of 10 intrinsic NWP evaluations. These results indicate that BOW is a viable alternative to vanilla NWP, inducing explicit next-word reasoning and strengthening general reasoning ability.

[291] Long-Context Generalization with Sparse Attention

Pavlo Vasylenko, Hugo Pitorro, André F. T. Martins, Marcos Treviso

Main category: cs.CL

TL;DR: The paper introduces Adaptive-Scalable Entmax (ASEntmax), a learnable sparse attention mechanism that outperforms softmax and fixed sparse methods, enabling 1000x length extrapolation and better long-context language modeling.

Details

Motivation: Softmax attention in transformers produces dense distributions that disperse attention probability mass across non-informative tokens as sequence length increases, leading to representational collapse for tasks requiring precise focus on fixed-size patterns.

Method: Proposes ASEntmax which combines α-entmax sparse attention with a learnable temperature parameter, allowing dynamic interpolation between sparse (pattern-focused) and dense (softmax-like) attention regimes.

Result: ASEntmax achieves up to 1000x length extrapolation on synthetic benchmarks, superior long-context generalization in language modeling while preserving short-context performance, with better perplexity trends and higher retrieval accuracies at 8x training length.

Conclusion: Dynamic sparse attention mechanisms like ASEntmax effectively address softmax’s limitations for long sequences, providing better pattern focus and representational stability through exact zero assignments to irrelevant tokens.

Abstract: Transformer-based architectures traditionally employ softmax to compute attention weights, which produces dense distributions over all tokens in a sequence. While effective in many settings, this density has been shown to be detrimental for tasks that demand precise focus on fixed-size patterns: as sequence length increases, non-informative tokens accumulate attention probability mass, leading to dispersion and representational collapse. We show in this paper that dynamically sparse attention mechanisms using $\alpha$-entmax can avoid these issues, due to their ability to assign exact zeros to irrelevant tokens. Furthermore, we introduce Adaptive-Scalable Entmax (ASEntmax), which endows $\alpha$-entmax with a learnable temperature parameter, allowing the attention distribution to interpolate between sparse (pattern-focused) and dense (softmax-like) regimes. Our empirical evaluation on synthetic tasks and language modeling demonstrates that ASEntmax substantially outperforms softmax, scalable softmax, and fixed-temperature $\alpha$-entmax baselines, achieving up to 1000$\times$ length extrapolation on synthetic benchmarks and superior long-context generalization on language modeling while preserving short-context performance, including better perplexity trends and higher retrieval accuracies at 8$\times$ training length.

Hua Tang, Lingyong Yan, Yukun Zhao, Shuaiqiang Wang, Jizhou Huang, Dawei Yin

Main category: cs.CL

TL;DR: A novel multi-turn jailbreaking method called GRAF that globally refines attack trajectories and fabricates model responses to suppress safety warnings, achieving superior effectiveness across six state-of-the-art LLMs.

Details

Motivation: LLMs pose safety risks due to potential misuse, and existing jailbreaking methods struggle to adapt to evolving dialogue dynamics in multi-turn attacks.

Method: Proposed GRAF method that globally refines attack trajectory at each interaction and actively fabricates model responses to suppress safety-related warnings.

Result: Extensive experiments show superior effectiveness compared to existing single-turn and multi-turn jailbreaking methods across six state-of-the-art LLMs.

Conclusion: GRAF successfully addresses the challenge of adapting to evolving dialogue dynamics in multi-turn jailbreaking attacks, demonstrating significant improvements over prior methods.

Abstract: Large Language Models (LLMs) have demonstrated remarkable performance across diverse tasks. Nevertheless, they still pose notable safety risks due to potential misuse for malicious purposes. Jailbreaking, which seeks to induce models to generate harmful content through single-turn or multi-turn attacks, plays a crucial role in uncovering underlying security vulnerabilities. However, prior methods, including sophisticated multi-turn approaches, often struggle to adapt to the evolving dynamics of dialogue as interactions progress. To address this challenge, we propose \ours (JailBreaking via \textbf{G}lobally \textbf{R}efining and \textbf{A}daptively \textbf{F}abricating), a novel multi-turn jailbreaking method that globally refines the attack trajectory at each interaction. In addition, we actively fabricate model responses to suppress safety-related warnings, thereby increasing the likelihood of eliciting harmful outputs in subsequent queries. Extensive experiments across six state-of-the-art LLMs demonstrate the superior effectiveness of our approach compared to existing single-turn and multi-turn jailbreaking methods. Our code will be released at https://github.com/Ytang520/Multi-Turn_jailbreaking_Global-Refinment_and_Active-Fabrication.

[293] KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model

Xinping Zhao, Xinshuo Hu, Zifei Shan, Shouzheng Huang, Yao Zhou, Xin Zhang, Zetian Sun, Zhenyu Liu, Dongfang Li, Xinyuan Wei, Youcheng Pan, Yang Xiang, Meishan Zhang, Haofen Wang, Jun Yu, Baotian Hu, Min Zhang

Main category: cs.CL

TL;DR: KaLM-Embedding-V2 is a series of compact 0.5B parameter embedding models that achieve state-of-the-art performance through advanced training techniques and high-quality data curation, outperforming models of similar size and rivaling much larger models.

Details

Motivation: Current LLM-based text embedding models focus mainly on data scaling or synthesis, with limited exploration of training techniques and data quality, which constrains performance.

Method: Uses 0.5B parameter architecture with mean-pooling and bidirectional representation learning; implements progressive multi-stage training with pre-training, fine-tuning, and contrastive distillation; employs focal-style reweighting and hard-negative mining; curates diverse datasets across 20+ pre-training and 100+ fine-tuning categories.

Result: Achieves state-of-the-art performance on Massive Text Embedding Benchmark, outperforming comparable-size models and rivaling models 3-26x larger.

Conclusion: Sets new standard for versatile compact embedding models under 1B parameters, demonstrating superior performance through systematic training techniques and high-quality data curation.

Abstract: Recent advancements in Large Language Models (LLMs)-based text embedding models primarily focus on data scaling or synthesis, yet limited exploration of training techniques and data quality, thereby constraining performance. In this work, we propose KaLM-Embedding-V2, a series of versatile and compact embedding models, systematically incentivizing advanced embedding capability in LLMs by superior training techniques and high-quality data. For model architecture, we implement the models on a 0.5B compact size with simple mean-pooling to produce fixed-length embeddings and remove the causal attention mask to enable fully bidirectional representation learning. For training techniques, we propose a progressive multi-stage training pipeline: pre-training on weakly supervised large-scale datasets, fine-tuning with supervised high-quality datasets, and contrastive distillation with fine-grained soft signals, integrated with focal-style reweighting and online hard-negative mixing to emphasize difficult samples and enrich hard negatives, respectively. For training data, we curate over 20 categories for pre-training and 100 categories for fine-tuning and contrastive distillation, to improve both performance and generalization, leveraging task-specific instructions, hard-negative mining, and example-based multi-class labeling to ensure high quality. Combining these techniques, our KaLM-Embedding-V2 series achieves state-of-the-art performance on the Massive Text Embedding Benchmark, outperforming models of comparable size and rivaling models 3-26x larger, setting a new standard for versatile and compact embedding models under 1B parameters. The code, data, and models will be publicly available to facilitate academic research.

[294] Reasoning Isn’t Enough: Examining Truth-Bias and Sycophancy in LLMs

Emilio Barkett, Olivia Long, Madhavendra Thakur

Main category: cs.CL

TL;DR: This study evaluates LLMs’ truth detection capabilities, finding reasoning models have lower truth-bias than non-reasoning models but still higher than humans, and identifies sycophantic tendencies in advanced models.

Details

Motivation: LLMs are widely used in fact-checking and decision-making but remain poorly understood as judges of truth, necessitating comprehensive evaluation of their veracity detection capabilities.

Method: Evaluated 8 LLMs making 4,800 veracity judgments across multiple prompts, comparing reasoning and non-reasoning models.

Result: Reasoning models showed lower truth-bias than non-reasoning models but still higher than human benchmarks; advanced models (o4-mini, GPT-4.1, R1) displayed sycophantic tendencies with good truth accuracy but poor deception accuracy.

Conclusion: Capability advances alone do not resolve fundamental veracity detection challenges in LLMs, highlighting persistent issues with truth-bias and sycophantic behavior.

Abstract: Despite their widespread use in fact-checking, moderation, and high-stakes decision-making, large language models (LLMs) remain poorly understood as judges of truth. This study presents the largest evaluation to date of LLMs’ veracity detection capabilities and the first analysis of these capabilities in reasoning models. We had eight LLMs make 4,800 veracity judgments across several prompts, comparing reasoning and non-reasoning models. We find that rates of truth-bias, or the likelihood to believe a statement is true, regardless of whether it is actually true, are lower in reasoning models than in non-reasoning models, but still higher than human benchmarks. Most concerning, we identify sycophantic tendencies in several advanced models (o4-mini and GPT-4.1 from OpenAI, R1 from DeepSeek), which displayed an asymmetry in detection accuracy, performing well in truth accuracy but poorly in deception accuracy. This suggests that capability advances alone do not resolve fundamental veracity detection challenges in LLMs.

Simon Münker, Nils Schwager, Achim Rettinger

Main category: cs.CL

TL;DR: This paper examines the use of LLMs to simulate social network user behavior, finding that simulations need empirical validation in their original context and advocating for more rigorous generative-agent-based modeling.

Details

Motivation: To understand when and how LLMs can effectively replicate human social network behavior for computational social science research, given conflicting findings in previous studies.

Method: Developed a formal framework for social network simulation and empirically tested different approaches to imitate user communication on X platform in both English and German languages.

Result: Findings indicate that social simulations should be validated by their empirical realism in the specific context where simulation components were originally fitted.

Conclusion: The paper argues for increased rigor when applying generative-agent-based modeling for social simulation, emphasizing the importance of context-specific validation.

Abstract: The ability of Large Language Models (LLMs) to mimic human behavior triggered a plethora of computational social science research, assuming that empirical studies of humans can be conducted with AI agents instead. Since there have been conflicting research findings on whether and when this hypothesis holds, there is a need to better understand the differences in their experimental designs. We focus on replicating the behavior of social network users with the use of LLMs for the analysis of communication on social networks. First, we provide a formal framework for the simulation of social networks, before focusing on the sub-task of imitating user communication. We empirically test different approaches to imitate user behavior on X in English and German. Our findings suggest that social simulations should be validated by their empirical realism measured in the setting in which the simulation components were fitted. With this paper, we argue for more rigor when applying generative-agent-based modeling for social simulation.

[296] Semantic-guided Diverse Decoding for Large Language Model

Weijie Shi, Yue Cui, Yaguang Wu, Jingzhi Fang, Shibo Zhang, Mengze Li, Sirui Han, Jia Zhu, Jiajie Xu, Xiaofang Zhou

Main category: cs.CL

TL;DR: SemDiD is a novel decoding method that achieves semantic diversity in LLM outputs by operating in embedding space with orthogonal guidance, inter-group repulsion, and position-debiased probability assessment.

Details

Motivation: Existing decoding methods primarily achieve lexical diversity but fail to ensure meaningful semantic differentiation, limiting applications like Best-of-N strategies, group-based RL, and data synthesis.

Method: SemDiD uses three mechanisms: orthogonal directional guidance, dynamic inter-group repulsion, and position-debiased probability assessment, harmonized through adaptive gain functions and constraint optimization.

Result: SemDiD outperforms existing methods, improving Best-of-N coverage by 1.4-5.2% across tasks and accelerating RLHF training convergence by 15% while increasing accuracy by up to 2.1%.

Conclusion: SemDiD effectively balances quality and diversity in LLM decoding, enabling better semantic differentiation for various applications.

Abstract: Diverse decoding of large language models is crucial for applications requiring multiple semantically distinct responses, yet existing methods primarily achieve lexical rather than semantic diversity. This limitation significantly constrains Best-of-N strategies, group-based reinforcement learning, and data synthesis. While temperature sampling and diverse beam search modify token distributions or apply n-gram penalties, they fail to ensure meaningful semantic differentiation. We introduce Semantic-guided Diverse Decoding (SemDiD), operating directly in embedding space that balances quality with diversity through three complementary mechanisms: orthogonal directional guidance, dynamic inter-group repulsion, and position-debiased probability assessment. SemDiD harmonizes these competing objectives using adaptive gain functions and constraint optimization, ensuring both quality thresholds and maximal semantic differentiation. Experiments show SemDiD consistently outperforms existing methods, improving Best-of-N coverage by 1.4-5.2% across diverse tasks and accelerating RLHF training convergence by 15% while increasing accuracy by up to 2.1%.

[297] Latent Chain-of-Thought? Decoding the Depth-Recurrent Transformer

Wenquan Lu, Yuechuan Yang, Kyle Lee, Yanshu Li, Enqi Liu

Main category: cs.CL

TL;DR: The paper investigates whether latent chain-of-thought reasoning emerges in Huginn-3.5B, a depth-recurrent Transformer, finding limited evidence of interpretable latent CoT and showing that increasing recurrence depth provides only marginal performance gains.

Details

Motivation: To explore whether recurrent architectures can internalize reasoning in latent space (latent CoT) rather than externalizing reasoning steps in natural language, potentially improving efficiency while maintaining reasoning capabilities.

Method: Examined Huginn-3.5B’s internal behavior on arithmetic tasks using probing techniques including Logit Lens and Coda Lens, tracking rank trajectories of final and intermediate result tokens across recurrent blocks.

Result: Limited evidence of interpretable latent CoT was found, with significant probing inconsistencies across recurrent blocks. Hidden state interpretability depends heavily on layer index and decoding method. Increasing recurrence depth yields only marginal gains.

Conclusion: Latent CoT reasoning does not robustly emerge in the studied depth-recurrent Transformer, and explicit externalization of reasoning steps remains more effective than attempting to internalize reasoning in latent space.

Abstract: Chain-of-thought (CoT) reasoning has enabled transformer-based language models to excel at complex mathematics and multi-step planning. However, in standard decoder-only architectures, these reasoning steps are externalized in natural language, improving interpretability at the cost of efficiency. To capture reasoning that is not easily represented in words, many works have explored recurrent architectures that aim to internalize reasoning in latent space, potentially supporting latent CoT. In this paper, we investigate whether such reasoning structures emerge in Huginn-3.5B, a depth-recurrent Transformer that reuses layers at inference time without increasing parameter count. We examine the model’s internal behavior on arithmetic tasks using a suite of probing techniques including the Logit Lens and Coda Lens. Our findings reveal limited evidence of interpretable latent CoT by tracking rank trajectories of final and intermediate result tokens. Furthermore, we uncover significant probing inconsistencies across recurrent blocks, where the interpretability of hidden states depends heavily on both the layer index and the decoding method. Finally, we empirically show that increasing recurrence depth yields only marginal gains and falls well short of models that explicitly externalize reasoning steps. The code is available at https://github.com/wenquanlu/huginn-latent-cot.

[298] PRIME: Large Language Model Personalization with Cognitive Dual-Memory and Personalized Thought Process

Xinliang Frederick Zhang, Nick Beauchamp, Lu Wang

Main category: cs.CL

TL;DR: PRIME is a unified LLM personalization framework that integrates cognitive dual-memory model (episodic and semantic memory) with personalized thinking capability, validated on a new CMV dataset for long-context evaluation.

Details

Motivation: Lack of unified theoretical framework for understanding effective LLM personalization drivers, despite various existing methods.

Method: Integrates cognitive dual-memory model: episodic memory for historical user engagements and semantic memory for long-term evolving beliefs. Introduces PRIME framework with personalized thinking capability inspired by slow thinking strategy.

Result: Extensive experiments validate PRIME’s effectiveness across both long- and short-context scenarios. Captures dynamic personalization beyond popularity biases.

Conclusion: PRIME provides a unified theoretical framework for LLM personalization that effectively captures dynamic user preferences using cognitive memory mechanisms.

Abstract: Large language model (LLM) personalization aims to align model outputs with individuals’ unique preferences and opinions. While recent efforts have implemented various personalization methods, a unified theoretical framework that can systematically understand the drivers of effective personalization is still lacking. In this work, we integrate the well-established cognitive dual-memory model into LLM personalization, by mirroring episodic memory to historical user engagements and semantic memory to long-term, evolving user beliefs. Specifically, we systematically investigate memory instantiations and introduce a unified framework, PRIME, using episodic and semantic memory mechanisms. We further augment PRIME with a novel personalized thinking capability inspired by the slow thinking strategy. Moreover, recognizing the absence of suitable benchmarks, we introduce a dataset using Change My View (CMV) from Reddit, specifically designed to evaluate long-context personalization. Extensive experiments validate PRIME’s effectiveness across both long- and short-context scenarios. Further analysis confirms that PRIME effectively captures dynamic personalization beyond mere popularity biases.

[299] CoSteer: Collaborative Decoding-Time Personalization via Local Delta Steering

Hang Lv, Sheng Liang, Hao Wang, Hongchao Gu, Yaxiong Wu, Wei Guo, Defu Lian, Yong Liu, Enhong Chen

Main category: cs.CL

TL;DR: CoSteer is a collaborative framework that enables personalized text generation by using local small models to steer cloud-based LLMs through delta vectors, preserving privacy while maintaining generation quality.

Details

Motivation: Existing methods struggle with real-time adaptation under resource constraints on personal devices, creating a dilemma where cloud models lack user-specific information while on-device models can't match cloud generation quality.

Method: Uses local small models to compute logits differences between personal context-aware and -agnostic outputs as steering signals, formulating token-level optimization as online learning to dynamically adjust remote LLM’s logits on-device.

Result: Effectively assists LLMs in generating personalized content using local user profiles and histories while preserving privacy through on-device processing and maintaining acceptable computational overhead.

Conclusion: CoSteer addresses the personalization dichotomy by enabling collaborative decoding-time personalization that preserves privacy and maintains cloud LLM capabilities without fine-tuning.

Abstract: Personalized text generation has become crucial for adapting language models to diverse and evolving users’ personal context across cultural, temporal, and contextual dimensions. While existing methods often rely on centralized fine-tuning or static preference alignment, they struggle to achieve real-time adaptation under resource constraints inherent to personal devices. This limitation creates a dilemma: large cloud-based models lack access to localized user-specific information, while small on-device models cannot match the generation quality of their cloud counterparts. To address this dichotomy, we present CoSteer, a novel collaborative framework that enables decoding-time personalization through localized delta steering. Our key insight lies in leveraging the logits difference between personal context-aware and -agnostic outputs from local small models as steering signals for cloud-based LLMs. Specifically, we formulate token-level optimization as an online learning problem, where local delta vectors dynamically adjust the remote LLM’s logits within the on-device environment. This approach preserves privacy by transmitting only the final steered tokens rather than raw data or intermediate vectors, while maintaining cloud-based LLMs’ general capabilities without fine-tuning. Through comprehensive experiments on various personalized generation tasks, we demonstrate that CoSteer effectively assists LLMs in generating personalized content by leveraging locally stored user profiles and histories, ensuring privacy preservation through on-device data processing while maintaining acceptable computational overhead.

[300] ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation

Chenchen Zhang, Yuhang Li, Can Xu, Jiaheng Liu, Ao Liu, Changzhi Zhou, Ken Deng, Dengpeng Wu, Guanhua Huang, Kejiao Li, Qi Yi, Ruibin Xiong, Shihui Hu, Yue Zhang, Yuhao Jiang, Zenan Xu, Yuanxing Zhang, Wiggin Zhou, Chayse Zhou, Fengzong Lian

Main category: cs.CL

TL;DR: ArtifactsBench is a new benchmark for automated multimodal evaluation of visual code generation that addresses the gap in assessing visual fidelity and interactive integrity of generated artifacts.

Details

Motivation: Current benchmarks focus on algorithmic correctness but fail to evaluate visual fidelity and interactive integrity, which are crucial for modern user experiences in visual code generation.

Method: Programmatically renders generated artifacts and captures dynamic behavior through temporal screenshots, then uses Multimodal LLM-as-Judge with fine-grained checklists for holistic scoring.

Result: Achieved 94.4% ranking consistency with human preference gold standard and over 90% pairwise agreement with human experts, establishing reliable automated assessment of human-perceived quality.

Conclusion: ArtifactsBench provides the first scalable framework for automated evaluation of visual code generation quality, revealing that generalist models often outperform domain-specific ones, and is open-sourced for community use.

Abstract: The generative capabilities of Large Language Models (LLMs) are rapidly expanding from static code to dynamic, interactive visual artifacts. This progress is bottlenecked by a critical evaluation gap: established benchmarks focus on algorithmic correctness and are blind to the visual fidelity and interactive integrity that define modern user experiences. To bridge this gap, we introduce ArtifactsBench, a new benchmark and paradigm for the automated, multimodal evaluation of visual code generation. Our framework programmatically renders each generated artifact and captures its dynamic behavior through temporal screenshots. This visual evidence, alongside the source code, is then assessed by a Multimodal LLM (MLLM)-as-Judge, which is rigorously guided by a fine-grained, per-task checklist to ensure holistic and reproducible scoring. We construct a new benchmark of 1,825 diverse tasks and evaluate over 30 leading LLMs. Our automated evaluation achieves a striking 94.4% ranking consistency with WebDev Arena, the gold-standard for human preference in web development, and over 90% pairwise agreement with human experts. This establishes ArtifactsBench as the first framework to reliably automate the assessment of human-perceived quality at scale. Our analysis provides a high-resolution map of the current SOTA, revealing that generalist models often outperform domain-specific ones. We open-source ArtifactsBench, including the benchmark, evaluation harness, and baseline results at https://artifactsbenchmark.github.io/, to provide the community with a scalable and accurate tool to accelerate the development of user-centric generative models.

[301] Entropy-Memorization Law: Evaluating Memorization Difficulty of Data in LLMs

Yizhan Huang, Zhe Yang, Meifang Chen, Huang Nianchen, Jianping Zhang, Michael R. Lyu

Main category: cs.CL

TL;DR: The paper investigates how data entropy correlates with memorization in LLMs, revealing a linear relationship (Entropy-Memorization Law) and showing that even random strings have lower entropy than expected, enabling dataset inference.

Details

Motivation: To understand how to characterize the memorization difficulty of training data in LLMs, as they are known to memorize portions of their training data.

Method: Empirical experiments on OLMo models to study the relationship between data entropy and memorization scores, including a case study on memorizing randomized strings.

Result: Found a linear correlation between data entropy and memorization score (Entropy-Memorization Law), and observed that random sequences have unexpectedly low empirical entropy compared to the training corpus.

Conclusion: The Entropy-Memorization Law provides a simple approach to distinguish training and testing data, enabling Dataset Inference (DI).

Abstract: Large Language Models (LLMs) are known to memorize portions of their training data, sometimes reproducing content verbatim when prompted appropriately. In this work, we investigate a fundamental yet under-explored question in the domain of memorization: How to characterize memorization difficulty of training data in LLMs? Through empirical experiments on OLMo, a family of open models, we present the Entropy-Memorization Law. It suggests that data entropy is linearly correlated with memorization score. Moreover, in a case study of memorizing highly randomized strings, or “gibberish”, we observe that such sequences, despite their apparent randomness, exhibit unexpectedly low empirical entropy compared to the broader training corpus. Adopting the same strategy to discover Entropy-Memorization Law, we derive a simple yet effective approach to distinguish training and testing data, enabling Dataset Inference (DI).

[302] Function Induction and Task Generalization: An Interpretability Study with Off-by-One Addition

Qinyuan Ye, Robin Jia, Xiang Ren

Main category: cs.CL

TL;DR: The paper investigates how language models generalize to unseen tasks through in-context learning, using off-by-one addition as a case study. It reveals a function induction mechanism that enables task-level generalization.

Details

Motivation: To understand the internal mechanisms that enable large language models to perform unseen tasks via in-context learning, particularly focusing on task-level generalization.

Method: Used circuit-style interpretability techniques like path patching to analyze models’ internal computations, specifically studying off-by-one addition as a counterfactual task.

Result: Uncovered a function induction mechanism that explains generalization from standard to off-by-one addition, showing it’s governed by multiple parallel attention heads and is reusable across various tasks.

Conclusion: The findings provide deeper insights into how reusable and composable structures within language models enable task-level generalization through function induction mechanisms.

Abstract: Large language models demonstrate the intriguing ability to perform unseen tasks via in-context learning. However, it remains unclear what mechanisms inside the model drive such task-level generalization. In this work, we approach this question through the lens of off-by-one addition (i.e., 1+1=3, 2+2=5, 3+3=?), a two-step, counterfactual task with an unexpected +1 function as a second step. Leveraging circuit-style interpretability techniques such as path patching, we analyze the models’ internal computations behind their performance and present three key findings. First, we uncover a function induction mechanism that explains the model’s generalization from standard addition to off-by-one addition. This mechanism resembles the structure of the induction head mechanism found in prior work and elevates it to a higher level of abstraction. Second, we show that the induction of the +1 function is governed by multiple attention heads in parallel, each of which emits a distinct piece of the +1 function. Finally, we find that this function induction mechanism is reused in a broader range of tasks, including synthetic tasks such as shifted multiple-choice QA and algorithmic tasks such as base-8 addition. Overall, our findings offer deeper insights into how reusable and composable structures within language models enable task-level generalization.

[303] Making Language Model a Hierarchical Classifier

Yihong Wang, Zhonglin Jiang, Ningyuan Xi, Yue Zhao, Qingqing Gu, Xiyuan Chen, Hao Wu, Sheng Xu, Hange Zhou, Yong Chen, Luo Ji

Main category: cs.CL

TL;DR: Proposes a hierarchical decoder architecture for language models where intermediate layers decode text simultaneously, achieving state-of-the-art performance on multiple tasks.

Details

Motivation: Inspired by human hierarchical thinking, to enable different layers in decoder-only language models to decode text simultaneously rather than just the last layer.

Method: Adapt pretrained language models by copying language heads from the last layer to selected intermediate layers and fine-tuning them with different task inputs.

Result: Validated that intermediate layers can generate meaningful content, achieving SOTA performance on hierarchical text classification, classification-guided generation, and hierarchical text generation across multiple datasets.

Conclusion: Demonstrates the feasibility of hierarchical decoders and suggests potential for generalized hierarchical reasoners through pretraining from scratch.

Abstract: Decoder-only language models, such as GPT and LLaMA, generally decode on the last layer. Motivated by human’s hierarchical thinking capability, we propose that a hierarchical decoder architecture could be built with different layers decoding texts simultaneously. Due to limited time and computationally resources, we choose to adapt a pretrained language model into this form of hierarchical decoder. Language heads of the last layer are copied to different selected intermediate layers, and fine-tuned with different task inputs. By thorough experiments, we validate that these selective intermediate layers could be adapted to speak meaningful and reasonable contents, and this paradigm of hierarchical decoder can obtain state-of-the-art performances on multiple tasks such as hierarchical text classification, classification-guided generation, and hierarchical text generation. HdLM outperforms all baselines on WoS, DBpedia, ESconv, EmpatheticDialogues, and several cognitive tests. We also provide thorough theoretical analysis to validate the convergence and computational savings of our methodology. This study suggests the possibility of a generalized hierarchical reasoner, pretraining from scratch.

[304] LionGuard 2: Building Lightweight, Data-Efficient & Localised Multilingual Content Moderators

Leanne Tan, Gabriel Chua, Ziyu Ge, Roy Ka-Wei Lee

Main category: cs.CL

TL;DR: LionGuard 2 is a lightweight multilingual moderation classifier for Singapore that outperforms commercial systems using pre-trained embeddings and multi-head ordinal classification, without fine-tuning large models.

Details

Motivation: Modern moderation systems often fail to address localization and low-resource language variants, creating safety gaps in real-world deployments, especially in multilingual contexts like Singapore.

Method: Built on pre-trained OpenAI embeddings and a multi-head ordinal classifier, supporting English, Chinese, Malay, and partial Tamil with high-quality local data.

Result: Outperforms several commercial and open-source systems across 17 benchmarks including Singapore-specific and public English datasets, and is actively deployed within the Singapore Government.

Conclusion: High-quality local data and robust multilingual embeddings can achieve strong moderation performance without fine-tuning large models, demonstrating practical efficacy at scale.

Abstract: Modern moderation systems increasingly support multiple languages, but often fail to address localisation and low-resource variants - creating safety gaps in real-world deployments. Small models offer a potential alternative to large LLMs, yet still demand considerable data and compute. We present LionGuard 2, a lightweight, multilingual moderation classifier tailored to the Singapore context, supporting English, Chinese, Malay, and partial Tamil. Built on pre-trained OpenAI embeddings and a multi-head ordinal classifier, LionGuard 2 outperforms several commercial and open-source systems across 17 benchmarks, including both Singapore-specific and public English datasets. The system is actively deployed within the Singapore Government, demonstrating practical efficacy at scale. Our findings show that high-quality local data and robust multilingual embeddings can achieve strong moderation performance, without fine-tuning large models. We release our model weights and part of our training data to support future work on LLM safety.

[305] The Ever-Evolving Science Exam

Junying Wang, Zicheng Zhang, Yijin Guo, Farong Wen, Ye Shen, Yingji Liang, Yalun Wu, Wenzhe Li, Chunyi Li, Zijian Chen, Qi Jia, Guangtao Zhai

Main category: cs.CL

TL;DR: EESE is a dynamic science benchmark with a private pool of 100K+ expert-created science questions and periodic 500-question subsets to prevent data leakage and enable efficient evaluation of foundation models’ scientific understanding.

Details

Motivation: Address data leakage risks and evaluation inefficiency in existing science benchmarks while maintaining broad coverage, wide reach, and high rigor in assessing foundation models' scientific capabilities.

Method: Two-component approach: 1) Non-public EESE-Pool with 100K+ expert-constructed science instances across 5 disciplines and 500+ subfields, 2) Periodically updated 500-instance subsets for leakage-resilient, low-overhead evaluations.

Result: Experiments on 32 open- and closed-source models show EESE effectively differentiates models’ strengths and weaknesses across scientific fields and cognitive dimensions.

Conclusion: EESE provides a robust, scalable, and forward-compatible solution for science benchmark design, offering realistic measurement of foundation models’ scientific question handling capabilities.

Abstract: As foundation models grow rapidly in capability and deployment, evaluating their scientific understanding becomes increasingly critical. Existing science benchmarks have made progress towards broad Range, wide Reach, and high Rigor, yet they often face two major challenges: data leakage risks that compromise benchmarking validity, and evaluation inefficiency due to large-scale testing. To address these issues, we introduce the Ever-Evolving Science Exam (EESE), a dynamic benchmark designed to reliably assess scientific capabilities in foundation models. Our approach consists of two components: 1) a non-public EESE-Pool with over 100K expertly constructed science instances (question-answer pairs) across 5 disciplines and 500+ subfields, built through a multi-stage pipeline ensuring Range, Reach, and Rigor, 2) a periodically updated 500-instance subset EESE, sampled and validated to enable leakage-resilient, low-overhead evaluations. Experiments on 32 open- and closed-source models demonstrate that EESE effectively differentiates the strengths and weaknesses of models in scientific fields and cognitive dimensions. Overall, EESE provides a robust, scalable, and forward-compatible solution for science benchmark design, offering a realistic measure of how well foundation models handle science questions. The project page is at: https://github.com/aiben-ch/EESE.

[306] Diversity-Enhanced Reasoning for Subjective Questions

Yumeng Wang, Zhiyuan Fan, Jiayu Liu, Jen-tse Huang, Yi R. Fung

Main category: cs.CL

TL;DR: MultiRole-R1 is a diversity-enhanced training framework that improves subjective reasoning in Large Reasoning Models by introducing perspective diversity and token-level diversity, achieving significant accuracy gains on subjective tasks and even enhancing performance on objective reasoning tasks like math.

Details

Motivation: Large Reasoning Models optimized via RLVR excel at objective reasoning but degrade generation diversity, causing poor performance on subjective reasoning tasks that have multiple valid answers depending on different role perspectives.

Method: Proposes MultiRole-R1 framework with unsupervised data construction pipeline synthesizing reasoning chains with various role perspectives, and reinforcement learning via Group Relative Policy Optimization with reward shaping that treats diversity as a reward signal.

Result: Training on subjective tasks increases in-domain accuracy by 14.1% and out-of-domain accuracy by 7.64%, and even enhances performance on advanced math reasoning like AIME 2024. Diversity is shown to be a more consistent indicator of accuracy than reasoning length.

Conclusion: Introducing perspective diversity and token-level diversity significantly improves subjective reasoning capabilities in LRMs, with diversity serving as a key factor for better performance across both subjective and objective reasoning tasks.

Abstract: Large Reasoning Models (LRMs) with long chain-of-thought capabilities, optimized via reinforcement learning with verifiable rewards (RLVR), excel at objective reasoning tasks like mathematical problem solving and code generation. However, RLVR is known for degrading generation diversity, which causes LRMs to fall short on subjective reasoning that has multiple answers depending on different role perspectives. While recent studies recognize the importance of diversity-enhanced training in objective reasoning, limited attention has been given to subjective tasks. In this paper, we find that subjective reasoning can be improved by introducing perspective diversity and token-level diversity, with the former one providing a coherent scaffolding anchored to a real-world stakeholder group and the latter one broadening the answer search space. We propose MultiRole-R1, a diversity-enhanced training framework featuring an unsupervised data construction pipeline that synthesizes reasoning chains incorporating various role perspectives. It also employs reinforcement learning via Group Relative Policy Optimization with reward shaping, taking diversity as a reward signal in addition to verifiable reward. Training on subjective tasks solely, MultiRole-R1 increases the in-domain and out-of-domain accuracy by 14.1% and 7.64%, and even enhances the performance on advanced math reasoning such as AIME 2024. We further show that diversity is a more consistent indicator of accuracy than reasoning length.

[307] CTTS: Collective Test-Time Scaling

Zhende Song, Shengji Tang, Peng Ye, Jiayuan Fan, Lei Bai, Tao Chen, Wanli Ouyang

Main category: cs.CL

TL;DR: Collective Test-Time Scaling (CTTS) introduces multi-agent and multi-reward collaboration to overcome limitations of single test-time scaling, achieving significant performance improvements over existing methods and proprietary LLMs.

Details

Motivation: Existing test-time scaling methods like Best-of-N and Self-Consistency are constrained by the single agent-single reward paradigm, which limits their effectiveness. Collective methods have shown potential to surpass individual model performance ceilings.

Method: Proposes CTTS-MM framework with: 1) Agent Collaboration Search (ACS) to find optimal LLM combinations, and 2) Mixture of Reward Models (MoR) with Prior Reward model Ensemble Selection (PRES) algorithm for optimal reward ensemble.

Result: Outperforms leading STTS methods by +4.82% over Best-of-N, surpasses proprietary LLMs (+7.06% over GPT-4.1) and open-source LLMs across seven mainstream benchmarks.

Conclusion: Collective scaling has substantial potential to push the frontier of LLM inference, demonstrating that multi-agent and multi-reward collaboration can significantly enhance model performance without additional training.

Abstract: Test-time scaling (TTS) has emerged as a promising, training-free approach for enhancing large language model (LLM) performance. However, the efficacy of existing methods, such as Best-of-N and Self-Consistency, is fundamentally constrained by the dominant single test-time scaling (STTS) paradigm, which relies on a single LLM agent interacting with a single reward model (SA-SR). Inspired by recent work showing that collective methods can surpass the performance ceiling of individual models, we introduce Collective Test-Time Scaling (CTTS). First, we systematically investigate three primary interaction paradigms of existing multiple models: single-agent-multi-reward (SA-MR), multi-agent-single-reward (MA-SR), and multi-agent-multi-reward (MA-MR). Extensive experiments reveal that the MA-MR paradigm is consistently superior. Based on this finding, we further propose CTTS-MM, a novel framework that operationalizes multi-agent and multi-reward collaboration. CTTS-MM integrates two key technical contributions: (1) for agent collaboration, an Agent Collaboration Search (ACS) that identifies the most effective combination of LLMs from a candidate pool; and (2) for reward model collaboration, a Mixture of Reward Models (MoR) strategy that leverages a Prior Reward model Ensemble Selection (PRES) algorithm to select the optimal ensemble. Evaluations across seven mainstream benchmarks demonstrate that CTTS-MM significantly outperforms leading STTS methods (+4.82% over Best-of-N) and surpasses even flagship proprietary LLMs (+7.06% over GPT-4.1) and open-source LLMs. These results highlight the substantial potential of collective scaling to push the frontier of LLM inference. Code will be released at https://github.com/magent4aci/CTTS-MM.

Haofei Yu, Zhengyang Qi, Yining Zhao, Kolby Nottingham, Keyang Xuan, Bodhisattwa Prasad Majumder, Hao Zhu, Paul Pu Liang, Jiaxuan You

Main category: cs.CL

TL;DR: Sotopia-RL is a reinforcement learning framework that addresses challenges in training socially intelligent LLMs by using utterance-level credit assignment and multi-dimensional rewards to handle partial observability and complex social interactions.

Details

Motivation: Social intelligence is crucial for LLMs in real-world tasks, but RL training faces barriers from partial observability (delayed effects of utterances) and multi-dimensionality (indirect contributions of behaviors), making MDP-based RL with single rewards inefficient.

Method: Proposes Sotopia-RL framework that refines coarse episode-level feedback into utterance-level, multi-dimensional rewards. Utterance-level credit assignment addresses partial observability, while multi-dimensional rewards capture social interaction richness and prevent reward hacking.

Result: Achieves state-of-the-art social goal completion scores: 7.17 on Sotopia-hard and 8.31 on Sotopia-full, significantly outperforming existing approaches. Ablation studies confirm necessity of both utterance-level credit assignment and multi-dimensional reward design.

Conclusion: Sotopia-RL effectively addresses RL training challenges for social intelligence by combining utterance-level credit assignment with multi-dimensional rewards, enabling more efficient and stable learning in complex social environments.

Abstract: Social intelligence has become a critical capability for large language models (LLMs), enabling them to engage effectively in real-world social tasks such as accommodation, persuasion, collaboration, and negotiation. Reinforcement learning (RL) is a natural fit for training socially intelligent agents because it allows models to learn sophisticated strategies directly through social interactions. However, social interactions have two key characteristics that set barriers for RL training: (1) partial observability, where utterances have indirect and delayed effects that complicate credit assignment, and (2) multi-dimensionality, where behaviors such as rapport-building or knowledge-seeking contribute indirectly to goal achievement. These characteristics make Markov decision process (MDP)-based RL with single-dimensional episode-level rewards inefficient and unstable. To address these challenges, we propose Sotopia-RL, a novel framework that refines coarse episode-level feedback into utterance-level, multi-dimensional rewards. Utterance-level credit assignment mitigates partial observability by attributing outcomes to individual utterances, while multi-dimensional rewards capture the full richness of social interactions and reduce reward hacking. Experiments in Sotopia, an open-ended social learning environment, demonstrate that Sotopia-RL achieves state-of-the-art social goal completion scores (7.17 on Sotopia-hard and 8.31 on Sotopia-full), significantly outperforming existing approaches. Ablation studies confirm the necessity of both utterance-level credit assignment and multi-dimensional reward design for RL training. Our implementation is publicly available at: https://github.com/sotopia-lab/sotopia-rl.

[309] Unveiling Over-Memorization in Finetuning LLMs for Reasoning Tasks

Zhiwen Ruan, Yun Chen, Yutao Hou, Peng Li, Yang Liu, Guanhua Chen

Main category: cs.CL

TL;DR: LLM finetuning exhibits over-memorization where models memorize training data, showing high test perplexity but good accuracy, while suffering from reduced robustness and generalization.

Details

Motivation: To study the learning dynamics of LLM finetuning on reasoning tasks and identify the over-memorization phenomenon that occurs during specific training stages.

Method: Analyzed LLM finetuning across various tasks, models, and methods, examining conditions that lead to over-memorization and its effects on model performance.

Result: Over-memorization is prevalent and worsens with prolonged training and large learning rates. Affected models maintain test accuracy but show reduced robustness, poor OOD generalization, and decreased generation diversity.

Conclusion: Proposed checkpoint selection strategies and mitigation techniques like checkpoint merging and memorization-aware reweighting to address over-memorization in LLM finetuning.

Abstract: The pretrained large language models (LLMs) are finetuned with labeled data for better instruction following ability and alignment with human values. In this paper, we study the learning dynamics of LLM finetuning on reasoning tasks and reveal the uncovered over-memorization phenomenon during a specific stage of LLM finetuning. At this stage, the LLMs have excessively memorized training data and exhibit high test perplexity while maintaining good test accuracy. We explore the conditions that contribute to over-memorization and discover that this issue is prevalent across various tasks, models, and fine-tuning methods, with prolonged training and large learning rates exacerbating the problem. Although models with over-memorization demonstrate comparable test accuracy to normal models, they suffer from reduced robustness, poor out-of-distribution generalization, and decreased generation diversity. In light of our findings on over-memorization, we offer recommendations for checkpoint selection and propose techniques such as checkpoint merging and memorization-aware reweighting to mitigate this effect.

[310] Sculptor: Empowering LLMs with Cognitive Agency via Active Context Management

Mo Li, L. H. Xu, Qitai Tan, Long Ma, Ting Cao, Yunxin Liu

Main category: cs.CL

TL;DR: Sculptor is a framework that equips LLMs with Active Context Management tools to mitigate proactive interference in long contexts, enabling them to actively manage attention and working memory through fragmentation, summarization, and search tools.

Details

Motivation: LLMs suffer from performance degradation in long contexts due to proactive interference, where irrelevant early information disrupts reasoning and memory recall. Current approaches focus on external memory systems, but there's a need for internal context management capabilities.

Method: Introduces Sculptor framework with three tool categories: (1) context fragmentation, (2) summary/hide/restore operations, and (3) precise search. Uses a dynamic context-aware reinforcement learning approach to train agents that actively modify conversational history.

Result: Experimental evaluation shows Sculptor significantly improves performance on diverse long-context benchmarks without specific training, leveraging LLMs’ inherent tool-calling and instruction-following capabilities.

Conclusion: Explicit context-control strategies, rather than merely larger token windows, are key to robustness at scale. Active Context Management mitigates proactive interference and provides a cognitive foundation for reliable reasoning across long-context tasks.

Abstract: Large Language Models (LLMs) suffer from significant performance degradation when processing long contexts due to proactive interference, where irrelevant information in earlier parts of the context disrupts reasoning and memory recall. While most research focuses on external memory systems to augment LLMs’ capabilities, we propose a complementary approach: empowering LLMs with Active Context Management (ACM) tools to actively sculpt their internal working memory. We introduce Sculptor, a framework that equips LLMs with three categories of tools: (1) context fragmentation, (2) summary, hide, and restore, and (3) precise search. Our approach enables LLMs to proactively manage their attention and working memory, analogous to how humans selectively focus on relevant information while filtering out distractions. Experimental evaluation on diverse long-context benchmarks demonstrates that Sculptor significantly improves performance even without specific training, leveraging LLMs’ inherent tool-calling and instruction-following capabilities. To further optimize these strategies, we introduce a novel dynamic context-aware reinforcement learning (RL) approach, advancing the training of an agent that actively modifies its own conversational history. By enabling Active Context Management, Sculptor not only mitigates proactive interference but also provides a cognitive foundation for more reliable reasoning across diverse long-context tasks-highlighting that explicit context-control strategies, rather than merely larger token windows, are key to robustness at scale.

[311] Discerning minds or generic tutors? Evaluating instructional guidance capabilities in Socratic LLMs

Ying Liu, Can Li, Ting Zhang, Mei Wang, Qiannan Zhu, Jian Li, Hua Huang

Main category: cs.CL

TL;DR: This paper introduces GuideEval, a benchmark for evaluating LLMs’ adaptive tutoring capabilities through a three-phase behavioral framework (Perception, Orchestration, Elicitation), revealing current LLMs’ limitations in providing effective adaptive guidance and proposing behavior-guided finetuning to improve performance.

Details

Motivation: Prior research on LLMs for tutoring has focused primarily on generating Socratic questions but overlooked the critical aspect of adaptively guiding learners based on their cognitive states. This study aims to evaluate whether LLMs can emulate expert tutors who dynamically adjust strategies in response to learners' states.

Method: Proposed GuideEval benchmark grounded in authentic educational dialogues, using a three-phase behavioral framework: (1) Perception - inferring learner states, (2) Orchestration - adapting instructional strategies, (3) Elicitation - stimulating proper reflections. Also introduced behavior-guided finetuning strategy using behavior-prompted instructional dialogues.

Result: Empirical results show existing LLMs often fail to provide effective adaptive scaffolding when learners experience confusion or require redirection. The proposed behavior-guided finetuning strategy substantially enhances guidance performance.

Conclusion: The work advocates shifting focus from isolated content evaluation to learner-centered state-aware interaction, promoting a more dialogic paradigm for evaluating Socratic LLMs in educational contexts.

Abstract: The conversational capabilities of large language models hold significant promise for enabling scalable and interactive tutoring. While prior research has primarily examined their ability to generate Socratic questions, it often overlooks a critical aspect: adaptively guiding learners in accordance with their cognitive states. This study moves beyond question generation to emphasize instructional guidance capability. We ask: Can LLMs emulate expert tutors who dynamically adjust strategies in response to learners’ states? To investigate this, we propose GuideEval, a benchmark grounded in authentic educational dialogues that evaluates pedagogical guidance through a three-phase behavioral framework: (1) Perception, inferring learner states; (2) Orchestration, adapting instructional strategies; and (3) Elicitation, stimulating proper reflections. Empirical results indicate that existing LLMs often fail to provide effective adaptive scaffolding when learners experience confusion or require redirection. To complement the quantitative evaluation, we conduct a detailed failure case analysis, providing an intuitive understanding of these shortcomings. Furthermore, we introduce a behavior-guided finetuning strategy that leverages behavior-prompted instructional dialogues, substantially enhancing guidance performance. By shifting the focus from isolated content evaluation to learner-centered state-aware interaction, our work advocates a more dialogic paradigm for evaluating Socratic LLMs.

[312] Omni-SafetyBench: A Benchmark for Safety Evaluation of Audio-Visual Large Language Models

Leyi Pan, Zheyu Fu, Yunpeng Zhai, Shuchang Tao, Sheng Guan, Shiyu Huang, Lingzhe Zhang, Zhaoyang Liu, Bolin Ding, Felix Henry, Aiwei Liu, Lijie Wen

Main category: cs.CL

TL;DR: Omni-SafetyBench is introduced as the first comprehensive benchmark for evaluating safety in Omni-modal Large Language Models (OLLMs), addressing gaps in assessing joint audio-visual inputs and cross-modal consistency.

Details

Motivation: Current benchmarks lack dedicated evaluation for OLLM safety, particularly for joint audio-visual inputs and cross-modal consistency, creating a critical gap in ensuring safe deployment of multimodal AI systems.

Method: Developed Omni-SafetyBench with 24 modality variations and 972 samples per variation, including audio-visual harm cases. Proposed tailored metrics: Safety-score (based on C-ASR and C-RR) and Cross-Modal Safety Consistency score (CMSC-score).

Result: Evaluation of 10 OLLMs revealed critical vulnerabilities: only 3 models achieved >0.6 in both safety metrics, safety defenses weaken with complex inputs (especially audio-visual), and some models scored as low as 0.14 on specific modalities.

Conclusion: Existing safety alignment methods face fundamental limitations: inference-time methods cannot alter model understanding, post-training methods struggle with out-of-distribution issues, and audio-visual tasks are inherently more complex. Urgent need for enhanced OLLM safety approaches.

Abstract: The rise of Omni-modal Large Language Models (OLLMs), which integrate visual and auditory processing with text, necessitates robust safety evaluations to mitigate harmful outputs. However, no dedicated benchmarks currently exist for OLLMs, and existing benchmarks fail to assess safety under joint audio-visual inputs or cross-modal consistency. To fill this gap, we introduce Omni-SafetyBench, the first comprehensive parallel benchmark for OLLM safety evaluation, featuring 24 modality variations with 972 samples each, including audio-visual harm cases. Considering OLLMs’ comprehension challenges with complex omni-modal inputs and the need for cross-modal consistency evaluation, we propose tailored metrics: a Safety-score based on Conditional Attack Success Rate (C-ASR) and Refusal Rate (C-RR) to account for comprehension failures, and a Cross-Modal Safety Consistency score (CMSC-score) to measure consistency across modalities. Evaluating 6 open-source and 4 closed-source OLLMs reveals critical vulnerabilities: (1) only 3 models achieving over 0.6 in both average Safety-score and CMSC-score; (2) safety defenses weaken with complex inputs, especially audio-visual joints; (3) severe weaknesses persist, with some models scoring as low as 0.14 on specific modalities. Using Omni-SafetyBench, we evaluated existing safety alignment algorithms and identified key challenges in OLLM safety alignment: (1) Inference-time methods are inherently less effective as they cannot alter the model’s underlying understanding of safety; (2) Post-training methods struggle with out-of-distribution issues due to the vast modality combinations in OLLMs; and, safety tasks involving audio-visual inputs are more complex, making even in-distribution training data less effective. Our proposed benchmark, metrics and the findings highlight urgent needs for enhanced OLLM safety.

[313] READER: Retrieval-Assisted Drafter for Efficient LLM Inference

Maxim Divilkovskiy, Vitaly Malygin, Sergey Zlobin, Stanislav Ilyushin, Sultan Isali, Vasily Kalugin, Nuriza Aitassova, Fei Yi, Weidi Zeng

Main category: cs.CL

TL;DR: READER is a lossless speculative decoding framework that accelerates autoregressive language model inference without training auxiliary draft models, achieving up to 6.13x speedup while preserving exact output equivalence.

Details

Motivation: Autoregressive language models have intrinsic latency bottlenecks due to sequential decoding, which limits scalable deployment of large-scale generative models. Existing acceleration techniques fail to address dominant memory and communication costs.

Method: READER formalizes speculative decoding as a stochastic tree construction problem, exploits natural language redundancy to generate candidate continuations, and introduces memory-optimal key-value cache-serving for batched inference with sublinear overhead.

Result: Achieves up to 6.13x wall-clock speedup on single-prompt inference and up to 5.92x on batched inference, consistently surpassing prior speculative decoding baselines while preserving exact output equivalence.

Conclusion: READER closes the gap between theoretical parallelism limits and practical LLM inference, suggesting a new standard for efficient deployment of large language models.

Abstract: Autoregressive Language Models instantiate a factorized likelihood over token sequences, yet their strictly sequential decoding process imposes an intrinsic lower bound on inference latency. This bottleneck has emerged as a central obstacle to the scalable deployment of large-scale generative models. Existing acceleration techniques partially mitigate token-level latency by relying on auxiliary draft models or introducing an additional training phase, but fail to address the dominant memory and communication costs. We present READER, a provably lossless speculative decoding framework that bypasses the training of the auxiliary draft model. READER formalizes speculative decoding as a stochastic tree construction problem and exploits the empirical redundancy structure of natural language to generate high-probability candidate continuations. Our method revisits the problem of constructing draft trees, establishing substantial statistical improvements over stochastic draft-tree methods and providing a complexity-theoretic analysis that characterizes the optimality frontier of speculative decoding under bounded computation and memory resources. Beyond the single-sequence regime traditionally considered in prior work, we introduce a memory-optimal key-value cache-serving strategy that guarantees amortized sublinear overhead in the batch dimension, allowing READER to scale to realistic inference workloads. Comprehensive experiments demonstrate up to 6.13x wall-clock speedup on single-prompt inference and up to 5.92x on batched inference, consistently surpassing prior speculative decoding baselines, while preserving exact output equivalence, with even more pronounced gains in retrieval-augmented generation pipelines. Our results close a key gap between theoretical parallelism limits and practical LLM inference, suggesting a new standard for efficient deployment.

[314] PakBBQ: A Culturally Adapted Bias Benchmark for QA

Abdullah Hashmat, Muhammad Arham Mirza, Agha Ali Raza

Main category: cs.CL

TL;DR: PakBBQ is a culturally adapted bias benchmark for LLMs focusing on Pakistan, with 17K+ QA pairs in English/Urdu across 8 bias dimensions. Evaluation shows disambiguation improves accuracy by 12%, Urdu prompts reduce bias more than English, and negative framing decreases stereotypes.

Details

Motivation: LLMs are trained on Western-centric data with little attention to low-resource languages and regional contexts, creating fairness gaps for diverse user communities.

Method: Created PakBBQ dataset with 214 templates and 17,180 QA pairs in English/Urdu across 8 bias categories relevant to Pakistan. Evaluated multilingual LLMs under ambiguous/disambiguated contexts and negative/non-negative question framings.

Result: Disambiguation improved accuracy by 12% on average; Urdu prompts showed stronger counter-bias behavior than English; negative question framing reduced stereotypical responses.

Conclusion: Contextualized benchmarks and simple prompt engineering strategies are crucial for bias mitigation in low-resource settings, highlighting the importance of culturally adapted evaluation frameworks.

Abstract: With the widespread adoption of Large Language Models (LLMs) across various applications, it is empirical to ensure their fairness across all user communities. However, most LLMs are trained and evaluated on Western centric data, with little attention paid to low-resource languages and regional contexts. To address this gap, we introduce PakBBQ, a culturally and regionally adapted extension of the original Bias Benchmark for Question Answering (BBQ) dataset. PakBBQ comprises over 214 templates, 17180 QA pairs across 8 categories in both English and Urdu, covering eight bias dimensions including age, disability, appearance, gender, socio-economic status, religious, regional affiliation, and language formality that are relevant in Pakistan. We evaluate multiple multilingual LLMs under both ambiguous and explicitly disambiguated contexts, as well as negative versus non negative question framings. Our experiments reveal (i) an average accuracy gain of 12% with disambiguation, (ii) consistently stronger counter bias behaviors in Urdu than in English, and (iii) marked framing effects that reduce stereotypical responses when questions are posed negatively. These findings highlight the importance of contextualized benchmarks and simple prompt engineering strategies for bias mitigation in low resource settings.

[315] CoDiEmb: A Collaborative yet Distinct Framework for Unified Representation Learning in Information Retrieval and Semantic Textual Similarity

Bowen Zhang, Zixin Song, Chunquan Chen, Qian-Wen Zhang, Di Yin, Xing Sun

Main category: cs.CL

TL;DR: CoDiEmb is a unified framework that enables joint training of text embeddings for both Information Retrieval and Semantic Textual Similarity tasks without performance trade-offs, using task-specialized objectives, dynamic sampling, and delta-guided model fusion.

Details

Motivation: Negative transfer remains a persistent obstacle when training unified text embeddings for diverse downstream tasks, particularly when jointly training for IR and STS which have fundamentally disparate requirements and typically yield steep performance trade-offs with naive co-training.

Method: CoDiEmb integrates three key innovations: (1) Task-specialized objectives with dynamic sampler forming single-task batches to prevent gradient interference, (2) Delta-guided model fusion strategy computing fine-grained merging weights by analyzing parameter deviations, (3) Efficient single-stage training pipeline that converges stably.

Result: Extensive experiments on 15 standard IR and STS benchmarks across three base encoders validate CoDiEmb. The framework not only mitigates cross-task trade-offs but also measurably improves the geometric properties of the embedding space.

Conclusion: CoDiEmb successfully reconciles the divergent requirements of IR and STS in a collaborative yet distinct manner, demonstrating that systematic decoupling of task-specific learning signals throughout the training pipeline can resolve the conflict between these fundamentally disparate tasks.

Abstract: Learning unified text embeddings that excel across diverse downstream tasks is a central goal in representation learning, yet negative transfer remains a persistent obstacle. This challenge is particularly pronounced when jointly training a single encoder for Information Retrieval (IR) and Semantic Textual Similarity (STS), two essential but fundamentally disparate tasks for which naive co-training typically yields steep performance trade-offs. We argue that resolving this conflict requires systematically decoupling task-specific learning signals throughout the training pipeline. To this end, we introduce CoDiEmb, a unified framework that reconciles the divergent requirements of IR and STS in a collaborative yet distinct manner. CoDiEmb integrates three key innovations for effective joint optimization: (1) Task-specialized objectives paired with a dynamic sampler that forms single-task batches and balances per-task updates, thereby preventing gradient interference. For IR, we employ a contrastive loss with multiple positives and hard negatives, augmented by cross-device sampling. For STS, we adopt order-aware objectives that directly optimize correlation and ranking consistency. (2) A delta-guided model fusion strategy that computes fine-grained merging weights for checkpoints by analyzing each parameter’s deviation from its pre-trained initialization, proving more effective than traditional Model Soups. (3) An efficient, single-stage training pipeline that is simple to implement and converges stably. Extensive experiments on 15 standard IR and STS benchmarks across three base encoders validate CoDiEmb. Our results and analysis demonstrate that the framework not only mitigates cross-task trade-offs but also measurably improves the geometric properties of the embedding space.

[316] DESIGNER: Design-Logic-Guided Multidisciplinary Data Synthesis for LLM Reasoning

Weize Liu, Yongchi Zhao, Yijia Luo, Mingyu Xu, Jiaheng Liu, Yanan Li, Xiguo Hu, Zhiqi Bai, Yuchi Xu, Wenbo Su, Bo Zheng

Main category: cs.CL

TL;DR: DESIGNER is a pipeline that uses “design logic” to automatically generate large-scale, multidisciplinary reasoning questions from raw documents, creating challenging datasets that improve LLM reasoning capabilities.

Details

Motivation: Existing reasoning datasets lack disciplinary breadth, reasoning depth, and diversity, and lack guiding principles for automated question synthesis.

Method: Reverse-engineer over 120,000 design logics from existing questions, then match these with source documents (book and web corpora) to automatically generate challenging reasoning questions.

Result: Created two large-scale datasets (DLR-Book: 3.04M questions, DLR-Web: 1.66M questions) spanning 75 disciplines, with greater difficulty and diversity than baseline datasets. Fine-tuning on this data substantially enhanced LLM reasoning capabilities.

Conclusion: The DESIGNER pipeline successfully generates high-quality reasoning data that significantly improves LLM performance, with base models even surpassing their official instruction-tuned counterparts after fine-tuning.

Abstract: Large language models (LLMs) have achieved remarkable success in many natural language tasks but still struggle with complex, multi-step reasoning, particularly across diverse disciplines. Existing reasoning datasets often lack disciplinary breadth, reasoning depth, and diversity, and lack guiding principles for question synthesis. We propose DESIGNER: a DESIGN-logic-guidEd Reasoning data synthesis pipeline that leverages naturally available, extensive raw documents (e.g., book corpus and web corpus) to generate multidisciplinary challenging questions. We introduce the concept of “design logic” and instruct LLMs to mimic human educators’ question-creation process, enabling automated synthesis of large-scale, high-difficulty questions. We use LLMs to reverse-engineer and abstract over 120,000 design logics from existing questions across various disciplines. By matching these design logics with source documents, we are able to create reasoning questions that far surpass the difficulty and diversity of existing datasets. Using this pipeline, we synthesized two large-scale reasoning datasets that span 75 disciplines: DLR-Book (3.04 million questions from the book corpus) and DLR-Web (1.66 million questions from the web corpus). Data analysis indicates that the questions synthesized by our method exhibit greater difficulty and diversity compared to those in the baseline datasets. We validate our synthesized data through supervised fine-tuning (SFT) on the Qwen3 and Llama3 model families. Our data substantially enhances their multidisciplinary reasoning capabilities, outperforming existing datasets. Notably, after SFT on our datasets, the base versions of these models even surpass their official instruction-tuned counterparts.

[317] Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVR

Xiao Liang, Zhongzhi Li, Yeyun Gong, Yelong Shen, Ying Nian Wu, Zhijiang Guo, Weizhu Chen

Main category: cs.CL

TL;DR: Proposes SvS (Self-play with Variational problem Synthesis) to maintain policy entropy in RLVR training, improving Pass@k performance by synthesizing variational problems from correct solutions.

Details

Motivation: Vanilla RLVR training improves Pass@1 but reduces policy entropy, limiting generation diversity and Pass@k performance which represents LLM reasoning upper bound.

Method: Online self-play strategy that uses policy’s correct solutions to synthesize variational problems while keeping reference answers identical to originals.

Result: Achieves 18.3% and 22.8% absolute gains in Pass@32 on AIME24 and AIME25 benchmarks, with consistent improvements across 12 reasoning benchmarks and model sizes from 3B to 32B.

Conclusion: SvS effectively maintains policy entropy during RLVR training, substantially improving Pass@k performance and demonstrating generalizability across various benchmarks and model sizes.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a key paradigm for post-training Large Language Models (LLMs), particularly for complex reasoning tasks. However, vanilla RLVR training has been shown to improve Pass@1 performance at the expense of policy entropy, leading to reduced generation diversity and limiting the Pass@k performance, which typically represents the upper bound of LLM reasoning capability. In this paper, we systematically analyze the policy’s generation diversity from the perspective of training problems and find that augmenting and updating training problems helps mitigate entropy collapse during training. Based on these observations, we propose an online Self-play with Variational problem Synthesis (SvS) strategy for RLVR training, which uses the policy’s correct solutions to synthesize variational problems while ensuring their reference answers remain identical to the originals. This self-improving strategy effectively maintains policy entropy during training and substantially improves Pass@k compared with standard RLVR, sustaining prolonged improvements and achieving absolute gains of 18.3% and 22.8% in Pass@32 performance on the competition-level AIME24 and AIME25 benchmarks. Experiments on 12 reasoning benchmarks across varying model sizes from 3B to 32B consistently demonstrate the generalizability and robustness of SvS.

[318] Coarse-to-Fine Personalized LLM Impressions for Streamlined Radiology Reports

Chengbo Sun, Hui Yi Leong, Lei Li

Main category: cs.CL

TL;DR: A coarse-to-fine framework using open-source LLMs to automatically generate personalized radiology report impressions from clinical findings, reducing radiologist burnout.

Details

Motivation: Manual creation of 'Impression' sections in radiology reports is a primary driver of radiologist burnout, creating need for automated solutions to reduce administrative workload.

Method: Fine-tuned LLaMA and Mistral models on University of Chicago Medicine reports, using draft generation followed by refinement with ML and RLHF to personalize impressions while ensuring factual accuracy.

Result: The framework generates personalized impressions aligned with individual radiologists’ styles while maintaining clinical precision.

Conclusion: The approach significantly reduces administrative workload and improves reporting efficiency while maintaining high standards of clinical accuracy.

Abstract: The manual creation of the “Impression” section in radiology reports is a primary driver of radiologist burnout. To address this challenge, we propose a coarse-to-fine framework that leverages open-source large language models (LLMs) to automatically generate and personalize impressions from clinical findings. The system first produces a draft impression and then refines it using machine learning and reinforcement learning from human feedback (RLHF) to align with individual radiologists’ styles while ensuring factual accuracy. We fine-tune LLaMA and Mistral models on a large dataset of reports from the University of Chicago Medicine. Our approach is designed to significantly reduce administrative workload and improve reporting efficiency while maintaining high standards of clinical precision.

[319] Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search

Yuxian Gu, Qinghao Hu, Shang Yang, Haocheng Xi, Junyu Chen, Song Han, Han Cai

Main category: cs.CL

TL;DR: Jet-Nemotron is a hybrid-architecture language model family that achieves comparable or superior accuracy to leading models while significantly improving generation throughput through a novel PostNAS architecture exploration pipeline.

Details

Motivation: To develop language models that match or exceed the accuracy of full-attention models while significantly improving generation throughput and efficiency.

Method: Uses Post Neural Architecture Search (PostNAS) pipeline that starts with pre-trained full-attention models, freezes MLP weights, and explores attention block designs through four components: optimal layer placement/elimination, linear attention selection, new attention block design, and hardware-aware hyperparameter search.

Result: Jet-Nemotron-2B achieves comparable or superior accuracy to Qwen3, Qwen2.5, Gemma3, and Llama3.2 across benchmarks, with up to 53.6x generation throughput speedup and 6.1x prefilling speedup. It also outperforms larger MoE models like DeepSeek-V3-Small and Moonlight on MMLU and MMLU-Pro.

Conclusion: The PostNAS pipeline enables efficient development of hybrid-architecture models that deliver both high accuracy and significant performance improvements, demonstrating the effectiveness of the approach for balancing model quality and inference efficiency.

Abstract: We present Jet-Nemotron, a new family of hybrid-architecture language models, which matches or exceeds the accuracy of leading full-attention models while significantly improving generation throughput. Jet-Nemotron is developed using Post Neural Architecture Search (PostNAS), a novel neural architecture exploration pipeline that enables efficient model design. Unlike prior approaches, PostNAS begins with a pre-trained full-attention model and freezes its MLP weights, allowing efficient exploration of attention block designs. The pipeline includes four key components: (1) learning optimal full-attention layer placement and elimination, (2) linear attention block selection, (3) designing new attention blocks, and (4) performing hardware-aware hyperparameter search. Our Jet-Nemotron-2B model achieves comparable or superior accuracy to Qwen3, Qwen2.5, Gemma3, and Llama3.2 across a comprehensive suite of benchmarks while delivering up to 53.6x generation throughput speedup and 6.1x prefilling speedup. It also achieves higher accuracy on MMLU and MMLU-Pro than recent advanced MoE full-attention models, such as DeepSeek-V3-Small and Moonlight, despite their larger scale with 15B total and 2.2B activated parameters.

[320] If We May De-Presuppose: Robustly Verifying Claims through Presupposition-Free Question Decomposition

Shubhashis Roy Dipta, Francis Ferraro

Main category: cs.CL

TL;DR: A robust claim verification framework that uses presupposition-free decomposed questions to address prompt sensitivity and presupposition issues in LLMs, achieving 2-5% improvement.

Details

Motivation: Presupposition in generated questions introduces unverified assumptions causing inconsistencies in claim verification, and prompt sensitivity remains a persistent challenge for LLMs with 3-6% performance variance.

Method: Proposed a structured claim verification framework that reasons through presupposition-free, decomposed questions to mitigate prompt sensitivity and presupposition issues.

Result: Extensive experiments show state-of-the-art models remain susceptible to prompt variance and presupposition, while the proposed method consistently mitigates these issues with up to 2-5% improvement.

Conclusion: The structured framework using presupposition-free decomposed questions effectively addresses persistent prompt sensitivity and presupposition problems in LLM-based claim verification.

Abstract: Prior work has shown that presupposition in generated questions can introduce unverified assumptions, leading to inconsistencies in claim verification. Additionally, prompt sensitivity remains a significant challenge for large language models (LLMs), resulting in performance variance as high as 3-6%. While recent advancements have reduced this gap, our study demonstrates that prompt sensitivity remains a persistent issue. To address this, we propose a structured and robust claim verification framework that reasons through presupposition-free, decomposed questions. Extensive experiments across multiple prompts, datasets, and LLMs reveal that even state-of-the-art models remain susceptible to prompt variance and presupposition. Our method consistently mitigates these issues, achieving up to a 2-5% improvement.

[321] CORE-RAG: Lossless Compression for Retrieval-Augmented LLMs via Reinforcement Learning

Ziqiang Cui, Yunpeng Weng, Xing Tang, Peiyang Liu, Shiwei Li, Bowei He, Jiamin Chen, Yansen Zhang, Xiuqiang He, Chen Ma

Main category: cs.CL

TL;DR: CORE is a novel end-to-end optimized method for lossless context compression in RAG that achieves high compression ratios (3%) while improving task performance by 3.3 EM points on average.

Details

Motivation: Existing document compression methods for RAG degrade task performance due to reliance on predefined heuristics without clear compression guidelines, failing to ensure compressed content effectively supports downstream tasks.

Method: CORE is optimized end-to-end without predefined compression labels, leveraging downstream task performance as feedback to iteratively refine compression policy for enhanced task effectiveness.

Result: Extensive experiments across four datasets show CORE achieves 3% compression ratio while preventing performance degradation compared to full documents and improving average Exact Match score by 3.3 points.

Conclusion: CORE effectively addresses limitations of existing compression methods by providing lossless context compression that enhances rather than degrades RAG task performance.

Abstract: Retrieval-Augmented Generation (RAG) has emerged as a promising approach to enhance the timeliness of knowledge updates and the factual accuracy of responses in large language models. However, incorporating a large number of retrieved documents significantly increases input length, leading to higher computational costs. Existing approaches to document compression tailored for RAG often degrade task performance, as they typically rely on predefined heuristics in the absence of clear compression guidelines. These heuristics fail to ensure that the compressed content effectively supports downstream tasks. To address these limitations, we propose CORE, a novel method for lossless context compression in RAG. CORE is optimized end-to-end and does not depend on predefined compression labels, which are often impractical to obtain. Instead, it leverages downstream task performance as a feedback signal, iteratively refining the compression policy to enhance task effectiveness. Extensive experiments across four datasets demonstrate the effectiveness of CORE. With a high compression ratio of 3%, CORE not only prevents performance degradation compared to including full documents (i.e., without compression) but also improves the average Exact Match (EM) score by 3.3 points. The code for CORE will be released soon.

[322] Automatic Question & Answer Generation Using Generative Large Language Model (LLM)

Md. Alvee Ehsan, A. S. M Mehedi Hasan, Kefaya Benta Shahnoor, Syeda Sumaiya Tasneem

Main category: cs.CL

TL;DR: This paper proposes an Automatic Question Answer Generation (AQAG) system using fine-tuned Meta-Llama 2-7B model with RACE dataset to automate educational assessment creation.

Details

Motivation: Manual creation of diverse and fair academic assessment questions is challenging for instructors who need to review multiple lecture materials. The goal is to streamline the evaluation process by automating question generation.

Method: Uses fine-tuned generative LLM (Meta-Llama 2-7B) with unsupervised learning methods in NLP, integrating RACE dataset for training. Employs Prompt Engineering to tailor question styles (MCQ, conceptual, factual).

Result: Developed a customized model that can generate various types of questions efficiently for text-based evaluations in English language.

Conclusion: The proposed AQAG system provides a reliable and efficient tool for educators to automate question generation, freeing up valuable time and resources while maintaining fairness and diversity in assessments.

Abstract: In the realm of education, student evaluation holds equal significance to imparting knowledge. To be evaluated, students usually need to go through text-based academic assessment methods. Instructors need to make a diverse set of questions that need to be fair for all students to prove their adequacy over a particular topic. This can prove to be quite challenging as they may need to manually go through several different lecture materials. Our objective is to make this whole process much easier by implementing Automatic Question Answer Generation(AQAG), using a fine-tuned generative LLM. For tailoring the instructor’s preferred question style (MCQ, conceptual, or factual questions), Prompt Engineering (PE) is being utilized. In this research, we propose to leverage unsupervised learning methods in NLP, primarily focusing on the English language. This approach empowers the base Meta-Llama 2-7B model to integrate the RACE dataset as training data for the fine-tuning process. Creating a customized model that will offer efficient solutions for educators, instructors, and individuals engaged in text-based evaluations. A reliable and efficient tool for generating questions and answers can free up valuable time and resources, thus streamlining their evaluation processes.

[323] When Thinking Backfires: Mechanistic Insights Into Reasoning-Induced Misalignment

Hanqi Yan, Hainiu Xu, Siya Qi, Shu Yang, Yulan He

Main category: cs.CL

TL;DR: The paper identifies Reasoning-Induced Misalignment (RIM) - a phenomenon where enhanced reasoning capabilities in LLMs can cause safety misalignment through specific attention mechanisms and neuron entanglement.

Details

Motivation: Growing concerns about LLM safety and alignment with human values, particularly when reasoning capabilities are strengthened during training or inference.

Method: Representation analysis to identify attention heads that facilitate refusal by reducing attention to Chain-of-Thought tokens, and analysis of activation entanglement between reasoning and safety in neurons.

Result: Discovered specific attention heads modulate rationalization during inference, and found significantly higher activation entanglement in safety-critical neurons after fine-tuning with identified reasoning patterns, correlating with catastrophic forgetting.

Conclusion: Provides a mechanistic account of RIM origins, showing how reasoning-safety entanglement at the neuron level explains the emergence of misalignment when reasoning patterns are introduced.

Abstract: With the growing accessibility and wide adoption of large language models, concerns about their safety and alignment with human values have become paramount. In this paper, we identify a concerning phenomenon: Reasoning-Induced Misalignment (RIM), in which misalignment emerges when reasoning capabilities strengthened-particularly when specific types of reasoning patterns are introduced during inference or training. Beyond reporting this vulnerability, we provide the first mechanistic account of its origins. Through representation analysis, we discover that specific attention heads facilitate refusal by reducing their attention to CoT tokens, a mechanism that modulates the model’s rationalization process during inference. During training, we find significantly higher activation entanglement between reasoning and safety in safety-critical neurons than in control neurons, particularly after fine-tuning with those identified reasoning patterns. This entanglement strongly correlates with catastrophic forgetting, providing a neuron-level explanation for RIM.

[324] CE-Bench: Towards a Reliable Contrastive Evaluation Benchmark of Interpretability of Sparse Autoencoders

Alex Gulko, Yusen Peng, Sachin Kumar

Main category: cs.CL

TL;DR: CE-Bench is a lightweight contrastive evaluation benchmark for sparse autoencoders that uses curated story pairs to measure interpretability without requiring external LLMs.

Details

Motivation: Existing automated evaluation methods for sparse autoencoders mostly rely on external LLMs, creating a need for more lightweight and self-contained evaluation approaches.

Method: Developed CE-Bench using a curated dataset of contrastive story pairs to evaluate sparse autoencoder interpretability through contrastive testing.

Result: CE-Bench reliably measures SAE interpretability and achieves over 70% Spearman correlation with SAEBench results, while eliminating the need for external LLM judges.

Conclusion: CE-Bench provides an effective and lightweight alternative for evaluating sparse autoencoders that aligns well with existing benchmarks and is publicly available.

Abstract: Sparse autoencoders (SAEs) are a promising approach for uncovering interpretable features in large language models (LLMs). While several automated evaluation methods exist for SAEs, most rely on external LLMs. In this work, we introduce CE-Bench, a novel and lightweight contrastive evaluation benchmark for sparse autoencoders, built on a curated dataset of contrastive story pairs. We conduct comprehensive evaluation studies to validate the effectiveness of our approach. Our results show that CE-Bench reliably measures the interpretability of sparse autoencoders and aligns well with existing benchmarks without requiring an external LLM judge, achieving over 70% Spearman correlation with results in SAEBench. The official implementation and evaluation dataset are open-sourced and publicly available.

[325] Do LLMs Adhere to Label Definitions? Examining Their Receptivity to External Label Definitions

Seyedali Mohammadi, Bhaskara Hanuma Vedula, Hemank Lamba, Edward Raff, Ponnurangam Kumaraguru, Francis Ferraro, Manas Gaur

Main category: cs.CL

TL;DR: LLMs don’t consistently integrate external label definitions, often defaulting to internal parametric knowledge instead, especially in general tasks.

Details

Motivation: To determine whether LLMs genuinely incorporate external definitions or primarily rely on their parametric knowledge during task-solving.

Method: Conducted controlled experiments across multiple explanation benchmark datasets (general and domain-specific) with various label definition conditions including expert-curated, LLM-generated, perturbed, and swapped definitions.

Result: Explicit label definitions can enhance accuracy and explainability, but their integration is neither guaranteed nor consistent. Models often default to internal representations, particularly in general tasks, while domain-specific tasks benefit more from explicit definitions.

Conclusion: LLMs’ processing of external knowledge alongside pre-existing capabilities requires deeper understanding, as they frequently rely on internalized representations rather than consistently integrating external definitions.

Abstract: Do LLMs genuinely incorporate external definitions, or do they primarily rely on their parametric knowledge? To address these questions, we conduct controlled experiments across multiple explanation benchmark datasets (general and domain-specific) and label definition conditions, including expert-curated, LLM-generated, perturbed, and swapped definitions. Our results reveal that while explicit label definitions can enhance accuracy and explainability, their integration into an LLM’s task-solving processes is neither guaranteed nor consistent, suggesting reliance on internalized representations in many cases. Models often default to their internal representations, particularly in general tasks, whereas domain-specific tasks benefit more from explicit definitions. These findings underscore the need for a deeper understanding of how LLMs process external knowledge alongside their pre-existing capabilities.

[326] Artificially Fluent: Swahili AI Performance Benchmarks Between English-Trained and Natively-Trained Datasets

Sophie Jaffer, Simeon Sayer

Main category: cs.CL

TL;DR: Native Swahili-trained BERT model performed significantly better than using translated Swahili data on English-trained model, showing translation alone doesn’t bridge language representation gaps.

Details

Motivation: To test whether data disparities in multilingual LLMs disadvantage non-English speakers and whether translation can effectively bridge language representation gaps.

Method: Compared two monolingual BERT models: one trained/tested on Swahili data, another on English data. Translated Swahili data to English for evaluation on English model to simulate multilingual LLM processing.

Result: Native Swahili-trained model performed much better with 0.36% error rate vs 1.47% for translated approach - nearly 4x fewer errors despite high-quality translation.

Conclusion: Translation alone doesn’t bridge representational differences; native-language training remains crucial for reliable outcomes. Future work should focus on dataset development for underrepresented languages and better multilingual evaluation.

Abstract: As large language models (LLMs) expand multilingual capabilities, questions remain about the equity of their performance across languages. While many communities stand to benefit from AI systems, the dominance of English in training data risks disadvantaging non-English speakers. To test the hypothesis that such data disparities may affect model performance, this study compares two monolingual BERT models: one trained and tested entirely on Swahili data, and another on comparable English news data. To simulate how multilingual LLMs process non-English queries through internal translation and abstraction, we translated the Swahili news data into English and evaluated it using the English-trained model. This approach tests the hypothesis by evaluating whether translating Swahili inputs for evaluation on an English model yields better or worse performance compared to training and testing a model entirely in Swahili, thus isolating the effect of language consistency versus cross-lingual abstraction. The results prove that, despite high-quality translation, the native Swahili-trained model performed better than the Swahili-to-English translated model, producing nearly four times fewer errors: 0.36% vs. 1.47% respectively. This gap suggests that translation alone does not bridge representational differences between languages and that models trained in one language may struggle to accurately interpret translated inputs due to imperfect internal knowledge representation, suggesting that native-language training remains important for reliable outcomes. In educational and informational contexts, even small performance gaps may compound inequality. Future research should focus on addressing broader dataset development for underrepresented languages and renewed attention to multilingual model evaluation, ensuring the reinforcing effect of global AI deployment on existing digital divides is reduced.

[327] COMPACT: Common-token Optimized Model Pruning Across Channels and Tokens

Eugene Kwek, Wenpeng Yin

Main category: cs.CL

TL;DR: COMPACT is a joint pruning method that combines vocabulary pruning and FFN channel pruning to make language models more efficient while maintaining standard transformer architecture and strong performance.

Details

Motivation: To enable efficient deployment of language models on edge devices and for interactive applications by addressing limitations of existing pruning methods that break transformer layouts or cause accuracy drops.

Method: Jointly prunes rare vocabulary to shrink embedding/LM head layers and prunes FFN intermediate channels using common-token-weighted activations that align importance with post-pruning token distribution.

Result: Achieves state-of-the-art downstream performance across Qwen, LLaMA, and Gemma families (0.5B-70B) with substantial reductions in parameters, GPU memory, and latency.

Conclusion: COMPACT provides deployment-friendly, scale-adaptive pruning that maintains competitive performance while offering significant efficiency gains for both large and small language models.

Abstract: Making large language models (LLMs) more efficient in memory, latency, and serving cost is crucial for edge deployment, interactive applications, and sustainable inference at scale. Pruning is a promising technique, but existing pruning methods are limited: width pruning often breaks the standard transformer layout, requiring custom inference code, while depth pruning can cause abrupt accuracy drops. Also, while many pruning approaches are effective against LLMs, they struggle to maintain performance on small language models (SLMs). In this work, we propose COMPACT, which jointly (i) prunes rare vocabulary to shrink embedding/LM head layers and (ii) prunes FFN intermediate channels using common-token-weighted activations, aligning importance with the post-pruning token distribution. COMPACT inherits strengths of both depth and width pruning, such as: deployment-friendliness (keeps a standard transformer architecture), scale-adaptivity (trade off vocab. vs. FFN pruning), competitive pruning times, and strong memory savings alongside throughput gains. Experiments across Qwen, LLaMA, and Gemma families (0.5B-70B) show state-of-the-art downstream performance, with substantial reductions in parameters, GPU memory, and latency.

[328] On the Same Wavelength? Evaluating Pragmatic Reasoning in Language Models across Broad Concepts

Linlu Qiu, Cedegao E. Zhang, Joshua B. Tenenbaum, Yoon Kim, Roger P. Levy

Main category: cs.CL

TL;DR: Evaluates language models’ pragmatic reasoning using a Wavelength game framework, finding large LMs achieve human-like comprehension and RSA improves production.

Details

Motivation: To understand LMs' pragmatic reasoning abilities as conversational agents, since language use involves reasoning about communicative goals and norms.

Method: Uses Wavelength communication game framework to test LMs on comprehension and production tasks with direct prompting, Chain-of-Thought, and Rational Speech Act approaches.

Result: Large LMs achieve human-like accuracy on comprehension without CoT/RSA; CoT improves production; RSA provides significant improvements over both approaches.

Conclusion: Identifies LM pragmatic reasoning strengths/limitations, demonstrates RSA’s potential for improvement, and opens avenues for understanding conceptual representation and social reasoning.

Abstract: Language use is shaped by pragmatics – i.e., reasoning about communicative goals and norms in context. As language models (LMs) are increasingly used as conversational agents, it becomes ever more important to understand their pragmatic reasoning abilities. We propose an evaluation framework derived from Wavelength, a popular communication game where a speaker and a listener communicate about a broad range of concepts in a granular manner. We study a range of LMs on both language comprehension and language production using direct and Chain-of-Thought (CoT) prompting, and further explore a Rational Speech Act (RSA) approach to incorporating Bayesian pragmatic reasoning into LM inference. We find that state-of-the-art LMs, but not smaller ones, achieve strong performance on language comprehension, obtaining similar-to-human accuracy and exhibiting high correlations with human judgments even without CoT prompting or RSA. On language production, CoT can outperform direct prompting, and using RSA provides significant improvements over both approaches. Our study helps identify the strengths and limitations in LMs’ pragmatic reasoning abilities and demonstrates the potential for improving them with RSA, opening up future avenues for understanding conceptual representation, language understanding, and social reasoning in LMs and humans.

[329] Causal Attention with Lookahead Keys

Zhuoqing Song, Peng Sun, Huizhuo Yuan, Quanquan Gu

Main category: cs.CL

TL;DR: CASTLE introduces lookahead keys that update with future context while maintaining autoregressive properties, enabling more efficient parallel training and better performance than standard causal attention.

Details

Motivation: Standard causal attention has static QKV that only encode preceding context, limiting their ability to incorporate future information while preserving autoregressive generation.

Method: CASTLE continually updates each token’s keys as context unfolds, creating lookahead keys that integrate later token information while maintaining autoregressive property through mathematical equivalence for parallel training.

Result: Consistently outperforms standard causal attention across model scales, reducing validation perplexity and improving performance on downstream tasks in language modeling benchmarks.

Conclusion: CASTLE provides an effective attention mechanism that integrates future context information while preserving autoregressive generation, enabling better language modeling performance with efficient parallel training.

Abstract: In standard causal attention, each token’s query, key, and value (QKV) are static and encode only preceding context. We introduce CAuSal aTtention with Lookahead kEys (CASTLE), an attention mechanism that continually updates each token’s keys as the context unfolds. We term these updated keys lookahead keys because they belong to earlier positions yet integrate information from tokens that appear later relative to those positions, while strictly preserving the autoregressive property. Although the mechanism appears sequential, we derive a mathematical equivalence that avoids explicitly materializing lookahead keys at each position and enables efficient parallel training. On language modeling benchmarks, CASTLE consistently outperforms standard causal attention across model scales, reducing validation perplexity and improving performance on a range of downstream tasks.

[330] Streaming Sequence-to-Sequence Learning with Delayed Streams Modeling

Neil Zeghidour, Eugene Kharitonov, Manu Orsini, Václav Volhejn, Gabriel de Marmiesse, Edouard Grave, Patrick Pérez, Laurent Mazaré, Alexandre Défossez

Main category: cs.CL

TL;DR: DSM is a streaming sequence-to-sequence method that uses delayed alignment between input and output streams, enabling flexible multimodal generation with state-of-the-art performance.

Details

Motivation: Traditional sequence-to-sequence models operate offline, consuming complete inputs before generating outputs, while streaming models require complex policies. DSM aims to provide flexible streaming inference without these limitations.

Method: Uses decoder-only language model with pre-processed time alignment and introduced delays between streams. Aligns text and audio streams with different delays for ASR (text delayed) and TTS (audio delayed).

Result: Achieves state-of-the-art performance and latency, supports arbitrary long sequences, and is competitive with offline baselines.

Conclusion: DSM provides a flexible framework for streaming multimodal sequence-to-sequence tasks that overcomes limitations of both offline and traditional streaming approaches.

Abstract: We introduce Delayed Streams Modeling (DSM), a flexible formulation for streaming, multimodal sequence-to-sequence learning. Sequence-to-sequence generation is often cast in an offline manner, where the model consumes the complete input sequence before generating the first output timestep. Alternatively, streaming sequence-to-sequence rely on learning a policy for choosing when to advance on the input stream, or write to the output stream. DSM instead models already time-aligned streams with a decoder-only language model. By moving the alignment to a pre-processing step,and introducing appropriate delays between streams, DSM provides streaming inference of arbitrary output sequences, from any input combination, making it applicable to many sequence-to-sequence problems. In particular, given text and audio streams, automatic speech recognition (ASR) corresponds to the text stream being delayed, while the opposite gives a text-to-speech (TTS) model. We perform extensive experiments for these two major sequence-to-sequence tasks, showing that DSM provides state-of-the-art performance and latency while supporting arbitrary long sequences, being even competitive with offline baselines. Code, samples and demos are available at https://github.com/kyutai-labs/delayed-streams-modeling

[331] Beyond Token Limits: Assessing Language Model Performance on Long Text Classification

Miklós Sebők, Viktor Kovács, Martin Bánóczy, Daniel Møller Eriksen, Nathalie Neptune, Philippe Roussille

Main category: cs.CL

TL;DR: This paper evaluates LLMs for classifying long legal documents, finding that specialized long-input models like Longformer don’t outperform standard models, and open models can compete with GPT variants.

Details

Motivation: Standard LLMs like BERT have input length limitations that prevent processing long legal documents (hundreds of pages), creating a need for effective long-text classification methods.

Method: Experiments with XLM-RoBERTa, Longformer, GPT-3.5, and GPT-4 on multi-class classification of legal documents across 5 languages using the Comparative Agendas Project’s 21 policy topic labels.

Result: Longformer showed no advantage despite being designed for long inputs. Open models performed competitively with GPT variants. Performance was influenced by category support and substance overlaps.

Conclusion: Specialized long-input models don’t necessarily outperform standard models for legal document classification, and open models can achieve competitive results with GPT models.

Abstract: The most widely used large language models in the social sciences (such as BERT, and its derivatives, e.g. RoBERTa) have a limitation on the input text length that they can process to produce predictions. This is a particularly pressing issue for some classification tasks, where the aim is to handle long input texts. One such area deals with laws and draft laws (bills), which can have a length of multiple hundred pages and, therefore, are not particularly amenable for processing with models that can only handle e.g. 512 tokens. In this paper, we show results from experiments covering 5 languages with XLM-RoBERTa, Longformer, GPT-3.5, GPT-4 models for the multiclass classification task of the Comparative Agendas Project, which has a codebook of 21 policy topic labels from education to health care. Results show no particular advantage for the Longformer model, pre-trained specifically for the purposes of handling long inputs. The comparison between the GPT variants and the best-performing open model yielded an edge for the latter. An analysis of class-level factors points to the importance of support and substance overlaps between specific categories when it comes to performance on long text inputs.

[332] Ko-PIQA: A Korean Physical Commonsense Reasoning Dataset with Cultural Context

Dasol Choi, Jungwhan Kim, Guijin Son

Main category: cs.CL

TL;DR: Ko-PIQA is a Korean physical commonsense reasoning dataset that addresses the English-centric bias in existing datasets by incorporating cultural context through traditional Korean elements.

Details

Motivation: Physical commonsense reasoning datasets like PIQA are predominantly English-centric and lack cultural diversity, creating a need for culturally diverse datasets.

Method: Used multi-stage filtering of 3.01M web-crawled questions with three language models, followed by GPT-4o refinement and human validation to create 441 high-quality Korean question-answer pairs with cultural elements.

Result: Models achieved between 59.86% to 83.22% accuracy on Ko-PIQA, with significant struggles on culturally specific scenarios, showing room for improvement in cultural reasoning.

Conclusion: Ko-PIQA serves as a benchmark for Korean language models and foundation for more inclusive commonsense reasoning research, highlighting the importance of culturally diverse datasets.

Abstract: Physical commonsense reasoning datasets like PIQA are predominantly English-centric and lack cultural diversity. We introduce Ko-PIQA, a Korean physical commonsense reasoning dataset that incorporates cultural context. Starting from 3.01 million web-crawled questions, we employed a multi-stage filtering approach using three language models to identify 11,553 PIQA-style questions. Through GPT-4o refinement and human validation, we obtained 441 high-quality question-answer pairs. A key feature of Ko-PIQA is its cultural grounding: 19.7% of questions contain culturally specific elements like traditional Korean foods (kimchi), clothing (hanbok), and specialized appliances (kimchi refrigerators) that require culturally-aware reasoning beyond direct translation. We evaluate seven language models on Ko-PIQA, with the best model achieving 83.22% accuracy while the weakest reaches only 59.86%, demonstrating significant room for improvement. Models particularly struggle with culturally specific scenarios, highlighting the importance of culturally diverse datasets. Ko-PIQA serves as both a benchmark for Korean language models and a foundation for more inclusive commonsense reasoning research. The dataset and code will be publicly available.

[333] WebWeaver: Structuring Web-Scale Evidence with Dynamic Outlines for Open-Ended Deep Research

Zijian Li, Xin Guan, Bo Zhang, Shen Huang, Houquan Zhou, Shaopeng Lai, Ming Yan, Yong Jiang, Pengjun Xie, Fei Huang, Jun Zhang, Jingren Zhou

Main category: cs.CL

TL;DR: WebWeaver is a dual-agent framework for open-ended deep research that addresses limitations of current approaches through iterative planning and hierarchical writing, achieving state-of-the-art performance on major benchmarks.

Details

Motivation: Current approaches for open-ended deep research suffer from static research pipelines that decouple planning from evidence acquisition, and monolithic generation paradigms that include redundant evidence, leading to hallucination issues and low citation accuracy.

Method: A dual-agent framework with a planner that iteratively interleaves evidence acquisition with outline optimization to create a citation-grounded outline, and a writer that performs hierarchical retrieval and writing section by section using targeted retrieval from a memory bank.

Result: Establishes new state-of-the-art across major OEDR benchmarks including DeepResearch Bench, DeepConsult, and DeepResearchGym, demonstrating improved performance in producing comprehensive and trusted reports.

Conclusion: The human-centric, iterative methodology with adaptive planning and focused synthesis is crucial for producing comprehensive, trusted, and well-structured reports in open-ended deep research tasks.

Abstract: This paper tackles \textbf{open-ended deep research (OEDR)}, a complex challenge where AI agents must synthesize vast web-scale information into insightful reports. Current approaches are plagued by dual-fold limitations: static research pipelines that decouple planning from evidence acquisition and monolithic generation paradigms that include redundant, irrelevant evidence, suffering from hallucination issues and low citation accuracy. To address these challenges, we introduce \textbf{WebWeaver}, a novel dual-agent framework that emulates the human research process. The planner operates in a dynamic cycle, iteratively interleaving evidence acquisition with outline optimization to produce a comprehensive, citation-grounded outline linking to a memory bank of evidence. The writer then executes a hierarchical retrieval and writing process, composing the report section by section. By performing targeted retrieval of only the necessary evidence from the memory bank via citations for each part, it effectively mitigates long-context issues and citation hallucinations. Our framework establishes a new state-of-the-art across major OEDR benchmarks, including DeepResearch Bench, DeepConsult, and DeepResearchGym. These results validate our human-centric, iterative methodology, demonstrating that adaptive planning and focused synthesis are crucial for producing comprehensive, trusted, and well-structured reports.

[334] Do Natural Language Descriptions of Model Activations Convey Privileged Information?

Millicent Li, Alberto Mario Ceballos Arroyo, Giordano Rogers, Naomi Saphra, Byron C. Wallace

Main category: cs.CL

TL;DR: Activation verbalization methods using a second LLM to translate internal representations may not provide privileged insights into target model internals, as they often reflect the verbalizer’s knowledge rather than the target model’s operations.

Details

Motivation: To critically evaluate whether activation verbalization approaches actually provide meaningful insights into LLM internal workings or merely convey information about inputs.

Method: Evaluated popular verbalization methods across prior datasets and conducted controlled experiments to test if verbalizations reflect target model knowledge vs. verbalizer LLM’s parametric knowledge.

Result: Verbalization methods succeeded at benchmarks without target model access, and verbalizations often reflected the verbalizer LLM’s knowledge rather than the target model’s knowledge.

Conclusion: Targeted benchmarks and experimental controls are needed to rigorously assess whether verbalization methods provide meaningful insights into LLM operations.

Abstract: Recent interpretability methods have proposed to translate LLM internal representations into natural language descriptions using a second verbalizer LLM. This is intended to illuminate how the target model represents and operates on inputs. But do such activation verbalization approaches actually provide privileged knowledge about the internal workings of the target model, or do they merely convey information about its inputs? We critically evaluate popular verbalization methods across datasets used in prior work and find that they can succeed at benchmarks without any access to target model internals, suggesting that these datasets may not be ideal for evaluating verbalization methods. We then run controlled experiments which reveal that verbalizations often reflect the parametric knowledge of the verbalizer LLM which generated them, rather than the knowledge of the target LLM whose activations are decoded. Taken together, our results indicate a need for targeted benchmarks and experimental controls to rigorously assess whether verbalization methods provide meaningful insights into the operations of LLMs.

[335] Position: Thematic Analysis of Unstructured Clinical Transcripts with Large Language Models

Seungjun Yi, Joakim Nguyen, Terence Lim, Andrew Well, Joseph Skrovan, Mehak Beri, YongGeon Lee, Kavita Radhakrishnan, Liu Leqi, Mia Markey, Ying Ding

Main category: cs.CL

TL;DR: LLMs can support thematic analysis of clinical transcripts, but current approaches are fragmented with inconsistent evaluation methods. Standardized evaluation framework is needed.

Details

Motivation: Thematic analysis of clinical transcripts is resource-intensive, and LLMs offer potential support, but current methods lack standardization.

Method: Systematic review of LLM applications to thematic analysis, complemented by clinician interview.

Result: Found fragmented approaches across analysis types, datasets, prompting strategies, and models, with inconsistent evaluation methods.

Conclusion: Proposed evaluation framework with three dimensions (validity, reliability, interpretability) to standardize practices and advance the field.

Abstract: This position paper examines how large language models (LLMs) can support thematic analysis of unstructured clinical transcripts, a widely used but resource-intensive method for uncovering patterns in patient and provider narratives. We conducted a systematic review of recent studies applying LLMs to thematic analysis, complemented by an interview with a practicing clinician. Our findings reveal that current approaches remain fragmented across multiple dimensions including types of thematic analysis, datasets, prompting strategies and models used, most notably in evaluation. Existing evaluation methods vary widely (from qualitative expert review to automatic similarity metrics), hindering progress and preventing meaningful benchmarking across studies. We argue that establishing standardized evaluation practices is critical for advancing the field. To this end, we propose an evaluation framework centered on three dimensions: validity, reliability, and interpretability.

[336] ATTS: Asynchronous Test-Time Scaling via Conformal Prediction

Jing Xiong, Qiujiang Chen, Fanghua Ye, Zhongwei Wan, Chuanyang Zheng, Chenyang Zhao, Hui Shen, Alexander Hanbo Li, Chaofan Tao, Haochen Tan, Haoli Bai, Lifeng Shang, Lingpeng Kong, Ngai Wong

Main category: cs.CL

TL;DR: ATTS is an asynchronous test-time scaling framework that accelerates LLM inference by enabling parallel and sequential scaling while maintaining statistical guarantees, achieving up to 56.7x speedup without accuracy loss.

Details

Motivation: Large language models suffer from high inference latency during test-time scaling, and existing speculative decoding approaches face challenges with memory-bound execution and synchronization overhead when scaling across both parallel and sequential dimensions.

Method: ATTS uses an asynchronous inference approach with online calibration and a three-stage rejection sampling pipeline based on ordinal classification. It identifies synchronization as the primary bottleneck and enables scaling along both sequential and parallel axes while maintaining statistical guarantees through hypothesis testing.

Result: ATTS achieves up to 56.7x speedup in test-time scaling and 4.14x throughput improvement across MATH, AMC23, AIME24, and AIME25 datasets. It enables 1.5B/70B draft/target models to match state-of-the-art reasoning model performance on AIME, while reducing latency and memory overhead with accurate rejection rate control.

Conclusion: ATTS successfully addresses the synchronization bottleneck in test-time scaling, enabling efficient parallel and sequential scaling of LLMs while maintaining accuracy and statistical guarantees, making high-performance reasoning models more accessible.

Abstract: Large language models (LLMs) benefit from test-time scaling but are often hampered by high inference latency. Speculative decoding is a natural way to accelerate the scaling process; however, scaling along both the parallel and sequential dimensions poses significant challenges, including substantial memory-bound execution and synchronization overhead. We introduce ATTS (Asynchronous Test-Time Scaling), a statistically guaranteed adaptive scaling framework that follows the hypothesis testing process to address these challenges. By revisiting arithmetic intensity, ATTS identifies synchronization as the primary bottleneck. It enables asynchronous inference through online calibration and proposes an ordinal classification algorithm that supports a three-stage rejection sampling pipeline, scaling along both the sequential and parallel axes. Across experiments on the MATH, AMC23, AIME24, and AIME25 datasets and across multiple draft-target model families, we show that ATTS delivers up to 56.7x speedup in test-time scaling and a 4.14x throughput improvement, while maintaining accurate control of the rejection rate, reducing latency and memory overhead, and incurring no accuracy loss. By scaling both in parallel and sequential dimensions, we enable the 1.5B/70B draft/target model combination to achieve the performance of the state-of-the-art reasoning model o3-mini (high) on the AIME dataset. We have released the code at https://github.com/menik1126/asynchronous-test-time-scaling.

[337] Fast and Fluent Diffusion Language Models via Convolutional Decoding and Rejective Fine-tuning

Yeongbin Seo, Dongha Lee, Jaehyung Kim, Jinyoung Yeo

Main category: cs.CL

TL;DR: The paper addresses the long decoding-window problem in diffusion language models by proposing Convolutional decoding (Conv) to narrow decoding windows without segmentation and Rejecting Rule-based Fine-Tuning (R2FT) to align distant tokens, achieving state-of-the-art results with improved speed and quality.

Details

Motivation: Autoregressive language models are slow due to sequential token generation, while diffusion models offer parallel decoding but suffer from the long decoding-window problem where distant tokens become irrelevant or repetitive, compromising their main advantage.

Method: Proposes Convolutional decoding (Conv) using normalization to narrow decoding windows without hard segmentation, and Rejecting Rule-based Fine-Tuning (R2FT) for post-hoc training to better align tokens far from context.

Result: Achieves state-of-the-art results on open-ended generation benchmarks (e.g., AlpacaEval) among diffusion LM baselines, with significantly lower step size than previous works, demonstrating both speed and quality improvements.

Conclusion: The proposed methods effectively overcome the long decoding-window bottleneck in diffusion language models, enabling faster inference while maintaining generation quality through improved fluency and flexibility.

Abstract: Autoregressive (AR) language models generate text one token at a time, which limits their inference speed. Diffusion-based language models offer a promising alternative, as they can decode multiple tokens in parallel. However, we identify a key bottleneck in current diffusion LMs: the long decoding-window problem, where tokens generated far from the input context often become irrelevant or repetitive. Previous solutions like semi-autoregressive address this issue by splitting windows into blocks, but this sacrifices speed and bidirectionality, eliminating the main advantage of diffusion models. To overcome this, we propose Convolutional decoding (Conv), a normalization-based method that narrows the decoding window without hard segmentation, leading to better fluency and flexibility. Additionally, we introduce Rejecting Rule-based Fine-Tuning (R2FT), a post-hoc training scheme that better aligns tokens at positions far from context. Our methods achieve state-of-the-art results on open-ended generation benchmarks (e.g., AlpacaEval) among diffusion LM baselines, with significantly lower step size than previous works, demonstrating both speed and quality improvements.

[338] K-DeCore: Facilitating Knowledge Transfer in Continual Structured Knowledge Reasoning via Knowledge Decoupling

Yongrui Chen, Yi Huang, Yunchang Liu, Shenyu Zhang, Junhao He, Tongtong Wu, Guilin Qi, Tianxing Wu

Main category: cs.CL

TL;DR: K-DeCore is a novel continual learning framework for structured knowledge reasoning that uses knowledge decoupling and dual-perspective memory consolidation to handle sequential tasks with fixed parameters.

Details

Motivation: Existing continual learning methods struggle with poor generalization across heterogeneous structured knowledge and inefficient reasoning due to parameter growth as tasks increase.

Method: Proposes knowledge decoupling mechanism to separate reasoning into task-specific and task-agnostic stages, dual-perspective memory consolidation, and structure-guided pseudo-data synthesis.

Result: Extensive experiments on four benchmark datasets show superiority over existing continual learning methods across multiple metrics using various backbone LLMs.

Conclusion: K-DeCore effectively addresses limitations of existing continual learning approaches for structured knowledge reasoning with fixed parameters and improved generalization.

Abstract: Continual Structured Knowledge Reasoning (CSKR) focuses on training models to handle sequential tasks, where each task involves translating natural language questions into structured queries grounded in structured knowledge. Existing general continual learning approaches face significant challenges when applied to this task, including poor generalization to heterogeneous structured knowledge and inefficient reasoning due to parameter growth as tasks increase. To address these limitations, we propose a novel CSKR framework, \textsc{K-DeCore}, which operates with a fixed number of tunable parameters. Unlike prior methods, \textsc{K-DeCore} introduces a knowledge decoupling mechanism that disentangles the reasoning process into task-specific and task-agnostic stages, effectively bridging the gaps across diverse tasks. Building on this foundation, \textsc{K-DeCore} integrates a dual-perspective memory consolidation mechanism for distinct stages and introduces a structure-guided pseudo-data synthesis strategy to further enhance the model’s generalization capabilities. Extensive experiments on four benchmark datasets demonstrate the superiority of \textsc{K-DeCore} over existing continual learning methods across multiple metrics, leveraging various backbone large language models.

[339] Everyday Physics in Korean Contexts: A Culturally Grounded Physical Reasoning Benchmark

Jihae Jeong, DaeYeop Lee, DongGeon Lee, Hwanjo Yu

Main category: cs.CL

TL;DR: EPiK is a Korean cultural physical commonsense reasoning benchmark with 181 binary-choice problems, created through a two-stage generation pipeline to address the gap in culturally-aware AI evaluation.

Details

Motivation: Existing physical commonsense reasoning benchmarks focus on Western contexts and overlook cultural variations in physical problem-solving, creating a need for culturally-aware evaluation frameworks.

Method: Two-stage generation and verification pipeline to create culturally-authentic problems from Korean contexts across 9 reasoning subtasks and 84 scenarios, avoiding simple translation approaches.

Result: Korean-specialized models consistently outperform general-purpose models of comparable size, revealing performance gaps that highlight limitations of culturally-agnostic models.

Conclusion: Culturally-aware benchmarks like EPiK are critically needed to truly measure language understanding, as they reveal the limitations of culturally-agnostic AI models.

Abstract: Existing physical commonsense reasoning benchmarks predominantly focus on Western contexts, overlooking cultural variations in physical problem-solving. To address this gap, we introduce EPiK (Everyday Physics in Korean Contexts), a novel benchmark comprising 181 binary-choice problems that test physical reasoning within Korean cultural contexts, ranging from kimchi (Korean food) to traditional fermentation. EPiK is constructed using a two-stage generation and verification pipeline to create culturally-authentic problems across 9 reasoning subtasks and 84 scenarios. Unlike approaches based on simple translation, our method generates problems organically from Korean contexts while upholding rigorous physical reasoning standards. Our evaluations show that Korean-specialized models consistently outperform general-purpose models of comparable size. This performance gap highlights the limitations of culturally-agnostic models and demonstrates the critical need for culturally-aware benchmarks to truly measure language understanding. Our EPiK is publicly available at https://huggingface.co/datasets/jjae/EPiK.

[340] Prior-based Noisy Text Data Filtering: Fast and Strong Alternative For Perplexity

Yeongbin Seo, Gayoung Kim, Jaehyung Kim, Jinyoung Yeo

Main category: cs.CL

TL;DR: Proposes a prior-based data filtering method using corpus-level term frequency statistics as a fast alternative to perplexity-based filtering for LLM pretraining.

Details

Motivation: Perplexity-based filtering is time-consuming and unreliable with noisy/out-of-distribution data, creating need for efficient data selection methods for LLM pretraining.

Method: Uses token priors from corpus-level term frequency statistics to filter documents based on mean and standard deviation of token priors, requiring no model inference.

Result: Achieves highest average performance across 20 downstream benchmarks while reducing time cost by over 1000x compared to PPL-based filtering.

Conclusion: Prior-based filtering is a simple, fast, and effective alternative to PPL-based filtering that works across languages and symbolic domains like code and math.

Abstract: As large language models (LLMs) are pretrained on massive web corpora, careful selection of data becomes essential to ensure effective and efficient learning. While perplexity (PPL)-based filtering has shown strong performance, it suffers from drawbacks: substantial time costs and inherent unreliability of the model when handling noisy or out-of-distribution samples. In this work, we propose a simple yet powerful alternative: a prior-based data filtering method that estimates token priors using corpus-level term frequency statistics, inspired by linguistic insights on word roles and lexical density. Our approach filters documents based on the mean and standard deviation of token priors, serving as a fast proxy to PPL while requiring no model inference. Despite its simplicity, the prior-based filter achieves the highest average performance across 20 downstream benchmarks, while reducing time cost by over 1000x compared to PPL-based filtering. We further demonstrate its applicability to symbolic languages such as code and math, and its dynamic adaptability to multilingual corpora without supervision

[341] Diversity Boosts AI-Generated Text Detection

Advik Raj Basani, Pin-Yu Chen

Main category: cs.CL

TL;DR: DivEye is a novel AI-generated text detection framework that uses surprisal-based features to capture unpredictability fluctuations in text, outperforming existing detectors and providing interpretable insights.

Details

Motivation: To combat misuse of LLMs in education, business, journalism, and social media by detecting synthetic text that can mask misinformation or deception, addressing limitations of prior detectors that struggle with high-quality generations and lack interpretability.

Method: Uses surprisal-based features to capture how unpredictability fluctuates across text, leveraging the observation that human-authored text exhibits richer variability in lexical and structural unpredictability than LLM outputs.

Result: Outperforms existing zero-shot detectors by up to 33.2%, achieves competitive performance with fine-tuned baselines, robust to paraphrasing and adversarial attacks, generalizes well across domains and models, and improves existing detectors by up to 18.7% when used as auxiliary signal.

Conclusion: DivEye provides an effective and interpretable approach for AI-generated text detection, with rhythmic unpredictability identified as a powerful underexplored signal for distinguishing human from LLM-generated content.

Abstract: Detecting AI-generated text is an increasing necessity to combat misuse of LLMs in education, business compliance, journalism, and social media, where synthetic fluency can mask misinformation or deception. While prior detectors often rely on token-level likelihoods or opaque black-box classifiers, these approaches struggle against high-quality generations and offer little interpretability. In this work, we propose DivEye, a novel detection framework that captures how unpredictability fluctuates across a text using surprisal-based features. Motivated by the observation that human-authored text exhibits richer variability in lexical and structural unpredictability than LLM outputs, DivEye captures this signal through a set of interpretable statistical features. Our method outperforms existing zero-shot detectors by up to 33.2% and achieves competitive performance with fine-tuned baselines across multiple benchmarks. DivEye is robust to paraphrasing and adversarial attacks, generalizes well across domains and models, and improves the performance of existing detectors by up to 18.7% when used as an auxiliary signal. Beyond detection, DivEye provides interpretable insights into why a text is flagged, pointing to rhythmic unpredictability as a powerful and underexplored signal for LLM detection.

[342] Agentic Reinforcement Learning with Implicit Step Rewards

Xiaoqian Liu, Ke Wang, Yuchuan Wu, Fei Huang, Yongbin Li, Junge Zhang, Jianbin Jiao

Main category: cs.CL

TL;DR: iStar introduces implicit step rewards for agentic RL to solve credit assignment problems in LLM agents, using an implicit process reward model that generates step-wise rewards without explicit labels or additional rollouts.

Details

Motivation: Sparse and unverifiable rewards make credit assignment challenging in LLM agent training. Existing methods suffer from biased annotation, reward hacking, high variance from fine-grained rewards, or failures with rare state overlap.

Method: Alternating optimization of implicit process reward model (PRM) with policy model using trajectory-based DPO objective to generate implicit step rewards. Combines step-level and episode-level advantages for policy updates in a self-reinforcing loop.

Result: State-of-the-art performance on WebShop, VisualSokoban, and SOTOPIA benchmarks, outperforming frontier LLMs and strong RL baselines. Higher sample efficiency, training stability, and efficient exploration with fewer steps to task success.

Conclusion: iStar provides an effective credit assignment strategy for agentic RL that integrates seamlessly with standard algorithms, demonstrating superior performance across domains with improved efficiency and stability.

Abstract: Large language models (LLMs) are increasingly developed as autonomous agents using reinforcement learning (agentic RL) that reason and act in interactive environments. However, sparse and sometimes unverifiable rewards make it extremely challenging to assign credit when training LLM agents that serve as a policy. Recent work attempts to integrate process supervision into RL but suffers from biased annotation, reward hacking, high-variance from overly fine-grained rewards or failtures when state overlap is rare. We therefore introduce implicit step rewards for agentic RL (iStar), a general credit-assignment strategy that integrates seamlessly with standard RL algorithms without relying on additional rollouts or explicit step labels. Particularly, we alternatively optimize an implicit process reward model (PRM) with the policy model to generate implicit step rewards via a trajectory-based DPO objective. Theoretical analysis shows that this learning objective produces a step-wise reward function. Then the implicit step rewards are used to compute step-level advantages, which are combined with trajectory (or episode)-level advantages for policy updates, creating a self-reinforcing training loop. We evaluate our method on three challenging agent benchmarks, including WebShop and VisualSokoban, as well as open-ended social interactions with unverifiable rewards in SOTOPIA. Crucially, iStar shows superior performance over frontier LLMs and strong RL baselines across domains, achieving state-of-the-art results with higher sample-efficiency and training stability. Further analysis also demonstrates efficient exploration by iStar with increased rewards in both step- and episode-level while maintaining fewer steps to achieve task success. Code will be available soon.

[343] LLMs4All: A Review on Large Language Models for Research and Applications in Academic Disciplines

Yanfang Ye, Zheyuan Zhang, Tianyi Ma, Zehong Wang, Yiyang Li, Shifu Hou, Weixiang Sun, Kaiwen Shi, Yijun Ma, Wei Song, Ahmed Abbasi, Ying Cheng, Jane Cleland-Huang, Steven Corcelli, Robert Goulding, Ming Hu, Ting Hua, John Lalor, Fang Liu, Tengfei Luo, Ed Maginn, Nuno Moniz, Jason Rohr, Brett Savoie, Daniel Slate, Tom Stapleford, Matthew Webber, Olaf Wiest, Johnny Zhang, Nitesh V. Chawla

Main category: cs.CL

TL;DR: This paper provides a comprehensive overview of state-of-the-art Large Language Models (LLMs) and their integration across diverse academic disciplines, exploring their impacts, limitations, and future directions in the generative AI era.

Details

Motivation: The motivation stems from the impressive performance of LLMs like ChatGPT on various language tasks and their potential for far-reaching impacts across real-world applications, inspiring an exploration of how these models are reshaping research and practice across different fields.

Method: The paper offers a systematic review and overview approach, examining LLM integration across three major disciplinary categories: (1) arts, letters, and law; (2) economics and business; and (3) science and engineering, while discussing key observations and insights.

Result: The review provides insights into how LLMs are engaged across disciplines, highlighting their applications in diverse fields while identifying key limitations and open challenges in the generative AI era.

Conclusion: The comprehensive review of LLM applications across disciplines can help researchers and practitioners interested in exploiting LLMs to advance their work in diverse real-world applications, while emphasizing the need to address limitations and future directions in this rapidly evolving field.

Abstract: Cutting-edge Artificial Intelligence (AI) techniques keep reshaping our view of the world. For example, Large Language Models (LLMs) based applications such as ChatGPT have shown the capability of generating human-like conversation on extensive topics. Due to the impressive performance on a variety of language-related tasks (e.g., open-domain question answering, translation, and document summarization), one can envision the far-reaching impacts that can be brought by the LLMs with broader real-world applications (e.g., customer service, education and accessibility, and scientific discovery). Inspired by their success, this paper will offer an overview of state-of-the-art LLMs and their integration into a wide range of academic disciplines, including: (1) arts, letters, and law (e.g., history, philosophy, political science, arts and architecture, law), (2) economics and business (e.g., finance, economics, accounting, marketing), and (3) science and engineering (e.g., mathematics, physics and mechanical engineering, chemistry and chemical engineering, life sciences and bioengineering, earth sciences and civil engineering, computer science and electrical engineering). Integrating humanity and technology, in this paper, we will explore how LLMs are shaping research and practice in these fields, while also discussing key limitations, open challenges, and future directions in the era of generative AI. The review of how LLMs are engaged across disciplines-along with key observations and insights-can help researchers and practitioners interested in exploiting LLMs to advance their works in diverse real-world applications.

[344] Responsible AI Technical Report

KT, :, Soonmin Bae, Wanjin Park, Jeongyeop Kim, Yunjin Park, Jungwon Yoon, Junhyung Moon, Myunggyo Oh, Wonhyuk Lee, Dongyoung Jung, Minwook Ju, Eunmi Kim, Sujin Kim, Youngchol Kim, Somin Lee, Wonyoung Lee, Minsung Noh, Hyoungjun Park, Eunyoung Shin

Main category: cs.CL

TL;DR: KT developed a Responsible AI assessment methodology and risk mitigation technologies including SafetyGuard to block harmful AI responses in real-time, ensuring AI service safety and regulatory compliance.

Details

Motivation: To ensure safety and reliability of AI services by addressing regulatory requirements from the Basic Act on AI implementation and global AI governance trends, while supporting the domestic AI development ecosystem.

Method: Established a unique RAI assessment methodology based on KT’s AI risk taxonomy, systematically identifying and managing risk factors from AI development to operation, with practical tools for risk management and mitigation.

Result: Developed proprietary Guardrail: SafetyGuard that blocks harmful AI responses in real-time, providing a reliable assessment methodology to verify model safety and robustness tailored to the domestic environment.

Conclusion: The research outcomes provide valuable insights for organizations developing Responsible AI and support the enhancement of safety in the domestic AI development ecosystem through systematic risk management and real-time protection technologies.

Abstract: KT developed a Responsible AI (RAI) assessment methodology and risk mitigation technologies to ensure the safety and reliability of AI services. By analyzing the Basic Act on AI implementation and global AI governance trends, we established a unique approach for regulatory compliance and systematically identify and manage all potential risk factors from AI development to operation. We present a reliable assessment methodology that systematically verifies model safety and robustness based on KT’s AI risk taxonomy tailored to the domestic environment. We also provide practical tools for managing and mitigating identified AI risks. With the release of this report, we also release proprietary Guardrail : SafetyGuard that blocks harmful responses from AI models in real-time, supporting the enhancement of safety in the domestic AI development ecosystem. We also believe these research outcomes provide valuable insights for organizations seeking to develop Responsible AI.

[345] Embedding Domain Knowledge for Large Language Models via Reinforcement Learning from Augmented Generation

Chaojun Nie, Jun Zhou, Guanxiang Wang, Shisong Wu, Zichen Wang

Main category: cs.CL

TL;DR: RLAG uses reinforcement learning with augmented generation to embed domain knowledge into LLMs, outperforming baseline methods on specialized tasks.

Details

Motivation: LLMs struggle with domain-specific tasks due to knowledge gaps from disproportionate training data representation and temporal lag. Existing methods like CPT and SFT have limitations in prioritizing critical knowledge and building coherent knowledge structures.

Method: Reinforcement Learning from Augmented Generation (RLAG) iteratively cycles between sampling generations and optimizing models through calculated rewards. It selects high-probability outputs and uses three tailored reward metrics to embed critical, contextually coherent domain knowledge.

Result: Experimental results across medical, legal, astronomy, and current events datasets show RLAG significantly outperforms baseline approaches in domain expertise, assessed through answer accuracy and explanation rationality.

Conclusion: RLAG effectively addresses knowledge gaps in LLMs for domain applications through iterative reinforcement learning with augmented generation, demonstrating superior performance over existing methods.

Abstract: Large language models (LLMs) often exhibit limited performance on domain-specific tasks due to the natural disproportionate representation of specialized information in their training data and the static nature of these datasets. Knowledge scarcity and temporal lag create knowledge gaps for domain applications. While post-training on domain datasets can embed knowledge into models, existing approaches have some limitations. Continual Pre-Training (CPT) treats all tokens in domain documents with equal importance, failing to prioritize critical knowledge points, while supervised fine-tuning (SFT) with question-answer pairs struggles to develop the coherent knowledge structures necessary for complex reasoning tasks. To address these challenges, we propose Reinforcement Learning from Augmented Generation (RLAG). Our approach iteratively cycles between sampling generations and optimizing the model through calculated rewards, effectively embedding critical and contextually coherent domain knowledge. We select generated outputs with the highest log probabilities as the sampling result, then compute three tailored reward metrics to guide the optimization process. To comprehensively evaluate domain expertise, we assess answer accuracy and the rationality of explanations generated for correctly answered questions. Experimental results across medical, legal, astronomy, and current events datasets demonstrate that our proposed method significantly outperforms baseline approaches. Our code and data are open sourced at https://github.com/ChaojunNie/RLAG.

[346] EmbeddingGemma: Powerful and Lightweight Text Representations

Henrique Schechter Vera, Sahil Dua, Biao Zhang, Daniel Salz, Ryan Mullins, Sindhu Raghuram Panyam, Sara Smoot, Iftekhar Naim, Joe Zou, Feiyang Chen, Daniel Cer, Alice Lisak, Min Choi, Lucas Gonzalez, Omar Sanseviero, Glenn Cameron, Ian Ballantyne, Kat Black, Kaifeng Chen, Weiyi Wang, Zhe Li, Gus Martins, Jinhyuk Lee, Mark Sherwood, Juyeong Ji, Renjie Wu, Jingxiao Zheng, Jyotinder Singh, Abheesht Sharma, Divyashree Sreepathihalli, Aashi Jain, Adham Elarabawy, AJ Co, Andreas Doumanoglou, Babak Samari, Ben Hora, Brian Potetz, Dahun Kim, Enrique Alfonseca, Fedor Moiseev, Feng Han, Frank Palma Gomez, Gustavo Hernández Ábrego, Hesen Zhang, Hui Hui, Jay Han, Karan Gill, Ke Chen, Koert Chen, Madhuri Shanbhogue, Michael Boratko, Paul Suganthan, Sai Meher Karthik Duddu, Sandeep Mariserla, Setareh Ariafar, Shanfeng Zhang, Shijie Zhang, Simon Baumgartner, Sonam Goenka, Steve Qiu, Tanmaya Dabral, Trevor Walker, Vikram Rao, Waleed Khawaja, Wenlei Zhou, Xiaoqi Ren, Ye Xia, Yichang Chen, Yi-Ting Chen, Zhe Dong, Zhongli Ding, Francesco Visin, Gaël Liu, Jiageng Zhang, Kathleen Kenealy, Michelle Casbon, Ravin Kumar, Thomas Mesnard, Zach Gleicher, Cormac Brick, Olivier Lacombe, Adam Roberts, Qin Yin, Yunhsuan Sung, Raphael Hoffmann, Tris Warkentin, Armand Joulin, Tom Duerig, Mojtaba Seyedhosseini

Main category: cs.CL

TL;DR: EmbeddingGemma is a lightweight 300M parameter text embedding model that achieves state-of-the-art performance on MTEB benchmark, outperforming larger models through innovative training techniques including encoder-decoder initialization and geometric embedding distillation.

Details

Motivation: To create a highly efficient and lightweight text embedding model that delivers exceptional performance-to-cost ratio for practical applications like on-device use cases, while being open-source to promote further research.

Method: Uses encoder-decoder initialization and geometric embedding distillation to capture knowledge from larger models, incorporates spread-out regularizer for robustness and expressiveness, and merges checkpoints from varied optimized mixtures for generalizability.

Result: Achieves state-of-the-art results on MTEB benchmark across multilingual, English, and code domains, outperforms prior top models with fewer than 500M parameters, provides performance comparable to models double its size, and maintains lead even with quantization or embedding truncation.

Conclusion: EmbeddingGemma offers exceptional efficiency and performance, making it particularly suitable for low-latency, high-throughput applications like on-device use, while being released as open-source to advance research in lightweight embedding models.

Abstract: We introduce EmbeddingGemma, a new lightweight, open text embedding model based on the Gemma 3 language model family. Our innovative training recipe strategically captures knowledge from larger models via encoder-decoder initialization and geometric embedding distillation. We improve model robustness and expressiveness with a spread-out regularizer, and ensure generalizability by merging checkpoints from varied, optimized mixtures. Evaluated on the Massive Text Embedding Benchmark (MTEB) across multilingual, English, and code domains, EmbeddingGemma (300M) achieves state-of-the-art results. Notably, it outperforms prior top models, both proprietary and open, with fewer than 500M parameters, and provides performance comparable to models double its size, offering an exceptional performance-to-cost ratio. Remarkably, this lead persists when quantizing model weights or truncating embedding outputs. This makes EmbeddingGemma particularly well-suited for low-latency and high-throughput use cases such as on-device applications. We provide ablation studies exploring our key design choices. We release EmbeddingGemma to the community to promote further research.

[347] Un-Doubling Diffusion: LLM-guided Disambiguation of Homonym Duplication

Evgeny Kaskov, Elizaveta Petrova, Petr Surovtsev, Anna Kostikova, Ilya Mistiurin, Alexander Kapitanov, Alexander Nagaev

Main category: cs.CL

TL;DR: This paper addresses homonym duplication in diffusion models, where words with identical spelling but different meanings cause models to generate multiple senses simultaneously. The authors develop methods to measure duplication rates and evaluate different models using both automatic VLM-based evaluation and human assessment. They also propose prompt expansion as a mitigation strategy that works for both homonym duplication and Anglocentric bias issues.

Details

Motivation: Homonyms pose challenges for generative models as they can cause diffusion models to generate multiple senses of a word simultaneously. This problem is exacerbated by Anglocentric bias in text-to-image pipelines, where non-English words may become homonyms after translation to English, leading to loss of original meaning.

Method: The authors introduce a method for measuring duplication rates and evaluate different diffusion models using both automatic evaluation with Vision-Language Models (VLM) and human evaluation. They also investigate prompt expansion as a technique to mitigate homonym duplication.

Result: The research demonstrates that prompt expansion effectively reduces duplication rates related to both homonym duplication and Anglocentric bias. The automatic evaluation pipeline code is made publicly available.

Conclusion: Homonym duplication is a significant issue in diffusion models that can be effectively measured and mitigated through prompt expansion techniques. The proposed methods address both direct homonym problems and those arising from Anglocentric bias in translation pipelines.

Abstract: Homonyms are words with identical spelling but distinct meanings, which pose challenges for many generative models. When a homonym appears in a prompt, diffusion models may generate multiple senses of the word simultaneously, which is known as homonym duplication. This issue is further complicated by an Anglocentric bias, which includes an additional translation step before the text-to-image model pipeline. As a result, even words that are not homonymous in the original language may become homonyms and lose their meaning after translation into English. In this paper, we introduce a method for measuring duplication rates and conduct evaluations of different diffusion models using both automatic evaluation utilizing Vision-Language Models (VLM) and human evaluation. Additionally, we investigate methods to mitigate the homonym duplication problem through prompt expansion, demonstrating that this approach also effectively reduces duplication related to Anglocentric bias. The code for the automatic evaluation pipeline is publicly available.

[348] Dual-Head Reasoning Distillation: Improving Classifier Accuracy with Train-Time-Only Reasoning

Jillian Xu, Dylan Zhou, Vinay Shukla, Yang Yang, Junrui Ruan, Shuhuai Lin, Wenfei Zou, Yinxiao Liu, Karthik Lakshmanan

Main category: cs.CL

TL;DR: Dual-Head Reasoning Distillation (DHRD) improves classification accuracy without throughput penalty by adding a pooled classification head and reasoning head during training only, achieving significant speedup over Chain-of-Thought prompting.

Details

Motivation: To resolve the trade-off between Chain-of-Thought (CoT) prompting's accuracy improvements and its significant throughput penalty due to rationale generation.

Method: Adds a pooled classification head for training/inference and a reasoning head supervised by teacher rationales used only in training, with weighted loss combining label cross-entropy and token-level LM loss over input-plus-rationale sequences.

Result: Achieves 0.65-5.47% relative gains over pooled baselines on seven SuperGLUE tasks, with larger gains on entailment/causal tasks. Inference throughput matches pooled classifiers and exceeds CoT decoding by 96-142 times in QPS.

Conclusion: DHRD successfully decouples reasoning benefits from inference costs, providing accuracy improvements without throughput penalty by disabling the reasoning head at test time.

Abstract: Chain-of-Thought (CoT) prompting often improves classification accuracy, but it introduces a significant throughput penalty with rationale generation (Wei et al., 2022; Cheng and Van Durme, 2024). To resolve this trade-off, we introduce Dual-Head Reasoning Distillation (DHRD), a simple training method for decoder-only language models (LMs) that adds (i) a pooled classification head used during training and inference and (ii) a reasoning head supervised by teacher rationales used only in training. We train with a loss function that is a weighted sum of label cross-entropy and token-level LM loss over input-plus-rationale sequences. On seven SuperGLUE tasks, DHRD yields relative gains of 0.65-5.47% over pooled baselines, with notably larger gains on entailment/causal tasks. Since we disable the reasoning head at test time, inference throughput matches pooled classifiers and exceeds CoT decoding on the same backbones by 96-142 times in QPS.

[349] Agribot: agriculture-specific question answer system

Naman Jain, Pranjali Jain, Pratik Kayal, Jayakrishna Sahit, Soham Pachpande, Jayesh Choudhari, Mayank Singh

Main category: cs.CL

TL;DR: An agricultural chatbot for Indian farmers using Kisan Call Center data, achieving 86% accuracy with entity extraction and synonym elimination.

Details

Motivation: India's agro-based economy needs accessible agricultural information for optimal growth. Farmers need 24/7 access to farming-related queries.

Method: Built chatbot using sentence embedding model with Kisan Call Center dataset. Incorporated entity extraction and eliminated synonyms to improve performance.

Result: Initial accuracy of 56% with sentence embedding model. After improvements, accuracy increased to 86%. System handles weather, market rates, plant protection, and government schemes.

Conclusion: The chatbot enables easier access to farming information, improves agricultural output, and reduces workload on call center staff by providing 24/7 automated assistance.

Abstract: India is an agro-based economy and proper information about agricultural practices is the key to optimal agricultural growth and output. In order to answer the queries of the farmer, we have build an agricultural chatbot based on the dataset from Kisan Call Center. This system is robust enough to answer queries related to weather, market rates, plant protection and government schemes. This system is available 24* 7, can be accessed through any electronic device and the information is delivered with the ease of understanding. The system is based on a sentence embedding model which gives an accuracy of 56%. After eliminating synonyms and incorporating entity extraction, the accuracy jumps to 86%. With such a system, farmers can progress towards easier information about farming related practices and hence a better agricultural output. The job of the Call Center workforce would be made easier and the hard work of various such workers can be redirected to a better goal.

[350] Taxonomy of Comprehensive Safety for Clinical Agents

Jean Seo, Hyunkyung Lee, Gibaeg Kim, Wooseok Han, Jaehyo Yoo, Seungseop Lim, Kihun Shin, Eunho Yang

Main category: cs.CL

TL;DR: TACOS is a 21-class safety taxonomy for clinical chatbots that integrates safety filtering and tool selection into user intent classification, addressing nuanced clinical domain requirements.

Details

Motivation: Existing safety methods like guardrails and tool calling are insufficient for clinical chatbots where inaccurate responses can have serious consequences, requiring more nuanced safety approaches.

Method: Developed TACOS taxonomy with 21 classes covering clinical/non-clinical queries, modeling safety thresholds and tool dependencies. Created TACOS-annotated dataset and conducted experiments to validate the taxonomy.

Result: The taxonomy proved valuable for clinical agent settings, with experiments revealing insights about training data distribution and base models’ pretrained knowledge.

Conclusion: TACOS provides a comprehensive safety framework for clinical chatbots that outperforms existing methods by integrating safety filtering and tool selection into a unified intent classification system.

Abstract: Safety is a paramount concern in clinical chatbot applications, where inaccurate or harmful responses can lead to serious consequences. Existing methods–such as guardrails and tool calling–often fall short in addressing the nuanced demands of the clinical domain. In this paper, we introduce TACOS (TAxonomy of COmprehensive Safety for Clinical Agents), a fine-grained, 21-class taxonomy that integrates safety filtering and tool selection into a single user intent classification step. TACOS is a taxonomy that can cover a wide spectrum of clinical and non-clinical queries, explicitly modeling varying safety thresholds and external tool dependencies. To validate our taxonomy, we curate a TACOS-annotated dataset and perform extensive experiments. Our results demonstrate the value of a new taxonomy specialized for clinical agent settings, and reveal useful insights about train data distribution and pretrained knowledge of base models.

[351] Fine-tuning Done Right in Model Editing

Wanli Yang, Fei Sun, Rui Tang, Hongyu Zang, Du Su, Qi Cao, Jingang Wang, Huawei Shen, Xueqi Cheng

Main category: cs.CL

TL;DR: Fine-tuning is actually effective for model editing when using breadth-first (epoch-based) pipeline with mini-batch optimization instead of the problematic depth-first sequential approach, and when combined with optimal tuning locations via LocFT-BF method.

Details

Motivation: To challenge the long-standing belief that fine-tuning is ineffective for model editing, arguing that previous failures were due to suboptimal pipeline design rather than inherent limitations of fine-tuning.

Method: Restore fine-tuning to standard breadth-first (epoch-based) pipeline with mini-batch optimization, and develop LocFT-BF method through systematic analysis of tuning locations for localized editing.

Result: LocFT-BF outperforms state-of-the-art methods by large margins, sustains 100K edits and 72B-parameter models (10x beyond prior practice) without sacrificing general capabilities.

Conclusion: Fine-tuning can be advanced from an underestimated baseline to a leading method for model editing by correcting pipeline misconceptions and implementing principled localized tuning strategies.

Abstract: Fine-tuning, a foundational method for adapting large language models, has long been considered ineffective for model editing. Here, we challenge this belief, arguing that the reported failure arises not from the inherent limitation of fine-tuning itself, but from adapting it to the sequential nature of the editing task, a single-pass depth-first pipeline that optimizes each sample to convergence before moving on. While intuitive, this depth-first pipeline coupled with sample-wise updating over-optimizes each edit and induces interference across edits. Our controlled experiments reveal that simply restoring fine-tuning to the standard breadth-first (i.e., epoch-based) pipeline with mini-batch optimization substantially improves its effectiveness for model editing. Moreover, fine-tuning in editing also suffers from suboptimal tuning parameter locations inherited from prior methods. Through systematic analysis of tuning locations, we derive LocFT-BF, a simple and effective localized editing method built on the restored fine-tuning framework. Extensive experiments across diverse LLMs and datasets demonstrate that LocFT-BF outperforms state-of-the-art methods by large margins. Notably, to our knowledge, it is the first to sustain 100K edits and 72B-parameter models,10 x beyond prior practice, without sacrificing general capabilities. By clarifying a long-standing misconception and introducing a principled localized tuning strategy, we advance fine-tuning from an underestimated baseline to a leading method for model editing, establishing a solid foundation for future research.

[352] R-Capsule: Compressing High-Level Plans for Efficient Large Language Model Reasoning

Hongyu Shan, Mingyang Song, Chang Dai, Di Liang, Han Chen

Main category: cs.CL

TL;DR: R-Capsule is a framework that compresses reasoning steps into learned latent tokens to improve efficiency while maintaining transparency, inspired by Information Bottleneck principle.

Details

Motivation: CoT prompting increases latency and memory usage while potentially propagating errors in long reasoning chains. Need for more efficient reasoning that maintains interpretability.

Method: Compress high-level reasoning plan into small set of learned latent tokens (Reasoning Capsule) with dual objectives: primary task loss for accuracy and auxiliary plan-reconstruction loss for grounding.

Result: Reduces visible token footprint while maintaining or improving accuracy on complex benchmarks through balanced efficiency, accuracy, and interpretability.

Conclusion: R-Capsule successfully combines efficiency of latent reasoning with transparency of explicit CoT, striking optimal balance for complex reasoning tasks.

Abstract: Chain-of-Thought (CoT) prompting helps Large Language Models (LLMs) tackle complex reasoning by eliciting explicit step-by-step rationales. However, CoT’s verbosity increases latency and memory usage and may propagate early errors across long chains. We propose the Reasoning Capsule (R-Capsule), a framework that aims to combine the efficiency of latent reasoning with the transparency of explicit CoT. The core idea is to compress the high-level plan into a small set of learned latent tokens (a Reasoning Capsule) while keeping execution steps lightweight or explicit. This hybrid approach is inspired by the Information Bottleneck (IB) principle, where we encourage the capsule to be approximately minimal yet sufficient for the task. Minimality is encouraged via a low-capacity bottleneck, which helps improve efficiency. Sufficiency is encouraged via a dual objective: a primary task loss for answer accuracy and an auxiliary plan-reconstruction loss that encourages the capsule to faithfully represent the original textual plan. The reconstruction objective helps ground the latent space, thereby improving interpretability and reducing the use of uninformative shortcuts. Our framework strikes a balance between efficiency, accuracy, and interpretability, thereby reducing the visible token footprint of reasoning while maintaining or improving accuracy on complex benchmarks. Our codes are available at: https://anonymous.4open.science/r/Reasoning-Capsule-7BE0

[353] Fine-Grained Detection of Context-Grounded Hallucinations Using LLMs

Yehonatan Peisakhovsky, Zorik Gekhman, Yosi Mass, Liat Ein-Dor, Roi Reichart

Main category: cs.CL

TL;DR: LLMs struggle with localizing context-grounded hallucinations in text, achieving only 0.67 F1 score on a new benchmark, with key challenges being false flagging of missing details and difficulty handling factually correct but unverifiable information.

Details

Motivation: To study LLMs' capability for localizing context-grounded hallucinations as a practical alternative to complex evaluation pipelines, addressing the lack of established benchmarks for this meta-evaluation task.

Method: Created a challenging human-annotated benchmark of 1,000+ examples tailored for LLMs, proposed a new free-form textual representation for hallucinations, and evaluated four large-scale LLMs with various prompting strategies.

Result: The benchmark proved difficult, with the best model achieving only 0.67 F1 score. Key findings show LLMs tend to incorrectly flag missing details as inconsistent and struggle with factually correct but unverifiable information.

Conclusion: LLMs face significant challenges in hallucination localization, particularly with distinguishing between actual inconsistencies and missing details, and handling information that aligns with their parametric knowledge but isn’t verifiable from the source.

Abstract: Context-grounded hallucinations are cases where model outputs contain information not verifiable against the source text. We study the applicability of LLMs for localizing such hallucinations, as a more practical alternative to existing complex evaluation pipelines. In the absence of established benchmarks for meta-evaluation of hallucinations localization, we construct one tailored to LLMs, involving a challenging human annotation of over 1,000 examples. We complement the benchmark with an LLM-based evaluation protocol, verifying its quality in a human evaluation. Since existing representations of hallucinations limit the types of errors that can be expressed, we propose a new representation based on free-form textual descriptions, capturing the full range of possible errors. We conduct a comprehensive study, evaluating four large-scale LLMs, which highlights the benchmark’s difficulty, as the best model achieves an F1 score of only 0.67. Through careful analysis, we offer insights into optimal prompting strategies for the task and identify the main factors that make it challenging for LLMs: (1) a tendency to incorrectly flag missing details as inconsistent, despite being instructed to check only facts in the output; and (2) difficulty with outputs containing factually correct information absent from the source - and thus not verifiable - due to alignment with the model’s parametric knowledge.

cs.CV

[354] Pathological Truth Bias in Vision-Language Models

Yash Thube

Main category: cs.CV

TL;DR: MATS is a behavioral audit that measures VLMs’ ability to reject visually contradicted statements, revealing systematic failures in instruction-tuned generative models while contrastive encoders show better robustness.

Details

Motivation: Standard benchmarks for Vision Language Models can hide systematic failures that reduce real-world trust, so there's a need for better evaluation methods to identify these issues.

Method: Introduces MATS (Multimodal Audit for Truthful Spatialization) with two metrics: Spatial Consistency Score (SCS) and Incorrect Agreement Rate (IAR). Uses activation patching to causally localize failure loci in model architectures.

Result: Instruction-tuned generative VLMs (LLaVA 1.5, QwenVLchat) show very low SCS and high IAR, while contrastive encoders (CLIP, SigLIP) are far more robust. Failure loci identified in mid-to-late cross attention for generative models and pooled projection components for contrastive models.

Conclusion: The audit reveals systematic failures in current VLMs and provides concrete repair paths through causal localization of failure points in model architectures.

Abstract: Vision Language Models (VLMs) are improving quickly, but standard benchmarks can hide systematic failures that reduce real world trust. We introduce MATS (Multimodal Audit for Truthful Spatialization), a compact behavioral audit that measures whether models reject visually contradicted statements, and two metrics Spatial Consistency Score (SCS) and Incorrect Agreement Rate (IAR). Instruction tuned generative VLMs (LLaVA 1.5, QwenVLchat) exhibit very low SCS and high IAR, while contrastive encoders (CLIP, SigLIP) are far more robust. Activation patching causally localizes failure loci (mid to late cross attention for generative models, pooled projection components for contrastive models) and suggests concrete repair paths.

[355] Scale and Rotation Estimation of Similarity-Transformed Images via Cross-Correlation Maximization Based on Auxiliary Function Method

Shinji Yamashita, Yuma Kinoshita, Hitoshi Kiya

Main category: cs.CV

TL;DR: A novel algorithm for joint scale and rotation estimation between images with sub-pixel precision using Fourier transform in log-polar coordinates and cross-correlation maximization.

Details

Motivation: Traditional phase-correlation techniques are effective for translational shifts but inadequate for scale and rotation changes that occur due to camera zooming or rotational movements.

Method: Integrates scale and rotation estimation using Fourier transform in log-polar coordinates with cross-correlation maximization strategy, leveraging the auxiliary function method and incorporating sub-pixel-level cross-correlation.

Result: Experimental results show lower mean estimation errors for scale and rotation compared to conventional Fourier transform-based techniques that rely on discrete cross-correlation.

Conclusion: The proposed method enables precise estimation of both scale and rotation with sub-pixel precision, outperforming existing Fourier-based approaches.

Abstract: This paper introduces a highly efficient algorithm capable of jointly estimating scale and rotation between two images with sub-pixel precision. Image alignment serves as a critical process for spatially registering images captured from different viewpoints, and finds extensive use in domains such as medical imaging and computer vision. Traditional phase-correlation techniques are effective in determining translational shifts; however, they are inadequate when addressing scale and rotation changes, which often arise due to camera zooming or rotational movements. In this paper, we propose a novel algorithm that integrates scale and rotation estimation based on the Fourier transform in log-polar coordinates with a cross-correlation maximization strategy, leveraging the auxiliary function method. By incorporating sub-pixel-level cross-correlation our method enables precise estimation of both scale and rotation. Experimental results demonstrate that the proposed method achieves lower mean estimation errors for scale and rotation than conventional Fourier transform-based techniques that rely on discrete cross-correlation.

[356] Robust Object Detection for Autonomous Driving via Curriculum-Guided Group Relative Policy Optimization

Xu Jia

Main category: cs.CV

TL;DR: A reinforcement learning framework using Group Relative Policy Optimization with curriculum-based data scheduling and difficulty-aware filtering to improve multimodal detection in autonomous driving.

Details

Motivation: Multimodal Large Language Models struggle with structured perception tasks requiring precise localization and robustness, particularly in autonomous driving scenarios.

Method: Augmented Group Relative Policy Optimization (GRPO) with curriculum-based data scheduling and difficulty-aware filtering to stabilize optimization under sparse, noisy rewards.

Result: Substantial improvements in detection accuracy and robustness on autonomous driving benchmarks, with ablation studies confirming the importance of reward design, KL regularization, and curriculum pacing.

Conclusion: Reinforcement-driven optimization with structured data curricula provides a scalable path toward robust and interpretable multimodal detection.

Abstract: Multimodal Large Language Models (MLLMs) excel in vision-language reasoning but often struggle with structured perception tasks requiring precise localization and robustness. We propose a reinforcement learning framework that augments Group Relative Policy Optimization (GRPO) with curriculum-based data scheduling and difficulty-aware filtering. This approach stabilizes optimization under sparse, noisy rewards and enables progressive adaptation to complex samples. Evaluations on autonomous driving benchmarks demonstrate substantial improvements in detection accuracy and robustness. Ablation studies confirm the importance of reward design, KL regularization, and curriculum pacing for convergence stability and generalization. Our findings highlight reinforcement-driven optimization with structured data curricula as a scalable path toward robust and interpretable multimodal detection.

[357] Graph-Theoretic Consistency for Robust and Topology-Aware Semi-Supervised Histopathology Segmentation

Ha-Hieu Pham, Minh Le, Han Huynh, Nguyen Quoc Khanh Le, Huy-Hieu Pham

Main category: cs.CV

TL;DR: TGC is a semi-supervised semantic segmentation framework that uses graph-theoretic constraints to enforce global topology and improve segmentation accuracy in computational pathology.

Details

Motivation: Existing semi-supervised methods rely on pixel-level consistency, which propagates noisy pseudo-labels and produces fragmented or topologically invalid masks in computational pathology where dense annotations are costly.

Method: Proposes Topology Graph Consistency (TGC) framework that integrates graph-theoretic constraints by aligning Laplacian spectra, component counts, and adjacency statistics between prediction graphs and references.

Result: TGC achieves state-of-the-art performance on GlaS and CRAG datasets under 5-10% supervision and significantly narrows the gap to full supervision.

Conclusion: The proposed TGC framework effectively enforces global topology in semi-supervised semantic segmentation, demonstrating superior performance in computational pathology applications.

Abstract: Semi-supervised semantic segmentation (SSSS) is vital in computational pathology, where dense annotations are costly and limited. Existing methods often rely on pixel-level consistency, which propagates noisy pseudo-labels and produces fragmented or topologically invalid masks. We propose Topology Graph Consistency (TGC), a framework that integrates graph-theoretic constraints by aligning Laplacian spectra, component counts, and adjacency statistics between prediction graphs and references. This enforces global topology and improves segmentation accuracy. Experiments on GlaS and CRAG demonstrate that TGC achieves state-of-the-art performance under 5-10% supervision and significantly narrows the gap to full supervision. Code is available at https://github.com/hieuphamha19/TGC.

[358] A review of Recent Techniques for Person Re-Identification

Andrea Asperti, Salvatore Fiorilla, Simone Nardi, Lorenzo Orsini

Main category: cs.CV

TL;DR: This survey paper analyzes person re-identification (ReID) techniques, comparing supervised and unsupervised approaches. While supervised methods using deep learning have achieved high performance, they require extensive labeled data. The paper reviews both mature supervised techniques and emerging unsupervised methods that show promising results with reduced data dependency.

Details

Motivation: The motivation is to address the scalability challenges of supervised person ReID methods that require vast annotated data, and to explore the recent advancements in unsupervised approaches that leverage abundant unlabeled data to overcome data labeling limitations.

Method: The survey employs a dual-focus approach: (1) reviewing and categorizing significant publications in supervised person ReID to assess current state-of-the-art, and (2) exploring latest advancements in unsupervised person ReID over the past three years to identify emerging trends.

Result: The survey finds that supervised approaches have little room for further improvement, while unsupervised techniques have shown promising developments in recent years with a narrowing performance gap between supervised and unsupervised paradigms.

Conclusion: The paper contributes to understanding both the mature landscape of supervised person ReID techniques and the promising outcomes in unsupervised learning, highlighting the potential convergence of performance between these two paradigms in person re-identification.

Abstract: Person re-identification (ReId), a crucial task in surveillance, involves matching individuals across different camera views. The advent of Deep Learning, especially supervised techniques like Convolutional Neural Networks and Attention Mechanisms, has significantly enhanced person Re-ID. However, the success of supervised approaches hinges on vast amounts of annotated data, posing scalability challenges in data labeling and computational costs. To address these limitations, recent research has shifted towards unsupervised person re-identification. Leveraging abundant unlabeled data, unsupervised methods aim to overcome the need for pairwise labelled data. Although traditionally trailing behind supervised approaches, unsupervised techniques have shown promising developments in recent years, signalling a narrowing performance gap. Motivated by this evolving landscape, our survey pursues two primary objectives. First, we review and categorize significant publications in supervised person re-identification, providing an in-depth overview of the current state-of-the-art and emphasizing little room for further improvement in this domain. Second, we explore the latest advancements in unsupervised person re-identification over the past three years, offering insights into emerging trends and shedding light on the potential convergence of performance between supervised and unsupervised paradigms. This dual-focus survey aims to contribute to the evolving narrative of person re-identification, capturing both the mature landscape of supervised techniques and the promising outcomes in the realm of unsupervised learning.

[359] Sequential Token Merging: Revisiting Hidden States

Yan Wen, Peng Ye, Lin Zhang, Baopu Li, Jiakang Yuan, Yaoxin Yang, Tao Chen

Main category: cs.CV

TL;DR: STM is a novel token merging method for Vision Mambas that addresses quadratic token scaling by preserving sequential dependencies through bidirectional merging and hidden states protection, achieving minimal accuracy degradation with significant token reduction.

Details

Motivation: Vision Mambas suffer from quadratic token scaling with image resolution, and existing methods overlook the intrinsic Limited Directional Sequential Dependence (LDSD) mechanism that is crucial for information flow in these models.

Method: Proposes Sequential Token Merging (STM) with: 1) Bidirectional nearest neighbor merging to preserve sequential dependencies through symmetric spatial aggregation, and 2) Hidden states protection to stabilize hidden states around the class token, leveraging Mamba’s layer-wise loss convergence.

Result: Achieves only 1.0% accuracy drop for ViM-Ti at 20% token reduction and 1.4% degradation for ViM-S at 40% reduction, demonstrating state-of-the-art efficiency with minimal complexity.

Conclusion: STM provides an effective solution for Vision Mamba efficiency while offering new insights into state-space model dynamics, with codes to be released soon.

Abstract: Vision Mambas (ViMs) achieve remarkable success with sub-quadratic complexity, but their efficiency remains constrained by quadratic token scaling with image resolution. While existing methods address token redundancy, they overlook ViMs’ intrinsic Limited Directional Sequential Dependence (LDSD) - a critical information flow mechanism revealed in our analysis. We further identify Mamba’s selective scan enables gradual information aggregation in hidden states. Based on these insights, we propose Sequential Token Merging (STM), featuring: 1) Bidirectional nearest neighbor merging to preserve sequential dependencies through symmetric spatial aggregation, and 2) Hidden states protection to stabilize the hidden states around the class token. STM strategically leverages Mamba’s layer-wise loss convergence to convert temporal forgetfulness into stability. Experiments demonstrate STM’s superiority: 1.0% accuracy drop for ViM-Ti at 20% token reduction, and only 1.4% degradation for ViM-S at 40% reduction. Our method achieves state-of-the-art efficiency with minimal complexity, while providing new insights into state-space model dynamics. Codes will be released soon.

[360] RCI: A Score for Evaluating Global and Local Reasoning in Multimodal Benchmarks

Amit Agarwal, Hitesh Laxmichand Patel, Srikant Panda, Hansa Meghwani, Jyotika Singh, Karan Dua, Paul Li, Tao Sheng, Sujith Ravi, Dan Roth

Main category: cs.CV

TL;DR: The paper introduces Region Comprehension Index (RCI), a model-based score to quantify whether multimodal benchmarks require global reasoning or can be solved using localized visual cues, revealing that most existing benchmarks favor localized reasoning with spatial biases.

Details

Motivation: Current multimodal benchmarks don't clearly distinguish between genuine global reasoning and success via localized visual cues, hindering effective dataset curation and real-world model development.

Method: RCI systematically compares reference-model performance on image patches versus full images to measure dataset reliance on global versus local visual information.

Result: When applied to 13 multimodal benchmarks, RCI revealed that most favor localized reasoning and exhibit significant spatial biases, posing risks for real-world applications.

Conclusion: RCI provides an actionable tool for diagnosing and mitigating biases in multimodal benchmarks, enabling construction of better datasets for developing robust, enterprise-ready multimodal systems.

Abstract: Multimodal Large Language Models (MLLMs) have achieved impressive results on vision-language benchmarks, yet it remains unclear whether these benchmarks assess genuine global reasoning or allow success via localized visual cues. Existing evaluation methods do not explicitly measure this distinction, hindering effective dataset curation and real-world focused model development. We introduce Region Comprehension Index (RCI), the first model-based score to directly quantify a dataset’s reliance on global versus local visual information. RCI systematically compares reference-model performance on image patches versus full images, revealing if tasks require holistic image understanding or can be solved with partial or localized visual cues. When applying RCI to 13 widely used multimodal benchmarks, we observed that most of them favor localized reasoning and exhibit significant spatial biases, indicating potential risks in real-world applications. RCI equips researchers & practitioners with an actionable tool for diagnosing & mitigating these biases, enabling the construction of datasets and benchmarks to foster the development of robust, enterprise-ready multimodal systems.

[361] Deep Learning Empowered Super-Resolution: A Comprehensive Survey and Future Prospects

Le Zhang, Ao Li, Qibin Hou, Ce Zhu, Yonina C. Eldar

Main category: cs.CV

TL;DR: This paper provides a comprehensive survey of super-resolution methods, covering SISR, VSR, SSR, and LFSR with analysis of over 150+70+30 methods, methodologies, datasets, and evaluation protocols.

Details

Motivation: Address the lack of comprehensive overviews in existing super-resolution surveys that focus only on specific domains, and provide a holistic review of the entire field.

Method: Conducted an in-depth review and taxonomy of diverse SR methods based on backbone structures and purposes, analyzing methodologies, datasets, evaluation protocols, empirical results, and complexity.

Result: Created a comprehensive survey covering over 150 SISR methods, nearly 70 VSR approaches, and approximately 30 SSR and LFSR techniques, with detailed analysis and taxonomy.

Conclusion: This work serves as a valuable resource and guidance for researchers in super-resolution, with an accompanying repository for easy access to related work.

Abstract: Super-resolution (SR) has garnered significant attention within the computer vision community, driven by advances in deep learning (DL) techniques and the growing demand for high-quality visual applications. With the expansion of this field, numerous surveys have emerged. Most existing surveys focus on specific domains, lacking a comprehensive overview of this field. Here, we present an in-depth review of diverse SR methods, encompassing single image super-resolution (SISR), video super-resolution (VSR), stereo super-resolution (SSR), and light field super-resolution (LFSR). We extensively cover over 150 SISR methods, nearly 70 VSR approaches, and approximately 30 techniques for SSR and LFSR. We analyze methodologies, datasets, evaluation protocols, empirical results, and complexity. In addition, we conducted a taxonomy based on each backbone structure according to the diverse purposes. We also explore valuable yet under-studied open issues in the field. We believe that this work will serve as a valuable resource and offer guidance to researchers in this domain. To facilitate access to related work, we created a dedicated repository available at https://github.com/AVC2-UESTC/Holistic-Super-Resolution-Review.

[362] Learning Hyperspectral Images with Curated Text Prompts for Efficient Multimodal Alignment

Abhiroop Chatterjee, Susmita Ghosh

Main category: cs.CV

TL;DR: A parameter-efficient CLIP-style VLM for hyperspectral image understanding that aligns 3D voxel embeddings with text embeddings using contrastive learning and descriptive prompts, achieving SOTA results with only 0.07% parameter updates.

Details

Motivation: Hyperspectral images have high-dimensional 3D voxel structures with hundreds of spectral channels, but cross-modal alignment between vision and language in this domain remains underexplored compared to natural images and text.

Method: Uses CLIP-style contrastive training to map voxel embeddings from a vision backbone to a frozen LEM’s latent space. A trainable probe aligns vision features with text tokens using contrastive loss with hard/semi-hard negatives and positive pairs. Descriptive prompts encode class semantics as structured anchors.

Result: Achieves state-of-the-art performance with only 0.07% parameter updates: +0.92 OA and +1.60 Kappa on Indian Pines, +0.69 OA and +0.90 Kappa on Pavia University. Uses 50× fewer parameters than DCTN and 90× fewer than SS-TMNet.

Conclusion: The proposed parameter-efficient framework successfully aligns hyperspectral vision features with language representations, demonstrating that effective cross-modal learning in HSI can be achieved with minimal parameter updates while outperforming larger models.

Abstract: As data requirements continue to grow, efficient learning increasingly depends on the curation and distillation of high-value data rather than brute-force scaling of model sizes. In the case of a hyperspectral image (HSI), the challenge is amplified by the high-dimensional 3D voxel structure, where each spatial location is associated with hundreds of contiguous spectral channels. While vision and language models have been optimized effectively for natural image or text tasks, their cross-modal alignment in the hyperspectral domain remains an open and underexplored problem. In this article, we make an attempt to optimize a Vision-Language Model (VLM) for hyperspectral scene understanding by exploiting a CLIP-style contrastive training framework. Our framework maps voxel-level embeddings from a vision backbone onto the latent space of a frozen large embedding model (LEM), where a trainable probe aligns vision features with the model’s textual token representations. The two modalities are aligned via a contrastive loss restricted to a curated set of hard (closest wrong classes) and semi-hard (random distractors) negatives, along with positive pairs. To further enhance alignment, descriptive prompts that encode class semantics are introduced and act as structured anchors for the HSI embeddings. It is seen that the proposed method updates only 0.07 percent of the total parameters, yet yields state-of-the-art performance. For example, on Indian Pines (IP) the model produces better results over unimodal and multimodal baselines by +0.92 Overall Accuracy (OA) and +1.60 Kappa ($\kappa$), while on Pavia University (PU) data it provides gains of +0.69 OA and +0.90 $\kappa$. Moreover, this is achieved with the set of parameters, nearly 50$\times$ smaller than DCTN and 90$\times$ smaller than SS-TMNet.

Zhuang Qi, Pan Yu, Lei Meng, Sijin Zhou, Han Yu, Xiaoxiao Li, Xiangxu Meng

Main category: cs.CV

TL;DR: GPR-NIAM is a one-shot federated prompt learning method that uses attention masking to restrict interaction between text and prompt embeddings, enabling cross-task generalization without multi-round communication.

Details

Motivation: Existing federated prompt learning methods require multi-round communication and lack cross-task generalization capabilities, limiting their efficiency and applicability.

Method: Uses two modules: attention isolation module to suppress prompt-to-text attention and reweight text-to-prompt attention, and cross-silo collaborative refinement module to integrate decentralized visual knowledge and calibrate global prompts.

Result: Outperforms eight state-of-the-art methods on ten benchmark datasets across class-level and domain-level generalization tasks.

Conclusion: GPR-NIAM provides an effective one-shot federated prompt learning solution that achieves superior generalization while being communication-efficient.

Abstract: Federated Prompt Learning (FPL) enables communication-efficient adaptation by tuning lightweight prompts on top of frozen pre-trained models. Existing FPL methods typically rely on global information, which is only available after the second training round, to facilitate collaboration among client models. Therefore, they are inherently dependent on multi-round communication to fully exhibit their strengths. Moreover, existing one-shot federated learning methods typically focus on fitting seen tasks, but lack cross-task generalization. To bridge this gap, we propose the Global Prompt Refinement with Non-Interfering Attention Masking (GPR-NIAM) method for one-shot FPL. The core idea is to design a masking mechanism that restricts excessive interaction between the original text embeddings and the learnable prompt embeddings. GPR-NIAM achieves this through the collaboration of two key modules. Firstly, the attention isolation module suppresses attention from the learnable prompt tokens to the original text tokens, and reweights the reverse attention which preserves generalization across tasks. Secondly, the cross-silo collaborative refinement module integrates decentralized visual knowledge into a unified base and calibrates the global prompt through multi-source cross-modal knowledge alignment, further mitigating the inconsistency caused by data heterogeneity. Extensive experiments conducted on ten benchmark datasets under two tasks show that GPR-NIAM outperforms eight state-of-the-art methods in both class-level and domain-level generalization.

[364] GZSL-MoE: Apprentissage G{é}n{é}ralis{é} Z{é}ro-Shot bas{é} sur le M{é}lange d’Experts pour la Segmentation S{é}mantique de Nuages de Points 3DAppliqu{é} {à} un Jeu de Donn{é}es d’Environnement de Collaboration Humain-Robot

Ahed Alboody

Main category: cs.CV

TL;DR: GZSL-MoE integrates Mixture-of-Experts into Generative Zero-Shot Learning for 3D point cloud semantic segmentation, improving performance on both seen and unseen classes in HRC environments.

Details

Motivation: To address the challenge of 3D semantic segmentation when comprehensive training data for all object classes is unavailable, particularly in Human-Robot Collaboration environments.

Method: Combines Generative Zero-Shot Learning with Mixture-of-Experts layers in Generator and Discriminator to generate realistic fake features of unseen classes using KPConv-extracted features from seen classes.

Result: The GZSL-MoE model enhances performance on both seen and unseen classes in 3D point cloud semantic segmentation tasks.

Conclusion: GZSL-MoE provides a promising solution for understanding complex 3D environments when complete training data is unavailable, demonstrating improved zero-shot learning capabilities.

Abstract: Generative Zero-Shot Learning approach (GZSL) has demonstrated significant potential in 3D point cloud semantic segmentation tasks. GZSL leverages generative models like GANs or VAEs to synthesize realistic features (real features) of unseen classes. This allows the model to label unseen classes during testing, despite being trained only on seen classes. In this context, we introduce the Generalized Zero-Shot Learning based-upon Mixture-of-Experts (GZSL-MoE) model. This model incorporates Mixture-of-Experts layers (MoE) to generate fake features that closely resemble real features extracted using a pre-trained KPConv (Kernel Point Convolution) model on seen classes. The main contribution of this paper is the integration of Mixture-of-Experts into the Generator and Discriminator components of the Generative Zero-Shot Learning model for 3D point cloud semantic segmentation, applied to the COVERED dataset (CollabOratiVE Robot Environment Dataset) for Human-Robot Collaboration (HRC) environments. By combining the Generative Zero-Shot Learning model with Mixture-of- Experts, GZSL-MoE for 3D point cloud semantic segmentation provides a promising solution for understanding complex 3D environments, especially when comprehensive training data for all object classes is unavailable. The performance evaluation of the GZSL-MoE model highlights its ability to enhance performance on both seen and unseen classes. Keywords Generalized Zero-Shot Learning (GZSL), 3D Point Cloud, 3D Semantic Segmentation, Human-Robot Collaboration, COVERED (CollabOratiVE Robot Environment Dataset), KPConv, Mixture-of Experts

[365] PCRI: Measuring Context Robustness in Multimodal Models for Enterprise Applications

Hitesh Laxmichand Patel, Amit Agarwal, Srikant Panda, Hansa Meghwani, Karan Dua, Paul Li, Tao Sheng, Sujith Ravi, Dan Roth

Main category: cs.CV

TL;DR: PCRI is a new metric that measures MLLM robustness to visual context variations by comparing performance on image patches vs full images, revealing most models are brittle to background noise.

Details

Motivation: Existing evaluation metrics don't capture MLLM sensitivity to irrelevant visual context, undermining reliability in real-world settings.

Method: Introduce PCRI score that systematically quantifies robustness by measuring performance changes between localized image patches and full-image input across 19 MLLMs and 15 benchmarks.

Result: Most leading MLLMs remain brittle to background noise, with only InternVL2-26B and Qwen2VL-72B showing consistent robustness. PCRI reveals how different architectures handle visual context.

Conclusion: PCRI enables rigorous comparison of context robustness, supporting principled model selection and guiding development of more robust architectures for real-world deployment.

Abstract: The reliability of Multimodal Large Language Models (MLLMs) in real-world settings is often undermined by sensitivity to irrelevant or distracting visual context, an aspect not captured by existing evaluation metrics. We introduce the \textbf{Patch Context Robustness Index (PCRI)}, the first systematic and interpretable score for quantifying MLLM robustness to variations in visual context granularity, measuring performance changes between localized image patches and full-image input. Applying PCRI to 19 state-of-the-art MLLMs across 15 vision-language benchmarks, we find that most leading models remain brittle to background noise, with only a few, such as InternVL2-26B and Qwen2VL-72B, demonstrating consistent robustness across tasks. PCRI analysis also highlights how different model architectures handle and integrate visual context, offering actionable diagnostic insight for both researchers and practitioners. PCRI enables rigorous comparison of context robustness, supporting principled model selection and guiding the development of future architectures and training strategies for robust, real-world deployment.

[366] IBiT: Utilizing Inductive Biases to Create a More Data Efficient Attention Mechanism

Adithya Giri

Main category: cs.CV

TL;DR: Vision Transformers lack CNN inductive biases, which can be learned through learned masks to improve performance on small datasets without Knowledge Distillation.

Details

Motivation: Transformers dominate Computer Vision but lack the inductive biases of CNNs, which limits their performance on small datasets.

Method: Introduce inductive biases through learned masks in Vision Transformers, creating Inductively Biased Image Transformers (IBiT).

Result: IBiT models are significantly more accurate on small datasets while maintaining Transformer explainability.

Conclusion: Learned masks can effectively introduce CNN-like inductive biases into Vision Transformers, enabling better performance on small datasets without sacrificing explainability.

Abstract: In recent years, Transformer-based architectures have become the dominant method for Computer Vision applications. While Transformers are explainable and scale well with dataset size, they lack the inductive biases of Convolutional Neural Networks. While these biases may be learned on large datasets, we show that introducing these inductive biases through learned masks allow Vision Transformers to learn on much smaller datasets without Knowledge Distillation. These Transformers, which we call Inductively Biased Image Transformers (IBiT), are significantly more accurate on small datasets, while retaining the explainability Transformers.

[367] LayoutAgent: A Vision-Language Agent Guided Compositional Diffusion for Spatial Layout Planning

Zezhong Fan, Xiaohan Li, Luyi Ma, Kai Zhao, Liang Peng, Topojoy Biswas, Evren Korpeoglu, Kaushiki Nag, Kannan Achan

Main category: cs.CV

TL;DR: LayoutAgent is an agentic framework that combines vision-language reasoning with compositional diffusion to generate realistic multi-object scene layouts by ensuring spatial plausibility and semantic consistency.

Details

Motivation: Current diffusion models lack explicit spatial reasoning, leading to unrealistic object layouts, while traditional spatial planning methods in robotics fail to capture semantic richness in visual scenes.

Method: Uses visual-language model for input preprocessing (segmentation, object size estimation, scene graph construction, prompt rewriting), then compositional diffusion for bounding box synthesis respecting object relations, and finally foreground-conditioned image generation.

Result: Outperforms state-of-the-art layout generation models in layout coherence, spatial realism and aesthetic alignment.

Conclusion: LayoutAgent successfully bridges the gap between high-quality image generation and spatial planning by unifying vision-language reasoning with compositional diffusion.

Abstract: Designing realistic multi-object scenes requires not only generating images, but also planning spatial layouts that respect semantic relations and physical plausibility. On one hand, while recent advances in diffusion models have enabled high-quality image generation, they lack explicit spatial reasoning, leading to unrealistic object layouts. On the other hand, traditional spatial planning methods in robotics emphasize geometric and relational consistency, but they struggle to capture semantic richness in visual scenes. To bridge this gap, in this paper, we propose LayoutAgent, an agentic framework that unifies vision-language reasoning with compositional diffusion for layout generation. Given multiple input images with target objects in them, our method first employs visual-language model to preprocess the inputs through segmentation, object size estimation, scene graph construction, and prompt rewriting. Then we leverage compositional diffusion-a method traditionally used in robotics-to synthesize bounding boxes that respect object relations encoded in the scene graph for spatial layouts. In the end, a foreground-conditioned image generator composes the complete scene by rendering the objects into the planned layout guided by designed prompts. Experiments demonstrate that LayoutAgent outperforms other state-of-the-art layout generation models in layout coherence, spatial realism and aesthetic alignment.

[368] CompareBench: A Benchmark for Visual Comparison Reasoning in Vision-Language Models

Jie Cai, Kangning Yang, Lan Fu, Jiaming Ding, Jinlong Li, Huiming Sun, Daitao Xing, Jinglin Shen, Zibo Meng

Main category: cs.CV

TL;DR: CompareBench is a new benchmark for evaluating visual comparison reasoning in VLMs, revealing systematic limitations in temporal ordering, spatial relations, counting, and geometric comparisons.

Details

Motivation: Visual comparison reasoning is a fundamental but understudied skill in vision-language models that needs better evaluation methods.

Method: Created CompareBench with 1000 QA pairs across four tasks (quantity, temporal, geometric, spatial) using auxiliary datasets TallyBench and HistCaps, then evaluated both closed-source and open-source VLMs.

Result: Models show scaling trends but consistently fail at temporal ordering and spatial relations, and make mistakes in basic counting and geometric comparisons that are trivial for humans.

Conclusion: Visual comparison remains a systematic blind spot for current VLMs, and CompareBench provides a foundation for advancing more reliable multimodal reasoning.

Abstract: We introduce CompareBench, a benchmark for evaluating visual comparison reasoning in vision-language models (VLMs), a fundamental yet understudied skill. CompareBench consists of 1000 QA pairs across four tasks: quantity (600), temporal (100), geometric (200), and spatial (100). It is derived from two auxiliary datasets that we constructed: TallyBench (2000 counting images with QA) and HistCaps (515 historical images with bilingual captions). We evaluate both closed-source APIs (OpenAI, Gemini, Claude) and open-source models (Qwen2.5-VL and Qwen3-VL series). Results show clear scaling trends but also reveal critical limitations: even the strongest models consistently fail at temporal ordering and spatial relations, and they often make mistakes in basic counting and geometric comparisons that are trivial for humans. These findings demonstrate that visual comparison remains a systematic blind spot for current VLMs. By providing controlled, diverse, and diagnostic evaluation, CompareBench establishes a foundation for advancing more reliable multimodal reasoning.

[369] From Satellite to Street: A Hybrid Framework Integrating Stable Diffusion and PanoGAN for Consistent Cross-View Synthesis

Khawlah Bajbaa, Abbas Anwar, Muhammad Saqib, Hafeez Anwar, Nabin Sharma, Muhammad Usman

Main category: cs.CV

TL;DR: A hybrid framework combining diffusion models and conditional GANs to generate geographically consistent street-view images from satellite imagery, achieving improved geometric consistency and visual quality.

Details

Motivation: Street view imagery is valuable for urban analytics, but synthesizing street-view images from satellite imagery is challenging due to appearance and perspective differences between domains.

Method: Multi-stage training strategy with Stable Diffusion as core component in dual-branch architecture, integrated with conditional GAN for panoramic street views, plus fusion strategy leveraging both models.

Result: Outperforms diffusion-only methods across multiple metrics and achieves competitive performance with state-of-the-art GAN-based methods on CVUSA dataset, generating realistic images with preserved local details.

Conclusion: The hybrid framework successfully generates geometrically consistent street-view images with fine-grained details like street markings and atmospheric elements.

Abstract: Street view imagery has become an essential source for geospatial data collection and urban analytics, enabling the extraction of valuable insights that support informed decision-making. However, synthesizing street-view images from corresponding satellite imagery presents significant challenges due to substantial differences in appearance and viewing perspective between these two domains. This paper presents a hybrid framework that integrates diffusion-based models and conditional generative adversarial networks to generate geographically consistent street-view images from satellite imagery. Our approach uses a multi-stage training strategy that incorporates Stable Diffusion as the core component within a dual-branch architecture. To enhance the framework’s capabilities, we integrate a conditional Generative Adversarial Network (GAN) that enables the generation of geographically consistent panoramic street views. Furthermore, we implement a fusion strategy that leverages the strengths of both models to create robust representations, thereby improving the geometric consistency and visual quality of the generated street-view images. The proposed framework is evaluated on the challenging Cross-View USA (CVUSA) dataset, a standard benchmark for cross-view image synthesis. Experimental results demonstrate that our hybrid approach outperforms diffusion-only methods across multiple evaluation metrics and achieves competitive performance compared to state-of-the-art GAN-based methods. The framework successfully generates realistic and geometrically consistent street-view images while preserving fine-grained local details, including street markings, secondary roads, and atmospheric elements such as clouds.

[370] MILR: Improving Multimodal Image Generation via Test-Time Latent Reasoning

Yapeng Mi, Hengli Li, Yanpeng Zhao, Chenxi Li, Huimin Wu, Xiaojian Ma, Song-Chun Zhu, Ying Nian Wu, Qing Li

Main category: cs.CV

TL;DR: MILR is a test-time method that performs joint reasoning over image and text in a unified latent space using policy gradient optimization guided by an image quality critic, achieving state-of-the-art results on multiple benchmarks.

Details

Motivation: Existing reasoning-based image generation methods are limited to single modalities or require high-quality reasoning data for fine-tuning, creating a need for more flexible cross-modal reasoning approaches.

Method: Joint reasoning over image and text tokens in a unified latent vector space using policy gradient method guided by an image quality critic, implemented within the MUG framework that supports language reasoning before image synthesis.

Result: Achieved state-of-the-art results on GenEval, T2I-CompBench, and WISE benchmarks, with 80% improvement over baseline on knowledge-intensive WISE (overall score 0.63).

Conclusion: Joint reasoning in unified latent space is key to strong performance, with demonstrated abilities in temporal and cultural reasoning, highlighting the efficacy of the proposed reasoning method.

Abstract: Reasoning-augmented machine learning systems have shown improved performance in various domains, including image generation. However, existing reasoning-based methods for image generation either restrict reasoning to a single modality (image or text) or rely on high-quality reasoning data for fine-tuning. To tackle these limitations, we propose MILR, a test-time method that jointly reasons over image and text in a unified latent vector space. Reasoning in MILR is performed by searching through vector representations of discrete image and text tokens. Practically, this is implemented via the policy gradient method, guided by an image quality critic. We instantiate MILR within the unified multimodal understanding and generation (MUG) framework that natively supports language reasoning before image synthesis and thus facilitates cross-modal reasoning. The intermediate model outputs, which are to be optimized, serve as the unified latent space, enabling MILR to operate entirely at test time. We evaluate MILR on GenEval, T2I-CompBench, and WISE, achieving state-of-the-art results on all benchmarks. Notably, on knowledge-intensive WISE, MILR attains an overall score of 0.63, improving over the baseline by 80%. Our further analysis indicates that joint reasoning in the unified latent space is the key to its strong performance. Moreover, our qualitative studies reveal MILR’s non-trivial ability in temporal and cultural reasoning, highlighting the efficacy of our reasoning method.

[371] SkyLink: Unifying Street-Satellite Geo-Localization via UAV-Mediated 3D Scene Alignment

Hongyang Zhang, Yinhao Liu, Zhenyu Kuang

Main category: cs.CV

TL;DR: SkyLink method for cross-view geo-localization that addresses semantic degradation from viewpoint disparities using Google Retrieval Enhancement, patch-aware feature aggregation, and 3D scene integration with contrastive learning.

Details

Motivation: Existing approaches overlook semantic degradation caused by extreme viewpoint disparities in cross-view geo-localization, leading to poor feature matching between different viewpoints.

Method: Uses Google Retrieval Enhancement Module for street image data enhancement, Patch-Aware Feature Aggregation for consistent feature extraction, and integrates 3D scene information from multi-scale UAV images with self-supervised and cross-view contrastive learning.

Result: Achieves 25.75% Recall@1 accuracy on University-1652 in the UAVM2025 Challenge, demonstrating robustness and generalization across diverse urban scenarios.

Conclusion: SkyLink effectively addresses viewpoint variation challenges in cross-view geo-localization through enhanced feature retrieval and 3D scene integration.

Abstract: Cross-view geo-localization aims at establishing location correspondences between different viewpoints. Existing approaches typically learn cross-view correlations through direct feature similarity matching, often overlooking semantic degradation caused by extreme viewpoint disparities. To address this unique problem, we focus on robust feature retrieval under viewpoint variation and propose the novel SkyLink method. We firstly utilize the Google Retrieval Enhancement Module to perform data enhancement on street images, which mitigates the occlusion of the key target due to restricted street viewpoints. The Patch-Aware Feature Aggregation module is further adopted to emphasize multiple local feature aggregations to ensure the consistent feature extraction across viewpoints. Meanwhile, we integrate the 3D scene information constructed from multi-scale UAV images as a bridge between street and satellite viewpoints, and perform feature alignment through self-supervised and cross-view contrastive learning. Experimental results demonstrate robustness and generalization across diverse urban scenarios, which achieve 25.75$%$ Recall@1 accuracy on University-1652 in the UAVM2025 Challenge. Code will be released at https://github.com/HRT00/CVGL-3D.

[372] UESA-Net: U-Shaped Embedded Multidirectional Shrinkage Attention Network for Ultrasound Nodule Segmentation

Tangqi Shi, Pietro Lio

Main category: cs.CV

TL;DR: UESA-Net is a U-shaped segmentation network with multidirectional shrinkage attention that achieves state-of-the-art performance on breast and thyroid ultrasound images by bridging global context with local details.

Details

Motivation: Breast and thyroid cancers are increasing public health burdens. Ultrasound imaging suffers from speckle noise, overlapping structures, and weak global-local feature interactions, making existing networks struggle to reconcile high-level semantics with low-level spatial details.

Method: UESA-Net uses a U-shaped encoder-decoder architecture with multidirectional shrinkage attention. Attention modules operate along horizontal, vertical, and depth directions to exploit spatial details, while shrinkage strategy integrates prior knowledge and local features. The decoder applies pairwise shrinkage combining prior low-level physical cues with encoder features.

Result: On TN3K (3493 images) and BUSI (780 images) datasets, UESA-Net achieved state-of-the-art performance with IoU scores of 0.8487 and 0.6495 respectively.

Conclusion: UESA-Net effectively aggregates multidirectional spatial information and prior knowledge to improve robustness and accuracy in breast and thyroid ultrasound segmentation, demonstrating superior performance to existing methods.

Abstract: Background: Breast and thyroid cancers pose an increasing public-health burden. Ultrasound imaging is a cost-effective, real-time modality for lesion detection and segmentation, yet suffers from speckle noise, overlapping structures, and weak global-local feature interactions. Existing networks struggle to reconcile high-level semantics with low-level spatial details. We aim to develop a segmentation framework that bridges the semantic gap between global context and local detail in noisy ultrasound images. Methods: We propose UESA-Net, a U-shaped network with multidirectional shrinkage attention. The encoder-decoder architecture captures long-range dependencies and fine-grained structures of lesions. Within each encoding block, attention modules operate along horizontal, vertical, and depth directions to exploit spatial details, while a shrinkage (threshold) strategy integrates prior knowledge and local features. The decoder mirrors the encoder but applies a pairwise shrinkage mechanism, combining prior low-level physical cues with corresponding encoder features to enhance context modeling. Results: On two public datasets - TN3K (3493 images) and BUSI (780 images) - UESA-Net achieved state-of-the-art performance with intersection-over-union (IoU) scores of 0.8487 and 0.6495, respectively. Conclusions: UESA-Net effectively aggregates multidirectional spatial information and prior knowledge to improve robustness and accuracy in breast and thyroid ultrasound segmentation, demonstrating superior performance to existing methods on multiple benchmarks.

[373] LUQ: Layerwise Ultra-Low Bit Quantization for Multimodal Large Language Models

Shubhang Bhatnagar, Andy Xu, Kar-Han Tan, Narendra Ahuja

Main category: cs.CV

TL;DR: LUQ: Layerwise Ultra-Low Bit Quantization for multimodal LLMs, selectively applying ultra-low bit quantization to resilient layers, reducing memory by 40% and 31% with <10% performance degradation.

Details

Motivation: Multimodal LLMs require huge memory and computational resources, but existing quantization methods are not well-explored for multimodal models. Multimodal tokens exhibit higher statistical variance and entropy than text tokens, making them less tolerant to ultra-low bit quantization.

Method: Proposed LUQ strategy that selectively applies ultra-low bit quantization to layers with lower entropy activation distributions that are more resilient to quantization. Also uses mixed multimodal tokens (image and text) for post-training quantization to boost VQA performance.

Result: Evaluated on LLaVA-1.5 and Qwen-2.5-VL across 9 VQA benchmarks. LUQ models use 40% and 31% less memory than 4-bit counterparts respectively, with performance degradation less than 10% on MME benchmark.

Conclusion: Layerwise selective quantization strategy effectively compresses multimodal LLMs to ultra-low bits while maintaining acceptable performance, addressing the unique challenges of multimodal token distributions.

Abstract: Large Language Models (LLMs) with multimodal capabilities have revolutionized vision-language tasks, but their deployment often requires huge memory and computational resources. While post-training quantization (PTQ) has successfully compressed language models to as low as 1-bit precision without significant performance loss, its effectiveness for multimodal LLMs (MLLMs) remains relatively unexplored. In this paper, we present the first study on ultra-low bit (<4-bit) quantization for multimodal LLMs. Our analysis reveals that multimodal tokens and intermediate layer activations produced by them exhibit significantly higher statistical variance and entropy compared to text tokens, making them less tolerant to ultra-low bit quantization. However, the activation distributions of multimodal tokens varies significantly over different layers, with some layers having lower entropy activation distributions. We empirically show that such layers in these models can better tolerate ultra-low bit quantization. Building on these insights, we propose a novel strategy for MLLM quantization, LUQ: Layerwise Ultra-Low Bit Quantization, which selectively applies ultra-low bit quantization to layers that are more resilient to it. Additionally, we also show that using a mix of multimodal tokens (image and text) for PTQ boosts VQA performance in the ultra-low bit regime. We evaluate our method on LLaVA-1.5 and Qwen-2.5-VL across 9 popular VQA benchmarks. The resulting LUQ models use 40% and 31% less memory than their 4-bit counterparts, respectively, while exhibiting a performance degradation of less than 10% on the MME benchmark.

[374] PartCo: Part-Level Correspondence Priors Enhance Category Discovery

Fernando Julio Cendra, Kai Han

Main category: cs.CV

TL;DR: PartCo introduces part-level correspondence prior to enhance Generalized Category Discovery by capturing finer-grained semantic structures through part-level visual feature correspondences, achieving state-of-the-art results without major modifications to existing methods.

Details

Motivation: Existing GCD methods rely on semantic labels and global image representations, overlooking detailed part-level cues crucial for distinguishing closely related categories.

Method: PartCo framework incorporates part-level visual feature correspondences to capture finer-grained semantic structures and enhance category discovery.

Result: Extensive experiments show PartCo significantly improves performance of current GCD approaches and achieves state-of-the-art results on multiple benchmark datasets.

Conclusion: PartCo bridges the gap between semantic labels and part-level visual compositions, setting new benchmarks for Generalized Category Discovery.

Abstract: Generalized Category Discovery (GCD) aims to identify both known and novel categories within unlabeled data by leveraging a set of labeled examples from known categories. Existing GCD methods primarily depend on semantic labels and global image representations, often overlooking the detailed part-level cues that are crucial for distinguishing closely related categories. In this paper, we introduce PartCo, short for Part-Level Correspondence Prior, a novel framework that enhances category discovery by incorporating part-level visual feature correspondences. By leveraging part-level relationships, PartCo captures finer-grained semantic structures, enabling a more nuanced understanding of category relationships. Importantly, PartCo seamlessly integrates with existing GCD methods without requiring significant modifications. Our extensive experiments on multiple benchmark datasets demonstrate that PartCo significantly improves the performance of current GCD approaches, achieving state-of-the-art results by bridging the gap between semantic labels and part-level visual compositions, thereby setting new benchmarks for GCD. Project page: https://visual-ai.github.io/partco

[375] A Data-Centric Perspective on the Influence of Image Data Quality in Machine Learning Models

Pei-Han Chen, Szu-Chi Chung

Main category: cs.CV

TL;DR: This paper develops a systematic pipeline for assessing image dataset quality using CleanVision and Fastdup tools with automatic thresholding enhancements, showing significant improvements in detecting low-quality images and near-duplicates.

Details

Motivation: As model architectures mature with diminishing marginal gains, data quality has become critical, but systematic studies on evaluating image dataset quality remain limited.

Method: Developed a pipeline integrating CleanVision and Fastdup tools with enhancements including automatic threshold selection. Analyzed impact of various image quality issues on model performance using CIFAKE dataset.

Result: Automatic thresholding improved F1 score from 0.6794 to 0.9468 (single perturbations) and 0.7447 to 0.8557 (dual perturbations). Near-duplicate detection improved from 0.4576 to 0.7928 F1 score.

Conclusion: The workflow effectively advances data quality assessment in image-based ML, showing CNNs are resilient to some distortions but vulnerable to blurring and severe downscaling that obscure critical features.

Abstract: In machine learning, research has traditionally focused on model development, with relatively less attention paid to training data. As model architectures have matured and marginal gains from further refinements diminish, data quality has emerged as a critical factor. However, systematic studies on evaluating and ensuring dataset quality in the image domain remain limited. This study investigates methods for systematically assessing image dataset quality and examines how various image quality factors influence model performance. Using the publicly available and relatively clean CIFAKE dataset, we identify common quality issues and quantify their impact on training. Building on these findings, we develop a pipeline that integrates two community-developed tools, CleanVision and Fastdup. We analyze their underlying mechanisms and introduce several enhancements, including automatic threshold selection to detect problematic images without manual tuning. Experimental results demonstrate that not all quality issues exert the same level of impact. While convolutional neural networks show resilience to certain distortions, they are particularly vulnerable to degradations that obscure critical visual features, such as blurring and severe downscaling. To assess the performance of existing tools and the effectiveness of our proposed enhancements, we formulate the detection of low-quality images as a binary classification task and use the F1 score as the evaluation metric. Our automatic thresholding method improves the F1 score from 0.6794 to 0.9468 under single perturbations and from 0.7447 to 0.8557 under dual perturbations. For near-duplicate detection, our deduplication strategy increases the F1 score from 0.4576 to 0.7928. These results underscore the effectiveness of our workflow and provide a foundation for advancing data quality assessment in image-based machine learning.

[376] DEFT: Decompositional Efficient Fine-Tuning for Text-to-Image Models

Komal Kumar, Rao Muhammad Anwer, Fahad Shahbaz Khan, Salman Khan, Ivan Laptev, Hisham Cholakkal

Main category: cs.CV

TL;DR: DEFT is an efficient fine-tuning framework that decomposes weight matrix updates into two trainable low-rank components for better trade-off between personalization, multi-task unification, and editability in text-to-image models.

Details

Motivation: Address the challenge of balancing target distribution alignment, novel concept learning from limited data, multi-task instruction ability, and prompt editability while minimizing computational resources in text-to-image model fine-tuning.

Method: Decompositional Efficient Fine-Tuning (DEFT) adapts pre-trained weights by decomposing updates into: (1) projection onto complement of low-rank subspace, and (2) low-rank update within that subspace using two trainable low-rank matrices.

Result: State-of-the-art performance on Dreambooth, Dreambench Plus (personalization), InsDet (object/scene adaptation), and VisualCloze (universal image generation) datasets with Stable Diffusion and unified models.

Conclusion: DEFT demonstrates emergent properties of efficient fine-tuning, achieving superior performance across diverse tasks while maintaining computational efficiency and editability.

Abstract: Efficient fine-tuning of pre-trained Text-to-Image (T2I) models involves adjusting the model to suit a particular task or dataset while minimizing computational resources and limiting the number of trainable parameters. However, it often faces challenges in striking a trade-off between aligning with the target distribution: learning a novel concept from a limited image for personalization and retaining the instruction ability needed for unifying multiple tasks, all while maintaining editability (aligning with a variety of prompts or in-context generation). In this work, we introduce DEFT, Decompositional Efficient Fine-Tuning, an efficient fine-tuning framework that adapts a pre-trained weight matrix by decomposing its update into two components with two trainable matrices: (1) a projection onto the complement of a low-rank subspace spanned by a low-rank matrix, and (2) a low-rank update. The single trainable low-rank matrix defines the subspace, while the other trainable low-rank matrix enables flexible parameter adaptation within that subspace. We conducted extensive experiments on the Dreambooth and Dreambench Plus datasets for personalization, the InsDet dataset for object and scene adaptation, and the VisualCloze dataset for a universal image generation framework through visual in-context learning with both Stable Diffusion and a unified model. Our results demonstrated state-of-the-art performance, highlighting the emergent properties of efficient fine-tuning. Our code is available on \href{https://github.com/MAXNORM8650/DEFT}{DEFTBase}.

[377] VideoScore2: Think before You Score in Generative Video Evaluation

Xuan He, Dongfu Jiang, Ping Nie, Minghao Liu, Zhengxuan Jiang, Mingyi Su, Wentao Ma, Junru Lin, Chun Ye, Yi Lu, Keming Wu, Benjamin Schneider, Quy Duc Do, Zhuofeng Li, Yiming Jia, Yuxuan Zhang, Guo Cheng, Haozhe Wang, Wangchunshu Zhou, Qunshu Lin, Yuanxing Zhang, Ge Zhang, Wenhao Huang, Wenhu Chen

Main category: cs.CV

TL;DR: VideoScore2 is a multi-dimensional, interpretable framework for evaluating text-to-video generation that assesses visual quality, semantic alignment, and physical consistency while providing detailed rationales.

Details

Motivation: Existing video evaluation methods are limited to single opaque scores, lack interpretability, or provide only coarse analysis, making them insufficient for comprehensive video quality assessment.

Method: Trained on VideoFeedback2 dataset (27,168 human-annotated videos) using a two-stage pipeline: supervised fine-tuning followed by reinforcement learning with Group Relative Policy Optimization (GRPO) to enhance analytical robustness.

Result: Achieves 44.35 (+5.94) accuracy on VideoScore-Bench-v2 and 50.37 (+4.32) average performance across four out-of-domain benchmarks, providing interpretable assessments and effective reward modeling for Best-of-N sampling.

Conclusion: VideoScore2 bridges the gap between evaluation and controllable generation through multi-dimensional, human-aligned assessments with detailed rationales.

Abstract: Recent advances in text-to-video generation have produced increasingly realistic and diverse content, yet evaluating such videos remains a fundamental challenge due to their multi-faceted nature encompassing visual quality, semantic alignment, and physical consistency. Existing evaluators and reward models are limited to single opaque scores, lack interpretability, or provide only coarse analysis, making them insufficient for capturing the comprehensive nature of video quality assessment. We present VideoScore2, a multi-dimensional, interpretable, and human-aligned framework that explicitly evaluates visual quality, text-to-video alignment, and physical/common-sense consistency while producing detailed chain-of-thought rationales. Our model is trained on a large-scale dataset VideoFeedback2 containing 27,168 human-annotated videos with both scores and reasoning traces across three dimensions, using a two-stage pipeline of supervised fine-tuning followed by reinforcement learning with Group Relative Policy Optimization (GRPO) to enhance analytical robustness. Extensive experiments demonstrate that VideoScore2 achieves superior performance with 44.35 (+5.94) accuracy on our in-domain benchmark VideoScore-Bench-v2 and 50.37 (+4.32) average performance across four out-of-domain benchmarks (VideoGenReward-Bench, VideoPhy2, etc), while providing interpretable assessments that bridge the gap between evaluation and controllable generation through effective reward modeling for Best-of-N sampling. Project Page: https://tiger-ai-lab.github.io/VideoScore2/

Sahar Dastani, Ali Bahri, Gustavo Adolfo Vargas Hakim, Moslem Yazdanpanah, Mehrdad Noori, David Osowiechi, Samuel Barbeau, Ismail Ben Ayed, Herve Lombaert, Christian Desrosiers

Main category: cs.CV

TL;DR: TRUST is a test-time adaptation method for State Space Models that uses uncertainty-guided traversal permutations to improve robustness under distribution shifts.

Details

Motivation: State Space Models like VMamba show degraded generalization under distribution shifts, despite being efficient alternatives to Vision Transformers.

Method: Leverages diverse traversal permutations to generate multiple causal perspectives, uses model predictions as pseudo-labels to update Mamba-specific parameters, and averages adapted weights across traversal scans.

Result: Experiments on seven benchmarks show TRUST consistently improves robustness and outperforms existing TTA methods.

Conclusion: TRUST is the first approach that explicitly leverages SSM architectural properties for adaptation, effectively addressing distribution shift issues.

Abstract: State Space Models (SSMs) have emerged as efficient alternatives to Vision Transformers (ViTs), with VMamba standing out as a pioneering architecture designed for vision tasks. However, their generalization performance degrades significantly under distribution shifts. To address this limitation, we propose TRUST (Test-Time Refinement using Uncertainty-Guided SSM Traverses), a novel test-time adaptation (TTA) method that leverages diverse traversal permutations to generate multiple causal perspectives of the input image. Model predictions serve as pseudo-labels to guide updates of the Mamba-specific parameters, and the adapted weights are averaged to integrate the learned information across traversal scans. Altogether, TRUST is the first approach that explicitly leverages the unique architectural properties of SSMs for adaptation. Experiments on seven benchmarks show that TRUST consistently improves robustness and outperforms existing TTA methods.

[379] UniF$^2$ace: A Unified Fine-grained Face Understanding and Generation Model

Junzhe Li, Sifan Zhou, Liya Guo, Xuerui Qiu, Linrui Xu, Delin Qu, Tingting Long, Chun Fan, Ming Li, Hehe Fan, Jun Liu, Shuicheng Yan

Main category: cs.CV

TL;DR: UniF²ace is the first unified multimodal model for fine-grained face understanding and generation, addressing fragmentation and lack of fine-grained attributes through Dual Discrete Diffusion loss and multi-level Mixture-of-Experts architecture.

Details

Motivation: Existing face research faces fragmentation between understanding and generation, and lacks fine-grained facial attributes needed for high-fidelity applications, hindering progress toward artificial general intelligence.

Method: Proposes Dual Discrete Diffusion (D3Diff) loss unifying masked generative models with discrete score matching diffusion, and a multi-level grouped Mixture-of-Experts architecture incorporating semantic and identity facial embeddings. Also constructs UniF²aceD-1M dataset with 130K fine-grained image-caption pairs and 1M VQA pairs.

Result: Outperforms existing models with similar scale: 7.1% higher Desc-GPT and 6.6% higher VQA-score in understanding and generation tasks respectively.

Conclusion: UniF²ace successfully unifies face understanding and generation while achieving superior performance in both tasks, demonstrating the effectiveness of the proposed D3Diff framework and architecture for fine-grained facial analysis.

Abstract: Unified multimodal models (UMMs) have emerged as a powerful paradigm in fundamental cross-modality research, demonstrating significant potential in both image understanding and generation. However, existing research in the face domain primarily faces two challenges: $\textbf{(1)}$ $\textbf{fragmentation development}$, with existing methods failing to unify understanding and generation into a single one, hindering the way to artificial general intelligence. $\textbf{(2) lack of fine-grained facial attributes}$, which are crucial for high-fidelity applications. To handle those issues, we propose $\textbf{UniF$^2$ace}$, $\textit{the first UMM specifically tailored for fine-grained face understanding and generation}$. $\textbf{First}$, we introduce a novel theoretical framework with a Dual Discrete Diffusion (D3Diff) loss, unifying masked generative models with discrete score matching diffusion and leading to a more precise approximation of the negative log-likelihood. Moreover, this D3Diff significantly enhances the model’s ability to synthesize high-fidelity facial details aligned with text input. $\textbf{Second}$, we propose a multi-level grouped Mixture-of-Experts architecture, adaptively incorporating the semantic and identity facial embeddings to complement the attribute forgotten phenomenon in representation evolvement. $\textbf{Finally}$, to this end, we construct UniF$^2$aceD-1M, a large-scale dataset comprising 130K fine-grained image-caption pairs and 1M visual question-answering pairs, spanning a much wider range of facial attributes than existing datasets. Extensive experiments demonstrate that UniF$^2$ace outperforms existing models with a similar scale in both understanding and generation tasks, with 7.1% higher Desc-GPT and 6.6% higher VQA-score, respectively.

Jaeik Kim, Woojin Kim, Woohyeon Park, Jaeyoung Do

Main category: cs.CV

TL;DR: MMPB is the first extensive benchmark for evaluating Vision-Language Models (VLMs) on personalization, revealing that most VLMs struggle with maintaining consistency, handling user preferences, and adapting to visual cues.

Details

Motivation: Visual personalization is essential in user-facing AI systems like smart homes and healthcare, but current VLMs remain underexplored in their ability to adapt to individual users.

Method: MMPB comprises 10k image-query pairs with 111 personalizable concepts across four categories, evaluated using 23 VLMs through a three-stage protocol: concept injection, multi-turn dialogue, and personalized querying.

Result: Most VLMs (including closed-source models) struggle with personalization, particularly in maintaining dialogue consistency, handling user preferences, and adapting to visual cues, with challenges like refusal behaviors and long-context forgetting.

Conclusion: MMPB identifies substantial room for improvement in VLM personalization and provides a scalable benchmark for future research toward truly personalized multi-modal AI.

Abstract: Visual personalization is essential in user-facing AI systems such as smart homes and healthcare, where aligning model behavior with user-centric concepts is critical. However, recent large Vision-Language Models (VLMs), despite their broad applicability, remain underexplored in their ability to adapt to individual users. In this paper, we introduce MMPB, the first extensive benchmark for evaluating VLMs on personalization. MMPB comprises 10k image-query pairs and includes 111 personalizable concepts across four categories: humans, animals, objects, and characters, with the human category enriched with preference-grounded queries. We structure personalization into three main task types, each highlighting a different key property of VLMs. Using 23 widely used VLMs including both open- and closed-source models, we evaluate personalization performance via a three-stage protocol: concept injection, multi-turn dialogue, and personalized querying. Our findings indicate that most VLMs (including some closed-source models) struggle with personalization, particularly in maintaining consistency over dialogue, handling user preferences, and adapting to visual cues. Our analysis reveals that the challenges in VLM personalization (such as refusal behaviors and long-context forgetting) highlight substantial room for improvement. By identifying these limitations and offering a scalable benchmark, MMPB offers valuable insights and a solid foundation for future research toward truly personalized multi-modal AI. Project Page: aidaslab.github.io/MMPB

[381] Seeing Isn’t Believing: Context-Aware Adversarial Patch Synthesis via Conditional GAN

Roie Kazoom, Alon Goldberg, Hodaya Cohen, Ofer Hadar

Main category: cs.CV

TL;DR: A novel framework for fully controllable adversarial patch generation that allows attackers to choose both input image and target class, achieving state-of-the-art performance with attack success rates exceeding 99% while maintaining visual realism.

Details

Motivation: Existing adversarial patch attacks rely on unrealistic white-box assumptions, untargeted objectives, or produce visually conspicuous patches that limit real-world applicability.

Method: Combines generative U-Net design with Grad-CAM-guided patch placement for semantic-aware localization that maximizes attack effectiveness while preserving visual realism.

Result: Achieves attack success rates and target-class success consistently exceeding 99% across convolutional networks (DenseNet-121, ResNet-50) and vision transformers (ViT-B/16, Swin-B/16). Outperforms prior white-box attacks, untargeted baselines, and non-realistic approaches.

Conclusion: Establishes a new benchmark for adversarial robustness research by simultaneously ensuring realism, targeted control, and black-box applicability - the three most challenging dimensions of patch-based attacks.

Abstract: Adversarial patch attacks pose a severe threat to deep neural networks, yet most existing approaches rely on unrealistic white-box assumptions, untargeted objectives, or produce visually conspicuous patches that limit real-world applicability. In this work, we introduce a novel framework for fully controllable adversarial patch generation, where the attacker can freely choose both the input image x and the target class y target, thereby dictating the exact misclassification outcome. Our method combines a generative U-Net design with Grad-CAM-guided patch placement, enabling semantic-aware localization that maximizes attack effectiveness while preserving visual realism. Extensive experiments across convolutional networks (DenseNet-121, ResNet-50) and vision transformers (ViT-B/16, Swin-B/16, among others) demonstrate that our approach achieves state-of-the-art performance across all settings, with attack success rates (ASR) and target-class success (TCS) consistently exceeding 99%. Importantly, we show that our method not only outperforms prior white-box attacks and untargeted baselines, but also surpasses existing non-realistic approaches that produce detectable artifacts. By simultaneously ensuring realism, targeted control, and black-box applicability-the three most challenging dimensions of patch-based attacks-our framework establishes a new benchmark for adversarial robustness research, bridging the gap between theoretical attack strength and practical stealthiness.

[382] Learning Temporal Saliency for Time Series Forecasting with Cross-Scale Attention

Ibrahim Delibasoglu, Fredrik Heintz

Main category: cs.CV

TL;DR: CrossScaleNet is a novel architecture combining patch-based cross-attention with multi-scale processing for time series forecasting, achieving both high performance and enhanced temporal explainability without sacrificing predictive accuracy.

Details

Motivation: Traditional post-hoc methods for temporal saliency detection are computationally expensive and challenging, while existing explainable models often fail to maintain strong performance on standard benchmarks.

Method: Uses patch-based cross-attention mechanism with multi-scale processing, embedding attention mechanisms into training to provide intrinsic explainability for temporal saliency.

Result: Outperforms most transformer-based models on real-world datasets, demonstrates superior performance in both temporal saliency detection and forecasting accuracy, and maintains strong performance across datasets of varying complexity.

Conclusion: CrossScaleNet addresses the gap between explainability and performance, offering a balanced approach that effectively captures temporal saliency while delivering state-of-the-art forecasting performance.

Abstract: Explainability in time series forecasting is essential for improving model transparency and supporting informed decision-making. In this work, we present CrossScaleNet, an innovative architecture that combines a patch-based cross-attention mechanism with multi-scale processing to achieve both high performance and enhanced temporal explainability. By embedding attention mechanisms into the training process, our model provides intrinsic explainability for temporal saliency, making its decision-making process more transparent. Traditional post-hoc methods for temporal saliency detection are computationally expensive, particularly when compared to feature importance detection. While ablation techniques may suffice for datasets with fewer features, identifying temporal saliency poses greater challenges due to its complexity. We validate CrossScaleNet on synthetic datasets with known saliency ground truth and on established public benchmarks, demonstrating the robustness of our method in identifying temporal saliency. Experiments on real-world datasets for forecasting task show that our approach consistently outperforms most transformer-based models, offering better explainability without sacrificing predictive accuracy. Our evaluations demonstrate superior performance in both temporal saliency detection and forecasting accuracy. Moreover, we highlight that existing models claiming explainability often fail to maintain strong performance on standard benchmarks. CrossScaleNet addresses this gap, offering a balanced approach that captures temporal saliency effectively while delivering state-of-the-art forecasting performance across datasets of varying complexity.

[383] Multimodal Slice Interaction Network Enhanced by Transfer Learning for Precise Segmentation of Internal Gross Tumor Volume in Lung Cancer PET/CT Imaging

Yi Luo, Yike Guo, Hamed Hooshangnejad, Rui Zhang, Xue Feng, Quan Chen, Wil Ngwa, Kai Ding

Main category: cs.CV

TL;DR: Transfer learning-based multimodal interactive perception network with MAMBA and slice interaction module (SIM) for improved IGTV segmentation in lung cancer PET/CT imaging, achieving significant performance improvement over baseline.

Details

Motivation: Accurate IGTV delineation is crucial for lung cancer radiation therapy but is challenged by limited annotated datasets and attenuated PET signal intensity at tumor boundaries.

Method: Transfer learning approach using multimodal interactive perception network pre-trained on GTV datasets and fine-tuned on IGTV data, with 2.5D segmentation framework incorporating slice interaction module (SIM) with channel and spatial attention branches.

Result: Achieved Dice score of 0.609 on private IGTV dataset, substantially outperforming baseline score of 0.385.

Conclusion: The approach demonstrates potential of transfer learning with multimodal techniques and SIM to enhance reliability and clinical relevance of IGTV segmentation for lung cancer radiation therapy planning.

Abstract: Lung cancer remains the leading cause of cancerrelated deaths globally. Accurate delineation of internal gross tumor volume (IGTV) in PET/CT imaging is pivotal for optimal radiation therapy in mobile tumors such as lung cancer to account for tumor motion, yet is hindered by the limited availability of annotated IGTV datasets and attenuated PET signal intensity at tumor boundaries. In this study, we present a transfer learningbased methodology utilizing a multimodal interactive perception network with MAMBA, pre-trained on extensive gross tumor volume (GTV) datasets and subsequently fine-tuned on a private IGTV cohort. This cohort constitutes the PET/CT subset of the Lung-cancer Unified Cross-modal Imaging Dataset (LUCID). To further address the challenge of weak PET intensities in IGTV peripheral slices, we introduce a slice interaction module (SIM) within a 2.5D segmentation framework to effectively model inter-slice relationships. Our proposed module integrates channel and spatial attention branches with depthwise convolutions, enabling more robust learning of slice-to-slice dependencies and thereby improving overall segmentation performance. A comprehensive experimental evaluation demonstrates that our approach achieves a Dice of 0.609 on the private IGTV dataset, substantially surpassing the conventional baseline score of 0.385. This work highlights the potential of transfer learning, coupled with advanced multimodal techniques and a SIM to enhance the reliability and clinical relevance of IGTV segmentation for lung cancer radiation therapy planning.

[384] ControlEvents: Controllable Synthesis of Event Camera Datawith Foundational Prior from Image Diffusion Models

Yixuan Hu, Yuxuan Xue, Simon Klenk, Daniel Cremers, Gerard Pons-Moll

Main category: cs.CV

TL;DR: ControlEvents is a diffusion-based generative model that synthesizes high-quality event data using control signals like text labels, 2D skeletons, and 3D poses, leveraging diffusion priors from foundation models to reduce the cost of labeled event dataset creation.

Details

Motivation: Event cameras offer high temporal resolution and dynamic range, but obtaining large-scale labeled ground-truth data for event-based vision tasks is challenging and expensive.

Method: Uses a diffusion-based generative model that leverages diffusion priors from foundation models like Stable Diffusion, enabling high-quality event data generation with minimal fine-tuning and limited labeled data through control signals.

Result: Successfully synthesizes event data for visual recognition, 2D skeleton estimation, and 3D body pose estimation. The synthesized data enhances model performance in all tasks and can generate events based on unseen text labels during training.

Conclusion: ControlEvents provides an effective solution for generating labeled event datasets, significantly reducing costs while maintaining high quality, and demonstrates powerful text-based generation capabilities inherited from foundation models.

Abstract: In recent years, event cameras have gained significant attention due to their bio-inspired properties, such as high temporal resolution and high dynamic range. However, obtaining large-scale labeled ground-truth data for event-based vision tasks remains challenging and costly. In this paper, we present ControlEvents, a diffusion-based generative model designed to synthesize high-quality event data guided by diverse control signals such as class text labels, 2D skeletons, and 3D body poses. Our key insight is to leverage the diffusion prior from foundation models, such as Stable Diffusion, enabling high-quality event data generation with minimal fine-tuning and limited labeled data. Our method streamlines the data generation process and significantly reduces the cost of producing labeled event datasets. We demonstrate the effectiveness of our approach by synthesizing event data for visual recognition, 2D skeleton estimation, and 3D body pose estimation. Our experiments show that the synthesized labeled event data enhances model performance in all tasks. Additionally, our approach can generate events based on unseen text labels during training, illustrating the powerful text-based generation capabilities inherited from foundation models.

[385] Learning KAN-based Implicit Neural Representations for Deformable Image Registration

Nikita Drozdov, Marat Zinovev, Dmitry Sorokin

Main category: cs.CV

TL;DR: KAN-IDIR and RandKAN-IDIR integrate Kolmogorov-Arnold Networks (KANs) into deformable image registration using implicit neural representations, achieving state-of-the-art accuracy with reduced computational costs through randomized basis sampling.

Details

Motivation: Learning-based DIR methods require large datasets and struggle with precision, while implicit neural representations need instance-specific optimization with computational efficiency and stability challenges.

Method: Proposed KAN-IDIR and RandKAN-IDIR using KANs with implicit neural representations, featuring randomized basis sampling to reduce basis functions while maintaining quality.

Result: Achieved highest accuracy among INR-based methods across lung CT, brain MRI, and cardiac MRI datasets, with minimal computational overhead and superior learning stability across random seeds.

Conclusion: RandKAN-IDIR with randomized basis sampling slightly outperforms learnable basis function indices while eliminating additional training complexity, offering an efficient and stable DIR solution.

Abstract: Deformable image registration (DIR) is a cornerstone of medical image analysis, enabling spatial alignment for tasks like comparative studies and multi-modal fusion. While learning-based methods (e.g., CNNs, transformers) offer fast inference, they often require large training datasets and struggle to match the precision of classical iterative approaches on some organ types and imaging modalities. Implicit neural representations (INRs) have emerged as a promising alternative, parameterizing deformations as continuous mappings from coordinates to displacement vectors. However, this comes at the cost of requiring instance-specific optimization, making computational efficiency and seed-dependent learning stability critical factors for these methods. In this work, we propose KAN-IDIR and RandKAN-IDIR, the first integration of Kolmogorov-Arnold Networks (KANs) into deformable image registration with implicit neural representations (INRs). Our proposed randomized basis sampling strategy reduces the required number of basis functions in KAN while maintaining registration quality, thereby significantly lowering computational costs. We evaluated our approach on three diverse datasets (lung CT, brain MRI, cardiac MRI) and compared it with competing instance-specific learning-based approaches, dataset-trained deep learning models, and classical registration approaches. KAN-IDIR and RandKAN-IDIR achieved the highest accuracy among INR-based methods across all evaluated modalities and anatomies, with minimal computational overhead and superior learning stability across multiple random seeds. Additionally, we discovered that our RandKAN-IDIR model with randomized basis sampling slightly outperforms the model with learnable basis function indices, while eliminating its additional training-time complexity.

[386] Convolutional Set Transformer

Federico Chinello, Giacomo Boracchi

Main category: cs.CV

TL;DR: Convolutional Set Transformer (CST) is a neural architecture that directly processes 3D image tensors in sets, combining feature extraction and contextual modeling for superior performance in set-based tasks.

Details

Motivation: Existing set-input networks like Deep Sets and Set Transformer are limited to vector inputs and require cascading with CNN feature extractors, preventing synergies between feature extraction and inter-image relationship modeling.

Method: CST operates directly on 3D image tensors, performing simultaneous feature extraction and contextual modeling, and is compatible with CNN explainability methods like Grad-CAM.

Result: CST achieves superior performance in Set Classification and Set Anomaly Detection tasks compared to existing approaches, and can be pre-trained on large datasets like ImageNet.

Conclusion: CST provides an effective architecture for processing visually heterogeneous image sets with shared semantics, offering better performance, explainability, and transfer learning capabilities than existing methods.

Abstract: We introduce the Convolutional Set Transformer (CST), a novel neural architecture designed to process image sets of arbitrary cardinality that are visually heterogeneous yet share high-level semantics - such as a common category, scene, or concept. Existing set-input networks, e.g., Deep Sets and Set Transformer, are limited to vector inputs and cannot directly handle 3D image tensors. As a result, they must be cascaded with a feature extractor, typically a CNN, which encodes images into embeddings before the set-input network can model inter-image relationships. In contrast, CST operates directly on 3D image tensors, performing feature extraction and contextual modeling simultaneously, thereby enabling synergies between the two processes. This design yields superior performance in tasks such as Set Classification and Set Anomaly Detection and further provides native compatibility with CNN explainability methods such as Grad-CAM, unlike competing approaches that remain opaque. Finally, we show that CSTs can be pre-trained on large-scale datasets and subsequently adapted to new domains and tasks through standard Transfer Learning schemes. To support further research, we release CST-15, a CST backbone pre-trained on ImageNet (https://github.com/chinefed/convolutional-set-transformer).

[387] Training-Free Defense Against Adversarial Attacks in Deep Learning MRI Reconstruction

Mahdi Saberi, Chi Zhang, Mehmet Akçakaya

Main category: cs.CV

TL;DR: A novel method to mitigate adversarial attacks on MRI reconstruction models without retraining, using cyclic measurement consistency to reduce attack impact across various scenarios.

Details

Motivation: DL methods for MRI reconstruction are vulnerable to adversarial attacks that distort output images, and existing mitigation strategies require retraining and may degrade clean input performance.

Method: Proposes a mitigation objective based on cyclic measurement consistency, minimized in a small ball around attack inputs, without requiring model retraining.

Result: Substantially reduces adversarial perturbation effects across datasets, attack types/strengths, and networks, outperforming conventional retraining-based methods both qualitatively and quantitatively.

Conclusion: The approach effectively mitigates adversarial attacks in realistic scenarios including blind setups and adaptive attacks, and shows applicability to impulse noise modeling herringbone artifacts.

Abstract: Deep learning (DL) methods have become the state-of-the-art for reconstructing sub-sampled magnetic resonance imaging (MRI) data. However, studies have shown that these methods are susceptible to small adversarial input perturbations, or attacks, resulting in major distortions in the output images. Various strategies have been proposed to reduce the effects of these attacks, but they require retraining and may lower reconstruction quality for non-perturbed/clean inputs. In this work, we propose a novel approach for mitigating adversarial attacks on MRI reconstruction models without any retraining. Based on the idea of cyclic measurement consistency, we devise a novel mitigation objective that is minimized in a small ball around the attack input. Results show that our method substantially reduces the impact of adversarial perturbations across different datasets, attack types/strengths and PD-DL networks, and qualitatively and quantitatively outperforms conventional mitigation methods that involve retraining. We also introduce a practically relevant scenario for small adversarial perturbations that models impulse noise in raw data, which relates to \emph{herringbone artifacts}, and show the applicability of our approach in this setting. Finally, we show our mitigation approach remains effective in two \emph{realistic} extension scenarios: a blind setup, where the attack strength or algorithm is not known to the user; and an adaptive attack setup, where the attacker has full knowledge of the defense strategy.

[388] TY-RIST: Tactical YOLO Tricks for Real-time Infrared Small Target Detection

Abdulkarim Atrash, Omar Moured, Yufan Chen, Jiaming Zhang, Seyda Ertekin, Omur Ugur

Main category: cs.CV

TL;DR: TY-RIST is an optimized YOLOv12n architecture for infrared small target detection that addresses target loss, false alarms, missed detections, and computational costs through improved backbone, detection head, attention mechanisms, and pruning, achieving state-of-the-art performance with real-time inference.

Details

Motivation: Address challenges in infrared small target detection including target loss from minimal features, false alarms in cluttered environments, missed detections from low saliency, and high computational costs.

Method: Propose TY-RIST with: (1) stride-aware backbone with fine-grained receptive fields, (2) high-resolution detection head, (3) cascaded coordinate attention blocks, (4) branch pruning strategy reducing computational cost by 25.5%, and (5) Normalized Gaussian Wasserstein Distance for regression stability.

Result: State-of-the-art performance on four benchmarks: +7.9% mAP at 0.5 IoU, +3% Precision, +10.2% Recall, up to 123 FPS on single GPU. Cross-dataset validation confirms strong generalization.

Conclusion: TY-RIST effectively addresses infrared small target detection challenges with improved accuracy, real-time performance, and strong generalization capability while reducing computational costs.

Abstract: Infrared small target detection (IRSTD) is critical for defense and surveillance but remains challenging due to (1) target loss from minimal features, (2) false alarms in cluttered environments, (3) missed detections from low saliency, and (4) high computational costs. To address these issues, we propose TY-RIST, an optimized YOLOv12n architecture that integrates (1) a stride-aware backbone with fine-grained receptive fields, (2) a high-resolution detection head, (3) cascaded coordinate attention blocks, and (4) a branch pruning strategy that reduces computational cost by about 25.5% while marginally improving accuracy and enabling real-time inference. We also incorporate the Normalized Gaussian Wasserstein Distance (NWD) to enhance regression stability. Extensive experiments on four benchmarks and across 20 different models demonstrate state-of-the-art performance, improving mAP at 0.5 IoU by +7.9%, Precision by +3%, and Recall by +10.2%, while achieving up to 123 FPS on a single GPU. Cross-dataset validation on a fifth dataset further confirms strong generalization capability. Additional results and resources are available at https://www.github.com/moured/TY-RIST

[389] Learning Unified Representation of 3D Gaussian Splatting

Yuelin Xin, Yuheng Liu, Xiaohui Xie, Xinke Li

Main category: cs.CV

TL;DR: Proposes embedding representation for 3D Gaussian Splatting using continuous submanifold fields to address parameter heterogeneity and enable better neural network learning.

Details

Motivation: Raw Gaussian parameters are non-unique and heterogeneous, making them hard to learn as features for neural networks, leading to data-dependent models.

Method: Uses continuous submanifold fields to encapsulate intrinsic information of Gaussian primitives, enforcing unique mapping and channel homogeneity.

Result: Creates a principled embedding representation that preserves color and geometric structure while being more suitable for neural network learning.

Conclusion: The proposed embedding representation benefits learning of 3D Gaussian Splatting by providing homogeneous and unique feature representations.

Abstract: A well-designed vectorized representation is crucial for the learning systems natively based on 3D Gaussian Splatting. While 3DGS enables efficient and explicit 3D reconstruction, its parameter-based representation remains hard to learn as features, especially for neural-network-based models. Directly feeding raw Gaussian parameters into learning frameworks fails to address the non-unique and heterogeneous nature of the Gaussian parameterization, yielding highly data-dependent models. This challenge motivates us to explore a more principled approach to represent 3D Gaussian Splatting in neural networks that preserves the underlying color and geometric structure while enforcing unique mapping and channel homogeneity. In this paper, we propose an embedding representation of 3DGS based on continuous submanifold fields that encapsulate the intrinsic information of Gaussian primitives, thereby benefiting the learning of 3DGS.

[390] Soft-Di[M]O: Improving One-Step Discrete Image Generation with Soft Embeddings

Yuanzhi Zhu, Xi Wang, Stéphane Lathuilière, Vicky Kalogeiton

Main category: cs.CV

TL;DR: Soft-Di[M]O introduces soft embeddings to make one-step generators from Masked Diffusion Models fully differentiable, enabling post-distillation refinements like GAN training and reward fine-tuning.

Details

Motivation: One-step generators from MDMs inherit modeling bias and have discrete token outputs that block gradient flow, preventing post-distillation refinements like adversarial training and reward-based fine-tuning.

Method: Introduces soft embeddings that replace discrete tokens with expected embeddings under the generator’s output distribution, creating a differentiable continuous surrogate compatible with teacher backbones and tokenizer decoders.

Result: Achieves state-of-the-art one-step results: improved class-to-image performance, one-step FID of 1.56 on ImageNet-256 with GAN refinement, higher GenEval and HPS scores on text-to-image with reward fine-tuning, and gains from TTEO.

Conclusion: Soft-Di[M]O makes one-step generators end-to-end trainable and enables straightforward application of various refinement techniques, significantly improving performance across multiple MDM teachers.

Abstract: One-step generators distilled from Masked Diffusion Models (MDMs) compress multiple sampling steps into a single forward pass, enabling efficient text and image synthesis. However, they suffer two key limitations: they inherit modeling bias from the teacher, and their discrete token outputs block gradient flow, preventing post-distillation refinements such as adversarial training, reward-based fine-tuning, and Test-Time Embedding Optimization (TTEO). In this work, we introduce soft embeddings, a simple relaxation that replaces discrete tokens with the expected embeddings under the generator’s output distribution. Soft embeddings preserve representation fidelity for one-step discrete generator while providing a fully differentiable continuous surrogate that is compatible with teacher backbones and tokenizer decoders. Integrating soft embeddings into the Di[M]O distillation framework (denoted Soft-Di[M]O) makes one-step generators end-to-end trainable and enables straightforward application of GAN-based refinement, differentiable reward fine-tuning, and TTEO. Empirically, across multiple MDM teachers (e.g., MaskBit, MaskGen), Soft-Di[M]O achieves state-of-the-art one-step results: improved class-to-image performance, a one-step FID of 1.56 on ImageNet-256 with GAN-based refinement, along with higher GenEval and HPS scores on text-to-image with reward fine-tuning, and further gains from TTEO.

Chenghan Yang, Peng Zhou, Dong-Sheng Zhang, Yueyun Wang, Hong-Bin Shen, Xiaoyong Pan

Main category: cs.CV

TL;DR: FishAI 2.0 is a marine fish recognition framework that combines multimodal few-shot deep learning with image generation for data augmentation, achieving high accuracy for rare species with limited training data.

Details

Motivation: Address challenges in marine biological image recognition, particularly data scarcity for rare species and unsatisfactory model accuracy in few-shot conditions.

Method: Uses hierarchical marine fish dataset, employs DeepSeek LLM for text descriptions, Stable Diffusion 2 for image augmentation via hierarchical diffusion strategy, and CLIP-based model for few-shot recognition.

Result: Achieves Top-1 accuracy of 91.67% (family level), 87.58% (genus level), and 85.42% (species level), significantly outperforming baseline CLIP and ViT models for minority classes with <10 training samples.

Conclusion: FishAI 2.0 improves marine fish identification efficiency and accuracy, providing scalable technical solution for marine ecological monitoring and conservation with practical applicability.

Abstract: Traditional marine biological image recognition faces challenges of incomplete datasets and unsatisfactory model accuracy, particularly for few-shot conditions of rare species where data scarcity significantly hampers the performance. To address these issues, this study proposes an intelligent marine fish recognition framework, FishAI 2.0, integrating multimodal few-shot deep learning techniques with image generation for data augmentation. First, a hierarchical marine fish benchmark dataset, which provides a comprehensive data foundation for subsequent model training, is utilized to train the FishAI 2.0 model. To address the data scarcity of rare classes, the large language model DeepSeek was employed to generate high-quality textual descriptions, which are input into Stable Diffusion 2 for image augmentation through a hierarchical diffusion strategy that extracts latent encoding to construct a multimodal feature space. The enhanced visual-textual datasets were then fed into a Contrastive Language-Image Pre-Training (CLIP) based model, enabling robust few-shot image recognition. Experimental results demonstrate that FishAI 2.0 achieves a Top-1 accuracy of 91.67 percent and Top-5 accuracy of 97.97 percent at the family level, outperforming baseline CLIP and ViT models with a substantial margin for the minority classes with fewer than 10 training samples. To better apply FishAI 2.0 to real-world scenarios, at the genus and species level, FishAI 2.0 respectively achieves a Top-1 accuracy of 87.58 percent and 85.42 percent, demonstrating practical utility. In summary, FishAI 2.0 improves the efficiency and accuracy of marine fish identification and provides a scalable technical solution for marine ecological monitoring and conservation, highlighting its scientific value and practical applicability.

[392] Feature Space Analysis by Guided Diffusion Model

Kimiaki Shirahama, Miki Yanobu, Kaduki Yamashita, Miho Ohsaki

Main category: cs.CV

TL;DR: A decoder using guided diffusion generates images with features matching user-specified ones, enabling analysis of DNN feature spaces without additional training.

Details

Motivation: To address the black-box nature of DNNs by enabling analysis of their internal feature extraction processes in vision domains.

Method: Implemented as a guided diffusion model that minimizes Euclidean distance between generated image features and user-specified features during reverse image generation.

Result: Generated images have features remarkably similar to user-specified ones, providing insights into feature spaces of CLIP, ResNet-50, and vision transformers.

Conclusion: The decoder successfully analyzes DNN feature spaces across different architectures without retraining, revealing valuable insights about encoded image attributes.

Abstract: One of the key issues in Deep Neural Networks (DNNs) is the black-box nature of their internal feature extraction process. Targeting vision-related domains, this paper focuses on analysing the feature space of a DNN by proposing a decoder that can generate images whose features are guaranteed to closely match a user-specified feature. Owing to this guarantee that is missed in past studies, our decoder allows us to evidence which of various image attributes are encoded into the user-specified feature. Our decoder is implemented as a guided diffusion model that guides the reverse image generation of a pre-trained diffusion model to minimise the Euclidean distance between the feature of a clean image estimated at each step and the user-specified feature. One practical advantage of our decoder is that it can analyse feature spaces of different DNNs with no additional training and run on a single COTS GPU. The experimental results targeting CLIP’s image encoder, ResNet-50 and vision transformer demonstrate that images generated by our decoder have features remarkably similar to the user-specified ones and reveal valuable insights into these DNNs’ feature spaces.

[393] Brain Tumor Classification from MRI Scans via Transfer Learning and Enhanced Feature Representation

Ahta-Shamul Hoque Emran, Hafija Akter, Abdullah Al Shiam, Abu Saleh Musa Miah, Anichur Rahman, Fahmid Al Farid, Hezerul Abdul Karim

Main category: cs.CV

TL;DR: This paper proposes an automatic deep-learning framework for brain tumor detection from MRI scans using ResNet50 feature extraction and a novel Dense-Dropout sequence, along with creating a new balanced dataset (MMCBT) to address data scarcity.

Details

Motivation: Timely detection of brain tumors is critical for improving patient outcomes, and there is a lack of reliable brain tumor MRI resources for research.

Method: Uses pre-trained ResNet50 for feature extraction with Global Average Pooling and linear projection, followed by a novel Dense-Dropout sequence to enhance non-linear feature learning and reduce overfitting.

Result: Created the MMCBT dataset with 209 subjects (3671 tumor and 13273 non-tumor images) and developed a framework that addresses class imbalance through augmentation.

Conclusion: The proposed framework and new dataset provide an effective solution for automated brain tumor detection with improved robustness and reduced overfitting.

Abstract: Brain tumors are abnormal cell growths in the central nervous system (CNS), and their timely detection is critical for improving patient outcomes. This paper proposes an automatic and efficient deep-learning framework for brain tumor detection from magnetic resonance imaging (MRI) scans. The framework employs a pre-trained ResNet50 model for feature extraction, followed by Global Average Pooling (GAP) and linear projection to obtain compact, high-level image representations. These features are then processed by a novel Dense-Dropout sequence, a core contribution of this work, which enhances non-linear feature learning, reduces overfitting, and improves robustness through diverse feature transformations. Another major contribution is the creation of the Mymensingh Medical College Brain Tumor (MMCBT) dataset, designed to address the lack of reliable brain tumor MRI resources. The dataset comprises MRI scans from 209 subjects (ages 9 to 65), including 3671 tumor and 13273 non-tumor images, all clinically verified under expert supervision. To overcome class imbalance, the tumor class was augmented, resulting in a balanced dataset well-suited for deep learning research.

[394] Hemorica: A Comprehensive CT Scan Dataset for Automated Brain Hemorrhage Classification, Segmentation, and Detection

Kasra Davoodi, Mohammad Hoseyni, Javad Khoramdel, Reza Barati, Reihaneh Mortazavi, Amirhossein Nikoofard, Mahdi Aliyari-Shoorehdeli, Jaber Hatam Parikhan

Main category: cs.CV

TL;DR: Hemorica is a publicly available dataset of 372 head CT scans with comprehensive annotations for 5 intracranial hemorrhage subtypes, providing patient-wise and slice-wise labels, bounding boxes, and 2D/3D masks to support AI development for ICH diagnosis.

Details

Motivation: To address the fragmented public data problem hindering robust AI solutions for timely intracranial hemorrhage diagnosis on CT scans by providing a unified, clinically realistic benchmark dataset.

Method: Created Hemorica dataset with 372 head CT exams (2012-2024) using double-reading workflow with neurosurgeon adjudication. Fine-tuned standard CNN and transformer architectures for binary classification and segmentation tasks.

Result: Lightweight models achieved 87.8% F1 score for binary classification and 85.5% Dice score for lesion segmentation, validating annotation quality and sufficient sample size.

Conclusion: Hemorica provides a unified benchmark supporting multi-task learning, transfer to weakly labeled cohorts, and facilitates AI assistant development for ICH detection and quantification.

Abstract: Timely diagnosis of Intracranial hemorrhage (ICH) on Computed Tomography (CT) scans remains a clinical priority, yet the development of robust Artificial Intelligence (AI) solutions is still hindered by fragmented public data. To close this gap, we introduce Hemorica, a publicly available collection of 372 head CT examinations acquired between 2012 and 2024. Each scan has been exhaustively annotated for five ICH subtypes-epidural (EPH), subdural (SDH), subarachnoid (SAH), intraparenchymal (IPH), and intraventricular (IVH)-yielding patient-wise and slice-wise classification labels, subtype-specific bounding boxes, two-dimensional pixel masks and three-dimensional voxel masks. A double-reading workflow, preceded by a pilot consensus phase and supported by neurosurgeon adjudication, maintained low inter-rater variability. Comprehensive statistical analysis confirms the clinical realism of the dataset. To establish reference baselines, standard convolutional and transformer architectures were fine-tuned for binary slice classification and hemorrhage segmentation. With only minimal fine-tuning, lightweight models such as MobileViT-XS achieved an F1 score of 87.8% in binary classification, whereas a U-Net with a DenseNet161 encoder reached a Dice score of 85.5% for binary lesion segmentation that validate both the quality of the annotations and the sufficiency of the sample size. Hemorica therefore offers a unified, fine-grained benchmark that supports multi-task and curriculum learning, facilitates transfer to larger but weakly labelled cohorts, and facilitates the process of designing an AI-based assistant for ICH detection and quantification systems.

[395] MindVL: Towards Efficient and Effective Training of Multimodal Large Language Models on Ascend NPUs

Feilong Chen, Yijiang Liu, Yi Huang, Hao Wang, Miren Tian, Ya-Qi Yu, Minghui Liao, Jihao Wu

Main category: cs.CV

TL;DR: MindVL is a multimodal large language model trained on Ascend NPUs, demonstrating superior data efficiency by matching state-of-the-art models using only 3-10% of their training data.

Details

Motivation: To address the limitations of current MLLM training being confined to specific hardware platforms and relying on undisclosed data recipes, which hinders reproducibility and open research.

Method: Developed MindSpeed-MLLM framework for efficient training on Ascend hardware, systematic data production methods, and weight averaging from checkpoints trained with different sequence lengths combined with test-time resolution search.

Result: MindVL-8B matches Qwen2.5VL-7B performance with only 10% training data, and MindVL-671B-A37B matches Qwen2.5VL-72B with only 3% training data, achieving comparable performance with leading multimodal MoE models.

Conclusion: The work provides a valuable hardware alternative, open data recipes, and effective performance-enhancing techniques for the community.

Abstract: We propose MindVL, a multimodal large language model (MLLMs) trained on Ascend NPUs. The training of state-of-the-art MLLMs is often confined to a limited set of hardware platforms and relies heavily on massive, undisclosed data recipes, which hinders reproducibility and open research. To change the common perception that Ascend hardware is unsuitable for efficient full-stage MLLM training, we introduce MindSpeed-MLLM, a highly efficient training framework that supports stable and high-performance training of large-scale Dense and Mixture-of-Experts (MoE) models on Ascend hardware. Based on this, we provide a systematic and open description of the data production methods and mixing strategies for all training stages. Furthermore, we present MindVL, a data-efficient multimodal large language model trained end-to-end on Ascend NPUs. In addition, we find that averaging weights from checkpoints trained with different sequence lengths is particularly effective and yields further gains when combined with test-time resolution search. Our experiments demonstrate superior data efficiency: MindVL-8B matches the performance of Qwen2.5VL-7B using only 10% of its training data, while our MoE model, MindVL-671B-A37B, matches Qwen2.5VL-72B using only 3% of the Qwen2.5VL training data, and achieves comparable performance with other leading multimodal MoE models. Our work provides the community with a valuable hardware alternative, open data recipes, and effective performance-enhancing techniques.

[396] ARSS: Taming Decoder-only Autoregressive Visual Generation for View Synthesis From Single View

Wenbin Teng, Gonglin Chen, Haiwei Chen, Yajie Zhao

Main category: cs.CV

TL;DR: ARSS is a novel autoregressive framework for novel view generation from single images using GPT-style decoder-only models, outperforming diffusion models by maintaining temporal consistency through camera trajectory conditioning and spatial token permutation.

Details

Motivation: Diffusion models have limitations in world modeling tasks like novel view generation due to non-causal generation causing distortions and inconsistencies across views, while autoregressive models offer causal generation that can better adapt accumulated knowledge to new queries.

Method: Uses GPT-style decoder-only AR model with video tokenizer for discrete token mapping, camera encoder for 3D positional guidance, and autoregressive transformer module that randomly permutes spatial token order while maintaining temporal order to enhance generation quality.

Result: Extensive experiments on public datasets show ARSS performs comparably to or better than state-of-the-art diffusion-based view synthesis approaches, demonstrating superior consistency and quality in novel view generation.

Conclusion: ARSS successfully addresses diffusion model limitations for world modeling tasks by leveraging autoregressive generation with camera trajectory conditioning, providing a promising approach for consistent and high-quality novel view synthesis from sparse inputs.

Abstract: Despite their exceptional generative quality, diffusion models have limited applicability to world modeling tasks, such as novel view generation from sparse inputs. This limitation arises because diffusion models generate outputs in a non-causal manner, often leading to distortions or inconsistencies across views, and making it difficult to incrementally adapt accumulated knowledge to new queries. In contrast, autoregressive (AR) models operate in a causal fashion, generating each token based on all previously generated tokens. In this work, we introduce \textbf{ARSS}, a novel framework that leverages a GPT-style decoder-only AR model to generate novel views from a single image, conditioned on a predefined camera trajectory. We employ a video tokenizer to map continuous image sequences into discrete tokens and propose a camera encoder that converts camera trajectories into 3D positional guidance. Then to enhance generation quality while preserving the autoregressive structure, we propose a autoregressive transformer module that randomly permutes the spatial order of tokens while maintaining their temporal order. Extensive qualitative and quantitative experiments on public datasets demonstrate that our method performs comparably to, or better than, state-of-the-art view synthesis approaches based on diffusion models. Our code will be released upon paper acceptance.

[397] Disentangling Static and Dynamic Information for Reducing Static Bias in Action Recognition

Masato Kobayashi, Ning Ding, Toru Tamaki

Main category: cs.CV

TL;DR: Proposes a method to reduce static bias in action recognition by separating temporal dynamics from static scene information using statistical independence and scene prediction losses.

Details

Motivation: Action recognition models excessively rely on static cues rather than dynamic human motion, leading to poor real-world performance and zero-shot recognition issues.

Method: Separates temporal dynamic information from static scene information using statistical independence loss between biased and unbiased streams, combined with scene prediction loss.

Result: Effectively reduces static bias in action recognition models and confirms the importance of scene prediction loss.

Conclusion: The proposed method successfully addresses static bias in action recognition by explicitly separating dynamic and static information through statistical independence and scene prediction losses.

Abstract: Action recognition models rely excessively on static cues rather than dynamic human motion, which is known as static bias. This bias leads to poor performance in real-world applications and zero-shot action recognition. In this paper, we propose a method to reduce static bias by separating temporal dynamic information from static scene information. Our approach uses a statistical independence loss between biased and unbiased streams, combined with a scene prediction loss. Our experiments demonstrate that this method effectively reduces static bias and confirm the importance of scene prediction loss.

[398] Desensitizing for Improving Corruption Robustness in Point Cloud Classification through Adversarial Training

Zhiqiang Tian, Weigang Li, Chunhua Deng, Junwei Hu, Yongqiang Wang, Wenping Liu

Main category: cs.CV

TL;DR: This paper proposes Desensitized Adversarial Training (DesenAT) to address DNNs’ over-reliance on point cloud features, which makes them vulnerable to corrupted data. The method uses feature desensitization and self-distillation to improve model robustness against point cloud corruption.

Details

Motivation: Point cloud corruption is inevitable due to scene complexity and sensor inaccuracies, and DNNs' over-reliance on input features makes them vulnerable. The study investigates whether this issue exists in 3D tasks and whether reducing feature dependence can enhance robustness to corrupted point clouds.

Method: The authors propose Desensitized Adversarial Training (DesenAT): 1) Quantify DNN sensitivity to point cloud features using Shapley values, 2) Generate adversarial samples by eliminating high-contribution data points and using spatial transformation to simulate corruption, 3) Use self-distillation to transfer knowledge from clean samples to adversarial samples and conduct adversarial training in a distillation framework.

Result: Extensive experiments on ModelNet-C and PointCloud-C show that DesenAT effectively improves model robustness against corrupted point clouds without reducing performance on clean datasets. The method outperforms traditional training approaches in handling point cloud corruption.

Conclusion: Reducing DNNs’ over-reliance on point cloud features through desensitized adversarial training can significantly enhance model robustness to corrupted data while maintaining performance on clean datasets, addressing a critical vulnerability in 3D point cloud processing tasks.

Abstract: Due to scene complexity, sensor inaccuracies, and processing imprecision, point cloud corruption is inevitable. Over-reliance on input features is the root cause of DNN vulnerabilities. It remains unclear whether this issue exists in 3D tasks involving point clouds and whether reducing dependence on these features can enhance the model’s robustness to corrupted point clouds. This study attempts to answer these questions. Specifically, we quantified the sensitivity of the DNN to point cloud features using Shapley values and found that models trained using traditional methods exhibited high sensitivity values for certain features. Furthermore, under an equal pruning ratio, prioritizing the pruning of highly sensitive features causes more severe damage to model performance than random pruning. We propose `Desensitized Adversarial Training’ (DesenAT), generating adversarial samples using feature desensitization and conducting training within a self-distillation framework, which aims to alleviate DNN’s over-reliance on point clouds features by smoothing sensitivity. First, data points with high contribution components are eliminated, and spatial transformation is used to simulate corruption scenes, generate adversarial samples, and conduct adversarial training on the model. Next, to compensate for information loss in adversarial samples, we use the self-distillation method to transfer knowledge from clean samples to adversarial samples, and perform adversarial training in a distillation manner.Extensive experiments on ModelNet-C and PointCloud-C demonstrate show that the propose method can effectively improve the robustness of the model without reducing the performance of clean data sets. This code is publicly available at \href{https://github.com/JerkyT/DesenAT/tree/master}{https://github.com/JerkyT/DesenAT}.

[399] Geometry-Aware Losses for Structure-Preserving Text-to-Sign Language Generation

Zetian Wu, Tianshuo Zhou, Stefan Lee, Liang Huang

Main category: cs.CV

TL;DR: A novel text-to-sign-language translation method that incorporates anatomical constraints and joint relationships to generate more natural and biomechanically plausible body movements.

Details

Motivation: To address the challenge of generating accurate and natural body poses in sign language translation, as prior methods often produce rigid or biomechanically implausible outputs by neglecting anatomical constraints.

Method: Explicitly models skeletal joint relationships with geometric constraints on joint positions, bone lengths, and movement dynamics. Uses parent-relative reweighting for finger flexibility, bone-pose losses, and bone-length constraints to enforce anatomical consistency.

Result: Reduces performance gap between previous best and ground-truth by 56.51%, decreases bone length discrepancies by 18.76%, and reduces movement variance by 5.48%.

Conclusion: The proposed method significantly improves anatomical realism and motion naturalness in sign language translation through explicit modeling of skeletal constraints.

Abstract: Sign language translation from text to video plays a crucial role in enabling effective communication for Deaf and hard–of–hearing individuals. A major challenge lies in generating accurate and natural body poses and movements that faithfully convey intended meanings. Prior methods often neglect the anatomical constraints and coordination patterns of human skeletal motion, resulting in rigid or biomechanically implausible outputs. To address this, we propose a novel approach that explicitly models the relationships among skeletal joints–including shoulders, arms, and hands–by incorporating geometric constraints on joint positions, bone lengths, and movement dynamics. During training, we introduce a parent-relative reweighting mechanism to enhance finger flexibility and reduce motion stiffness. Additionally, bone-pose losses and bone-length constraints enforce anatomically consistent structures. Our method narrows the performance gap between the previous best and the ground-truth oracle by 56.51%, and further reduces discrepancies in bone length and movement variance by 18.76% and 5.48%, respectively, demonstrating significant gains in anatomical realism and motion naturalness.

[400] Planning with Unified Multimodal Models

Yihao Sun, Zhilong Zhang, Yang Yu, Pierre-Luc Bacon

Main category: cs.CV

TL;DR: Uni-Plan is a planning framework using unified multimodal models (UMMs) that enables visual reasoning for decision-making, achieving better success rates than VLM-based methods without needing expert demonstrations.

Details

Motivation: Current LLM/VLM-based decision-making approaches rely solely on language reasoning, limiting their ability to make informed decisions. UMMs that support multimodal inputs/outputs offer greater potential by enabling reasoning through generated visual content.

Method: Built on UMMs where a single model serves as policy, dynamics model, and value function. Introduces self-discriminated filtering to avoid hallucinations in dynamics predictions by using the generative model as a self-discriminator.

Result: Experiments on long-horizon planning tasks show Uni-Plan substantially improves success rates compared to VLM-based methods, demonstrates strong data scalability, requires no expert demonstrations, and achieves better performance with same training data size.

Conclusion: This work lays a foundation for future research in reasoning and decision-making with unified multimodal models, showing the potential of visual reasoning for improved planning capabilities.

Abstract: With the powerful reasoning capabilities of large language models (LLMs) and vision-language models (VLMs), many recent works have explored using them for decision-making. However, most of these approaches rely solely on language-based reasoning, which limits their ability to reason and make informed decisions. Recently, a promising new direction has emerged with unified multimodal models (UMMs), which support both multimodal inputs and outputs. We believe such models have greater potential for decision-making by enabling reasoning through generated visual content. To this end, we propose Uni-Plan, a planning framework built on UMMs. Within this framework, a single model simultaneously serves as the policy, dynamics model, and value function. In addition, to avoid hallucinations in dynamics predictions, we present a novel approach self-discriminated filtering, where the generative model serves as a self-discriminator to filter out invalid dynamics predictions. Experiments on long-horizon planning tasks show that Uni-Plan substantially improves success rates compared to VLM-based methods, while also showing strong data scalability, requiring no expert demonstrations and achieving better performance under the same training-data size. This work lays a foundation for future research in reasoning and decision-making with UMMs.

[401] Copyright Infringement Detection in Text-to-Image Diffusion Models via Differential Privacy

Xiafeng Man, Zhipeng Wei, Jingjing Chen

Main category: cs.CV

TL;DR: D-Plus-Minus (DPM) is a novel framework that detects copyright infringement in text-to-image diffusion models by measuring output deviations caused by specific training data, using differential privacy principles and fine-tuning simulations.

Details

Motivation: Address legal and ethical concerns about large vision models memorizing copyrighted content, and overcome limitations of existing detection methods that lack robustness and theoretical foundations.

Method: Formalize copyright detection using Differential Privacy concepts, introduce conditional sensitivity metric, simulate inclusion/exclusion via fine-tuning (learning/unlearning), compute confidence scores over orthogonal prompt distributions, and create CIDD benchmark dataset.

Result: DPM reliably detects infringement content without needing original training data or text prompts, providing an interpretable and practical solution for intellectual property protection.

Conclusion: The proposed framework offers a theoretically grounded, robust, and practical approach to copyright infringement detection in generative AI models, addressing critical legal and ethical challenges.

Abstract: The widespread deployment of large vision models such as Stable Diffusion raises significant legal and ethical concerns, as these models can memorize and reproduce copyrighted content without authorization. Existing detection approaches often lack robustness and fail to provide rigorous theoretical underpinnings. To address these gaps, we formalize the concept of copyright infringement and its detection from the perspective of Differential Privacy (DP), and introduce the conditional sensitivity metric, a concept analogous to sensitivity in DP, that quantifies the deviation in a diffusion model’s output caused by the inclusion or exclusion of a specific training data point. To operationalize this metric, we propose D-Plus-Minus (DPM), a novel post-hoc detection framework that identifies copyright infringement in text-to-image diffusion models. Specifically, DPM simulates inclusion and exclusion processes by fine-tuning models in two opposing directions: learning or unlearning. Besides, to disentangle concept-specific influence from the global parameter shifts induced by fine-tuning, DPM computes confidence scores over orthogonal prompt distributions using statistical metrics. Moreover, to facilitate standardized benchmarking, we also construct the Copyright Infringement Detection Dataset (CIDD), a comprehensive resource for evaluating detection across diverse categories. Our results demonstrate that DPM reliably detects infringement content without requiring access to the original training dataset or text prompts, offering an interpretable and practical solution for safeguarding intellectual property in the era of generative AI.

[402] Perceptual Influence: Improving the Perceptual Loss Design for Low-Dose CT Enhancement

Gabriel A. Viana, Luis F. Alves Pereira, Tsang Ing Ren, George D. C. Cavalcanti, Jan Sijbers

Main category: cs.CV

TL;DR: This paper introduces a principled framework to assess perceptual loss design choices for LDCT denoising, showing that widely used configurations underperform compared to better-designed alternatives that significantly improve noise reduction and structural fidelity.

Details

Motivation: Perceptual losses are powerful for LDCT enhancement but involve critical underexplored design decisions including feature representation level, pretraining dataset, and optimization weighting. Current literature lacks systematic analysis of these choices.

Method: Proposed perceptual influence metric to quantify perceptual loss contribution, and developed a principled framework to systematically assess loss design choices through experimentation.

Result: Better perceptual loss designs lead to significant improvements in noise reduction and structural fidelity of reconstructed CT images without requiring network architecture changes. Widely used configurations underperform compared to optimized alternatives.

Conclusion: Provides objective guidelines supported by statistical analysis for effective use of perceptual losses in LDCT denoising, demonstrating that proper loss design can substantially enhance reconstruction quality.

Abstract: Perceptual losses have emerged as powerful tools for training networks to enhance Low-Dose Computed Tomography (LDCT) images, offering an alternative to traditional pixel-wise losses such as Mean Squared Error, which often lead to over-smoothed reconstructions and loss of clinically relevant details in LDCT images. The perceptual losses operate in a latent feature space defined by a pretrained encoder and aim to preserve semantic content by comparing high-level features rather than raw pixel values. However, the design of perceptual losses involves critical yet underexplored decisions, including the feature representation level, the dataset used to pretrain the encoder, and the relative importance assigned to the perceptual component during optimization. In this work, we introduce the concept of perceptual influence (a metric that quantifies the relative contribution of the perceptual loss term to the total loss) and propose a principled framework to assess the impact of the loss design choices on the model training performance. Through systematic experimentation, we show that the widely used configurations in the literature to set up a perceptual loss underperform compared to better-designed alternatives. Our findings show that better perceptual loss designs lead to significant improvements in noise reduction and structural fidelity of reconstructed CT images, without requiring any changes to the network architecture. We also provide objective guidelines, supported by statistical analysis, to inform the effective use of perceptual losses in LDCT denoising. Our source code is available at https://github.com/vngabriel/perceptual-influence.

Tomohiro Tanaka, Narumasa Tsutsumida

Main category: cs.CV

TL;DR: A lightweight multi-modal transformer model (Presto) for flood detection that works with SAR-only, MS-only, or combined SAR+MS data, achieving high performance while being computationally efficient.

Details

Motivation: Address operational challenges in flood mapping: single-sensor approaches face weather-dependent data availability and limited revisit periods, while multi-sensor fusion requires substantial computational resources and large datasets.

Method: Fine-tuned Presto, a lightweight multi-modal pre-trained transformer (0.4M parameters) that processes both SAR and multispectral data at pixel level, enabling flood mapping with flexible sensor inputs through single model architecture.

Result: Achieved F1 score of 0.896 and mIoU of 0.886 in optimal sensor-fusion scenario, outperforming Prithvi-100M baseline. Maintained strong performance in MS-only (F1: 0.893) and functional capability in SAR-only (F1: 0.718) scenarios.

Conclusion: The parameter-efficient, sensor-flexible approach provides accessible and robust solution for real-world disaster scenarios requiring immediate flood extent assessment regardless of sensor availability constraints.

Abstract: Floods are increasingly frequent natural disasters causing extensive human and economic damage, highlighting the critical need for rapid and accurate flood inundation mapping. While remote sensing technologies have advanced flood monitoring capabilities, operational challenges persist: single-sensor approaches face weather-dependent data availability and limited revisit periods, while multi-sensor fusion methods require substantial computational resources and large-scale labeled datasets. To address these limitations, this study introduces a novel sensor-flexible flood detection methodology by fine-tuning Presto, a lightweight ($\sim$0.4M parameters) multi-modal pre-trained transformer that processes both Synthetic Aperture Radar (SAR) and multispectral (MS) data at the pixel level. Our approach uniquely enables flood mapping using SAR-only, MS-only, or combined SAR+MS inputs through a single model architecture, addressing the critical operational need for rapid response with whatever sensor data becomes available first during disasters. We evaluated our method on the Sen1Floods11 dataset against the large-scale Prithvi-100M baseline ($\sim$100M parameters) across three realistic data availability scenarios. The proposed model achieved superior performance with an F1 score of 0.896 and mIoU of 0.886 in the optimal sensor-fusion scenario, outperforming the established baseline. Crucially, the model demonstrated robustness by maintaining effective performance in MS-only scenarios (F1: 0.893) and functional capabilities in challenging SAR-only conditions (F1: 0.718), confirming the advantage of multi-modal pre-training for operational flood mapping. Our parameter-efficient, sensor-flexible approach offers an accessible and robust solution for real-world disaster scenarios requiring immediate flood extent assessment regardless of sensor availability constraints.

[404] GeLoc3r: Enhancing Relative Camera Pose Regression with Geometric Consistency Regularization

Jingxing Li, Yongjae Lee, Deliang Fan

Main category: cs.CV

TL;DR: GeLoc3r enhances relative camera pose estimation by training regression networks with Geometric Consistency Regularization (GCR), achieving both fast inference speed and high accuracy approaching correspondence-based methods.

Details

Motivation: To overcome the speed-accuracy dilemma in camera pose estimation, where prior regression methods like ReLoc3R are fast but have geometric inconsistencies, while correspondence-based methods like MASt3R are accurate but slow.

Method: Uses ground-truth depth during training to generate dense 3D-2D correspondences, weights them with a FusionTransformer, computes geometrically-consistent poses via weighted RANSAC, and creates a consistency loss to transfer geometric knowledge into the regression network.

Result: Significantly outperforms ReLoc3R across benchmarks: 40.45% vs 34.85% AUC@5° on CO3Dv2 (16% improvement), 68.66% vs 66.70% on RealEstate10K, and 50.45% vs 49.60% on MegaDepth1500.

Conclusion: GeLoc3r represents a paradigm shift by teaching geometric consistency during training rather than enforcing it at inference, achieving both regression speed and correspondence method accuracy.

Abstract: Prior ReLoc3R achieves breakthrough performance with fast 25ms inference and state-of-the-art regression accuracy, yet our analysis reveals subtle geometric inconsistencies in its internal representations that prevent reaching the precision ceiling of correspondence-based methods like MASt3R (which require 300ms per pair). In this work, we present GeLoc3r, a novel approach to relative camera pose estimation that enhances pose regression methods through Geometric Consistency Regularization (GCR). GeLoc3r overcomes the speed-accuracy dilemma by training regression networks to produce geometrically consistent poses without inference-time geometric computation. During training, GeLoc3r leverages ground-truth depth to generate dense 3D-2D correspondences, weights them using a FusionTransformer that learns correspondence importance, and computes geometrically-consistent poses via weighted RANSAC. This creates a consistency loss that transfers geometric knowledge into the regression network. Unlike FAR method which requires both regression and geometric solving at inference, GeLoc3r only uses the enhanced regression head at test time, maintaining ReLoc3R’s fast speed and approaching MASt3R’s high accuracy. On challenging benchmarks, GeLoc3r consistently outperforms ReLoc3R, achieving significant improvements including 40.45% vs. 34.85% AUC@5{\deg} on the CO3Dv2 dataset (16% relative improvement), 68.66% vs. 66.70% AUC@5{\deg} on RealEstate10K, and 50.45% vs. 49.60% on MegaDepth1500. By teaching geometric consistency during training rather than enforcing it at inference, GeLoc3r represents a paradigm shift in how neural networks learn camera geometry, achieving both the speed of regression and the geometric understanding of correspondence methods.

Ye-eun Kim, Suhyeon Lim, Andrew J. Choi

Main category: cs.CV

TL;DR: This paper proposes a multimodal deep learning system for recognizing upper limb activities of daily living (ADL) in stroke patients, addressing the limitations of existing HAR systems that focus on non-disabled individuals.

Details

Motivation: To address the shortage of rehabilitation therapy for stroke patients by developing remote monitoring systems that reduce medical staff burden, specifically focusing on stroke patients' action recognition which has been neglected in existing HAR research.

Method: Designed a monitoring system using IMU sensors and RGB-D camera, collected a dataset directly from stroke patients, developed appropriate preprocessing methods, and proposed a deep learning model suitable for processing multimodal data.

Result: Found that stroke patients’ action data shows less clustering than non-disabled individuals, and the proposed model learns similar tendencies for each label in data with difficult-to-cluster features.

Conclusion: The study demonstrates potential for expanding deep learning models to not only recognize simple actions but also provide assessment feedback for domiciliary rehabilitation of stroke patients.

Abstract: Rehabilitation therapy for stroke patients faces a supply shortage despite the increasing demand. To address this issue, remote monitoring systems that reduce the burden on medical staff are emerging as a viable alternative. A key component of these remote monitoring systems is Human Action Recognition (HAR) technology, which classifies actions. However, existing HAR studies have primarily focused on non-disable individuals, making them unsuitable for recognizing the actions of stroke patients. HAR research for stroke has largely concentrated on classifying relatively simple actions using machine learning rather than deep learning. In this study, we designed a system to monitor the actions of stroke patients, focusing on domiciliary upper limb Activities of Daily Living (ADL). Our system utilizes IMU (Inertial Measurement Unit) sensors and an RGB-D camera, which are the most common modalities in HAR. We directly collected a dataset through this system, investigated an appropriate preprocess and proposed a deep learning model suitable for processing multimodal data. We analyzed the collected dataset and found that the action data of stroke patients is less clustering than that of non-disabled individuals. Simultaneously, we found that the proposed model learns similar tendencies for each label in data with features that are difficult to clustering. This study suggests the possibility of expanding the deep learning model, which has learned the action features of stroke patients, to not only simple action recognition but also feedback such as assessment contributing to domiciliary rehabilitation in future research. The code presented in this study is available at https://github.com/ye-Kim/MMeViT.

[406] Activation Matching for Explanation Generation

Pirzada Suhail, Aditya Anand, Amit Sethi

Main category: cs.CV

TL;DR: An activation-matching approach generates minimal binary masks that preserve both model predictions and intermediate activations, creating faithful explanations for classifier decisions on images.

Details

Motivation: To create minimal yet faithful explanations for pretrained classifiers that preserve both the model's prediction and internal representations while being human-interpretable.

Method: Train a lightweight autoencoder to output binary masks using multi-layer activation matching (KL divergence + cross-entropy), mask priors (L1 area, binarization penalty, total variation), and abductive constraints for faithfulness.

Result: Produces small, crisp binary masks that retain classifier behavior while removing irrelevant image regions, yielding practical minimalist explanations.

Conclusion: The combined objectives successfully generate human-interpretable masks that provide faithful explanations of model decision-making by preserving both predictions and internal activations.

Abstract: In this paper we introduce an activation-matching–based approach to generate minimal, faithful explanations for the decision-making of a pretrained classifier on any given image. Given an input image (x) and a frozen model (f), we train a lightweight autoencoder to output a binary mask (m) such that the explanation (e = m \odot x) preserves both the model’s prediction and the intermediate activations of (x). Our objective combines: (i) multi-layer activation matching with KL divergence to align distributions and cross-entropy to retain the top-1 label for both the image and the explanation; (ii) mask priors – L1 area for minimality, a binarization penalty for crisp 0/1 masks, and total variation for compactness; and (iii) abductive constraints for faithfulness and necessity. Together, these objectives yield small, human-interpretable masks that retain classifier behavior while discarding irrelevant input regions, providing practical and faithful minimalist explanations for the decision making of the underlying model.

[407] Mask What Matters: Controllable Text-Guided Masking for Self-Supervised Medical Image Analysis

Ruilang Wang, Shuotong Xu, Bowen Liu, Runlin Huang, Donglong Chen, Weifeng Su

Main category: cs.CV

TL;DR: Mask What Matters is a text-guided masking framework for self-supervised medical image analysis that uses vision-language models to apply differentiated masking on diagnostically relevant regions, achieving better performance with lower masking ratios.

Details

Motivation: Address the challenges of data scarcity in medical imaging and improve upon existing self-supervised masked image modeling approaches that rely on random high-ratio masking, leading to inefficiency and poor semantic alignment.

Method: Leverages vision-language models for prompt-based region localization to apply differentiated masking that emphasizes diagnostically relevant regions while reducing redundancy in background areas, enabling controllable text-guided masking.

Result: Consistently outperforms existing MIM methods across multiple medical imaging modalities (brain MRI, chest CT, lung X-ray), achieving gains of up to +3.1 percentage points in classification accuracy, +1.3 in BoxAP, and +1.1 in MaskAP for detection, with substantially lower overall masking ratios (40% vs 70%).

Conclusion: Controllable, text-driven masking enables semantically aligned self-supervised learning and advances the development of robust vision models for medical image analysis.

Abstract: The scarcity of annotated data in specialized domains such as medical imaging presents significant challenges to training robust vision models. While self-supervised masked image modeling (MIM) offers a promising solution, existing approaches largely rely on random high-ratio masking, leading to inefficiency and poor semantic alignment. Moreover, region-aware variants typically depend on reconstruction heuristics or supervised signals, limiting their adaptability across tasks and modalities. We propose Mask What Matters, a controllable text-guided masking framework for self-supervised medical image analysis. By leveraging vision-language models for prompt-based region localization, our method flexibly applies differentiated masking to emphasize diagnostically relevant regions while reducing redundancy in background areas. This controllable design enables better semantic alignment, improved representation learning, and stronger cross-task generalizability. Comprehensive evaluation across multiple medical imaging modalities, including brain MRI, chest CT, and lung X-ray, shows that Mask What Matters consistently outperforms existing MIM methods (e.g., SparK), achieving gains of up to +3.1 percentage points in classification accuracy, +1.3 in box average precision (BoxAP), and +1.1 in mask average precision (MaskAP) for detection. Notably, it achieves these improvements with substantially lower overall masking ratios (e.g., 40% vs. 70%). This work demonstrates that controllable, text-driven masking can enable semantically aligned self-supervised learning, advancing the development of robust vision models for medical image analysis.

[408] FMC-DETR: Frequency-Decoupled Multi-Domain Coordination for Aerial-View Object Detection

Ben Liang, Yuan Liu, Bingwen Qiu, Yihong Wang, Xiubao Sui, Qian Chen

Main category: cs.CV

TL;DR: FMC-DETR is a novel aerial-view object detection framework that uses frequency-decoupled fusion to address tiny object detection challenges in high-resolution imagery, achieving state-of-the-art performance with improved global context modeling.

Details

Motivation: Aerial-view object detection faces challenges in detecting tiny objects due to limited visual cues and difficulty modeling global context. Existing methods suffer from delayed contextual fusion and inadequate non-linear modeling, creating performance bottlenecks.

Method: Proposes FMC-DETR with three key components: 1) WeKat backbone using cascaded wavelet transforms and Kolmogorov-Arnold networks for global context and non-linear modeling, 2) Cross-stage Partial Fusion module for efficient multi-scale feature interaction, 3) Multi-Domain Feature Coordination module unifying spatial, frequency, and structural priors.

Result: Achieves state-of-the-art performance on benchmark datasets with fewer parameters. On VisDrone dataset, improves by 6.5% AP and 8.2% AP50 over baseline, demonstrating effectiveness in tiny object detection.

Conclusion: FMC-DETR effectively addresses tiny object detection challenges in aerial imagery through frequency-decoupled fusion, achieving superior performance with efficient parameter usage.

Abstract: Aerial-view object detection is a critical technology for real-world applications such as natural resource monitoring, traffic management, and UAV-based search and rescue. Detecting tiny objects in high-resolution aerial imagery presents a long-standing challenge due to their limited visual cues and the difficulty of modeling global context in complex scenes. Existing methods are often hampered by delayed contextual fusion and inadequate non-linear modeling, failing to effectively use global information to refine shallow features and thus encountering a performance bottleneck. To address these challenges, we propose FMC-DETR, a novel framework with frequency-decoupled fusion for aerial-view object detection. First, we introduce the Wavelet Kolmogorov-Arnold Transformer (WeKat) backbone, which applies cascaded wavelet transforms to enhance global low-frequency context perception in shallow features while preserving fine-grained details, and employs Kolmogorov-Arnold networks to achieve adaptive non-linear modeling of multi-scale dependencies. Next, a lightweight Cross-stage Partial Fusion (CPF) module reduces redundancy and improves multi-scale feature interaction. Finally, we introduce the Multi-Domain Feature Coordination (MDFC) module, which unifies spatial, frequency, and structural priors to to balance detail preservation and global enhancement. Extensive experiments on benchmark aerial-view datasets demonstrate that FMC-DETR achieves state-of-the-art performance with fewer parameters. On the challenging VisDrone dataset, our model achieves improvements of 6.5% AP and 8.2% AP50 over the baseline, highlighting its effectiveness in tiny object detection. The code can be accessed at https://github.com/bloomingvision/FMC-DETR.

[409] Follow-Your-Preference: Towards Preference-Aligned Image Inpainting

Yutao Shen, Junkun Yuan, Toru Aonishi, Hideki Nakayama, Yue Ma

Main category: cs.CV

TL;DR: This paper revisits fundamental problems in image inpainting preference alignment using direct preference optimization with public reward models, finding that most reward models provide valid scores for preference data construction, preference data shows robust scaling trends, reward models have observable biases, and simple ensemble methods yield robust results.

Details

Motivation: To investigate fundamental problems in achieving preference alignment for image inpainting rather than introducing novel methods, and to establish a solid baseline for this promising research direction.

Method: Leveraged direct preference optimization approach for alignment training, employed public reward models to construct preference training datasets, conducted experiments across nine reward models, two benchmarks, and two baseline models with varying structures and generative algorithms.

Result: Alignment models significantly outperformed prior models across standard metrics, GPT-4 assessments, and human evaluations without any changes to model structures or use of new datasets. Simple ensemble of reward models yielded robust and generalizable results by mitigating biases.

Conclusion: The work establishes a simple yet solid baseline for image inpainting preference alignment, demonstrating that fundamental approaches with existing tools can achieve significant improvements, and hopes to push this promising research frontier forward.

Abstract: This paper investigates image inpainting with preference alignment. Instead of introducing a novel method, we go back to basics and revisit fundamental problems in achieving such alignment. We leverage the prominent direct preference optimization approach for alignment training and employ public reward models to construct preference training datasets. Experiments are conducted across nine reward models, two benchmarks, and two baseline models with varying structures and generative algorithms. Our key findings are as follows: (1) Most reward models deliver valid reward scores for constructing preference data, even if some of them are not reliable evaluators. (2) Preference data demonstrates robust trends in both candidate scaling and sample scaling across models and benchmarks. (3) Observable biases in reward models, particularly in brightness, composition, and color scheme, render them susceptible to cause reward hacking. (4) A simple ensemble of these models yields robust and generalizable results by mitigating such biases. Built upon these observations, our alignment models significantly outperform prior models across standard metrics, GPT-4 assessments, and human evaluations, without any changes to model structures or the use of new datasets. We hope our work can set a simple yet solid baseline, pushing this promising frontier. Our code is open-sourced at: https://github.com/shenytzzz/Follow-Your-Preference.

[410] Streamline pathology foundation model by cross-magnification distillation

Ziyu Su, Abdul Rehman Akbar, Usama Sajjad, Anil V. Parwani, Muhammad Khalid Khan Niazi

Main category: cs.CV

TL;DR: XMAG is a lightweight foundation model for computational pathology that uses cross-magnification distillation to transfer knowledge from 20x to 5x magnification, achieving near-state-of-the-art performance with 30x faster processing.

Details

Motivation: Foundation models in computational pathology are computationally prohibitive for clinical deployment due to massive parameter counts and high-magnification processing requirements.

Method: Cross-magnification distillation framework that transfers knowledge from 20x magnification teacher to 5x magnification student architecture, using dual-level knowledge transfer (global image representations and local spatial token mapping) and trained on 3.49 million images.

Result: XMAG achieved diagnostic accuracy within 1% of larger foundation models while delivering 30-fold processing acceleration (8.8 WSIs per minute) and demonstrated robust generalization in cross-institutional validation.

Conclusion: Cross-magnification distillation is a viable approach for deploying foundation model capabilities in resource-constrained clinical environments, potentially enabling real-time pathology AI integration.

Abstract: Foundation models (FM) have transformed computational pathology but remain computationally prohibitive for clinical deployment due to their massive parameter counts and high-magnification processing requirements. Here, we introduce XMAG, a lightweight FM developed through corss-magnification distillation that transfers knowledge from state-of-the-art 20x magnification teacher to an efficient 5x magnification student architecture. XMAG employs a compact backbone and operates entirely at 5x, requiring 11.3 times fewer patches per whole slide image (WSI) compared to existing approaches. Our Novel distillation framework incorporates dual-level knowledge transfer, aligning both global image representations and local spatial token mapping. We trained XMAG on 3.49 million images curated from publicly available datasets and evaluated performance across six clinically relevant histopathology analysis tasks spanning multiple cancer types. XMAG achieved diagnostic accuracy within 1% of substantially larger foundation models while delivering 30-fold processing acceleration, reaching 8.8 WSIs per minute processing speed. Our cross-institutional validation confirmed robust generalization. Further, we developed an end-to-end training strategy to further boost our model’s performance to approach the larger FMs’ performance. These results establish cross-magnification distillation as a viable approach for deploying FM capabilities in resource-constrained clinical environments, potentially enabling real-time pathology AI integration.

[411] CoPatch: Zero-Shot Referring Image Segmentation by Leveraging Untapped Spatial Knowledge in CLIP

Na Min An, Inha Kang, Minhyun Lee, Hyunjung Shim

Main category: cs.CV

TL;DR: CoPatch is a zero-shot referring image segmentation framework that enhances spatial representations in vision-language models by constructing hybrid text features with spatial context and extracting patch-level image features from intermediate layers, achieving significant improvements without additional training.

Details

Motivation: Current vision-language models like CLIP struggle with spatial relationships - they focus on primary noun phrases in text and generate similar features for images with different spatial layouts, limiting spatial grounding capabilities for referring image segmentation.

Method: CoPatch leverages internal model components to enhance spatial representations: 1) constructs hybrid text features by incorporating context tokens with spatial cues, 2) extracts patch-level image features from intermediate layers where spatial structure is better preserved, and 3) fuses these into a clustered image-text similarity map (CoMap) for precise mask selection.

Result: CoPatch significantly improves spatial grounding in zero-shot RIS across RefCOCO, RefCOCO+, RefCOCOg, and PhraseCut datasets, achieving 2-7 mIoU improvements without requiring any additional training.

Conclusion: The findings demonstrate the importance of recovering and leveraging untapped spatial knowledge inherently embedded in vision-language models, paving the way for opportunities in zero-shot referring image segmentation.

Abstract: Spatial grounding is crucial for referring image segmentation (RIS), where the goal of the task is to localize an object described by language. Current foundational vision-language models (VLMs), such as CLIP, excel at aligning images and text but struggle with understanding spatial relationships. Within the language stream, most existing methods often focus on the primary noun phrase when extracting local text features, undermining contextual tokens. Within the vision stream, CLIP generates similar features for images with different spatial layouts, resulting in limited sensitivity to spatial structure. To address these limitations, we propose \textsc{CoPatch}, a zero-shot RIS framework that leverages internal model components to enhance spatial representations in both text and image modalities. For language, \textsc{CoPatch} constructs hybrid text features by incorporating context tokens carrying spatial cues. For vision, it extracts patch-level image features using our novel path discovered from intermediate layers, where spatial structure is better preserved. These enhanced features are fused into a clustered image-text similarity map, \texttt{CoMap}, enabling precise mask selection. As a result, \textsc{CoPatch} significantly improves spatial grounding in zero-shot RIS across RefCOCO, RefCOCO+, RefCOCOg, and PhraseCut (+ 2–7 mIoU) without requiring any additional training. Our findings underscore the importance of recovering and leveraging the untapped spatial knowledge inherently embedded in VLMs, thereby paving the way for opportunities in zero-shot RIS.

[412] Deep Learning for Oral Health: Benchmarking ViT, DeiT, BEiT, ConvNeXt, and Swin Transformer

Ajo Babu George, Sadhvik Bathini, Niranjana S R

Main category: cs.CV

TL;DR: Systematic comparison of 5 transformer architectures for dental disease classification, with ConvNeXt achieving best performance and addressing data imbalance challenges.

Details

Motivation: To evaluate and compare transformer-based architectures for multi-class dental disease classification, specifically addressing real-world challenges like data imbalance that are often overlooked in existing literature.

Method: Used the Oral Diseases dataset to train and validate Vision Transformer (ViT), DeiT, ConvNeXt, Swin Transformer, and BEiT models, measuring performance metrics including validation accuracy, precision, recall, and F1-score with emphasis on handling imbalanced classes.

Result: ConvNeXt achieved highest validation accuracy (81.06%), followed by BEiT (80.00%) and Swin Transformer (79.73%), all with strong F1-scores. ViT and DeiT achieved 79.37% and 78.79% respectively but struggled with Caries-related classes.

Conclusion: ConvNeXt, Swin Transformer, and BEiT showed reliable diagnostic performance and are promising for clinical dental imaging applications. Findings provide guidance for model selection in AI-driven oral disease diagnostics and highlight the importance of addressing data imbalance.

Abstract: Objective: The aim of this study was to systematically evaluate and compare the performance of five state-of-the-art transformer-based architectures - Vision Transformer (ViT), Data-efficient Image Transformer (DeiT), ConvNeXt, Swin Transformer, and Bidirectional Encoder Representation from Image Transformers (BEiT) - for multi-class dental disease classification. The study specifically focused on addressing real-world challenges such as data imbalance, which is often overlooked in existing literature. Study Design: The Oral Diseases dataset was used to train and validate the selected models. Performance metrics, including validation accuracy, precision, recall, and F1-score, were measured, with special emphasis on how well each architecture managed imbalanced classes. Results: ConvNeXt achieved the highest validation accuracy at 81.06, followed by BEiT at 80.00 and Swin Transformer at 79.73, all demonstrating strong F1-scores. ViT and DeiT achieved accuracies of 79.37 and 78.79, respectively, but both struggled particularly with Caries-related classes. Conclusions: ConvNeXt, Swin Transformer, and BEiT showed reliable diagnostic performance, making them promising candidates for clinical application in dental imaging. These findings provide guidance for model selection in future AI-driven oral disease diagnostic tools and highlight the importance of addressing data imbalance in real-world scenarios

[413] HTMA-Net: Towards Multiplication-Avoiding Neural Networks via Hadamard Transform and In-Memory Computing

Emadeldeen Hamdan, Ahmet Enis Cetin

Main category: cs.CV

TL;DR: HTMA-Net combines Hadamard Transform with multiplication-avoiding in-memory computing to reduce multiplication costs in neural networks while maintaining accuracy.

Details

Motivation: Reducing multiplication costs is critical for efficient deep neural network deployment in energy-constrained edge devices.

Method: Selectively replaces intermediate convolutions with Hybrid Hadamard-based transform layers whose internal convolutions are implemented via multiplication-avoiding in-memory operations using SRAM-based in-memory computing.

Result: Eliminates up to 52% of multiplications compared to baseline ResNet models while achieving comparable accuracy on CIFAR-10, CIFAR-100, and Tiny ImageNet datasets, with significant reduction in computational complexity and parameters.

Conclusion: Combining structured Hadamard transform layers with SRAM-based in-memory computing multiplication-avoiding operators is a promising path towards efficient deep learning architectures.

Abstract: Reducing the cost of multiplications is critical for efficient deep neural network deployment, especially in energy-constrained edge devices. In this work, we introduce HTMA-Net, a novel framework that integrates the Hadamard Transform (HT) with multiplication-avoiding (MA) SRAM-based in-memory computing to reduce arithmetic complexity while maintaining accuracy. Unlike prior methods that only target multiplications in convolutional layers or focus solely on in-memory acceleration, HTMA-Net selectively replaces intermediate convolutions with Hybrid Hadamard-based transform layers whose internal convolutions are implemented via multiplication-avoiding in-memory operations. We evaluate HTMA-Net on ResNet-18 using CIFAR-10, CIFAR-100, and Tiny ImageNet, and provide a detailed comparison against regular, MF-only, and HT-only variants. Results show that HTMA-Net eliminates up to 52% of multiplications compared to baseline ResNet-18, ResNet-20, and ResNet-50 models, while achieving comparable accuracy in evaluation and significantly reducing computational complexity and the number of parameters. Our results demonstrate that combining structured Hadamard transform layers with SRAM-based in-memory computing multiplication-avoiding operators is a promising path towards efficient deep learning architectures.

[414] Towards Comprehensive Interactive Change Understanding in Remote Sensing: A Large-scale Dataset and Dual-granularity Enhanced VLM

Junxiao Xue, Quan Deng, Xuecheng Wu, Kelu Yao, Xinyi Yin, Fei Yu, Wei Zhou, Yanfei Zhong, Yang Liu, Dingkang Yang

Main category: cs.CV

TL;DR: The paper introduces ChangeIMTI, a large-scale multi-task dataset for remote sensing change understanding, and proposes ChangeVG, a vision-guided vision-language model with dual-granularity awareness that outperforms existing methods on change captioning tasks.

Details

Motivation: Existing datasets lack deep understanding and interactions in diverse change captioning, counting, and localization tasks for remote sensing change understanding (RSCU).

Method: Constructed ChangeIMTI dataset with four tasks, and designed ChangeVG model with vision-guided module featuring dual-branch architecture combining fine-grained spatial feature extraction with high-level semantic summarization, using these as auxiliary prompts to guide large VLMs during instruction tuning.

Result: Outperforms the strongest method Semantic-CC by 1.39 points on the comprehensive S*m metric for change captioning task, demonstrating superiority across four tasks.

Conclusion: The proposed ChangeVG model with dual-granularity awareness effectively facilitates hierarchical cross-modal learning for remote sensing change understanding tasks.

Abstract: Remote sensing change understanding (RSCU) is essential for analyzing remote sensing images and understanding how human activities affect the environment. However, existing datasets lack deep understanding and interactions in the diverse change captioning, counting, and localization tasks. To tackle these gaps, we construct ChangeIMTI, a new large-scale interactive multi-task instruction dataset that encompasses four complementary tasks including change captioning, binary change classification, change counting, and change localization. Building upon this new dataset, we further design a novel vision-guided vision-language model (ChangeVG) with dual-granularity awareness for bi-temporal remote sensing images (i.e., two remote sensing images of the same area at different times). The introduced vision-guided module is a dual-branch architecture that synergistically combines fine-grained spatial feature extraction with high-level semantic summarization. These enriched representations further serve as the auxiliary prompts to guide large vision-language models (VLMs) (e.g., Qwen2.5-VL-7B) during instruction tuning, thereby facilitating the hierarchical cross-modal learning. We extensively conduct experiments across four tasks to demonstrate the superiority of our approach. Remarkably, on the change captioning task, our method outperforms the strongest method Semantic-CC by 1.39 points on the comprehensive S*m metric, which integrates the semantic similarity and descriptive accuracy to provide an overall evaluation of change caption. Moreover, we also perform a series of ablation studies to examine the critical components of our method.

[415] Stochastic Interpolants via Conditional Dependent Coupling

Chenrui Ma, Xi Xiao, Tianyang Wang, Xiao Wang, Yanning Shen

Main category: cs.CV

TL;DR: A unified multistage generative framework using Conditional Dependent Coupling that decomposes generation into interpolant trajectories, enabling accurate distribution learning and end-to-end optimization within a single Diffusion Transformer.

Details

Motivation: Address limitations in existing image generation models: VAE-based models suffer from information loss and lack end-to-end training, pixel-space models have high computational cost, and cascade models prevent effective optimization and knowledge sharing.

Method: Proposed Conditional Dependent Coupling strategy that decomposes generative process into interpolant trajectories at multiple stages, modeled as a single unified Diffusion Transformer for end-to-end optimization and knowledge sharing.

Result: Achieves both high fidelity and efficiency across multiple resolutions, demonstrating improved performance compared to existing approaches.

Conclusion: The unified multistage framework successfully addresses the computation-fidelity trade-off in image generation by enabling accurate distribution learning and end-to-end optimization within a single model architecture.

Abstract: Existing image generation models face critical challenges regarding the trade-off between computation and fidelity. Specifically, models relying on a pretrained Variational Autoencoder (VAE) suffer from information loss, limited detail, and the inability to support end-to-end training. In contrast, models operating directly in the pixel space incur prohibitive computational cost. Although cascade models can mitigate computational cost, stage-wise separation prevents effective end-to-end optimization, hampers knowledge sharing, and often results in inaccurate distribution learning within each stage. To address these challenges, we introduce a unified multistage generative framework based on our proposed Conditional Dependent Coupling strategy. It decomposes the generative process into interpolant trajectories at multiple stages, ensuring accurate distribution learning while enabling end-to-end optimization. Importantly, the entire process is modeled as a single unified Diffusion Transformer, eliminating the need for disjoint modules and also enabling knowledge sharing. Extensive experiments demonstrate that our method achieves both high fidelity and efficiency across multiple resolutions.

[416] Benchmarking DINOv3 for Multi-Task Stroke Analysis on Non-Contrast CT

Donghao Zhang, Yimin Chen, Kauê TN Duarte, Taha Aslan, Mohamed AlShamrani, Brij Karmur, Yan Wan, Shengcai Chen, Bo Hu, Bijoy K Menon, Wu Qiu

Main category: cs.CV

TL;DR: DINOv3 self-supervised vision transformer used for comprehensive stroke analysis tasks on non-contrast CT, achieving strong benchmarks across multiple datasets.

Details

Motivation: NCCT is crucial for rapid stroke diagnosis but suffers from low image contrast and signal-to-noise ratio, limiting its effectiveness.

Method: Leveraged DINOv3, a state-of-the-art self-supervised vision transformer, to generate powerful feature representations for stroke analysis tasks including segmentation, classification, and ASPECTS scoring.

Result: Established strong benchmarks for multiple stroke analysis tasks on public and private datasets, demonstrating the potential of self-supervised models for automated stroke diagnosis.

Conclusion: Advanced self-supervised models like DINOv3 show promise for improving automated stroke diagnosis from NCCT, though the approach has both advantages and current constraints that require further analysis.

Abstract: Non-contrast computed tomography (NCCT) is essential for rapid stroke diagnosis but is limited by low image contrast and signal to noise ratio. We address this challenge by leveraging DINOv3, a state-of-the-art self-supervised vision transformer, to generate powerful feature representations for a comprehensive set of stroke analysis tasks. Our evaluation encompasses infarct and hemorrhage segmentation, anomaly classification (normal vs. stroke and normal vs. infarct vs. hemorrhage), hemorrhage subtype classification (EDH, SDH, SAH, IPH, IVH), and dichotomized ASPECTS classification (<=6 vs. >6) on multiple public and private datasets. This study establishes strong benchmarks for these tasks and demonstrates the potential of advanced self-supervised models to improve automated stroke diagnosis from NCCT, providing a clear analysis of both the advantages and current constraints of the approach. The code is available at https://github.com/Zzz0251/DINOv3-stroke.

[417] Earth-Agent: Unlocking the Full Landscape of Earth Observation with Agents

Peilin Feng, Zhutao Lv, Junyan Ye, Xiaolei Wang, Xinjie Huo, Jinhua Yu, Wanghan Xu, Wenlong Zhang, Lei Bai, Conghui He, Weijia Li

Main category: cs.CV

TL;DR: Earth-Agent is the first agentic framework that unifies RGB and spectral EO data with MCP-based tools for cross-modal, multi-step reasoning, addressing limitations of current MLLMs in complex Earth observation tasks.

Details

Motivation: Current MLLMs lack capability for complex tasks requiring multi-step reasoning and domain-specific tools in Earth observation. Agent-based methods are promising but limited to RGB perception, shallow reasoning, and lack systematic evaluation.

Method: Earth-Agent unifies RGB and spectral EO data within an MCP-based tool ecosystem, enabling cross-modal, multi-step, and quantitative spatiotemporal reasoning. It dynamically invokes expert tools and models across modalities for complex scientific tasks.

Result: Comprehensive experiments with different LLM backbones, comparisons with general agent frameworks and MLLMs on remote sensing benchmarks demonstrate Earth-Agent’s effectiveness and potential. Earth-Bench benchmark includes 248 expert-curated tasks with 13,729 images across spectrum, products and RGB modalities.

Conclusion: Earth-Agent establishes a new paradigm for EO analysis, moving the field toward scientifically grounded, next-generation applications of LLMs in Earth observation.

Abstract: Earth observation (EO) is essential for understanding the evolving states of the Earth system. Although recent MLLMs have advanced EO research, they still lack the capability to tackle complex tasks that require multi-step reasoning and the use of domain-specific tools. Agent-based methods offer a promising direction, but current attempts remain in their infancy, confined to RGB perception, shallow reasoning, and lacking systematic evaluation protocols. To overcome these limitations, we introduce Earth-Agent, the first agentic framework that unifies RGB and spectral EO data within an MCP-based tool ecosystem, enabling cross-modal, multi-step, and quantitative spatiotemporal reasoning beyond pretrained MLLMs. Earth-Agent supports complex scientific tasks such as geophysical parameter retrieval and quantitative spatiotemporal analysis by dynamically invoking expert tools and models across modalities. To support comprehensive evaluation, we further propose Earth-Bench, a benchmark of 248 expert-curated tasks with 13,729 images, spanning spectrum, products and RGB modalities, and equipped with a dual-level evaluation protocol that assesses both reasoning trajectories and final outcomes. We conduct comprehensive experiments varying different LLM backbones, comparisons with general agent frameworks, and comparisons with MLLMs on remote sensing benchmarks, demonstrating both the effectiveness and potential of Earth-Agent. Earth-Agent establishes a new paradigm for EO analysis, moving the field toward scientifically grounded, next-generation applications of LLMs in Earth observation. Our code and dataset will be publicly released.

[418] WeatherCycle: Unpaired Multi-Weather Restoration via Color Space Decoupled Cycle Learning

Wenxuan Fang, Jiangwei Weng, Jianjun Qian, Jian Yang, Jun Li

Main category: cs.CV

TL;DR: WeatherCycle is an unsupervised framework for multi-weather image restoration that uses bidirectional degradation-content translation with degradation-aware curriculum regularization, achieving state-of-the-art performance without paired training data.

Details

Motivation: Existing weather restoration methods rely on task-specific physical priors, which limits their scalability and generalization to diverse real-world weather scenarios. There's a need for a unified approach that can handle multiple weather conditions without paired supervision.

Method: Uses lumina-chroma decomposition to decouple degradation from content, a Lumina Degradation Guidance Module (LDGM) for learning luminance degradation priors via frequency-domain amplitude modulation, and a Difficulty-Aware Contrastive Regularization (DACR) module that identifies hard samples and enforces contrastive alignment.

Result: Extensive experiments across multiple weather datasets demonstrate state-of-the-art performance among unsupervised approaches with strong generalization to complex weather degradations.

Conclusion: WeatherCycle provides an effective unified framework for multi-weather image restoration that outperforms existing unsupervised methods and generalizes well to diverse weather conditions.

Abstract: Unsupervised image restoration under multi-weather conditions remains a fundamental yet underexplored challenge. While existing methods often rely on task-specific physical priors, their narrow focus limits scalability and generalization to diverse real-world weather scenarios. In this work, we propose \textbf{WeatherCycle}, a unified unpaired framework that reformulates weather restoration as a bidirectional degradation-content translation cycle, guided by degradation-aware curriculum regularization. At its core, WeatherCycle employs a \textit{lumina-chroma decomposition} strategy to decouple degradation from content without modeling complex weather, enabling domain conversion between degraded and clean images. To model diverse and complex degradations, we propose a \textit{Lumina Degradation Guidance Module} (LDGM), which learns luminance degradation priors from a degraded image pool and injects them into clean images via frequency-domain amplitude modulation, enabling controllable and realistic degradation modeling. Additionally, we incorporate a \textit{Difficulty-Aware Contrastive Regularization (DACR)} module that identifies hard samples via a CLIP-based classifier and enforces contrastive alignment between hard samples and restored features to enhance semantic consistency and robustness. Extensive experiments across serve multi-weather datasets, demonstrate that our method achieves state-of-the-art performance among unsupervised approaches, with strong generalization to complex weather degradations.

[419] Sparse2Dense: A Keypoint-driven Generative Framework for Human Video Compression and Vertex Prediction

Bolin Chen, Ru-Ling Liao, Yan Ye, Jie Chen, Shanzhi Yin, Xinrui Ju, Shiqi Wang, Yibo Fan

Main category: cs.CV

TL;DR: Sparse2Dense is a keypoint-driven generative framework that uses sparse 3D keypoints for ultra-low bitrate human video compression and accurate vertex prediction, achieving competitive performance while enabling downstream geometry applications.

Details

Motivation: Addressing the challenge of simultaneously achieving ultra-low bitrate human video compression and accurate vertex prediction for bandwidth-constrained multimedia applications, which requires harmonizing dynamic motion modeling, detailed appearance synthesis, and geometric consistency.

Method: Proposes a multi-task learning-based and keypoint-aware deep generative model that encodes complex human motion via compact 3D keypoints, uses these sparse keypoints to estimate dense motion for video synthesis, and integrates a vertex predictor for joint optimization with video generation.

Result: Extensive experiments show Sparse2Dense achieves competitive compression performance over traditional/generative video codecs while enabling precise human vertex prediction for downstream geometry applications.

Conclusion: Sparse2Dense facilitates bandwidth-efficient human-centric media transmission for applications like real-time motion analysis, virtual human animation, and immersive entertainment.

Abstract: For bandwidth-constrained multimedia applications, simultaneously achieving ultra-low bitrate human video compression and accurate vertex prediction remains a critical challenge, as it demands the harmonization of dynamic motion modeling, detailed appearance synthesis, and geometric consistency. To address this challenge, we propose Sparse2Dense, a keypoint-driven generative framework that leverages extremely sparse 3D keypoints as compact transmitted symbols to enable ultra-low bitrate human video compression and precise human vertex prediction. The key innovation is the multi-task learning-based and keypoint-aware deep generative model, which could encode complex human motion via compact 3D keypoints and leverage these sparse keypoints to estimate dense motion for video synthesis with temporal coherence and realistic textures. Additionally, a vertex predictor is integrated to learn human vertex geometry through joint optimization with video generation, ensuring alignment between visual content and geometric structure. Extensive experiments demonstrate that the proposed Sparse2Dense framework achieves competitive compression performance for human video over traditional/generative video codecs, whilst enabling precise human vertex prediction for downstream geometry applications. As such, Sparse2Dense is expected to facilitate bandwidth-efficient human-centric media transmission, such as real-time motion analysis, virtual human animation, and immersive entertainment.

[420] TRAX: TRacking Axles for Accurate Axle Count Estimation

Avinash Rai, Sandeep Jana, Vishal Vijay

Main category: cs.CV

TL;DR: An end-to-end video-based pipeline for vehicle axle counting using YOLO-OBB for vehicle detection and YOLO for tire detection, with a novel TRAX algorithm to handle occlusions and improve accuracy for long vehicles.

Details

Motivation: Accurate vehicle axle counting is essential for traffic control, toll collection, and infrastructure development, but previous methods have limitations in dense environments with occlusions.

Method: Combines YOLO-OBB for vehicle detection/categorization and YOLO for tire detection, with intelligent tire-vehicle association and a novel TRAX (Tire and Axle Tracking) Algorithm to handle occlusions and partial detections in longer vehicles.

Result: Significantly reduces false positives and improves axle-counting accuracy for long vehicles, demonstrating strong robustness in real-world traffic videos.

Conclusion: This work represents a significant step toward scalable, AI-driven axle counting systems that can replace legacy roadside infrastructure.

Abstract: Accurate counting of vehicle axles is essential for traffic control, toll collection, and infrastructure development. We present an end-to-end, video-based pipeline for axle counting that tackles limitations of previous works in dense environments. Our system leverages a combination of YOLO-OBB to detect and categorize vehicles, and YOLO to detect tires. Detected tires are intelligently associated to their respective parent vehicles, enabling accurate axle prediction even in complex scenarios. However, there are a few challenges in detection when it comes to scenarios with longer and occluded vehicles. We mitigate vehicular occlusions and partial detections for longer vehicles by proposing a novel TRAX (Tire and Axle Tracking) Algorithm to successfully track axle-related features between frames. Our method stands out by significantly reducing false positives and improving the accuracy of axle-counting for long vehicles, demonstrating strong robustness in real-world traffic videos. This work represents a significant step toward scalable, AI-driven axle counting systems, paving the way for machine vision to replace legacy roadside infrastructure.

[421] Confidence-Calibrating Regularization for Robust Brain MRI Segmentation Under Domain Shift

Behraj Khan, Tahir Qasim Syed

Main category: cs.CV

TL;DR: CalSAM is a lightweight adaptation framework that improves SAM’s performance on medical volumes by addressing domain shift and overconfidence through Feature Fisher Information Penalty and Confidence Misalignment Penalty, achieving significant improvements in segmentation accuracy and calibration.

Details

Motivation: SAM performs well on natural images but suffers from domain shift and overconfidence when applied to medical volumes, limiting its clinical utility for brain MRI segmentation.

Method: CalSAM uses two penalties: Feature Fisher Information Penalty (FIP) to reduce encoder sensitivity to domain shift on 3D feature maps, and Confidence Misalignment Penalty (CMP) to penalize overconfident voxel-wise errors. Only the mask decoder is fine-tuned while keeping SAM’s encoders frozen.

Result: On BraTS scanner split (Siemens→GE): +7.4% DSC improvement (80.1% vs 74.6%), -26.9% HD95 reduction (4.6mm vs 6.3mm), -39.5% ECE reduction (5.2% vs 8.6%). On ATLAS-C: +5.3% DSC improvement (75.9%), -32.6% ECE reduction (5.8%).

Conclusion: CalSAM effectively improves domain generalization and uncertainty calibration for brain MRI segmentation while maintaining computational efficiency by freezing SAM’s encoder, with FIP and CMP providing complementary benefits.

Abstract: The Segment Anything Model (SAM) exhibits strong zero-shot performance on natural images but suffers from domain shift and overconfidence when applied to medical volumes. We propose \textbf{CalSAM}, a lightweight adaptation framework that (i) reduces encoder sensitivity to domain shift via a \emph{Feature Fisher Information Penalty} (FIP) computed on 3D feature maps and (ii) penalizes overconfident voxel-wise errors through a \emph{Confidence Misalignment Penalty} (CMP). The combined loss, (\mathcal{L}_{\mathrm{CalSAM}}) fine-tunes only the mask decoder while keeping SAM’s encoders frozen. On cross-center and scanner-shift evaluations, CalSAM substantially improves accuracy and calibration: e.g., on the BraTS scanner split (Siemens$\to$GE) CalSAM shows a $+7.4%$ relative improvement in $\mathrm{DSC}$ (80.1% vs.\ 74.6%), a $-26.9%$ reduction in $\mathrm{HD95}$ (4.6 mm vs.\ 6.3 mm), and a $-39.5%$ reduction in $\mathrm{ECE}$ (5.2% vs.\ 8.6%). On ATLAS-C (motion corruptions), CalSAM achieves a $+5.3%$ relative improvement in $\mathrm{DSC}$ (75.9%) and a $-32.6%$ reduction in $\mathrm{ECE}$ (5.8%). Ablations show FIP and CMP contribute complementary gains ($p<0.01$), and the Fisher penalty incurs a modest $\sim$15% training-time overhead. CalSAM therefore delivers improved domain generalization and better-calibrated uncertainty estimates for brain MRI segmentation, while retaining the computational benefits of freezing SAM’s encoder.

[422] Unsupervised Online 3D Instance Segmentation with Synthetic Sequences and Dynamic Loss

Yifan Zhang, Wei Zhang, Chuangxin He, Zhonghua Miao, Junhui Hou

Main category: cs.CV

TL;DR: A new unsupervised online 3D instance segmentation framework that improves training diversity through synthetic point cloud generation, uses flexible temporal sampling for long-range dependencies, and employs dynamic-weighting loss for robust learning.

Details

Motivation: Existing unsupervised 3D instance segmentation methods like UNIT have limitations including limited training diversity, rigid temporal sampling, and heavy dependence on noisy pseudo-labels, which hinder consistent object identity tracking across LiDAR scans.

Method: Proposes synthetic point cloud sequence generation for diverse training data, flexible temporal sampling using both adjacent and non-adjacent frames, and dynamic-weighting loss to emphasize confident samples for robust representation learning.

Result: Outperforms UNIT and other unsupervised baselines on SemanticKITTI, nuScenes, and PandaSet datasets, achieving higher segmentation accuracy and more robust temporal associations.

Conclusion: The proposed framework successfully addresses key limitations of existing methods through synthetic data generation, flexible temporal modeling, and adaptive loss weighting, demonstrating superior performance in unsupervised online 3D instance segmentation.

Abstract: Unsupervised online 3D instance segmentation is a fundamental yet challenging task, as it requires maintaining consistent object identities across LiDAR scans without relying on annotated training data. Existing methods, such as UNIT, have made progress in this direction but remain constrained by limited training diversity, rigid temporal sampling, and heavy dependence on noisy pseudo-labels. We propose a new framework that enriches the training distribution through synthetic point cloud sequence generation, enabling greater diversity without relying on manual labels or simulation engines. To better capture temporal dynamics, our method incorporates a flexible sampling strategy that leverages both adjacent and non-adjacent frames, allowing the model to learn from long-range dependencies as well as short-term variations. In addition, a dynamic-weighting loss emphasizes confident and informative samples, guiding the network toward more robust representations. Through extensive experiments on SemanticKITTI, nuScenes, and PandaSet, our method consistently outperforms UNIT and other unsupervised baselines, achieving higher segmentation accuracy and more robust temporal associations. The code will be publicly available at github.com/Eaphan/SFT3D.

[423] Real-World Transferable Adversarial Attack on Face-Recognition Systems

Andrey Kaznacheev, Matvey Mikhalchuk, Andrey Kuznetsov, Aleksandr Petiushko, Anton Razzhigaev

Main category: cs.CV

TL;DR: GaP (Gaussian Patch) is a black-box adversarial attack method that creates a universal, physically transferable forehead patch using Gaussian blobs to degrade face recognition systems with high success rates.

Details

Motivation: Most adversarial attacks on face recognition systems are limited to digital domains or require white-box access, creating a need for practical black-box attacks that can work in real-world physical settings.

Method: Uses a query-efficient, zero-order greedy algorithm to iteratively construct a symmetric, grayscale forehead patch by adding Gaussian blobs, guided only by cosine similarity scores from a surrogate FR model.

Result: Achieves high attack success rate with only ~10,000 queries to black-box ArcFace model, works in both digital and physical tests, and shows strong transferability to unseen FaceNet model.

Conclusion: Demonstrates a practical and severe vulnerability in face recognition systems, proving that robust, transferable attacks can be crafted with limited knowledge of the target system.

Abstract: Adversarial attacks on face recognition (FR) systems pose a significant security threat, yet most are confined to the digital domain or require white-box access. We introduce GaP (Gaussian Patch), a novel method to generate a universal, physically transferable adversarial patch under a strict black-box setting. Our approach uses a query-efficient, zero-order greedy algorithm to iteratively construct a symmetric, grayscale pattern for the forehead. The patch is optimized by successively adding Gaussian blobs, guided only by the cosine similarity scores from a surrogate FR model to maximally degrade identity recognition. We demonstrate that with approximately 10,000 queries to a black-box ArcFace model, the resulting GaP achieves a high attack success rate in both digital and real-world physical tests. Critically, the attack shows strong transferability, successfully deceiving an entirely unseen FaceNet model. Our work highlights a practical and severe vulnerability, proving that robust, transferable attacks can be crafted with limited knowledge of the target system.

[424] UltraUNet: Real-Time Ultrasound Tongue Segmentation for Diverse Linguistic and Imaging Conditions

Alisher Myrgyyassov, Zhen Song, Yu Sun, Bruce Xiao Wang, Min Ney Wong, Yongping Zheng

Main category: cs.CV

TL;DR: UltraUNet is a lightweight neural network for real-time tongue contour segmentation in ultrasound images, achieving 250 FPS with high accuracy across multiple datasets.

Details

Motivation: Real-time tongue contour segmentation in ultrasound imaging is challenging due to low signal-to-noise ratios, imaging variability, and computational demands, limiting its use in speech research and clinical applications.

Method: Proposed UltraUNet - a lightweight encoder-decoder architecture with domain-specific innovations: Squeeze-and-Excitation blocks, Group Normalization for small-batch stability, summation-based skip connections, and ultrasound-specific augmentations like denoising and blur simulation.

Result: Achieves 250 frames per second with high accuracy: single-dataset Dice = 0.855 and MSD = 0.993px, cross-dataset Dice averaging 0.734 and 0.761 across 8 datasets.

Conclusion: UltraUNet provides a fast, accurate solution for real-time tongue contour segmentation, enabling applications in speech research, clinical diagnostics, and analysis of speech motor disorders.

Abstract: Ultrasound tongue imaging (UTI) is a non-invasive and cost-effective tool for studying speech articulation, motor control, and related disorders. However, real-time tongue contour segmentation remains challenging due to low signal-to-noise ratios, imaging variability, and computational demands. We propose UltraUNet, a lightweight encoder-decoder architecture optimized for real-time segmentation of tongue contours in ultrasound images. UltraUNet incorporates domain-specific innovations such as lightweight Squeeze-and-Excitation blocks, Group Normalization for small-batch stability, and summation-based skip connections to reduce memory and computational overhead. It achieves 250 frames per second and integrates ultrasound-specific augmentations like denoising and blur simulation. Evaluations on 8 datasets demonstrate high accuracy and robustness, with single-dataset Dice = 0.855 and MSD = 0.993px, and cross-dataset Dice averaging 0.734 and 0.761. UltraUNet provides a fast, accurate solution for speech research, clinical diagnostics, and analysis of speech motor disorders.

[425] Patch Rebirth: Toward Fast and Transferable Model Inversion of Vision Transformers

Seongsoo Heo, Dong-Wan Choi

Main category: cs.CV

TL;DR: PRI is a novel model inversion method that incrementally detaches important patches during inversion to create sparse synthetic images while allowing remaining patches to continue evolving, achieving 10x faster inversion than DMI and 2x faster than SMI while maintaining accuracy.

Details

Motivation: Current sparse model inversion methods discard patches prematurely, suppressing extraction of both class-agnostic and class-specific features essential for knowledge transfer, even though random patches can eventually acquire transferable knowledge.

Method: Patch Rebirth Inversion (PRI) progressively detaches the most important patches during inversion to construct sparse synthetic images, while allowing remaining patches to continue evolving for future selection, enabling initially less informative patches to accumulate class-relevant knowledge.

Result: PRI achieves up to 10x faster inversion than standard Dense Model Inversion and 2x faster than SMI, while consistently outperforming SMI in accuracy and matching DMI performance.

Conclusion: The progressive patch detachment strategy in PRI effectively balances class-agnostic and class-specific knowledge extraction, demonstrating that discarding patches prematurely is inefficient and that all patches can contribute to knowledge transfer through continued evolution.

Abstract: Model inversion is a widely adopted technique in data-free learning that reconstructs synthetic inputs from a pretrained model through iterative optimization, without access to original training data. Unfortunately, its application to state-of-the-art Vision Transformers (ViTs) poses a major computational challenge, due to their expensive self-attention mechanisms. To address this, Sparse Model Inversion (SMI) was proposed to improve efficiency by pruning and discarding seemingly unimportant patches, which were even claimed to be obstacles to knowledge transfer. However, our empirical findings suggest the opposite: even randomly selected patches can eventually acquire transferable knowledge through continued inversion. This reveals that discarding any prematurely inverted patches is inefficient, as it suppresses the extraction of class-agnostic features essential for knowledge transfer, along with class-specific features. In this paper, we propose Patch Rebirth Inversion (PRI), a novel approach that incrementally detaches the most important patches during the inversion process to construct sparse synthetic images, while allowing the remaining patches to continue evolving for future selection. This progressive strategy not only improves efficiency, but also encourages initially less informative patches to gradually accumulate more class-relevant knowledge, a phenomenon we refer to as the Re-Birth effect, thereby effectively balancing class-agnostic and class-specific knowledge. Experimental results show that PRI achieves up to 10x faster inversion than standard Dense Model Inversion (DMI) and 2x faster than SMI, while consistently outperforming SMI in accuracy and matching the performance of DMI.

[426] Self-Consistency as a Free Lunch: Reducing Hallucinations in Vision-Language Models via Self-Reflection

Mingfei Han, Haihong Hao, Jinxing Zhou, Zhihui Li, Yuhui Zheng, Xueqing Deng, Linjie Yang, Xiaojun Chang

Main category: cs.CV

TL;DR: A self-supervised framework that uses model self-consistency between long responses and short binary answers to automatically generate preference pairs for training, reducing hallucinations without human annotations or external supervision.

Details

Motivation: Vision-language models often hallucinate details, generating non-existent objects or inaccurate attributes that compromise output reliability, and existing methods require extensive human annotations or external supervision.

Method: Leverages model self-consistency by comparing detailed responses against concise binary answers, using inconsistency signals to automatically curate training data without external supervision.

Result: Significant improvements in factual grounding and reliability across multiple benchmarks (AMBER, MultiObject-Hal, Object HalBench, MMHal-Bench), while maintaining robust instruction-following ability on LLaVA-Bench and MMBench.

Conclusion: The self-consistency based approach offers a scalable and efficient solution that effectively reduces hallucinations using unlabeled data, eliminating the need for human annotations or external model supervision.

Abstract: Vision-language models often hallucinate details, generating non-existent objects or inaccurate attributes that compromise output reliability. Existing methods typically address these issues via extensive human annotations or external supervision from more powerful models. In this work, we present a novel framework that leverages the model’s self-consistency between long responses and short answers to generate preference pairs for training. We observe that short binary questions tend to yield highly reliable responses, which can be used to query the target model to evaluate and rank its generated responses. Specifically, we design a self-reflection pipeline where detailed model responses are compared against concise binary answers, and inconsistency signals are utilized to automatically curate high-quality training data without human annotations or external model-based supervision. By relying solely on self-consistency rather than external supervision, our method offers a scalable and efficient solution that effectively reduces hallucinations using unlabeled data. Extensive experiments on multiple benchmarks, i.e., AMBER, MultiObject-Hal (ROPE), Object HalBench, and MMHal-Bench, demonstrate significant improvements in factual grounding and reliability. Moreover, our approach maintains robust instruction-following ability, as evidenced by enhanced performance on LLaVA-Bench and MMBench.

[427] TATTOO: Training-free AesTheTic-aware Outfit recOmmendation

Yuntian Wu, Xiaonan Hu, Ziqi Zhou, Hao Lu

Main category: cs.CV

TL;DR: TATTOO is a training-free aesthetic-aware outfit recommendation approach that uses Multimodal Large Language Models to generate item descriptions and aesthetic profiles, achieving state-of-the-art performance without task-specific training.

Details

Motivation: Current fashion outfit-completion tools require expensive task-specific training on large labeled data and lack explicit human aesthetic guidance. The rise of MLLMs enables a more efficient training-free paradigm.

Method: Uses MLLMs to generate target-item descriptions and aesthetic chain-of-thought to create structured aesthetic profiles. Fuses visual summaries with textual descriptions using dynamic entropy-gated mechanism for embedding space ranking.

Result: Achieves state-of-the-art performance on Aesthetic-100 dataset and demonstrates advanced zero-shot retrieval capability on Polyvore dataset.

Conclusion: TATTOO successfully streamlines outfit recommendation to a training-free paradigm with better performance and enhanced aesthetic awareness compared to conventional training-based methods.

Abstract: The global fashion e-commerce market relies significantly on intelligent and aesthetic-aware outfit-completion tools to promote sales. While previous studies have approached the problem of fashion outfit-completion and compatible-item retrieval, most of them require expensive, task-specific training on large-scale labeled data, and no effort is made to guide outfit recommendation with explicit human aesthetics. In the era of Multimodal Large Language Models (MLLMs), we show that the conventional training-based pipeline could be streamlined to a training-free paradigm, with better recommendation scores and enhanced aesthetic awareness. We achieve this with TATTOO, a Training-free AesTheTic-aware Outfit recommendation approach. It first generates a target-item description using MLLMs, followed by an aesthetic chain-of-thought used to distill the images into a structured aesthetic profile including color, style, occasion, season, material, and balance. By fusing the visual summary of the outfit with the textual description and aesthetics vectors using a dynamic entropy-gated mechanism, candidate items can be represented in a shared embedding space and be ranked accordingly. Experiments on a real-world evaluation set Aesthetic-100 show that TATTOO achieves state-of-the-art performance compared with existing training-based methods. Another standard Polyvore dataset is also used to measure the advanced zero-shot retrieval capability of our training-free method.

[428] Increasing the Diversity in RGB-to-Thermal Image Translation for Automotive Applications

Kaili Wang, Leonardo Ravaglia, Roberto Longo, Lore Goetschalckx, David Van Hamme, Julie Moeyersoms, Ben Stoffelen, Tom De Schepper

Main category: cs.CV

TL;DR: Proposes CoAdaIN for RGB-to-thermal image translation in ADAS, enabling one-to-many mappings with component-aware style adaptation for more realistic and diverse thermal images.

Details

Motivation: Thermal imaging enhances ADAS safety in low-light and harsh weather, but faces dataset limitations and poor simulator representation. Existing RGB-to-thermal methods only support one-to-one mappings.

Method: Multi-modal translation framework using Component-aware Adaptive Instance Normalization (CoAdaIN) that adapts styles to different image components individually, unlike global style application in original AdaIN.

Result: Produces more realistic and diverse thermal image translations compared to existing methods.

Conclusion: CoAdaIN enables effective one-to-many RGB-to-thermal mapping, addressing dataset limitations for thermal imaging in ADAS applications.

Abstract: Thermal imaging in Advanced Driver Assistance Systems (ADAS) improves road safety with superior perception in low-light and harsh weather conditions compared to traditional RGB cameras. However, research in this area faces challenges due to limited dataset availability and poor representation in driving simulators. RGB-to-thermal image translation offers a potential solution, but existing methods focus on one-to-one mappings. We propose a one-to-many mapping using a multi-modal translation framework enhanced with our Component-aware Adaptive Instance Normalization (CoAdaIN). Unlike the original AdaIN, which applies styles globally, CoAdaIN adapts styles to different image components individually. The result, as we show, is more realistic and diverse thermal image translations. This is the accepted author manuscript of the paper published in IEEE Sensors Conference 2024. The final published version is available at 10.1109/SENSORS60989.2024.10785056.

[429] LiDAR-based Human Activity Recognition through Laplacian Spectral Analysis

Sasan Sharifipour, Constantino Álvarez Casado, Le Nguyen, Tharindu Ekanayake, Manuel Lage Cañellas, Nhi Nguyen, Miguel Bordallo López

Main category: cs.CV

TL;DR: A privacy-preserving human activity recognition method using LiDAR point clouds, converting frames to proximity graphs and analyzing Laplacian spectra to create pose descriptors for classification.

Details

Motivation: Human Activity Recognition needs privacy-preserving alternatives to cameras that are robust to illumination changes. LiDAR point clouds offer these advantages while maintaining accuracy.

Method: Convert LiDAR frames to epsilon-graphs, compute Laplacian spectrum, use eigenvalues and eigenvector statistics as pose descriptors, apply temporal statistics over sliding windows, and classify with SVM and random forests.

Result: Achieved 94.4% accuracy on 13-class rehabilitation set and 90.3% on all 27 activities in MM-Fi dataset with 40 subjects under strict subject-independent protocol, outperforming skeleton-based baselines.

Conclusion: The method provides compact, interpretable features directly from point cloud geometry, offering an accurate and efficient alternative to end-to-end deep learning for HAR.

Abstract: Human Activity Recognition supports applications in healthcare, manufacturing, and human-machine interaction. LiDAR point clouds offer a privacy-preserving alternative to cameras and are robust to illumination. We propose a HAR method based on graph spectral analysis. Each LiDAR frame is mapped to a proximity graph (epsilon-graph) and the Laplacian spectrum is computed. Eigenvalues and statistics of eigenvectors form pose descriptors, and temporal statistics over sliding windows yield fixed vectors for classification with support vector machines and random forests. On the MM-Fi dataset with 40 subjects and 27 activities, under a strict subject-independent protocol, the method reaches 94.4% accuracy on a 13-class rehabilitation set and 90.3% on all 27 activities. It also surpasses the skeleton-based baselines reported for MM-Fi. The contribution is a compact and interpretable feature set derived directly from point cloud geometry that provides an accurate and efficient alternative to end-to-end deep learning.

[430] OracleGS: Grounding Generative Priors for Sparse-View Gaussian Splatting

Atakan Topaloglu, Kunyi Li, Michael Niemeyer, Nassir Navab, A. Murat Tekalp, Federico Tombari

Main category: cs.CV

TL;DR: OracleGS is a framework that combines generative completeness with regressive fidelity for sparse-view novel view synthesis by using a “propose-and-validate” approach with 3D-aware diffusion models and multi-view stereo validation.

Details

Motivation: To resolve the trade-off between geometrically faithful but incomplete regressive models and structurally inconsistent generative models in sparse-view novel view synthesis.

Method: Uses a pre-trained 3D-aware diffusion model to propose complete scenes, then repurposes a multi-view stereo model as a 3D-aware oracle to validate uncertainties, guiding 3D Gaussian Splatting optimization via uncertainty-weighted loss.

Result: Outperforms state-of-the-art methods on datasets including Mip-NeRF 360 and NeRF Synthetic by filtering hallucinatory artifacts while preserving plausible completions.

Conclusion: OracleGS successfully reconciles generative completeness with regressive fidelity by conditioning generative priors on multi-view geometric evidence.

Abstract: Sparse-view novel view synthesis is fundamentally ill-posed due to severe geometric ambiguity. Current methods are caught in a trade-off: regressive models are geometrically faithful but incomplete, whereas generative models can complete scenes but often introduce structural inconsistencies. We propose OracleGS, a novel framework that reconciles generative completeness with regressive fidelity for sparse view Gaussian Splatting. Instead of using generative models to patch incomplete reconstructions, our “propose-and-validate” framework first leverages a pre-trained 3D-aware diffusion model to synthesize novel views to propose a complete scene. We then repurpose a multi-view stereo (MVS) model as a 3D-aware oracle to validate the 3D uncertainties of generated views, using its attention maps to reveal regions where the generated views are well-supported by multi-view evidence versus where they fall into regions of high uncertainty due to occlusion, lack of texture, or direct inconsistency. This uncertainty signal directly guides the optimization of a 3D Gaussian Splatting model via an uncertainty-weighted loss. Our approach conditions the powerful generative prior on multi-view geometric evidence, filtering hallucinatory artifacts while preserving plausible completions in under-constrained regions, outperforming state-of-the-art methods on datasets including Mip-NeRF 360 and NeRF Synthetic.

[431] Learning Regional Monsoon Patterns with a Multimodal Attention U-Net

Swaib Ilias Mazumder, Manish Kumar, Aparajita Khan

Main category: cs.CV

TL;DR: A multimodal deep learning framework using satellite and Earth observation data for high-resolution (1 km) precipitation classification in India, outperforming existing methods especially for extreme rainfall.

Details

Motivation: Accurate monsoon rainfall prediction is crucial for India's agriculture, water management, and climate risk planning, but remains challenging due to sparse ground observations and complex regional variability.

Method: Uses an attention-guided U-Net architecture with focal and dice loss functions to handle rainfall class imbalance, integrating seven geospatial modalities at 1 km resolution: land surface temperature, NDVI, soil moisture, relative humidity, wind speed, elevation, and land use.

Result: The multimodal framework consistently outperforms unimodal baselines and existing deep learning methods, particularly in extreme rainfall categories.

Conclusion: This work provides a scalable framework, benchmark dataset, and state-of-the-art results for regional monsoon forecasting, climate resilience, and geospatial AI applications in India.

Abstract: Accurate monsoon rainfall prediction is vital for India’s agriculture, water management, and climate risk planning, yet remains challenging due to sparse ground observations and complex regional variability. We present a multimodal deep learning framework for high-resolution precipitation classification that leverages satellite and Earth observation data. Unlike previous rainfall prediction models based on coarse 5-50 km grids, we curate a new 1 km resolution dataset for five Indian states, integrating seven key geospatial modalities: land surface temperature, vegetation (NDVI), soil moisture, relative humidity, wind speed, elevation, and land use, covering the June-September 2024 monsoon season. Our approach uses an attention-guided U-Net architecture to capture spatial patterns and temporal dependencies across modalities, combined with focal and dice loss functions to handle rainfall class imbalance defined by the India Meteorological Department (IMD). Experiments demonstrate that our multimodal framework consistently outperforms unimodal baselines and existing deep learning methods, especially in extreme rainfall categories. This work contributes a scalable framework, benchmark dataset, and state-of-the-art results for regional monsoon forecasting, climate resilience, and geospatial AI applications in India.

[432] SynDoc: A Hybrid Discriminative-Generative Framework for Enhancing Synthetic Domain-Adaptive Document Key Information Extraction

Yihao Ding, Soyeon Caren Han, Yanbei Jiang, Yan Li, Zechuan Li, Yifan Peng

Main category: cs.CV

TL;DR: SynDoc is a framework combining discriminative and generative models with synthetic data generation and adaptive instruction tuning to improve domain-specific document understanding while reducing hallucinations and data dependency.

Details

Motivation: Existing LLMs/MLLMs face limitations in domain-specific VRDU tasks including hallucinations, inadequate domain adaptation, and reliance on extensive fine-tuning datasets, particularly in sensitive fields like medicine and finance.

Method: Uses synthetic data generation with structural extraction and domain-specific queries, adaptive instruction tuning for discriminative models, and recursive inferencing to iteratively refine outputs from both models.

Result: The framework demonstrates scalable, efficient, and precise document understanding, bridging the gap between domain-specific adaptation and general world knowledge for key information extraction.

Conclusion: SynDoc effectively addresses domain-specific VRDU challenges through its combined discriminative-generative approach with synthetic data and adaptive tuning, achieving stable and accurate predictions.

Abstract: Domain-specific Visually Rich Document Understanding (VRDU) presents significant challenges due to the complexity and sensitivity of documents in fields such as medicine, finance, and material science. Existing Large (Multimodal) Language Models (LLMs/MLLMs) achieve promising results but face limitations such as hallucinations, inadequate domain adaptation, and reliance on extensive fine-tuning datasets. This paper introduces SynDoc, a novel framework that combines discriminative and generative models to address these challenges. SynDoc employs a robust synthetic data generation workflow, using structural information extraction and domain-specific query generation to produce high-quality annotations. Through adaptive instruction tuning, SynDoc improves the discriminative model’s ability to extract domain-specific knowledge. At the same time, a recursive inferencing mechanism iteratively refines the output of both models for stable and accurate predictions. This framework demonstrates scalable, efficient, and precise document understanding and bridges the gap between domain-specific adaptation and general world knowledge for document key information extraction tasks.

[433] Vid-Freeze: Protecting Images from Malicious Image-to-Video Generation via Temporal Freezing

Rohit Chowdhury, Aniruddha Bala, Rohan Jaiswal, Siddharth Roheda

Main category: cs.CV

TL;DR: Vid-Freeze is an adversarial attack that adds perturbations to images to block motion synthesis in image-to-video models by targeting attention mechanisms, preventing malicious content creation while preserving image semantics.

Details

Motivation: To address risks from image-to-video generation models that enable deceptive content creation, and provide effective protection that blocks motion synthesis while prior defenses are insufficient.

Method: Introduces attention-suppressing adversarial attack that adds carefully crafted perturbations to images, explicitly targeting the attention mechanism of I2V models to disrupt motion synthesis.

Result: The immunized images generate stand-still or near-static videos, effectively blocking malicious content creation while preserving semantic fidelity of input images.

Conclusion: Attention attacks are a promising direction for robust and proactive defenses against misuse of I2V generation models, with Vid-Freeze demonstrating impressive protection capabilities.

Abstract: The rapid progress of image-to-video (I2V) generation models has introduced significant risks, enabling video synthesis from static images and facilitating deceptive or malicious content creation. While prior defenses such as I2VGuard attempt to immunize images, effective and principled protection to block motion remains underexplored. In this work, we introduce Vid-Freeze - a novel attention-suppressing adversarial attack that adds carefully crafted adversarial perturbations to images. Our method explicitly targets the attention mechanism of I2V models, completely disrupting motion synthesis while preserving semantic fidelity of the input image. The resulting immunized images generate stand-still or near-static videos, effectively blocking malicious content creation. Our experiments demonstrate the impressive protection provided by the proposed approach, highlighting the importance of attention attacks as a promising direction for robust and proactive defenses against misuse of I2V generation models.

[434] Seeing Through the Blur: Unlocking Defocus Maps for Deepfake Detection

Minsun Jeon, Simon S. Woo

Main category: cs.CV

TL;DR: A deepfake detection framework that uses defocus blur as a forensic signal to distinguish AI-generated images from real ones, leveraging optical physics principles for robust detection.

Details

Motivation: The rapid advancement of generative AI has made synthetic images increasingly photorealistic, threatening the integrity of visual media and making it difficult to distinguish authentic from fabricated content.

Method: Proposes using defocus blur - a depth-dependent optical phenomenon naturally occurring in camera-captured images - as a discriminative feature. Constructs defocus blur maps to capture discrepancies between real images (with realistic depth-of-field) and synthetic images (lacking proper DoF characteristics).

Result: Experimental results confirm that defocus blur provides a reliable and interpretable cue for identifying synthetic images, supported by three in-depth feature analyses.

Conclusion: Defocus blur serves as an effective forensic signal for deepfake detection due to its universal origin from optical imaging principles and encoding of physical scene structure, making it robust and generalizable.

Abstract: The rapid advancement of generative AI has enabled the mass production of photorealistic synthetic images, blurring the boundary between authentic and fabricated visual content. This challenge is particularly evident in deepfake scenarios involving facial manipulation, but also extends to broader AI-generated content (AIGC) cases involving fully synthesized scenes. As such content becomes increasingly difficult to distinguish from reality, the integrity of visual media is under threat. To address this issue, we propose a physically interpretable deepfake detection framework and demonstrate that defocus blur can serve as an effective forensic signal. Defocus blur is a depth-dependent optical phenomenon that naturally occurs in camera-captured images due to lens focus and scene geometry. In contrast, synthetic images often lack realistic depth-of-field (DoF) characteristics. To capture these discrepancies, we construct a defocus blur map and use it as a discriminative feature for detecting manipulated content. Unlike RGB textures or frequency-domain signals, defocus blur arises universally from optical imaging principles and encodes physical scene structure. This makes it a robust and generalizable forensic cue. Our approach is supported by three in-depth feature analyses, and experimental results confirm that defocus blur provides a reliable and interpretable cue for identifying synthetic images. We aim for our defocus-based detection pipeline and interpretability tools to contribute meaningfully to ongoing research in media forensics. The implementation is publicly available at: https://github.com/irissun9602/Defocus-Deepfake-Detection

[435] Seeing the Unseen in Low-light Spike Streams

Liwen Hu, Yang Li, Mianzhi Liu, Yijia Guo, Shenghao Xie, Ziluo Ding, Tiejun Huang, Lei Ma

Main category: cs.CV

TL;DR: Diff-SPK is the first diffusion-based reconstruction method for spike camera streams, using an ETFI module to aggregate sparse information from low-light spike streams and ControlNet for scene generation, outperforming existing methods in low-light high-speed scenarios.

Details

Motivation: Spike cameras produce asynchronous spike streams that need reconstruction for human perception, but existing methods struggle with severe noise and sparse information in low-light high-speed scenarios.

Method: Proposes Diff-SPK with Enhanced Texture from Inter-spike Interval (ETFI) to aggregate sparse information, then uses ETFI as conditioning input for ControlNet to generate high-speed scenes with an ETFI-based feature fusion module.

Result: Establishes the first bona fide benchmark for low-light spike stream reconstruction and demonstrates superior performance on real low-light spike streams compared to existing methods.

Conclusion: Diff-SPK effectively leverages generative priors to handle low-light high-speed spike streams and provides a significant advancement in spike camera reconstruction.

Abstract: Spike camera, a type of neuromorphic sensor with high-temporal resolution, shows great promise for high-speed visual tasks. Unlike traditional cameras, spike camera continuously accumulates photons and fires asynchronous spike streams. Due to unique data modality, spike streams require reconstruction methods to become perceptible to the human eye. However, lots of methods struggle to handle spike streams in low-light high-speed scenarios due to severe noise and sparse information. In this work, we propose Diff-SPK, the first diffusion-based reconstruction method for spike camera. Diff-SPK effectively leverages generative priors to supplement texture information in low-light conditions. Specifically, it first employs an \textbf{E}nhanced \textbf{T}exture \textbf{f}rom Inter-spike \textbf{I}nterval (ETFI) to aggregate sparse information from low-light spike streams. Then, ETFI serves as a conditioning input for ControlNet to generate the high-speed scenes. To improve the quality of results, we introduce an ETFI-based feature fusion module during the generation process. Moreover, we establish the first bona fide benchmark for the low-light spike stream reconstruction task. It significantly surpasses existing reconstruction datasets in scale and provides quantitative illumination information. The performance on real low-light spike streams demonstrates the superiority of Diff-SPK.

[436] Balanced Diffusion-Guided Fusion for Multimodal Remote Sensing Classification

Hao Liu, Yongjie Zheng, Yuhan Kang, Mingyang Zhang, Maoguo Gong, Lorenzo Bruzzone

Main category: cs.CV

TL;DR: Proposes a balanced diffusion-guided fusion (BDGF) framework that uses multimodal diffusion features to guide multi-branch networks for land-cover classification, addressing modality imbalance in DDPMs through adaptive masking and hierarchical feature guidance.

Details

Motivation: Address modality imbalance in pre-training multimodal DDPMs and effectively leverage diffusion features to guide complementary diversity feature extraction for remote sensing data analysis.

Method: Uses adaptive modality masking strategy to achieve modality-balanced data distribution, hierarchically guides feature extraction across CNN, Mamba, and transformer networks with fusion mechanisms, and employs mutual learning strategy for inter-branch collaboration.

Result: Achieves superior classification performance on four multimodal remote sensing datasets.

Conclusion: The BDGF framework effectively addresses modality imbalance in DDPMs and successfully leverages diffusion features for improved land-cover classification in multimodal remote sensing.

Abstract: Deep learning-based techniques for the analysis of multimodal remote sensing data have become popular due to their ability to effectively integrate complementary spatial, spectral, and structural information from different sensors. Recently, denoising diffusion probabilistic models (DDPMs) have attracted attention in the remote sensing community due to their powerful ability to capture robust and complex spatial-spectral distributions. However, pre-training multimodal DDPMs may result in modality imbalance, and effectively leveraging diffusion features to guide complementary diversity feature extraction remains an open question. To address these issues, this paper proposes a balanced diffusion-guided fusion (BDGF) framework that leverages multimodal diffusion features to guide a multi-branch network for land-cover classification. Specifically, we propose an adaptive modality masking strategy to encourage the DDPMs to obtain a modality-balanced rather than spectral image-dominated data distribution. Subsequently, these diffusion features hierarchically guide feature extraction among CNN, Mamba, and transformer networks by integrating feature fusion, group channel attention, and cross-attention mechanisms. Finally, a mutual learning strategy is developed to enhance inter-branch collaboration by aligning the probability entropy and feature similarity of individual subnetworks. Extensive experiments on four multimodal remote sensing datasets demonstrate that the proposed method achieves superior classification performance. The code is available at https://github.com/HaoLiu-XDU/BDGF.

[437] Seeing Symbols, Missing Cultures: Probing Vision-Language Models’ Reasoning on Fire Imagery and Cultural Meaning

Haorui Yu, Qiufeng Yi, Yijia Chu, Yang Zhao

Main category: cs.CV

TL;DR: VLMs show cultural incompetence despite appearing competent, relying on superficial pattern matching rather than genuine understanding. A diagnostic framework reveals systematic biases in fire-themed cultural imagery.

Details

Motivation: To expose the limitations of VLMs' cultural understanding and demonstrate they use symbolic shortcuts rather than genuine reasoning, which poses risks for multimodal systems.

Method: Introduced a diagnostic framework to probe VLM reasoning through classification and explanation analysis of fire-themed cultural imagery across Western festivals, non-Western traditions, and emergency scenes.

Result: Models correctly identify prominent Western festivals but struggle with underrepresented cultural events, offering vague labels or dangerously misclassifying emergencies as celebrations, exposing systematic biases.

Conclusion: Current VLMs rely on symbolic shortcuts rather than genuine cultural understanding, highlighting the need for cultural evaluation beyond accuracy metrics to ensure interpretable and fair multimodal systems.

Abstract: Vision-Language Models (VLMs) often appear culturally competent but rely on superficial pattern matching rather than genuine cultural understanding. We introduce a diagnostic framework to probe VLM reasoning on fire-themed cultural imagery through both classification and explanation analysis. Testing multiple models on Western festivals, non-Western traditions, and emergency scenes reveals systematic biases: models correctly identify prominent Western festivals but struggle with underrepresented cultural events, frequently offering vague labels or dangerously misclassifying emergencies as celebrations. These failures expose the risks of symbolic shortcuts and highlight the need for cultural evaluation beyond accuracy metrics to ensure interpretable and fair multimodal systems.

Siheng Wang, Zhengdao Li, Yanshu Li, Canran Xiao, Haibo Zhan, Zhengtao Yao, Xuzhi Zhang, Jiale Kang, Linshan Li, Weiming Liu, Zhikang Dong, Jifeng Shen, Junhao Dong, Qiang Sun, Piotr Koniusz

Main category: cs.CV

TL;DR: C3-OWD is a curriculum cross-modal contrastive learning framework that simultaneously addresses robustness and generalization in object detection by combining RGBT pretraining for robustness with vision-language alignment for open-world generalization.

Details

Motivation: Real-world object detection faces two key challenges: poor generalization to unseen categories and insufficient robustness under adverse conditions. Prior work addresses these issues separately, creating a trade-off where robustness and diversity are difficult to achieve simultaneously.

Method: A two-stage curriculum framework: Stage 1 uses RGBT data pretraining to enhance robustness, while Stage 2 employs vision-language alignment to improve generalization. An Exponential Moving Average (EMA) mechanism prevents catastrophic forgetting between stages by theoretically guaranteeing preservation of pre-stage performance.

Result: Achieves 80.1 AP50 on FLIR, 48.6 AP50_Novel on OV-COCO, and 35.7 mAPr on OV-LVIS, demonstrating competitive performance across both robustness and diversity evaluations.

Conclusion: C3-OWD successfully unifies robustness and generalization in object detection, establishing strong performance across multiple benchmarks and addressing the limitations of prior approaches that treated these challenges separately.

Abstract: Object detection has advanced significantly in the closed-set setting, but real-world deployment remains limited by two challenges: poor generalization to unseen categories and insufficient robustness under adverse conditions. Prior research has explored these issues separately: visible-infrared detection improves robustness but lacks generalization, while open-world detection leverages vision-language alignment strategy for category diversity but struggles under extreme environments. This trade-off leaves robustness and diversity difficult to achieve simultaneously. To mitigate these issues, we propose \textbf{C3-OWD}, a curriculum cross-modal contrastive learning framework that unifies both strengths. Stage~~1 enhances robustness by pretraining with RGBT data, while Stage~~2 improves generalization via vision-language alignment. To prevent catastrophic forgetting between two stages, we introduce an Exponential Moving Average (EMA) mechanism that theoretically guarantees preservation of pre-stage performance with bounded parameter lag and function consistency. Experiments on FLIR, OV-COCO, and OV-LVIS demonstrate the effectiveness of our approach: C3-OWD achieves $80.1$ AP$^{50}$ on FLIR, $48.6$ AP$^{50}_{\text{Novel}}$ on OV-COCO, and $35.7$ mAP$_r$ on OV-LVIS, establishing competitive performance across both robustness and diversity evaluations. Code available at: https://github.com/justin-herry/C3-OWD.git.

[439] Spatial-Spectral Binarized Neural Network for Panchromatic and Multi-spectral Images Fusion

Yizhen Jiang, Mengting Ma, Anqi Zhu, Xiaowen Ma, Jiaxin Li, Wei Zhang

Main category: cs.CV

TL;DR: This paper proposes S2BNet, a binary neural network for remote sensing pansharpening that addresses spectral distortion and spatial feature degradation through customized spatial-spectral binarized convolution.

Details

Motivation: Deep learning models for pansharpening have high computational complexity, limiting their use on resource-limited devices. Binary neural networks offer efficiency but face issues with spectral distortion and spatial feature degradation.

Method: Designed S2B-Conv with Spectral-Redistribution Mechanism (SRM) using dynamic affine transformation and Gabor Spatial Feature Amplifier (GSFA) with random frequency/angle selection to handle multi-scale anisotropic features.

Result: Extensive experiments show the high-efficiency binarized method achieves promising performance in both quantitative and qualitative evaluations.

Conclusion: S2BNet successfully applies binary neural networks to pansharpening while overcoming spectral and spatial challenges, offering an efficient solution for resource-limited devices.

Abstract: Remote sensing pansharpening aims to reconstruct spatial-spectral properties during the fusion of panchromatic (PAN) images and low-resolution multi-spectral (LR-MS) images, finally generating the high-resolution multi-spectral (HR-MS) images. Although deep learning-based models have achieved excellent performance, they often come with high computational complexity, which hinder their applications on resource-limited devices. In this paper, we explore the feasibility of applying the binary neural network (BNN) to pan-sharpening. Nevertheless, there are two main issues with binarizing pan-sharpening models: (i) the binarization will cause serious spectral distortion due to the inconsistent spectral distribution of the PAN/LR-MS images; (ii) the common binary convolution kernel is difficult to adapt to the multi-scale and anisotropic spatial features of remote sensing objects, resulting in serious degradation of contours. To address the above issues, we design the customized spatial-spectral binarized convolution (S2B-Conv), which is composed of the Spectral-Redistribution Mechanism (SRM) and Gabor Spatial Feature Amplifier (GSFA). Specifically, SRM employs an affine transformation, generating its scaling and bias parameters through a dynamic learning process. GSFA, which randomly selects different frequencies and angles within a preset range, enables to better handle multi-scale and-directional spatial features. A series of S2B-Conv form a brand-new binary network for pan-sharpening, dubbed as S2BNet. Extensive quantitative and qualitative experiments have shown our high-efficiency binarized pan-sharpening method can attain a promising performance.

[440] Decoupling Reasoning and Perception: An LLM-LMM Framework for Faithful Visual Reasoning

Hongrui Jia, Chaoya Jiang, Shikun Zhang, Wei Ye

Main category: cs.CV

TL;DR: A training-free visual-reasoning pipeline that decouples reasoning and perception, using an LLM to orchestrate reasoning while a LMM serves as visual QA engine, reducing visually-unfounded reasoning.

Details

Motivation: LMMs increasingly rely on textual logic and lose visual grounding as reasoning chains extend, leading to reasoning paths that diverge from image content and cause errors.

Method: Decouple reasoning and perception processes - a powerful LLM orchestrates high-level reasoning and strategically interrogates a LMM to extract specific visual information needed for logical chains.

Result: Significant reduction in visually-unfounded reasoning steps and substantial improvement in reasoning fidelity, validated through comprehensive evaluations.

Conclusion: The lightweight, plug-and-play approach effectively governs visual reasoning process without requiring additional training or architectural changes.

Abstract: Significant advancements in the reasoning capabilities of Large Language Models (LLMs) are now driven by test-time scaling laws, particularly those leveraging extended Chain-of-Thought (CoT) reasoning. Inspired by these breakthroughs, researchers have extended these paradigms to Large Multimodal Models (LMMs). However, a critical limitation emerges: as their reasoning chains extend, LMMs increasingly rely on textual logic, progressively losing grounding in the underlying visual information. This leads to reasoning paths that diverge from the image content, culminating in erroneous conclusions. To address this, we introduce a strikingly simple yet effective training-free visual-reasoning pipeline. The core concept is to decouple the reasoning and perception processes. A powerful LLM orchestrates the high-level reasoning, strategically interrogating a LMM to extract specific visual information required for its logical chain. The LMM, in turn, functions exclusively as a visual question-answering engine, supplying the necessary perceptual details on demand. This lightweight, plug-and-play approach requires no additional training or architectural changes. Comprehensive evaluations validate that our framework effectively governs the visual reasoning process, leading to a significant reduction in visually-unfounded reasoning steps and a substantial improvement in reasoning fidelity.

[441] DDP: Dual-Decoupled Prompting for Multi-Label Class-Incremental Learning

Kaile Du, Zihan Ye, Junzhou Xie, Fan Lyu, Yixi Shen, Yuyang Li, Miaoxuan Zhu, Fuyuan Hu, Ling Shao, Guangcan Liu

Main category: cs.CV

TL;DR: Dual-Decoupled Prompting (DDP) is a replay-free framework for multi-label class-incremental learning that addresses semantic confusion and partial labeling issues through class-specific prompts and progressive confidence decoupling.

Details

Motivation: Prompt-based methods work well for single-label incremental learning but perform poorly in multi-label scenarios due to semantic confusion from co-occurring categories and true-negative-false-positive confusion from partial labeling.

Method: DDP uses class-specific positive-negative prompts to disentangle semantics and introduces Progressive Confidence Decoupling (PCD) to suppress false positives. It freezes past prompts as knowledge anchors and employs interlayer prompting for efficiency.

Result: DDP consistently outperforms prior methods on MS-COCO and PASCAL VOC, becoming the first replay-free MLCIL approach to exceed 80% mAP and 70% F1 under the standard MS-COCO B40-C10 benchmark.

Conclusion: The proposed DDP framework effectively addresses the core challenges in multi-label class-incremental learning and demonstrates superior performance compared to existing methods.

Abstract: Prompt-based methods have shown strong effectiveness in single-label class-incremental learning, but their direct extension to multi-label class-incremental learning (MLCIL) performs poorly due to two intrinsic challenges: semantic confusion from co-occurring categories and true-negative-false-positive confusion caused by partial labeling. We propose Dual-Decoupled Prompting (DDP), a replay-free and parameter-efficient framework that explicitly addresses both issues. DDP assigns class-specific positive-negative prompts to disentangle semantics and introduces Progressive Confidence Decoupling (PCD), a curriculum-inspired decoupling strategy that suppresses false positives. Past prompts are frozen as knowledge anchors, and interlayer prompting enhances efficiency. On MS-COCO and PASCAL VOC, DDP consistently outperforms prior methods and is the first replay-free MLCIL approach to exceed 80% mAP and 70% F1 under the standard MS-COCO B40-C10 benchmark.

Bin Wu, Yahui Liu, Chi Zhang, Yao Zhao, Wei Wang

Main category: cs.CV

TL;DR: LRPO is the first online reinforcement learning framework for blind face restoration, using likelihood regularization to improve restoration quality while balancing perceptual quality and fidelity.

Details

Motivation: Blind Face Restoration faces challenges with large solution space causing artifacts like missing details and identity ambiguity in restored images.

Method: Proposes Likelihood-Regularized Policy Optimization (LRPO) with three key strategies: composite reward function, ground-truth guided likelihood regularization, and noise-level advantage assignment.

Result: LRPO significantly improves face restoration quality over baseline methods and achieves state-of-the-art performance.

Conclusion: The proposed LRPO framework effectively addresses BFR challenges and demonstrates superior restoration performance through reinforcement learning with proper regularization.

Abstract: Blind Face Restoration (BFR) encounters inherent challenges in exploring its large solution space, leading to common artifacts like missing details and identity ambiguity in the restored images. To tackle these challenges, we propose a Likelihood-Regularized Policy Optimization (LRPO) framework, the first to apply online reinforcement learning (RL) to the BFR task. LRPO leverages rewards from sampled candidates to refine the policy network, increasing the likelihood of high-quality outputs while improving restoration performance on low-quality inputs. However, directly applying RL to BFR creates incompatibility issues, producing restoration results that deviate significantly from the ground truth. To balance perceptual quality and fidelity, we propose three key strategies: 1) a composite reward function tailored for face restoration assessment, 2) ground-truth guided likelihood regularization, and 3) noise-level advantage assignment. Extensive experiments demonstrate that our proposed LRPO significantly improves the face restoration quality over baseline methods and achieves state-of-the-art performance.

[443] DentVLM: A Multimodal Vision-Language Model for Comprehensive Dental Diagnosis and Enhanced Clinical Practice

Zijie Meng, Jin Hao, Xiwei Dai, Yang Feng, Jiaxiang Liu, Bin Feng, Huikai Wu, Xiaotang Gai, Hengchuan Zhu, Tianxiang Hu, Yangyang Wu, Hongxia Xu, Jin Li, Jun Xiao, Xiaoqiang Liu, Joey Tianyi Zhou, Fudong Zhu, Zhihe Zhao, Lunguo Xia, Bing Fang, Jimeng Sun, Jian Wu, Zuozhu Liu

Main category: cs.CV

TL;DR: DentVLM is a multimodal vision-language model for oral disease diagnosis that outperforms existing AI models and even human dentists in clinical evaluations, while reducing diagnostic time by 15-22% when used collaboratively.

Details

Motivation: Current AI models fall short in addressing the complex, multimodal requirements of comprehensive clinical dental practice, which necessitates advanced visual interpretation across diverse imaging modalities and integrated information synthesis.

Method: Developed using a comprehensive bilingual dataset of 110,447 images and 2.46 million visual question-answering pairs, capable of interpreting seven 2D oral imaging modalities across 36 diagnostic tasks.

Result: Outperformed leading proprietary and open-source models by 19.6% higher accuracy for oral diseases and 27.9% for malocclusions. In clinical study with 25 dentists and 1,946 patients, surpassed junior dentists on 21/36 tasks and senior dentists on 12/36 tasks. Reduced diagnostic time by 15-22% when integrated into collaborative workflow.

Conclusion: DentVLM establishes itself as a robust clinical decision support tool that can enhance primary dental care, mitigate provider-patient imbalances, and democratize access to specialized medical expertise in dentistry.

Abstract: Diagnosing and managing oral diseases necessitate advanced visual interpretation across diverse imaging modalities and integrated information synthesis. While current AI models excel at isolated tasks, they often fall short in addressing the complex, multimodal requirements of comprehensive clinical dental practice. Here we introduce DentVLM, a multimodal vision-language model engineered for expert-level oral disease diagnosis. DentVLM was developed using a comprehensive, large-scale, bilingual dataset of 110,447 images and 2.46 million visual question-answering (VQA) pairs. The model is capable of interpreting seven 2D oral imaging modalities across 36 diagnostic tasks, significantly outperforming leading proprietary and open-source models by 19.6% higher accuracy for oral diseases and 27.9% for malocclusions. In a clinical study involving 25 dentists, evaluating 1,946 patients and encompassing 3,105 QA pairs, DentVLM surpassed the diagnostic performance of 13 junior dentists on 21 of 36 tasks and exceeded that of 12 senior dentists on 12 of 36 tasks. When integrated into a collaborative workflow, DentVLM elevated junior dentists’ performance to senior levels and reduced diagnostic time for all practitioners by 15-22%. Furthermore, DentVLM exhibited promising performance across three practical utility scenarios, including home-based dental health management, hospital-based intelligent diagnosis and multi-agent collaborative interaction. These findings establish DentVLM as a robust clinical decision support tool, poised to enhance primary dental care, mitigate provider-patient imbalances, and democratize access to specialized medical expertise within the field of dentistry.

[444] Dynamic-TreeRPO: Breaking the Independent Trajectory Bottleneck with Structured Sampling

Xiaolong Fu, Lichen Ma, Zipeng Guo, Gaojing Zhou, Chongxiao Wang, ShiPing Dong, Shizhe Zhou, Shizhe Zhou, Ximan Liu, Jingling Fu, Tan Lit Sin, Yu Shi, Zhen Chen, Junshi Huang, Jason Li

Main category: cs.CV

TL;DR: Dynamic-TreeRPO with LayerTuning-RL improves text-to-image generation by using tree-structured sampling with dynamic noise intensities and integrating SFT as a weighted progress reward model, achieving better quality and efficiency.

Details

Motivation: Current RL-enhanced flow matching models for text-to-image generation suffer from exhaustive exploration and inefficient sampling due to slight variations in sampling groups.

Method: Proposes Dynamic-TreeRPO with sliding-window tree-structured search using dynamic noise intensities, GRPO-guided optimization, constrained SDE sampling, and LayerTuning-RL that reformulates SFT loss as weighted progress reward model with dynamic clipping bounds.

Result: Significantly outperforms state-of-the-art by 4.9% on HPS-v2.1, 5.91% on PickScore, and 8.66% on ImageReward benchmarks, while improving training efficiency by nearly 50%.

Conclusion: The tree-structured sampling and LayerTuning-RL paradigm enable dynamic exploration of diverse search spaces along effective directions, achieving superior semantic consistency, visual fidelity, and human preference alignment.

Abstract: The integration of Reinforcement Learning (RL) into flow matching models for text-to-image (T2I) generation has driven substantial advances in generation quality. However, these gains often come at the cost of exhaustive exploration and inefficient sampling strategies due to slight variation in the sampling group. Building on this insight, we propose Dynamic-TreeRPO, which implements the sliding-window sampling strategy as a tree-structured search with dynamic noise intensities along depth. We perform GRPO-guided optimization and constrained Stochastic Differential Equation (SDE) sampling within this tree structure. By sharing prefix paths of the tree, our design effectively amortizes the computational overhead of trajectory search. With well-designed noise intensities for each tree layer, Dynamic-TreeRPO can enhance the variation of exploration without any extra computational cost. Furthermore, we seamlessly integrate Supervised Fine-Tuning (SFT) and RL paradigm within Dynamic-TreeRPO to construct our proposed LayerTuning-RL, reformulating the loss function of SFT as a dynamically weighted Progress Reward Model (PRM) rather than a separate pretraining method. By associating this weighted PRM with dynamic-adaptive clipping bounds, the disruption of exploration process in Dynamic-TreeRPO is avoided. Benefiting from the tree-structured sampling and the LayerTuning-RL paradigm, our model dynamically explores a diverse search space along effective directions. Compared to existing baselines, our approach demonstrates significant superiority in terms of semantic consistency, visual fidelity, and human preference alignment on established benchmarks, including HPS-v2.1, PickScore, and ImageReward. In particular, our model outperforms SoTA by $4.9%$, $5.91%$, and $8.66%$ on those benchmarks, respectively, while improving the training efficiency by nearly $50%$.

[445] Test-time Uncertainty Estimation for Medical Image Registration via Transformation Equivariance

Lin Tian, Xiaoling Hu, Juan Eugenio Iglesias

Main category: cs.CV

TL;DR: A test-time uncertainty estimation framework for image registration that works with any pretrained network by analyzing prediction variance under spatial perturbations, decomposing uncertainty into intrinsic spread and bias jitter.

Details

Motivation: Current deep registration networks lack reliability indicators, and existing uncertainty methods require architectural changes or retraining, limiting their use with pretrained models.

Method: Leverages transformation equivariance property to analyze prediction variance under spatial perturbations, theoretically decomposing uncertainty into intrinsic spread (epistemic noise) and bias jitter (systematic error drift).

Result: Across four anatomical structures and multiple registration models, uncertainty maps consistently correlate with registration errors and identify regions requiring caution.

Conclusion: The framework enables any pretrained registration network to become risk-aware at test time, advancing safe deployment in clinical and research settings.

Abstract: Accurate image registration is essential for downstream applications, yet current deep registration networks provide limited indications of whether and when their predictions are reliable. Existing uncertainty estimation strategies, such as Bayesian methods, ensembles, or MC dropout, require architectural changes or retraining, limiting their applicability to pretrained registration networks. Instead, we propose a test-time uncertainty estimation framework that is compatible with any pretrained networks. Our framework is grounded in the transformation equivariance property of registration, which states that the true mapping between two images should remain consistent under spatial perturbations of the input. By analyzing the variance of network predictions under such perturbations, we derive a theoretical decomposition of perturbation-based uncertainty in registration. This decomposition separates into two terms: (i) an intrinsic spread, reflecting epistemic noise, and (ii) a bias jitter, capturing how systematic error drifts under perturbations. Across four anatomical structures (brain, cardiac, abdominal, and lung) and multiple registration models (uniGradICON, SynthMorph), the uncertainty maps correlate consistently with registration errors and highlight regions requiring caution. Our framework turns any pretrained registration network into a risk-aware tool at test time, placing medical image registration one step closer to safe deployment in clinical and large-scale research settings.

[446] GRAPE: Let GPRO Supervise Query Rewriting by Ranking for Retrieval

Zhaohua Zhang, Jianhuan Zhuo, Muxi Chen, Chenchen Zhao, Wenyu Jiang, Tianwen Jiang, Mingyang Chen, Yu Tang, Qiuyong Xiao, Jihong Zhang, Zhixun Su

Main category: cs.CV

TL;DR: GRAPE is a plug-and-play enhancement that uses ranking-aware policy optimization to improve CLIP-based retrieval performance under distribution shifts like multilingual, long-form, and multimodal differences, achieving 4.9% average improvement in Recall@10.

Details

Motivation: CLIP struggles with distribution shifts from its training data (multilingual, long-form, multimodal queries). Existing query-rewriting methods using LLMs lack supervision signals and fail to generate optimal queries aligned with the retriever's training distribution.

Method: GRAPE incorporates ranking signals into retrieval-guided query rewriting with LLMs using Grouped Ranking-Aware Policy Optimization (GRPO). It addresses score inflation through corpus-relative ranking-based rewards that align optimization with ranking metrics.

Result: GRAPE consistently improves retrieval performance across various distribution shifts: multilingual (Flickr30k-CN, CVLUE, XM3600), length differences (Wikipedia), and multimodal differences (CIRR), achieving 4.9% average improvement in Recall@10.

Conclusion: GRAPE effectively bridges distributional gaps in CLIP-based retrieval systems by incorporating ranking signals into query rewriting, demonstrating significant performance improvements across diverse distribution shifts without requiring costly retraining.

Abstract: The CLIP model has become a cornerstone of large-scale retrieval systems by aligning text and image data in a unified embedding space. Despite its simplicity and efficiency, CLIP struggles when applied to tasks whose input distributions diverge from its training corpus, such as queries with multilingual, long-form, or multimodal differences. To avoid costly retraining, existing methods mainly adopt query-rewriting strategies with large language models (LLMs), aiming to mitigate distribution gaps at the query level. However, due to the lack of supervision signals, LLMs fail to generate the optimal one that fits the training distribution. We address this challenge with GRAPE (Grouped Ranking-Aware Policy Optimization Enhancement), a plug-and-play enhancement approach that incorporates ranking signals into retrieval-guided query rewriting with LLMs. Intuitively, GRAPE proposes to leverage GRPO to bridge distributional differences – including length, multilingual, and modality shifts – by transforming queries into forms better aligned with the retriever’s training distribution. However, our preliminary experiment finds that naively finetuning LLM with similarity scores can lead to score inflation, where nearly all candidates are assigned unexpectedly high scores regardless of their true relevance. To address score inflation, we propose a corpus-relative ranking-based reward, which explicitly aligns optimization with ranking metrics while suppressing spurious score inflation. Extensive experiments demonstrate that GRAPE consistently improves retrieval performance under distributional shifts – including multilingual differences (Flickr30k-CN, CVLUE, XM3600), length differences (Wikipedia), and multimodal differences (CIRR) – achieving an average improvement of 4.9% in Recall@10. The code is available at https://github.com/Chinese0123456/GRAPE.git

[447] CasPoinTr: Point Cloud Completion with Cascaded Networks and Knowledge Distillation

Yifan Yang, Yuxiang Yan, Boda Liu, Jian Pu

Main category: cs.CV

TL;DR: CasPoinTr is a point cloud completion framework using cascaded networks and knowledge distillation to reconstruct missing regions from incomplete point clouds.

Details

Motivation: Point clouds from real-world environments are often incomplete due to limited sensor resolution, single viewpoints, occlusions, and noise, making completion essential for various applications.

Method: Uses cascaded networks with two stages: Shape Reconstruction (generates auxiliary information) and Fused Completion (leverages information with knowledge distillation). A teacher model trained on denser point clouds transfers knowledge to the student model.

Result: Outperforms existing methods on ShapeNet-55 under different difficulty settings, demonstrating superior shape recovery and detail preservation.

Conclusion: The cascaded structure and distillation strategy effectively bridge the gap between incomplete inputs and complete targets by capturing global shape context while refining local details.

Abstract: Point clouds collected from real-world environments are often incomplete due to factors such as limited sensor resolution, single viewpoints, occlusions, and noise. These challenges make point cloud completion essential for various applications. A key difficulty in this task is predicting the overall shape and reconstructing missing regions from highly incomplete point clouds. To address this, we introduce CasPoinTr, a novel point cloud completion framework using cascaded networks and knowledge distillation. CasPoinTr decomposes the completion task into two synergistic stages: Shape Reconstruction, which generates auxiliary information, and Fused Completion, which leverages this information alongside knowledge distillation to generate the final output. Through knowledge distillation, a teacher model trained on denser point clouds transfers incomplete-complete associative knowledge to the student model, enhancing its ability to estimate the overall shape and predict missing regions. Together, the cascaded networks and knowledge distillation enhance the model’s ability to capture global shape context while refining local details, effectively bridging the gap between incomplete inputs and complete targets. Experiments on ShapeNet-55 under different difficulty settings demonstrate that CasPoinTr outperforms existing methods in shape recovery and detail preservation, highlighting the effectiveness of our cascaded structure and distillation strategy.

[448] UniPose: Unified Cross-modality Pose Prior Propagation towards RGB-D data for Weakly Supervised 3D Human Pose Estimation

Jinghong Zheng, Changlong Jiang, Jiaqi Li, Haohong Kuang, Hang Xu, Tingbing Yan

Main category: cs.CV

TL;DR: UniPose is a unified cross-modality method for weakly supervised 3D human pose estimation that transfers 2D pose annotations to 3D using RGB-D sequences without requiring 3D keypoint annotations.

Details

Motivation: To eliminate the need for labor-intensive 3D keypoint annotations and bridge the gap between 2D and 3D domains without multi-view camera calibration or synthetic-to-real data shift issues.

Method: Uses self-supervised learning on RGB-D sequences, transferring 2D HPE annotations to 3D via spatial-temporal constraints, 2D-to-3D back-projection loss, and cross-modality interaction. Employs anchor-to-joint prediction for 3D lifting on RGB and depth networks.

Result: Achieves comparable performance to fully supervised methods on CMU Panoptic and ITOP datasets. Incorporation of large-scale unlabeled data (NTU RGB+D 60) enhances performance under challenging conditions.

Conclusion: UniPose demonstrates potential for practical applications and achieves state-of-the-art results with its proposed 3D lifting method.

Abstract: In this paper, we present UniPose, a unified cross-modality pose prior propagation method for weakly supervised 3D human pose estimation (HPE) using unannotated single-view RGB-D sequences (RGB, depth, and point cloud data). UniPose transfers 2D HPE annotations from large-scale RGB datasets (e.g., MS COCO) to the 3D domain via self-supervised learning on easily acquired RGB-D sequences, eliminating the need for labor-intensive 3D keypoint annotations. This approach bridges the gap between 2D and 3D domains without suffering from issues related to multi-view camera calibration or synthetic-to-real data shifts. During training, UniPose leverages off-the-shelf 2D pose estimations as weak supervision for point cloud networks, incorporating spatial-temporal constraints like body symmetry and joint motion. The 2D-to-3D back-projection loss and cross-modality interaction further enhance this process. By treating the point cloud network’s 3D HPE results as pseudo ground truth, our anchor-to-joint prediction method performs 3D lifting on RGB and depth networks, making it more robust against inaccuracies in 2D HPE results compared to state-of-the-art methods. Experiments on CMU Panoptic and ITOP datasets show that UniPose achieves comparable performance to fully supervised methods. Incorporating large-scale unlabeled data (e.g., NTU RGB+D 60) enhances its performance under challenging conditions, demonstrating its potential for practical applications. Our proposed 3D lifting method also achieves state-of-the-art results.

[449] Generative Modeling of Shape-Dependent Self-Contact Human Poses

Takehiko Ohkawa, Jihyun Lee, Shunsuke Saito, Jason Saragih, Fabian Prado, Yichen Xu, Shoou-I Yu, Ryosuke Furuta, Yoichi Sato, Takaaki Shiratori

Main category: cs.CV

TL;DR: This paper introduces Goliath-SC, the first extensive self-contact dataset with precise body shape registration, and proposes a generative model for self-contact poses conditioned on body shape parameters, improving single-view pose estimation.

Details

Motivation: Existing self-contact datasets lack variety of poses and precise body shapes, limiting analysis between self-contact poses and shapes. Body shape significantly affects self-contact (e.g., hand-belly contact differs for low vs high BMI individuals).

Method: Created Goliath-SC dataset with 383K self-contact poses across 130 subjects. Proposed body-part-wise latent diffusion with self-attention for generative modeling of self-contact prior conditioned by body shape parameters. Incorporated this prior into single-view pose estimation with contact refinement.

Result: Shape conditioning proved vital for successful modeling of self-contact pose distribution. The approach improved single-view pose estimation in self-contact scenarios.

Conclusion: Body shape conditioning is essential for accurate self-contact modeling, and the proposed method effectively enhances pose estimation by incorporating shape-aware self-contact priors.

Abstract: One can hardly model self-contact of human poses without considering underlying body shapes. For example, the pose of rubbing a belly for a person with a low BMI leads to penetration of the hand into the belly for a person with a high BMI. Despite its relevance, existing self-contact datasets lack the variety of self-contact poses and precise body shapes, limiting conclusive analysis between self-contact poses and shapes. To address this, we begin by introducing the first extensive self-contact dataset with precise body shape registration, Goliath-SC, consisting of 383K self-contact poses across 130 subjects. Using this dataset, we propose generative modeling of self-contact prior conditioned by body shape parameters, based on a body-part-wise latent diffusion with self-attention. We further incorporate this prior into single-view human pose estimation while refining estimated poses to be in contact. Our experiments suggest that shape conditioning is vital to the successful modeling of self-contact pose distribution, hence improving single-view pose estimation in self-contact.

[450] WorldSplat: Gaussian-Centric Feed-Forward 4D Scene Generation for Autonomous Driving

Ziyue Zhu, Zhanqian Wu, Zhenxin Zhu, Lijun Zhou, Haiyang Sun, Bing Wan, Kun Ma, Guang Chen, Hangjun Ye, Jin Xie, jian Yang

Main category: cs.CV

TL;DR: WorldSplat is a feed-forward framework for 4D driving-scene generation that bridges the gap between scene generation and reconstruction by producing consistent multi-track novel view driving videos.

Details

Motivation: Existing methods either focus on synthesizing diverse driving videos but lack 3D consistency and novel-view synthesis capability, or excel at reconstruction but lack generative capabilities. This creates a dilemma between generation and reconstruction.

Method: Two-step approach: (1) 4D-aware latent diffusion model that integrates multi-modal information to produce pixel-aligned 4D Gaussians, (2) Refinement of novel view videos using an enhanced video diffusion model.

Result: Extensive experiments show WorldSplat effectively generates high-fidelity, temporally and spatially consistent multi-track novel view driving videos on benchmark datasets.

Conclusion: WorldSplat successfully overcomes the dilemma between scene generation and reconstruction, providing a unified framework for 4D driving-scene generation with superior novel-view synthesis capabilities.

Abstract: Recent advances in driving-scene generation and reconstruction have demonstrated significant potential for enhancing autonomous driving systems by producing scalable and controllable training data. Existing generation methods primarily focus on synthesizing diverse and high-fidelity driving videos; however, due to limited 3D consistency and sparse viewpoint coverage, they struggle to support convenient and high-quality novel-view synthesis (NVS). Conversely, recent 3D/4D reconstruction approaches have significantly improved NVS for real-world driving scenes, yet inherently lack generative capabilities. To overcome this dilemma between scene generation and reconstruction, we propose \textbf{WorldSplat}, a novel feed-forward framework for 4D driving-scene generation. Our approach effectively generates consistent multi-track videos through two key steps: ((i)) We introduce a 4D-aware latent diffusion model integrating multi-modal information to produce pixel-aligned 4D Gaussians in a feed-forward manner. ((ii)) Subsequently, we refine the novel view videos rendered from these Gaussians using a enhanced video diffusion model. Extensive experiments conducted on benchmark datasets demonstrate that \textbf{WorldSplat} effectively generates high-fidelity, temporally and spatially consistent multi-track novel view driving videos.

[451] Enhanced Fracture Diagnosis Based on Critical Regional and Scale Aware in YOLO

Yuyang Sun, Junchuan Yu, Cuiming Zou

Main category: cs.CV

TL;DR: Fracture-YOLO is an improved YOLO-based model for fracture detection that integrates Critical-Region-Selector Attention and Scale-Aware heads to enhance detection performance, achieving state-of-the-art results.

Details

Motivation: Traditional fracture diagnosis relies on visual assessment by physicians, which is constrained by expertise and limits speed and accuracy. Deep learning models based on YOLO framework show potential for improving diagnostic efficiency.

Method: Proposes Fracture-YOLO with two key modules: CRSelector that uses global texture information to focus on critical fracture region features, and ScA module that dynamically adjusts weights of features at different scales to enhance multi-scale fracture detection.

Result: Experimental results show significant improvement over baseline model, with mAP50 and mAP50-95 increasing by 4 and 3 respectively, achieving state-of-the-art performance.

Conclusion: The proposed Fracture-YOLO model effectively enhances fracture detection performance through attention mechanisms and scale-aware feature processing, demonstrating superior results compared to baseline models.

Abstract: Fracture detection plays a critical role in medical imaging analysis, traditional fracture diagnosis relies on visual assessment by experienced physicians, however the speed and accuracy of this approach are constrained by the expertise. With the rapid advancements in artificial intelligence, deep learning models based on the YOLO framework have been widely employed for fracture detection, demonstrating significant potential in improving diagnostic efficiency and accuracy. This study proposes an improved YOLO-based model, termed Fracture-YOLO, which integrates novel Critical-Region-Selector Attention (CRSelector) and Scale-Aware (ScA) heads to further enhance detection performance. Specifically, the CRSelector module utilizes global texture information to focus on critical features of fracture regions. Meanwhile, the ScA module dynamically adjusts the weights of features at different scales, enhancing the model’s capacity to identify fracture targets at multiple scales. Experimental results demonstrate that, compared to the baseline model, Fracture-YOLO achieves a significant improvement in detection precision, with mAP50 and mAP50-95 increasing by 4 and 3, surpassing the baseline model and achieving state-of-the-art (SOTA) performance.

[452] FracDetNet: Advanced Fracture Detection via Dual-Focus Attention and Multi-scale Calibration in Medical X-ray Imaging

Yuyang Sun, Cuiming Zou

Main category: cs.CV

TL;DR: FracDetNet is an advanced fracture detection framework that integrates Dual-Focus Attention and Multi-scale Calibration to improve detection of subtle and diverse fractures in medical imaging.

Details

Motivation: Accurate fracture detection is essential for clinical diagnostic efficiency, but existing methods struggle with detecting subtle and morphologically diverse fractures due to variable imaging angles and suboptimal image quality.

Method: FracDetNet integrates Dual-Focus Attention (DFA) to capture detailed local features and comprehensive global context through combined global and local attention mechanisms, and Multi-scale Calibration (MC) to adaptively refine feature representations.

Result: On the GRAZPEDWRI-DX dataset, FracDetNet achieved state-of-the-art performance with mAP${50-95}$ of 40.0% (7.5% improvement over baseline), mAP${50}$ of 63.9% (4.2% improvement), and fracture-specific detection accuracy enhanced by 2.9%.

Conclusion: FracDetNet effectively addresses challenges in fracture detection through its attention mechanisms and feature calibration, demonstrating significant performance improvements over baseline methods.

Abstract: In this paper, an advanced fracture detection framework, FracDetNet, is proposed to address challenges in medical imaging, as accurate fracture detection is essential for enhancing diagnostic efficiency in clinical practice. Despite recent advancements, existing methods still struggle with detecting subtle and morphologically diverse fractures due to variable imaging angles and suboptimal image quality. To overcome these limitations, FracDetNet integrates Dual-Focus Attention (DFA) and Multi-scale Calibration (MC). Specifically, the DFA module effectively captures detailed local features and comprehensive global context through combined global and local attention mechanisms. Additionally, the MC adaptively refines feature representations to enhance detection performance. Experimental evaluations on the publicly available GRAZPEDWRI-DX dataset demonstrate state-of-the-art performance, with FracDetNet achieving a mAP${50-95}$ of 40.0%, reflecting a \textbf{7.5%} improvement over the baseline model. Furthermore, the mAP${50}$ reaches 63.9%, representing an increase of \textbf{4.2%}, with fracture-specific detection accuracy also enhanced by \textbf{2.9%}.

[453] SPIKE-RL: Video-LLMs meet Bayesian Surprise

Sahithya Ravi, Aditya Chinchure, Raymond T. Ng, Leonid Sigal, Vered Shwartz

Main category: cs.CV

TL;DR: SPIKE is an inference-time framework that identifies surprising moments in videos using Bayesian Surprise, enabling better frame sampling for Video-LLMs to focus on critical narrative moments.

Details

Motivation: Most Video-LLMs sample frames uniformly, missing critical surprising events that define a video's narrative, leading to suboptimal video understanding.

Method: SPIKE quantifies Bayesian Surprise as belief updates from new visual evidence. SPIKE-RL uses GRPO to optimize belief hypotheses based on video caption rewards, enabling surprise-weighted frame sampling.

Result: SPIKE effectively localizes surprise in videos, correlating with human judgment on FunQA and Oops! benchmarks. Surprise-weighted sampling achieves consistent performance gains on five downstream benchmarks over uniform sampling.

Conclusion: By enabling Video-LLMs to track beliefs and register surprise, this work paves the way for more robust models that can revise their understanding in response to new information.

Abstract: Real-world videos often show routine activities punctuated by memorable, surprising events. However, most Video-LLMs process videos by sampling frames uniformly, likely missing critical moments that define a video’s narrative. We introduce SPIKE, an inference-time framework that quantifies Bayesian Surprise as the belief update triggered by new visual evidence in the video stream, identifying moments where new visual evidence conflicts with prior beliefs. SPIKE effectively localizes surprise in videos, strongly correlated with humans on positive (FunQA) and negative (Oops!) surprise benchmarks. Since the beliefs of zero-shot Video-LLMs are often suboptimal, we develop SPIKE-RL, which leverages GRPO to optimize belief hypotheses based on a reward signal from the video caption. SPIKE and SPIKE-RL guide query-agnostic surprise-weighted frame sampling, which allocates more frames to interesting moments in the video. With this strategy, we achieve consistent performance gains on five downstream benchmarks over uniform sampling. By enabling Video-LLMs to track beliefs and register surprise, our work paves the way for more robust models that can revise their understanding in response to new information.

[454] FM-SIREN & FM-FINER: Nyquist-Informed Frequency Multiplier for Implicit Neural Representation with Periodic Activation

Mohammed Alsakabi, Wael Mobeirek, John M. Dolan, Ozan K. Tonguz

Main category: cs.CV

TL;DR: FM-SIREN and FM-FINER improve periodic activation-based implicit neural representations by assigning neuron-specific frequency multipliers to reduce feature redundancy and enhance signal reconstruction across various tasks.

Details

Motivation: Existing periodic activation-based INR networks like SIREN and FINER suffer from hidden feature redundancy due to fixed frequency multipliers, limiting MLP expressive capacity.

Method: Propose FM-SIREN and FM-FINER with Nyquist-informed, neuron-specific frequency multipliers for periodic activations, introducing frequency diversity without hyperparameter tuning or additional network depth.

Result: Reduces feature redundancy by nearly 50% and consistently improves signal reconstruction across 1D audio, 2D image, 3D shape fitting, and NeRF synthesis, outperforming baseline counterparts while maintaining efficiency.

Conclusion: The simple yet principled modification of neuron-specific frequency multipliers effectively reduces feature redundancy and enhances the performance of periodic activation-based INR networks.

Abstract: Existing periodic activation-based implicit neural representation (INR) networks, such as SIREN and FINER, suffer from hidden feature redundancy, where neurons within a layer capture overlapping frequency components due to the use of a fixed frequency multiplier. This redundancy limits the expressive capacity of multilayer perceptrons (MLPs). Drawing inspiration from classical signal processing methods such as the Discrete Sine Transform (DST), we propose FM-SIREN and FM-FINER, which assign Nyquist-informed, neuron-specific frequency multipliers to periodic activations. Unlike existing approaches, our design introduces frequency diversity without requiring hyperparameter tuning or additional network depth. This simple yet principled modification reduces the redundancy of features by nearly 50% and consistently improves signal reconstruction across diverse INR tasks, including fitting 1D audio, 2D image and 3D shape, and synthesis of neural radiance fields (NeRF), outperforming their baseline counterparts while maintaining efficiency.

[455] FoR-SALE: Frame of Reference-guided Spatial Adjustment in LLM-based Diffusion Editing

Tanawan Premsri, Parisa Kordjamshidi

Main category: cs.CV

TL;DR: FoR-SALE is a framework that enhances text-to-image generation by incorporating Frame of Reference concepts to better handle spatial descriptions from different perspectives, improving alignment between language and vision.

Details

Motivation: Current text-to-image models struggle with spatial descriptions from perspectives other than the camera view, creating a need to integrate Frame of Reference concepts for better spatial reasoning.

Method: Extends SLD framework by evaluating text-image alignment using FoR, extracts spatial configuration from images, maps spatial expressions to camera perspectives, and applies latent-space operations to adjust facing direction and depth.

Result: Improves state-of-the-art T2I models by up to 5.3% on spatial understanding benchmarks using only a single correction round.

Conclusion: Integrating Frame of Reference into multimodal models significantly enhances their ability to handle spatial reasoning in text-to-image generation.

Abstract: Frame of Reference (FoR) is a fundamental concept in spatial reasoning that humans utilize to comprehend and describe space. With the rapid progress in Multimodal Language models, the moment has come to integrate this long-overlooked dimension into these models. In particular, in text-to-image (T2I) generation, even state-of-the-art models exhibit a significant performance gap when spatial descriptions are provided from perspectives other than the camera. To address this limitation, we propose Frame of Reference-guided Spatial Adjustment in LLM-based Diffusion Editing (FoR-SALE), an extension of the Self-correcting LLM-controlled Diffusion (SLD) framework for T2I. For-Sale evaluates the alignment between a given text and an initially generated image, and refines the image based on the Frame of Reference specified in the spatial expressions. It employs vision modules to extract the spatial configuration of the image, while simultaneously mapping the spatial expression to a corresponding camera perspective. This unified perspective enables direct evaluation of alignment between language and vision. When misalignment is detected, the required editing operations are generated and applied. FoR-SALE applies novel latent-space operations to adjust the facing direction and depth of the generated images. We evaluate FoR-SALE on two benchmarks specifically designed to assess spatial understanding with FoR. Our framework improves the performance of state-of-the-art T2I models by up to 5.3% using only a single round of correction.

[456] 3DPCNet: Pose Canonicalization for Robust Viewpoint-Invariant 3D Kinematic Analysis from Monocular RGB cameras

Tharindu Ekanayake, Constantino Álvarez Casado, Miguel Bordallo López

Main category: cs.CV

TL;DR: 3DPCNet is a compact module that converts view-dependent 3D poses into body-centered canonical frames using a hybrid encoder and self-supervised learning, significantly improving pose alignment accuracy.

Details

Motivation: Monocular 3D pose estimators produce camera-centered skeletons with view-dependent signals, complicating comparative analysis in applications like health and sports science.

Method: Uses a hybrid encoder combining graph convolutional networks (local skeletal features) and transformers (global context) via gated cross-attention. Predicts continuous 6D rotation mapped to SO(3) matrix for pose alignment. Trained self-supervised on MM-Fi dataset with synthetic rotations.

Result: Reduces mean rotation error from over 20° to 3.4° and Mean Per Joint Position Error from ~64 mm to 47 mm on MM-Fi benchmark. Qualitative evaluations on TotalCapture show acceleration signals from video correspond well with ground-truth IMU data.

Conclusion: The module effectively removes viewpoint variability, enabling physically plausible motion analysis by producing consistent body-centered poses.

Abstract: Monocular 3D pose estimators produce camera-centered skeletons, creating view-dependent kinematic signals that complicate comparative analysis in applications such as health and sports science. We present 3DPCNet, a compact, estimator-agnostic module that operates directly on 3D joint coordinates to rectify any input pose into a consistent, body-centered canonical frame. Its hybrid encoder fuses local skeletal features from a graph convolutional network with global context from a transformer via a gated cross-attention mechanism. From this representation, the model predicts a continuous 6D rotation that is mapped to an $SO(3)$ matrix to align the pose. We train the model in a self-supervised manner on the MM-Fi dataset using synthetically rotated poses, guided by a composite loss ensuring both accurate rotation and pose reconstruction. On the MM-Fi benchmark, 3DPCNet reduces the mean rotation error from over 20$^{\circ}$ to 3.4$^{\circ}$ and the Mean Per Joint Position Error from ~64 mm to 47 mm compared to a geometric baseline. Qualitative evaluations on the TotalCapture dataset further demonstrate that our method produces acceleration signals from video that show strong visual correspondence to ground-truth IMU sensor data, confirming that our module removes viewpoint variability to enable physically plausible motion analysis.

[457] No Concept Left Behind: Test-Time Optimization for Compositional Text-to-Image Generation

Mohammad Hossein Sameti, Amir M. Mansourian, Arash Marioriyad, Soheil Fadaee Oshyani, Mohammad Hossein Rohban, Mahdieh Soleymani Baghshah

Main category: cs.CV

TL;DR: A fine-grained test-time optimization framework that improves compositional faithfulness in text-to-image generation by decomposing prompts into semantic concepts and using concept-level alignment feedback to iteratively refine prompts.

Details

Motivation: Text-to-image models often fail to faithfully render all elements of complex prompts, frequently omitting or misrepresenting specific objects and attributes. Test-time optimization can address this without retraining.

Method: Decomposes input prompts into semantic concepts and evaluates alignment at both global and concept levels using fine-grained CLIP. Uses concept-level feedback to drive iterative prompt refinement via large language models.

Result: Significantly improves concept coverage and human-judged faithfulness over standard test-time optimization and base T2I models on DrawBench and CompBench prompts.

Conclusion: The proposed fine-grained test-time optimization framework effectively enhances compositional faithfulness in text-to-image generation through concept-level alignment and iterative prompt refinement.

Abstract: Despite recent advances in text-to-image (T2I) models, they often fail to faithfully render all elements of complex prompts, frequently omitting or misrepresenting specific objects and attributes. Test-time optimization has emerged as a promising approach to address this limitation by refining generation without the need for retraining. In this paper, we propose a fine-grained test-time optimization framework that enhances compositional faithfulness in T2I generation. Unlike most of prior approaches that rely solely on a global image/text similarity score, our method decomposes the input prompt into semantic concepts and evaluates alignment at both the global and concept levels. A fine-grained variant of CLIP is used to compute concept-level correspondence, producing detailed feedback on missing or inaccurate concepts. This feedback is fed into an iterative prompt refinement loop, enabling the large language model to propose improved prompts. Experiments on DrawBench and CompBench prompts demonstrate that our method significantly improves concept coverage and human-judged faithfulness over both standard test-time optimization and the base T2I model. Code is available at: https://github.com/AmirMansurian/NoConceptLeftBehind

Ming-Tsung Hsu, Fang-Yu Hsu, Yi-Ting Lin, Kai-Heng Chien, Jun-Ren Chen, Cheng-Hsiang Su, Yi-Chen Ou, Chiou-Ting Hsu, Pei-Kai Huang

Main category: cs.CV

TL;DR: MFAS-DANet is a novel multi-modal face anti-spoofing framework that addresses domain adaptation challenges including missing modalities, noisy pseudo labels, and model degradation through complementary feature extraction, uncertainty-based pseudo labeling, and adaptive loss weighting.

Details

Motivation: Existing multi-modal FAS models fail to detect unseen attacks from new domains, and domain adaptation remains unexplored in multi-modal FAS despite being studied in single-modal approaches.

Method: Proposes three key components: 1) Complementary feature extraction to handle missing modalities, 2) Uncertainty-based pseudo labeling to reduce noise, and 3) Adaptive loss weighting mechanism to prevent model degradation during unstable adaptations.

Result: Extensive experiments demonstrate state-of-the-art performance and effectiveness in handling domain adaptation challenges in multi-modal face anti-spoofing.

Conclusion: MFAS-DANet successfully addresses the three major challenges in multi-modal FAS domain adaptation and achieves superior performance compared to existing methods.

Abstract: Recent multi-modal face anti-spoofing (FAS) methods have investigated the potential of leveraging multiple modalities to distinguish live and spoof faces. However, pre-adapted multi-modal FAS models often fail to detect unseen attacks from new target domains. Although a more realistic domain adaptation (DA) scenario has been proposed for single-modal FAS to learn specific spoof attacks during inference, DA remains unexplored in multi-modal FAS methods. In this paper, we propose a novel framework, MFAS-DANet, to address three major challenges in multi-modal FAS under the DA scenario: missing modalities, noisy pseudo labels, and model degradation. First, to tackle the issue of missing modalities, we propose extracting complementary features from other modalities to substitute missing modality features or enhance existing ones. Next, to reduce the impact of noisy pseudo labels during model adaptation, we propose deriving reliable pseudo labels by leveraging prediction uncertainty across different modalities. Finally, to prevent model degradation, we design an adaptive mechanism that decreases the loss weight during unstable adaptations and increasing it during stable ones. Extensive experiments demonstrate the effectiveness and state-of-the-art performance of our proposed MFAS-DANet.

[459] RestoRect: Degraded Image Restoration via Latent Rectified Flow & Feature Distillation

Shourya Verma, Mengbo Wang, Nadia Atallah Lanman, Ananth Grama

Main category: cs.CV

TL;DR: RestoRect is a novel Latent Rectified Flow Feature Distillation method that addresses the speed-quality trade-off in image restoration by using rectified flow to enable students to synthesize teacher-quality features through learnable latent trajectories.

Details

Motivation: Current image restoration approaches face a critical trade-off where high-performance models are too slow for practical use, while fast models produce poor results. Existing static feature matching methods cannot capture how modern transformer architectures dynamically generate features.

Method: Proposes RestoRect using rectified flow to reformulate feature distillation as a generative process. Combines Retinex theory for physics-based decomposition with learnable anisotropic diffusion constraints and trigonometric color space polarization. Introduces Feature Layer Extraction loss for robust knowledge transfer between different architectures through cross-normalized transformer feature alignment with percentile-based outlier detection.

Result: Achieves better training stability, faster convergence and inference while preserving restoration quality. Demonstrates superior results across 15 image restoration datasets, covering 4 tasks, on 8 metrics.

Conclusion: RestoRect effectively addresses the speed-quality trade-off in image restoration through its novel latent rectified flow feature distillation approach, enabling practical deployment of high-quality restoration models.

Abstract: Current approaches for restoration of degraded images face a critical trade-off: high-performance models are too slow for practical use, while fast models produce poor results. Knowledge distillation transfers teacher knowledge to students, but existing static feature matching methods cannot capture how modern transformer architectures dynamically generate features. We propose ‘RestoRect’, a novel Latent Rectified Flow Feature Distillation method for restoring degraded images. We apply rectified flow to reformulate feature distillation as a generative process where students learn to synthesize teacher-quality features through learnable trajectories in latent space. Our framework combines Retinex theory for physics-based decomposition with learnable anisotropic diffusion constraints, and trigonometric color space polarization. We introduce a Feature Layer Extraction loss for robust knowledge transfer between different network architectures through cross-normalized transformer feature alignment with percentile-based outlier detection. RestoRect achieves better training stability, and faster convergence and inference while preserving restoration quality. We demonstrate superior results across 15 image restoration datasets, covering 4 tasks, on 8 metrics.

[460] Orientation-anchored Hyper-Gaussian for 4D Reconstruction from Casual Videos

Junyi Wu, Jiachen Tao, Haoxuan Wang, Gaowen Liu, Ramana Rao Kompella, Yan Yan

Main category: cs.CV

TL;DR: Orientation-anchored Gaussian Splatting (OriGS) is a novel 4D reconstruction framework that uses scene orientation as stable guidance to model complex deformations in dynamic scenes from monocular videos.

Details

Motivation: Existing methods that extend 3D Gaussian Splatting to dynamic scenes rely on low-rank assumptions and struggle with complex, region-specific deformations in unconstrained dynamics.

Method: Estimates a Global Orientation Field to propagate principal directions across space and time, then introduces Orientation-aware Hyper-Gaussian that embeds time, space, geometry, and orientation into a unified probabilistic state for inferring region-specific deformations.

Result: Experiments show superior reconstruction fidelity over mainstream methods in challenging real-world dynamic scenes.

Conclusion: OriGS effectively addresses complex dynamic modeling by leveraging orientation-based guidance and principled conditioned slicing for adaptive local dynamics capture.

Abstract: We present Orientation-anchored Gaussian Splatting (OriGS), a novel framework for high-quality 4D reconstruction from casually captured monocular videos. While recent advances extend 3D Gaussian Splatting to dynamic scenes via various motion anchors, such as graph nodes or spline control points, they often rely on low-rank assumptions and fall short in modeling complex, region-specific deformations inherent to unconstrained dynamics. OriGS addresses this by introducing a hyperdimensional representation grounded in scene orientation. We first estimate a Global Orientation Field that propagates principal forward directions across space and time, serving as stable structural guidance for dynamic modeling. Built upon this, we propose Orientation-aware Hyper-Gaussian, a unified formulation that embeds time, space, geometry, and orientation into a coherent probabilistic state. This enables inferring region-specific deformation through principled conditioned slicing, adaptively capturing diverse local dynamics in alignment with global motion intent. Experiments demonstrate the superior reconstruction fidelity of OriGS over mainstream methods in challenging real-world dynamic scenes.

Divyam Madaan, Varshan Muhunthan, Kyunghyun Cho, Sumit Chopra

Main category: cs.CV

TL;DR: Large-scale empirical study reveals that multi-modal benchmarks often amplify image-only dependencies rather than enabling true multi-modal reasoning, with larger models using intra-modality dependencies to mask lack of genuine cross-modal understanding.

Details

Motivation: To understand and quantify the interplay between intra-modality dependencies (individual modality contributions) and inter-modality dependencies (relationships between modalities and target task) in multi-modal learning, which remains poorly characterized in current benchmarks.

Method: Conducted large-scale empirical study across 23 visual question-answering benchmarks using multi-modal large language models (MLLMs), covering domains including general/expert knowledge reasoning, OCR, and document understanding.

Result: Found significant variation in reliance on vision, text, and their interaction across benchmarks; discovered that many benchmarks intended to reduce text-only biases have instead amplified image-only dependencies; this pattern persists across model sizes with larger models using intra-modality dependencies to achieve high performance that masks lack of true multi-modal reasoning.

Conclusion: Provides quantitative characterization of multi-modal datasets to enable principled approach to multi-modal benchmark design and evaluation, revealing current limitations in assessing genuine multi-modal reasoning capabilities.

Abstract: Understanding the interplay between intra-modality dependencies (the contribution of an individual modality to a target task) and inter-modality dependencies (the relationships between modalities and the target task) is fundamental to advancing multi-modal learning. However, the nature of and interaction between these dependencies within current benchmark evaluations remains poorly characterized. In this work, we present a large-scale empirical study to quantify these dependencies across 23 visual question-answering benchmarks using multi-modal large language models (MLLMs) covering domains such as general and expert knowledge reasoning, optical character recognition, and document understanding. Our findings show that the reliance on vision, question (text), and their interaction varies significantly, both across and within benchmarks. We discover that numerous benchmarks intended to mitigate text-only biases have inadvertently amplified image-only dependencies. This characterization persists across model sizes, as larger models often use these intra-modality dependencies to achieve high performance that mask an underlying lack of multi-modal reasoning. We provide a quantitative characterization of multi-modal datasets, enabling a principled approach to multi-modal benchmark design and evaluation.

[462] Enhancing Polyp Segmentation via Encoder Attention and Dynamic Kernel Update

Fatemeh Salahi Chashmi, Roya Sotoudeh

Main category: cs.CV

TL;DR: A novel polyp segmentation framework that integrates Dynamic Kernel mechanism with Encoder Attention and Unified Channel Adaptation to improve accuracy and efficiency in medical imaging.

Details

Motivation: Polyp segmentation is challenging due to diverse shapes, sizes, and low contrast boundaries in medical imaging, requiring improved accuracy and efficiency for colorectal cancer detection.

Method: Proposes Dynamic Kernel mechanism initialized by global context vector from Encoder Attention module, iteratively refines predictions across decoding stages, and uses Unified Channel Adaptation in decoder to standardize feature dimensions.

Result: Outperforms state-of-the-art methods on KvasirSEG and CVC ClinicDB datasets, achieving superior Dice and IoU scores while reducing computational costs without compromising accuracy.

Conclusion: Provides a robust and adaptable solution for polyp segmentation with promising applications in clinical and automated diagnostic systems.

Abstract: Polyp segmentation is a critical step in colorectal cancer detection, yet it remains challenging due to the diverse shapes, sizes, and low contrast boundaries of polyps in medical imaging. In this work, we propose a novel framework that improves segmentation accuracy and efficiency by integrating a Dynamic Kernel (DK) mechanism with a global Encoder Attention module. The DK mechanism, initialized by a global context vector from the EA module, iteratively refines segmentation predictions across decoding stages, enabling the model to focus on and accurately delineate complex polyp boundaries. The EA module enhances the network’s ability to capture critical lesion features by aggregating multi scale information from all encoder layers. In addition, we employ Unified Channel Adaptation (UCA) in the decoder to standardize feature dimensions across stages, ensuring consistent and computationally efficient information fusion. Our approach extends the lesion-aware kernel framework by introducing a more flexible, attention driven kernel initialization and a unified decoder design. Extensive experiments on the KvasirSEG and CVC ClinicDB benchmark datasets demonstrate that our model outperforms several state of the art segmentation methods, achieving superior Dice and Intersection over Union scores. Moreover, UCA simplifies the decoder structure, reducing computational cost without compromising accuracy. Overall, the proposed method provides a robust and adaptable solution for polyp segmentation, with promising applications in clinical and automated diagnostic systems.

[463] Evaluating point-light biological motion in multimodal large language models

Akila Kadambi, Marco Iacoboni, Lisa Aziz-Zadeh, Srini Narayanan

Main category: cs.CV

TL;DR: ActPLD is the first benchmark to evaluate action processing in multimodal large language models (MLLMs) using human point-light displays (PLDs), revealing consistently poor performance across state-of-the-art models.

Details

Motivation: Humans can extract rich semantic information from minimal visual cues like PLDs, which isolate body motion as the sole source of meaning. This ability emerges early in development and is attributed to human embodied experience, making PLDs ideal for testing action understanding constraints in AI systems.

Method: Created ActPLD benchmark to evaluate action processing in MLLMs using single-actor and socially interacting point-light displays. Tested state-of-the-art proprietary and open-source systems.

Result: Models showed consistently low performance across all tested conditions, revealing fundamental gaps in action and spatiotemporal understanding.

Conclusion: Current MLLMs have significant limitations in processing human action from minimal visual motion cues, highlighting critical gaps in their action understanding capabilities compared to humans.

Abstract: Humans can extract rich semantic information from minimal visual cues, as demonstrated by point-light displays (PLDs), which consist of sparse sets of dots localized to key joints of the human body. This ability emerges early in development and is largely attributed to human embodied experience. Since PLDs isolate body motion as the sole source of meaning, they represent key stimuli for testing the constraints of action understanding in these systems. Here we introduce ActPLD, the first benchmark to evaluate action processing in MLLMs from human PLDs. Tested models include state-of-the-art proprietary and open-source systems on single-actor and socially interacting PLDs. Our results reveal consistently low performance across models, introducing fundamental gaps in action and spatiotemporal understanding.

[464] Imaging-Based Mortality Prediction in Patients with Systemic Sclerosis

Alec K. Peltekian, Karolina Senkow, Gorkem Durak, Kevin M. Grudzinski, Bradford C. Bemiss, Jane E. Dematte, Carrie Richardson, Nikolay S. Markov, Mary Carns, Kathleen Aren, Alexandra Soriano, Matthew Dapas, Harris Perlman, Aaron Gundersheimer, Kavitha C. Selvan, John Varga, Monique Hinchcliff, Krishnan Warrior, Catherine A. Gao, Richard G. Wunderink, GR Scott Budinger, Alok N. Choudhary, Anthony J. Esposito, Alexander V. Misharin, Ankit Agrawal, Ulas Bagci

Main category: cs.CV

TL;DR: This study developed a deep learning framework using chest CT scans to predict mortality in systemic sclerosis patients with interstitial lung disease, achieving AUCs of 0.769, 0.801, and 0.709 for 1-, 3-, and 5-year mortality prediction respectively.

Details

Motivation: Chest CT is the primary imaging modality for diagnosing lung complications in systemic sclerosis, but its role in predicting disease progression and mortality hasn't been fully clarified. There's a need for better early detection and risk assessment tools for SSc-related interstitial lung disease.

Method: Used a large-scale longitudinal analysis of 2,125 CT scans from SSc patients. Applied radiomics and deep learning with pre-trained models (ResNet-18, DenseNet-121, Swin Transformer) fine-tuned on the dataset. Conducted mortality analysis at 1, 3, and 5 years using death labels confirmed by expert physicians.

Result: The models achieved AUCs of 0.769 for 1-year mortality prediction, 0.801 for 3-year mortality, and 0.709 for 5-year mortality. The dataset included 181, 326, and 428 CT scans from patients who died within 1, 3, and 5 years respectively.

Conclusion: Both radiomics and deep learning computational methods show significant potential for improving early detection and risk assessment of SSc-related interstitial lung disease, representing a major advancement in the field.

Abstract: Interstitial lung disease (ILD) is a leading cause of morbidity and mortality in systemic sclerosis (SSc). Chest computed tomography (CT) is the primary imaging modality for diagnosing and monitoring lung complications in SSc patients. However, its role in disease progression and mortality prediction has not yet been fully clarified. This study introduces a novel, large-scale longitudinal chest CT analysis framework that utilizes radiomics and deep learning to predict mortality associated with lung complications of SSc. We collected and analyzed 2,125 CT scans from SSc patients enrolled in the Northwestern Scleroderma Registry, conducting mortality analyses at one, three, and five years using advanced imaging analysis techniques. Death labels were assigned based on recorded deaths over the one-, three-, and five-year intervals, confirmed by expert physicians. In our dataset, 181, 326, and 428 of the 2,125 CT scans were from patients who died within one, three, and five years, respectively. Using ResNet-18, DenseNet-121, and Swin Transformer we use pre-trained models, and fine-tuned on 2,125 images of SSc patients. Models achieved an AUC of 0.769, 0.801, 0.709 for predicting mortality within one-, three-, and five-years, respectively. Our findings highlight the potential of both radiomics and deep learning computational methods to improve early detection and risk assessment of SSc-related interstitial lung disease, marking a significant advancement in the literature.

[465] Calibrated and Resource-Aware Super-Resolution for Reliable Driver Behavior Analysis

Ibne Farabi Shihab, Weiheng Chai, Jiyang Wang, Sanjeda Akter, Senem Velipasalar Gursoy, Anuj Sharma

Main category: cs.CV

TL;DR: Proposes an adaptive super-resolution framework for driver monitoring systems that improves model calibration and safety-critical performance while filtering hallucinations.

Details

Motivation: Driver monitoring systems need reliable confidence scores for safety-critical deployment. Direct low-resolution training yields high accuracy but poor calibration, which is dangerous in safety-critical scenarios.

Method: Resource-aware adaptive super-resolution framework that optimizes for model calibration and high precision-recall on critical events, with a lightweight artifact detector to filter SR-induced hallucinations.

Result: Achieves state-of-the-art performance: best calibration (ECE 5.8% vs 6.2% baseline), highest AUPR for drowsiness detection (0.78 vs 0.74), superior precision-recall for phone use detection (0.74 vs 0.71), with minimal overhead (0.3M parameters, 5.2ms).

Conclusion: While LR-trained models serve as strong general-purpose baselines, the adaptive framework represents state-of-the-art for safety-critical applications where reliability is paramount.

Abstract: Driver monitoring systems require not just high accuracy but reliable, well-calibrated confidence scores for safety-critical deployment. While direct low-resolution training yields high overall accuracy, it produces poorly calibrated predictions that can be dangerous in safety-critical scenarios. We propose a resource-aware adaptive super-resolution framework that optimizes for model calibration and high precision-recall on critical events. Our approach achieves state-of-the-art performance on safety-centric metrics: best calibration (ECE of 5.8% vs 6.2% for LR-trained baselines), highest AUPR for drowsiness detection (0.78 vs 0.74), and superior precision-recall for phone use detection (0.74 vs 0.71). A lightweight artifact detector (0.3M parameters, 5.2ms overhead) provides additional safety by filtering SR-induced hallucinations. While LR-trained video models serve as strong general-purpose baselines, our adaptive framework represents the state-of-the-art solution for safety-critical applications where reliability is paramount.

[466] OVSeg3R: Learn Open-vocabulary Instance Segmentation from 2D via 3D Reconstruction

Hongyang Li, Jinyuan Qu, Lei Zhang

Main category: cs.CV

TL;DR: OVSeg3R is a training scheme for open-vocabulary 3D instance segmentation that leverages 2D perception models and 3D reconstruction to generate 3D annotations automatically, avoiding manual labeling costs.

Details

Motivation: To enable open-vocabulary 3D instance segmentation without costly manual annotation by exploiting existing 2D perception models and 3D reconstruction techniques.

Method: Projects 2D instance masks from open-vocabulary 2D models onto 3D scenes using reconstruction correspondences, uses View-wise Instance Partition to handle partial annotations, and employs 2D Instance Boundary-aware Superpoint clustering to preserve object boundaries.

Result: Achieves +2.3 mAP improvement on ScanNet200 benchmark and +7.1 mAP gain on novel classes under open-vocabulary setting, significantly reducing performance gap between tail and head classes.

Conclusion: OVSeg3R effectively extends closed-vocabulary 3D segmentation to open-vocabulary scenarios by leveraging 2D models and 3D reconstruction, demonstrating substantial performance improvements particularly for novel classes.

Abstract: In this paper, we propose a training scheme called OVSeg3R to learn open-vocabulary 3D instance segmentation from well-studied 2D perception models with the aid of 3D reconstruction. OVSeg3R directly adopts reconstructed scenes from 2D videos as input, avoiding costly manual adjustment while aligning input with real-world applications. By exploiting the 2D to 3D correspondences provided by 3D reconstruction models, OVSeg3R projects each view’s 2D instance mask predictions, obtained from an open-vocabulary 2D model, onto 3D to generate annotations for the view’s corresponding sub-scene. To avoid incorrectly introduced false positives as supervision due to partial annotations from 2D to 3D, we propose a View-wise Instance Partition algorithm, which partitions predictions to their respective views for supervision, stabilizing the training process. Furthermore, since 3D reconstruction models tend to over-smooth geometric details, clustering reconstructed points into representative super-points based solely on geometry, as commonly done in mainstream 3D segmentation methods, may overlook geometrically non-salient objects. We therefore introduce 2D Instance Boundary-aware Superpoint, which leverages 2D masks to constrain the superpoint clustering, preventing superpoints from violating instance boundaries. With these designs, OVSeg3R not only extends a state-of-the-art closed-vocabulary 3D instance segmentation model to open-vocabulary, but also substantially narrows the performance gap between tail and head classes, ultimately leading to an overall improvement of +2.3 mAP on the ScanNet200 benchmark. Furthermore, under the standard open-vocabulary setting, OVSeg3R surpasses previous methods by about +7.1 mAP on the novel classes, further validating its effectiveness.

[467] From Fields to Splats: A Cross-Domain Survey of Real-Time Neural Scene Representations

Javed Ahmad, Penggang Gao, Donatien Delehelle, Mennuti Canio, Nikhil Deshpande, Jesús Ortiz, Darwin G. Caldwell, Yonas Teodros Tefera

Main category: cs.CV

TL;DR: This survey examines how 3D Gaussian Splatting (3DGS) is displacing Neural Radiance Fields (NeRF) across various domains including SLAM, telepresence, robotics, and 3D content generation due to its advantages in photorealism, geometric fidelity, and computational efficiency.

Details

Motivation: To understand why 3DGS is increasingly replacing NeRF-based approaches and to provide a systematic comparison of domain-specific pipelines that leverage neural rendering for both image synthesis and practical applications like perception and interaction.

Method: The survey organizes around unified research questions examining 3DGS’s technical advantages, adaptability to different input modalities and domain constraints, and remaining limitations through systematic comparison of domain-specific pipelines.

Result: The analysis shows that 3DGS effectively balances photorealism, geometric fidelity, and computational efficiency, making it suitable for various applications beyond just rendering, including perception and content creation.

Conclusion: 3DGS offers a compelling alternative to NeRF that supports high-quality rendering, faster optimization, and integration into hybrid pipelines, providing a roadmap for leveraging neural rendering across real and virtual environments for both synthesis and practical applications.

Abstract: Neural scene representations such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have transformed how 3D environments are modeled, rendered, and interpreted. NeRF introduced view-consistent photorealism via volumetric rendering; 3DGS has rapidly emerged as an explicit, efficient alternative that supports high-quality rendering, faster optimization, and integration into hybrid pipelines for enhanced photorealism and task-driven scene understanding. This survey examines how 3DGS is being adopted across SLAM, telepresence and teleoperation, robotic manipulation, and 3D content generation. Despite their differences, these domains share common goals: photorealistic rendering, meaningful 3D structure, and accurate downstream tasks. We organize the review around unified research questions that explain why 3DGS is increasingly displacing NeRF-based approaches: What technical advantages drive its adoption? How does it adapt to different input modalities and domain-specific constraints? What limitations remain? By systematically comparing domain-specific pipelines, we show that 3DGS balances photorealism, geometric fidelity, and computational efficiency. The survey offers a roadmap for leveraging neural rendering not only for image synthesis but also for perception, interaction, and content creation across real and virtual environments.

[468] Pancreas Part Segmentation under Federated Learning Paradigm

Ziliang Hong, Halil Ertugrul Aktas, Andrea Mia Bejar, Katherine Wu, Hongyi Pan, Gorkem Durak, Zheyuan Zhang, Sait Kayali, Temel Tirkes, Federica Proietto Salanitri, Concetto Spampinato, Michael Goggins, Tamas Gonda, Candice Bolan, Raj Keswani, Frank Miller, Michael Wallace, Ulas Bagci

Main category: cs.CV

TL;DR: First federated learning approach for pancreas part segmentation in MRI using privacy-preserving collaborative training across 7 institutions with 711 T1W and 726 T2W MRI scans.

Details

Motivation: Pancreatic diseases show regional heterogeneity (cancers in head, chronic pancreatitis in tail), requiring accurate part segmentation for diagnosis and treatment planning, but MRI segmentation is challenging due to variable morphology and poor contrast.

Method: Privacy-preserving FL framework with evaluation of 3 segmentation architectures (U-Net, Attention U-Net, Swin UNETR) paired with 2 FL algorithms (FedAvg, FedProx), plus novel anatomically-informed loss function for region-specific texture contrasts.

Result: Attention U-Net with FedAvg identified as optimal for pancreatic heterogeneity, achieving clinically viable performance on distributed, heterogeneous datasets.

Conclusion: The approach successfully addresses both technical complexity of pancreas part delineation in MRI and data scarcity problem through federated learning, enabling collaborative training without direct data sharing.

Abstract: We present the first federated learning (FL) approach for pancreas part(head, body and tail) segmentation in MRI, addressing a critical clinical challenge as a significant innovation. Pancreatic diseases exhibit marked regional heterogeneity cancers predominantly occur in the head region while chronic pancreatitis causes tissue loss in the tail, making accurate segmentation of the organ into head, body, and tail regions essential for precise diagnosis and treatment planning. This segmentation task remains exceptionally challenging in MRI due to variable morphology, poor soft-tissue contrast, and anatomical variations across patients. Our novel contribution tackles two fundamental challenges: first, the technical complexity of pancreas part delineation in MRI, and second the data scarcity problem that has hindered prior approaches. We introduce a privacy-preserving FL framework that enables collaborative model training across seven medical institutions without direct data sharing, leveraging a diverse dataset of 711 T1W and 726 T2W MRI scans. Our key innovations include: (1) a systematic evaluation of three state-of-the-art segmentation architectures (U-Net, Attention U-Net,Swin UNETR) paired with two FL algorithms (FedAvg, FedProx), revealing Attention U-Net with FedAvg as optimal for pancreatic heterogeneity, which was never been done before; (2) a novel anatomically-informed loss function prioritizing region-specific texture contrasts in MRI. Comprehensive evaluation demonstrates that our approach achieves clinically viable performance despite training on distributed, heterogeneous datasets.

[469] Towards Interpretable Visual Decoding with Attention to Brain Representations

Pinyuan Feng, Hossein Adeli, Wenxuan Guo, Fan Cheng, Ethan Hwang, Nikolaus Kriegeskorte

Main category: cs.CV

TL;DR: NeuroAdapter is a brain-to-image decoding framework that directly conditions latent diffusion models on brain signals, bypassing intermediate feature spaces and enabling better interpretability of how different brain areas influence image generation.

Details

Motivation: Current brain decoding methods use intermediate image/text features that mask the contributions of different brain areas, limiting interpretability of how brain signals shape visual reconstruction.

Method: Proposes NeuroAdapter framework that directly conditions latent diffusion models on brain representations, and introduces IBBI interpretability framework to analyze cross-attention mechanisms across diffusion denoising steps.

Result: Achieves competitive visual reconstruction quality on public fMRI datasets while providing greater transparency into how brain signals influence the generation process.

Conclusion: Demonstrates potential of end-to-end brain-to-image decoding and establishes path toward interpreting diffusion models through visual neuroscience lens.

Abstract: Recent work has demonstrated that complex visual stimuli can be decoded from human brain activity using deep generative models, helping brain science researchers interpret how the brain represents real-world scenes. However, most current approaches leverage mapping brain signals into intermediate image or text feature spaces before guiding the generative process, masking the effect of contributions from different brain areas on the final reconstruction output. In this work, we propose NeuroAdapter, a visual decoding framework that directly conditions a latent diffusion model on brain representations, bypassing the need for intermediate feature spaces. Our method demonstrates competitive visual reconstruction quality on public fMRI datasets compared to prior work, while providing greater transparency into how brain signals shape the generation process. To this end, we contribute an Image-Brain BI-directional interpretability framework (IBBI) which investigates cross-attention mechanisms across diffusion denoising steps to reveal how different cortical areas influence the unfolding generative trajectory. Our results highlight the potential of end-to-end brain-to-image decoding and establish a path toward interpreting diffusion models through the lens of visual neuroscience.

[470] RobuQ: Pushing DiTs to W1.58A2 via Robust Activation Quantization

Kaicheng Yang, Xun Zhang, Haotong Qin, Yucheng Lin, Kaisen Yang, Xianglong Yan, Yulun Zhang

Main category: cs.CV

TL;DR: RobuQ is a systematic QAT framework for Diffusion Transformers that addresses activation quantization challenges, enabling stable sub-4-bit quantization with state-of-the-art performance on ImageNet-1K.

Details

Motivation: Diffusion Transformers (DiTs) show superior performance but face practical deployment challenges due to high computational and memory costs, with activation quantization being the primary bottleneck for low-bit quantization.

Method: Proposes RobuQ framework with: 1) Strong ternary weight baseline (W1.58A4), 2) RobustQuantizer using Hadamard transform for robust activation quantization, and 3) AMPN - Activation-only Mixed-Precision Network pipeline with ternary weights and layer-wise activation precisions.

Result: Achieves state-of-the-art performance for DiT quantization in sub-4-bit configurations, with stable image generation on ImageNet-1K using average 2-bit activations.

Conclusion: RobuQ successfully addresses the activation quantization bottleneck in DiTs, enabling practical deployment of low-bit quantized diffusion models while maintaining competitive performance.

Abstract: Diffusion Transformers (DiTs) have recently emerged as a powerful backbone for image generation, demonstrating superior scalability and performance over U-Net architectures. However, their practical deployment is hindered by substantial computational and memory costs. While Quantization-Aware Training (QAT) has shown promise for U-Nets, its application to DiTs faces unique challenges, primarily due to the sensitivity and distributional complexity of activations. In this work, we identify activation quantization as the primary bottleneck for pushing DiTs to extremely low-bit settings. To address this, we propose a systematic QAT framework for DiTs, named RobuQ. We start by establishing a strong ternary weight (W1.58A4) DiT baseline. Building upon this, we propose RobustQuantizer to achieve robust activation quantization. Our theoretical analyses show that the Hadamard transform can convert unknown per-token distributions into per-token normal distributions, providing a strong foundation for this method. Furthermore, we propose AMPN, the first Activation-only Mixed-Precision Network pipeline for DiTs. This method applies ternary weights across the entire network while allocating different activation precisions to each layer to eliminate information bottlenecks. Through extensive experiments on unconditional and conditional image generation, our RobuQ framework achieves state-of-the-art performance for DiT quantization in sub-4-bit quantization configuration. To the best of our knowledge, RobuQ is the first achieving stable and competitive image generation on large datasets like ImageNet-1K with activations quantized to average 2 bits. The code and models will be available at https://github.com/racoonykc/RobuQ .

[471] VividFace: High-Quality and Efficient One-Step Diffusion For Video Face Enhancement

Shulian Zhang, Yong Guo, Long Peng, Ziyang Wang, Ye Chen, Wenbo Li, Xiao Zhang, Yulun Zhang, Jian Chen

Main category: cs.CV

TL;DR: VividFace is a one-step diffusion framework for video face enhancement that addresses challenges in facial texture modeling, temporal consistency, model generalization, and efficiency through flow matching, joint latent-pixel training, and automated data curation.

Details

Motivation: Current video face enhancement methods face three key challenges: difficulty in modeling intricate facial textures while maintaining temporal consistency, limited generalization due to lack of high-quality training data, and low efficiency from repeated denoising steps during inference.

Method: Built on pretrained WANX video generation model, uses single-step flow matching for direct mapping from degraded inputs to high-quality outputs. Employs Joint Latent-Pixel Face-Focused Training with stochastic switching between facial region optimization and global reconstruction, plus MLLM-driven data curation pipeline for automated dataset selection.

Result: Achieves state-of-the-art results in perceptual quality, identity preservation, and temporal stability, with significantly reduced inference time compared to traditional methods.

Conclusion: VividFace provides an efficient and effective solution for video face enhancement that overcomes key limitations of existing approaches while offering practical resources for the research community.

Abstract: Video Face Enhancement (VFE) seeks to reconstruct high-quality facial regions from degraded video sequences, a capability that underpins numerous applications including video conferencing, film restoration, and surveillance. Despite substantial progress in the field, current methods that primarily rely on video super-resolution and generative frameworks continue to face three fundamental challenges: (1) faithfully modeling intricate facial textures while preserving temporal consistency; (2) restricted model generalization due to the lack of high-quality face video training data; and (3) low efficiency caused by repeated denoising steps during inference. To address these challenges, we propose VividFace, a novel and efficient one-step diffusion framework for video face enhancement. Built upon the pretrained WANX video generation model, our method leverages powerful spatiotemporal priors through a single-step flow matching paradigm, enabling direct mapping from degraded inputs to high-quality outputs with significantly reduced inference time. To further boost efficiency, we propose a Joint Latent-Pixel Face-Focused Training strategy that employs stochastic switching between facial region optimization and global reconstruction, providing explicit supervision in both latent and pixel spaces through a progressive two-stage training process. Additionally, we introduce an MLLM-driven data curation pipeline for automated selection of high-quality video face datasets, enhancing model generalization. Extensive experiments demonstrate that VividFace achieves state-of-the-art results in perceptual quality, identity preservation, and temporal stability, while offering practical resources for the research community.

[472] Multi-Level Heterogeneous Knowledge Transfer Network on Forward Scattering Center Model for Limited Samples SAR ATR

Chenxi Zhao, Daochang Wang, Siqian Zhang, Gangyao Kuang

Main category: cs.CV

TL;DR: Proposes a multi-level heterogeneous knowledge transfer network that uses forward scattering center model data instead of simulated images to migrate purer target knowledge for SAR target recognition, addressing limited sample problems.

Details

Motivation: Existing simulated data-assisted methods use simulated images containing irrelevant information like background and noise, which affects migration quality. The paper aims to use forward scattering center model (FSCM) data with strong physical meaning and interpretability to migrate purer target knowledge.

Method: Multi-level heterogeneous knowledge transfer (MHKT) network that migrates FSCM knowledge at feature, distribution and category levels. Uses task-associated information selector (TAIS) for feature migration, maximum discrimination divergence (MDD) metric for distribution alignment, and category relation consistency constraint to handle data imbalance.

Result: Extensive experiments on two new datasets combining FSCM data and measured SAR images demonstrate superior performance compared to existing methods.

Conclusion: The proposed method effectively migrates purer target knowledge from FSCM data through stepwise knowledge selection and migration, achieving better SAR target recognition performance while addressing the limited sample problem.

Abstract: Simulated data-assisted SAR target recognition methods are the research hotspot currently, devoted to solving the problem of limited samples. Existing works revolve around simulated images, but the large amount of irrelevant information embedded in the images, such as background, noise, etc., seriously affects the quality of the migrated information. Our work explores a new simulated data to migrate purer and key target knowledge, i.e., forward scattering center model (FSCM) which models the actual local structure of the target with strong physical meaning and interpretability. To achieve this purpose, multi-level heterogeneous knowledge transfer (MHKT) network is proposed, which fully migrates FSCM knowledge from the feature, distribution and category levels, respectively. Specifically, we permit the more suitable feature representations for the heterogeneous data and separate non-informative knowledge by task-associated information selector (TAIS), to complete purer target feature migration. In the distribution alignment, the new metric function maximum discrimination divergence (MDD) in target generic knowledge transfer (TGKT) module perceives transferable knowledge efficiently while preserving discriminative structure about classes. Moreover, category relation knowledge transfer (CRKT) module leverages the category relation consistency constraint to break the dilemma of optimization bias towards simulation data due to imbalance between simulated and measured data. Such stepwise knowledge selection and migration will ensure the integrity of the migrated FSCM knowledge. Notably, extensive experiments on two new datasets formed by FSCM data and measured SAR images demonstrate the superior performance of our method.

[473] VAMamba: An Efficient Visual Adaptive Mamba for Image Restoration

Han Hu, Zhuoran Zheng, Liang Li, Chen Lyu

Main category: cs.CV

TL;DR: VAMamba is a Visual Adaptive Mamba framework that overcomes limitations of fixed scanning patterns and inefficient feature utilization in Mamba-based image restoration through two innovations: QCLAM for dynamic feature reuse and GPS-SS2D for adaptive scanning paths.

Details

Motivation: Recent Mamba-based image restoration methods are limited by fixed scanning patterns that cannot adapt to diverse degradations, constraining both restoration performance and computational efficiency.

Method: Proposes VAMamba with two key components: 1) QCLAM (Queue-based Cache Low-rank Adaptive Memory) uses a FIFO cache to store historical representations and intelligently fuses current LoRA-adapted features with cached ones; 2) GPS-SS2D (Greedy Path Scan SS2D) uses a Vision Transformer to generate score maps for pixel importance estimation and a greedy strategy to determine optimal forward/backward scanning paths.

Result: Extensive experiments across diverse restoration tasks demonstrate that VAMamba consistently outperforms existing approaches in both restoration quality and efficiency, establishing new benchmarks for adaptive image restoration.

Conclusion: VAMamba successfully addresses the limitations of conventional Mamba architectures by enabling adaptive feature utilization and scanning, achieving superior performance in image restoration tasks while maintaining computational efficiency.

Abstract: Recent Mamba-based image restoration methods have achieved promising results but remain limited by fixed scanning patterns and inefficient feature utilization. Conventional Mamba architectures rely on predetermined paths that cannot adapt to diverse degradations, constraining both restoration performance and computational efficiency. To overcome these limitations, we propose VAMamba, a Visual Adaptive Mamba framework with two key innovations. First, QCLAM(Queue-basedCacheLow-rankAdaptiveMemory)enhancesfeaturelearningthrougha FIFO cache that stores historical representations. Similarity between current LoRA-adapted and cached features guides intelligent fusion, enabling dynamic reuse while effectively controlling memorygrowth.Second, GPS-SS2D(GreedyPathScanSS2D)introducesadaptive scanning. A Vision Transformer generates score maps to estimate pixel importance, and a greedy strategy de termines optimal forward and backward scanning paths. These learned trajectories replace rigid patterns, enabling SS2D to perform targeted feature extraction. The integration of QCLAM and GPS-SS2D allows VAMamba to adaptively focus on degraded regions while maintaining high computational efficiency. Extensive experiments across diverse restoration tasks demonstrate that VAMamba consistently outperforms existing approaches in both restoration quality and efficiency, establishing new benchmarks for adaptive image restoration. Our code is available at https://github.com/WaterHQH/VAMamba.

[474] Deep Taxonomic Networks for Unsupervised Hierarchical Prototype Discovery

Zekun Wang, Ethan Haarer, Zhiyi Dai, Tianyi Zhu, Christopher J. MacLellan

Main category: cs.CV

TL;DR: Deep taxonomic networks use a deep latent variable approach with complete binary tree structured mixture-of-Gaussian prior to automatically discover hierarchical taxonomies and prototype clusters from unlabeled data without assuming true label sizes.

Details

Motivation: Address limitations in current deep hierarchical clustering methods that tie structure to number of classes and underutilize prototype information at intermediate hierarchical levels, inspired by human ability to organize knowledge into hierarchical taxonomies with prototypes.

Method: Optimizes a large latent taxonomic hierarchy using complete binary tree structured mixture-of-Gaussian prior within variational inference framework, analytically showing ELBO optimization encourages discovery of hierarchical relationships among prototypes.

Result: Learned models demonstrate strong hierarchical clustering performance, outperforming baselines across diverse image classification datasets using novel evaluation mechanism that leverages prototype clusters discovered at all hierarchical levels.

Conclusion: Deep taxonomic networks discover rich and interpretable hierarchical taxonomies capturing both coarse-grained semantic categories and fine-grained visual distinctions, bridging gaps in current hierarchical clustering methods.

Abstract: Inspired by the human ability to learn and organize knowledge into hierarchical taxonomies with prototypes, this paper addresses key limitations in current deep hierarchical clustering methods. Existing methods often tie the structure to the number of classes and underutilize the rich prototype information available at intermediate hierarchical levels. We introduce deep taxonomic networks, a novel deep latent variable approach designed to bridge these gaps. Our method optimizes a large latent taxonomic hierarchy, specifically a complete binary tree structured mixture-of-Gaussian prior within a variational inference framework, to automatically discover taxonomic structures and associated prototype clusters directly from unlabeled data without assuming true label sizes. We analytically show that optimizing the ELBO of our method encourages the discovery of hierarchical relationships among prototypes. Empirically, our learned models demonstrate strong hierarchical clustering performance, outperforming baselines across diverse image classification datasets using our novel evaluation mechanism that leverages prototype clusters discovered at all hierarchical levels. Qualitative results further reveal that deep taxonomic networks discover rich and interpretable hierarchical taxonomies, capturing both coarse-grained semantic categories and fine-grained visual distinctions.

[475] MAN: Latent Diffusion Enhanced Multistage Anti-Noise Network for Efficient and High-Quality Low-Dose CT Image Denoising

Tangtangfang Fang, Jingxi Hu, Xiangjian He, Jiaqi Yang

Main category: cs.CV

TL;DR: MAN is a latent diffusion model for LDCT denoising that achieves high quality results while being 60x faster than pixel-space diffusion models, making it clinically viable.

Details

Motivation: Diffusion models provide superior quality for LDCT denoising but are too slow for clinical use, with inference times exceeding thousands of seconds per scan.

Method: Uses a latent diffusion approach with a perceptually-optimized autoencoder and attention-based conditional U-Net for fast deterministic denoising in compressed latent space.

Result: Achieves superior perceptual quality over CNN/GAN methods, rivals heavy diffusion models like DDPM/Dn-Dp, and is 60x faster than pixel-space diffusion models while maintaining competitive PSNR/SSIM scores.

Conclusion: MAN bridges the gap between high fidelity and clinical viability, demonstrating a practical path for advanced generative models in medical imaging.

Abstract: While diffusion models have set a new benchmark for quality in Low-Dose Computed Tomography (LDCT) denoising, their clinical adoption is critically hindered by extreme computational costs, with inference times often exceeding thousands of seconds per scan. To overcome this barrier, we introduce MAN, a Latent Diffusion Enhanced Multistage Anti-Noise Network for Efficient and High-Quality Low-Dose CT Image Denoising task. Our method operates in a compressed latent space via a perceptually-optimized autoencoder, enabling an attention-based conditional U-Net to perform the fast, deterministic conditional denoising diffusion process with drastically reduced overhead. On the LDCT and Projection dataset, our model achieves superior perceptual quality, surpassing CNN/GAN-based methods while rivaling the reconstruction fidelity of computationally heavy diffusion models like DDPM and Dn-Dp. Most critically, in the inference stage, our model is over 60x faster than representative pixel space diffusion denoisers, while remaining competitive on PSNR/SSIM scores. By bridging the gap between high fidelity and clinical viability, our work demonstrates a practical path forward for advanced generative models in medical imaging.

[476] VMDiff: Visual Mixing Diffusion for Limitless Cross-Object Synthesis

Zeren Xiong, Yue Yu, Zedong Zhang, Shuo Chen, Jian Yang, Jun Li

Main category: cs.CV

TL;DR: VMDiff is a diffusion-based framework that synthesizes coherent objects by fusing two input images at noise and latent levels, addressing coexistent generation and semantic bias issues through hybrid sampling and adaptive parameter adjustment.

Details

Motivation: Existing image fusion methods face coexistent generation (objects simply juxtaposed without integration) and bias generation (one object dominates due to semantic imbalance), limiting their effectiveness in artistic creation and visual media applications.

Method: Proposes Visual Mixing Diffusion (VMDiff) with: (1) hybrid sampling combining guided denoising, inversion, and spherical interpolation for structure-aware fusion; (2) adaptive adjustment module with similarity-based score to automatically search optimal parameters.

Result: Experiments on 780 concept pairs show VMDiff outperforms baselines in visual quality, semantic consistency, and human-rated creativity.

Conclusion: VMDiff effectively addresses coexistent and bias generation challenges in image fusion, providing a robust framework for creating coherent novel images from multiple visual sources.

Abstract: Creating novel images by fusing visual cues from multiple sources is a fundamental yet underexplored problem in image-to-image generation, with broad applications in artistic creation, virtual reality and visual media. Existing methods often face two key challenges: coexistent generation, where multiple objects are simply juxtaposed without true integration, and bias generation, where one object dominates the output due to semantic imbalance. To address these issues, we propose Visual Mixing Diffusion (VMDiff), a simple yet effective diffusion-based framework that synthesizes a single, coherent object by integrating two input images at both noise and latent levels. Our approach comprises: (1) a hybrid sampling process that combines guided denoising, inversion, and spherical interpolation with adjustable parameters to achieve structure-aware fusion, mitigating coexistent generation; and (2) an efficient adaptive adjustment module, which introduces a novel similarity-based score to automatically and adaptively search for optimal parameters, countering semantic bias. Experiments on a curated benchmark of 780 concept pairs demonstrate that our method outperforms strong baselines in visual quality, semantic consistency, and human-rated creativity.

[477] FlowLUT: Efficient Image Enhancement via Differentiable LUTs and Iterative Flow Matching

Liubing Hu, Chen Wu, Anrui Wang, Dianjie Lu, Guijuan Zhang, Zhuoran Zheng

Main category: cs.CV

TL;DR: FlowLUT is a novel image enhancement model that combines the efficiency of 3D LUTs with flow matching to achieve real-time processing while overcoming the representational limitations of traditional LUTs.

Details

Motivation: Address the fundamental trade-off between computational efficiency and representational capacity in deep learning-based image enhancement, where conventional 3D LUTs are fast but lack flexibility and depend on fixed priors.

Method: Uses a collection of differentiable 3D LUTs with different priors, a lightweight content-aware network to predict fusion weights for scene-adaptive color correction, and an iterative flow matching method to restore local structural details and eliminate artifacts.

Result: Extensive experiments demonstrate effectiveness on three benchmarks, achieving scene-adaptive color correction with O(1) complexity while maintaining perceptual and structural fidelity.

Conclusion: FlowLUT successfully integrates the efficiency of LUTs with multiple priors and flow matching, providing an end-to-end solution that balances computational efficiency with enhanced representational capacity for image enhancement.

Abstract: Deep learning-based image enhancement methods face a fundamental trade-off between computational efficiency and representational capacity. For example, although a conventional three-dimensional Look-Up Table (3D LUT) can process a degraded image in real time, it lacks representational flexibility and depends solely on a fixed prior. To address this problem, we introduce FlowLUT, a novel end-to-end model that integrates the efficiency of LUTs, multiple priors, and the parameter-independent characteristic of flow-matched reconstructed images. Specifically, firstly, the input image is transformed in color space by a collection of differentiable 3D LUTs (containing a large number of 3D LUTs with different priors). Subsequently, a lightweight content-aware dynamically predicts fusion weights, enabling scene-adaptive color correction with $\mathcal{O}(1)$ complexity. Next, a lightweight fusion prediction network runs on multiple 3D LUTs, with $\mathcal{O}(1)$ complexity for scene-adaptive color correction.Furthermore, to address the inherent representation limitations of LUTs, we design an innovative iterative flow matching method to restore local structural details and eliminate artifacts. Finally, the entire model is jointly optimized under a composite loss function enforcing perceptual and structural fidelity. Extensive experimental results demonstrate the effectiveness of our method on three benchmarks.

[478] InteractMove: Text-Controlled Human-Object Interaction Generation in 3D Scenes with Movable Objects

Xinhao Cai, Minghang Zheng, Xin Jin, Yang Liu

Main category: cs.CV

TL;DR: Proposes text-controlled human-object interaction generation in 3D scenes with movable objects, introduces InteractMove dataset, and presents a pipeline for accurate object identification, affordance learning, and collision-free motion generation.

Details

Motivation: Existing human-scene interaction datasets have limited interaction categories and only consider static objects, while collecting movable object interaction data is difficult and costly.

Method: Uses 3D visual grounding for object identification, hand-object joint affordance learning for contact prediction, and optimization with local-scene modeling and collision avoidance constraints.

Result: Method generates physically plausible, text-compliant interactions and outperforms existing approaches in comprehensive experiments.

Conclusion: The proposed approach effectively addresses the challenging task of text-controlled human-object interaction with movable objects in 3D scenes.

Abstract: We propose a novel task of text-controlled human object interaction generation in 3D scenes with movable objects. Existing human-scene interaction datasets suffer from insufficient interaction categories and typically only consider interactions with static objects (do not change object positions), and the collection of such datasets with movable objects is difficult and costly. To address this problem, we construct the InteractMove dataset for Movable Human-Object Interaction in 3D Scenes by aligning existing human object interaction data with scene contexts, featuring three key characteristics: 1) scenes containing multiple movable objects with text-controlled interaction specifications (including same-category distractors requiring spatial and 3D scene context understanding), 2) diverse object types and sizes with varied interaction patterns (one-hand, two-hand, etc.), and 3) physically plausible object manipulation trajectories. With the introduction of various movable objects, this task becomes more challenging, as the model needs to identify objects to be interacted with accurately, learn to interact with objects of different sizes and categories, and avoid collisions between movable objects and the scene. To tackle such challenges, we propose a novel pipeline solution. We first use 3D visual grounding models to identify the interaction object. Then, we propose a hand-object joint affordance learning to predict contact regions for different hand joints and object parts, enabling accurate grasping and manipulation of diverse objects. Finally, we optimize interactions with local-scene modeling and collision avoidance constraints, ensuring physically plausible motions and avoiding collisions between objects and the scene. Comprehensive experiments demonstrate our method’s superiority in generating physically plausible, text-compliant interactions compared to existing approaches.

[479] BioVessel-Net and RetinaMix: Unsupervised Retinal Vessel Segmentation from OCTA Images

Cheng Huang, Weizheng Xie, Fan Gao, Yutong Liu, Ruoling Wu, Zeyu Han, Jingxi Qiu, Xiangxiang Wang, Zhenglin Yang, Hao Wang, Yongbin Yu

Main category: cs.CV

TL;DR: BioVessel-Net is an unsupervised generative framework for retinal vessel segmentation that integrates vessel biostatistics with adversarial refinement, achieving near-perfect accuracy without manual annotations or high-performance computing.

Details

Motivation: Current vessel segmentation approaches rely on supervised learning and extensive manual annotations, which are costly, error-prone, and difficult to obtain in optical coherence tomography angiography.

Method: BioVessel-Net integrates vessel biostatistics with adversarial refinement and a radius-guided segmentation strategy, directly modeling vascular structures with biostatistical coherence. The authors also introduce RetinaMix, a new benchmark dataset of 2D and 3D OCTA images.

Result: BioVessel-Net achieves near-perfect segmentation accuracy across RetinaMix and existing datasets, substantially outperforming state-of-the-art supervised and semi-supervised methods.

Conclusion: BioVessel-Net and RetinaMix provide a label-free, computationally efficient, and clinically interpretable solution for retinal vessel analysis, with broad potential for glaucoma monitoring, blood flow modeling, and progression prediction.

Abstract: Structural changes in retinal blood vessels are critical biomarkers for the onset and progression of glaucoma and other ocular diseases. However, current vessel segmentation approaches largely rely on supervised learning and extensive manual annotations, which are costly, error-prone, and difficult to obtain in optical coherence tomography angiography. Here we present BioVessel-Net, an unsupervised generative framework that integrates vessel biostatistics with adversarial refinement and a radius-guided segmentation strategy. Unlike pixel-based methods, BioVessel-Net directly models vascular structures with biostatistical coherence, achieving accurate and explainable vessel extraction without labeled data or high-performance computing. To support training and evaluation, we introduce RetinaMix, a new benchmark dataset of 2D and 3D OCTA images with high-resolution vessel details from diverse populations. Experimental results demonstrate that BioVessel-Net achieves near-perfect segmentation accuracy across RetinaMix and existing datasets, substantially outperforming state-of-the-art supervised and semi-supervised methods. Together, BioVessel-Net and RetinaMix provide a label-free, computationally efficient, and clinically interpretable solution for retinal vessel analysis, with broad potential for glaucoma monitoring, blood flow modeling, and progression prediction. Code and dataset are available: https://github.com/VikiXie/SatMar8.

[480] DiffInk: Glyph- and Style-Aware Latent Diffusion Transformer for Text to Online Handwriting Generation

Wei Pan, Huiguo He, Hiuyi Cheng, Yilin Shi, Lianwen Jin

Main category: cs.CV

TL;DR: DiffInk is a latent diffusion Transformer framework for full-line handwriting generation that uses a sequential VAE with dual regularization losses and a diffusion Transformer to achieve superior glyph accuracy and style fidelity.

Details

Motivation: Existing text-to-online handwriting generation methods focus on character- or word-level generation, leading to inefficiency and lack of holistic structural modeling for full text lines.

Method: Proposes DiffInk with two components: (1) InkVAE - a sequential VAE with OCR-based loss for glyph accuracy and style-classification loss for style preservation, creating a disentangled latent space; (2) InkDiT - a latent diffusion Transformer that integrates text and style references to generate pen trajectories.

Result: Outperforms state-of-the-art methods in both glyph accuracy and style fidelity, while significantly improving generation efficiency.

Conclusion: DiffInk successfully addresses limitations of existing methods by enabling full-line handwriting generation with improved accuracy, style preservation, and efficiency.

Abstract: Deep generative models have advanced text-to-online handwriting generation (TOHG), which aims to synthesize realistic pen trajectories conditioned on textual input and style references. However, most existing methods still primarily focus on character- or word-level generation, resulting in inefficiency and a lack of holistic structural modeling when applied to full text lines. To address these issues, we propose DiffInk, the first latent diffusion Transformer framework for full-line handwriting generation. We first introduce InkVAE, a novel sequential variational autoencoder enhanced with two complementary latent-space regularization losses: (1) an OCR-based loss enforcing glyph-level accuracy, and (2) a style-classification loss preserving writing style. This dual regularization yields a semantically structured latent space where character content and writer styles are effectively disentangled. We then introduce InkDiT, a novel latent diffusion Transformer that integrates target text and reference styles to generate coherent pen trajectories. Experimental results demonstrate that DiffInk outperforms existing state-of-the-art methods in both glyph accuracy and style fidelity, while significantly improving generation efficiency. Code will be made publicly available.

[481] RIV: Recursive Introspection Mask Diffusion Vision Language Model

YuQian Li, Limeng Qiao, Lin Ma

Main category: cs.CV

TL;DR: RIV introduces self-correction capability to Mask Diffusion-based Vision Language Models through Introspection Training and Recursive Inference, achieving state-of-the-art performance.

Details

Motivation: Current MDVLMs lack self-correction ability and cannot correct errors in generated tokens, limiting their reliability and accuracy.

Method: Two novel mechanisms: 1) Introspection Training - trains a model to identify errors including logical errors; 2) Recursive Inference - alternating unmask→introspection→remask process repeated until reliable results.

Result: Experimental results show RIV achieves state-of-the-art performance, outperforming most existing MDVLMs on multiple benchmarks.

Conclusion: The proposed RIV framework successfully equips MDVLMs with self-correction capability through recursive introspection, significantly improving model reliability and performance.

Abstract: Mask Diffusion-based Vision Language Models (MDVLMs) have achieved remarkable progress in multimodal understanding tasks. However, these models are unable to correct errors in generated tokens, meaning they lack self-correction capability. In this paper, we propose Recursive Introspection Mask Diffusion Vision Language Model (RIV), which equips the model with self-correction ability through two novel mechanisms. The first is Introspection Training, where an Introspection Model is introduced to identify errors within generated sequences. Introspection Training enables the model to detect not only grammatical and spelling mistakes, but more importantly, logical errors. The second is Recursive Inference. Beginning with the standard unmasking step, the learned Introspection Model helps to identify errors in the output sequence and remask them. This alternating ($\text{unmask}\rightarrow\text{introspection}\rightarrow\text{remask}$) process is repeated recursively until reliable results are obtained. Experimental results on multiple benchmarks demonstrate that the proposed RIV achieves state-of-the-art performance, outperforming most existing MDVLMs.

[482] Efficient Domain-Adaptive Multi-Task Dense Prediction with Vision Foundation Models

Beomseok Kang, Niluthpol Chowdhury Mithun, Mikhail Sizintsev, Han-Pang Chiu, Supun Samarasekera

Main category: cs.CV

TL;DR: FAMDA is a multi-task unsupervised domain adaptation framework that uses Vision Foundation Models as teachers to generate pseudo-labels for target domains, enabling efficient domain adaptation for semantic segmentation and depth estimation tasks.

Details

Motivation: Multi-task dense prediction for robotics suffers from domain shift when deploying in new environments. Existing multi-task UDA methods rely on adversarial learning, which is less effective than recent self-training techniques.

Method: Leverages Vision Foundation Models (VFMs) as teachers in a self-training paradigm to generate high-quality pseudo-labels for target domains, distilling their generalization capabilities into a single efficient student network.

Result: Achieves state-of-the-art performance on synthetic-to-real UDA benchmarks and a challenging day-to-night adaptation task. Lightweight variant is 10x smaller than foundation models while maintaining SOTA accuracy.

Conclusion: FAMDA enables training of highly efficient, domain-adaptive models suitable for resource-constrained robotics applications by effectively bridging the gap between foundation model capabilities and practical deployment needs.

Abstract: Multi-task dense prediction, which aims to jointly solve tasks like semantic segmentation and depth estimation, is crucial for robotics applications but suffers from domain shift when deploying models in new environments. While unsupervised domain adaptation (UDA) addresses this challenge for single tasks, existing multi-task UDA methods primarily rely on adversarial learning approaches that are less effective than recent self-training techniques. In this paper, we introduce FAMDA, a simple yet effective UDA framework that bridges this gap by leveraging Vision Foundation Models (VFMs) as powerful teachers. Our approach integrates Segmentation and Depth foundation models into a self-training paradigm to generate high-quality pseudo-labels for the target domain, effectively distilling their robust generalization capabilities into a single, efficient student network. Extensive experiments show that FAMDA achieves state-of-the-art (SOTA) performance on standard synthetic-to-real UDA multi-task learning (MTL) benchmarks and a challenging new day-to-night adaptation task. Our framework enables the training of highly efficient models; a lightweight variant achieves SOTA accuracy while being more than 10$\times$ smaller than foundation models, highlighting FAMDA’s suitability for creating domain-adaptive and efficient models for resource-constrained robotics applications.

[483] MotionVerse: A Unified Multimodal Framework for Motion Comprehension, Generation and Editing

Ruibing Hou, Mingshuang Luo, Hongyu Pan, Hong Chang, Shiguang Shan

Main category: cs.CV

TL;DR: MotionVerse is a unified framework using LLMs for human motion understanding, generation, and editing in single/multi-person scenarios, featuring motion tokenization with residual quantization, delay parallel modeling, and dual-tower architecture.

Details

Motivation: To create a comprehensive framework that leverages LLMs for diverse motion-related tasks while addressing challenges in motion representation, computational efficiency, and modality interference between motion and language.

Method: Uses motion tokenizer with residual quantization to convert motion sequences into multi-stream discrete tokens, implements Delay Parallel Modeling for staggered encoding of token streams, and employs dual-tower architecture with modality-specific parameters.

Result: The framework achieves superior performance across various motion-relevant tasks, with ablation studies confirming the effectiveness of each proposed component.

Conclusion: MotionVerse successfully demonstrates that LLMs can be effectively adapted for comprehensive motion understanding and generation through careful architectural design addressing motion representation and modality integration challenges.

Abstract: This paper proposes MotionVerse, a unified framework that harnesses the capabilities of Large Language Models (LLMs) to comprehend, generate, and edit human motion in both single-person and multi-person scenarios. To efficiently represent motion data, we employ a motion tokenizer with residual quantization, which converts continuous motion sequences into multi-stream discrete tokens. Furthermore, we introduce a \textit{Delay Parallel} Modeling strategy, which temporally staggers the encoding of residual token streams. This design enables LLMs to effectively capture inter-stream dependencies while maintaining computational efficiency comparable to single-stream modeling. Moreover, to alleviate modality interference between motion and language, we design a \textit{dual-tower architecture} with modality-specific parameters, ensuring stable integration of motion information for both comprehension and generation tasks. Comprehensive ablation studies demonstrate the effectiveness of each component in MotionVerse, and extensive experiments showcase its superior performance across a wide range of motion-relevant tasks.

[484] LightFair: Towards an Efficient Alternative for Fair T2I Diffusion via Debiasing Pre-trained Text Encoders

Boyu Han, Qianqian Xu, Shilong Bao, Zhiyong Yang, Kangli Zi, Qingming Huang

Main category: cs.CV

TL;DR: LightFair is a lightweight method that improves fairness in text-to-image diffusion models by fine-tuning text embeddings with distance-constrained debiasing and two-stage sampling, achieving state-of-the-art results with minimal training and sampling overhead.

Details

Motivation: Existing methods for fair T2I DMs either require full-parameter training or use auxiliary networks, leading to heavy computational burden and unsatisfactory performance. The text encoder is identified as a key source of bias that can be addressed more efficiently.

Method: Proposes collaborative distance-constrained debiasing to balance embedding distances in CLIP space without auxiliary references, combined with a two-stage text-guided sampling strategy to limit debiasing intervention and preserve generation quality.

Result: Achieves state-of-the-art debiasing performance on Stable Diffusion v1.5 with only 1/4 of the training burden compared to existing methods, and virtually no increase in sampling burden.

Conclusion: LightFair provides an effective and efficient solution for fair text-to-image generation by focusing on text encoder fine-tuning, demonstrating that significant fairness improvements can be achieved with minimal computational overhead.

Abstract: This paper explores a novel lightweight approach LightFair to achieve fair text-to-image diffusion models (T2I DMs) by addressing the adverse effects of the text encoder. Most existing methods either couple different parts of the diffusion model for full-parameter training or rely on auxiliary networks for correction. They incur heavy training or sampling burden and unsatisfactory performance. Since T2I DMs consist of multiple components, with the text encoder being the most fine-tunable and front-end module, this paper focuses on mitigating bias by fine-tuning text embeddings. To validate feasibility, we observe that the text encoder’s neutral embedding output shows substantial skewness across image embeddings of various attributes in the CLIP space. More importantly, the noise prediction network further amplifies this imbalance. To finetune the text embedding, we propose a collaborative distance-constrained debiasing strategy that balances embedding distances to improve fairness without auxiliary references. However, mitigating bias can compromise the original generation quality. To address this, we introduce a two-stage text-guided sampling strategy to limit when the debiased text encoder intervenes. Extensive experiments demonstrate that LightFair is effective and efficient. Notably, on Stable Diffusion v1.5, our method achieves SOTA debiasing at just $1/4$ of the training burden, with virtually no increase in sampling burden. The code is available at https://github.com/boyuh/LightFair.

[485] EfficientMIL: Efficient Linear-Complexity MIL Method for WSI Classification

Chengying She, Ben Wang, Xinran Zhang, Dongjie Fan, Jialu Zhang, Chengwei Chen, Lizhuang Liu

Main category: cs.CV

TL;DR: EfficientMIL is a linear-complexity multiple instance learning approach for whole slide image classification that replaces quadratic self-attention with efficient sequence models (GRU, LSTM, Mamba) and includes an Adaptive Patch Selector, achieving state-of-the-art performance with significantly improved computational efficiency.

Details

Motivation: Current state-of-the-art MIL methods for whole slide image classification rely on attention mechanisms with quadratic complexity, requiring substantial computational resources when processing hundreds of thousands of patches, creating a computational bottleneck.

Method: Introduces EfficientMIL with Adaptive Patch Selector (APS) for patches selection, replacing quadratic-complexity self-attention in Transformer-based MIL methods with efficient linear-complexity sequence models including GRU, LSTM, and State Space Model Mamba.

Result: Achieved AUC of 0.976 and accuracy of 0.933 on TCGA-Lung dataset with EfficientMIL-Mamba, and AUC of 0.990 and accuracy of 0.975 on CAMELYON16 dataset with EfficientMIL-GRU, surpassing previous state-of-the-art methods. APS demonstrated superior patch selection effectiveness compared to conventional strategies.

Conclusion: EfficientMIL provides significant computational efficiency improvements while outperforming other MIL methods across multiple histopathology datasets, offering a practical solution to the computational bottleneck in whole slide image classification.

Abstract: Whole slide images (WSIs) classification represents a fundamental challenge in computational pathology, where multiple instance learning (MIL) has emerged as the dominant paradigm. Current state-of-the-art (SOTA) MIL methods rely on attention mechanisms, achieving good performance but requiring substantial computational resources due to quadratic complexity when processing hundreds of thousands of patches. To address this computational bottleneck, we introduce EfficientMIL, a novel linear-complexity MIL approach for WSIs classification with the patches selection module Adaptive Patch Selector (APS) that we designed, replacing the quadratic-complexity self-attention mechanisms in Transformer-based MIL methods with efficient sequence models including RNN-based GRU, LSTM, and State Space Model (SSM) Mamba. EfficientMIL achieves significant computational efficiency improvements while outperforming other MIL methods across multiple histopathology datasets. On TCGA-Lung dataset, EfficientMIL-Mamba achieved AUC of 0.976 and accuracy of 0.933, while on CAMELYON16 dataset, EfficientMIL-GRU achieved AUC of 0.990 and accuracy of 0.975, surpassing previous state-of-the-art methods. Extensive experiments demonstrate that APS is also more effective for patches selection than conventional selection strategies.

[486] From Static to Dynamic: a Survey of Topology-Aware Perception in Autonomous Driving

Yixiao Chen, Ruining Yang, Xin Chen, Jia He, Dongliang Xu, Yue Yao

Main category: cs.CV

TL;DR: Survey of topology-aware perception in autonomous driving, covering four key research directions that enable dynamic, sensor-driven map construction and understanding instead of relying on static pre-built maps.

Details

Motivation: Traditional static maps are costly, hard to update in real-time, and lack generalization across regions, limiting scalability for autonomous driving systems.

Method: Systematic review of four core research directions: vectorized map construction, topological structure modeling, prior knowledge fusion, and language model-based perception.

Result: Identified a unifying trend of paradigm shift from static pre-built maps to dynamic, sensor-driven perception that enables real-time map construction and topology reasoning.

Conclusion: These research directions collectively enable more adaptive, scalable, and explainable autonomous driving systems through compact spatial modeling, semantic relational reasoning, robust domain knowledge integration, and multimodal scene understanding.

Abstract: The key to achieving autonomous driving lies in topology-aware perception, the structured understanding of the driving environment with an emphasis on lane topology and road semantics. This survey systematically reviews four core research directions under this theme: vectorized map construction, topological structure modeling, prior knowledge fusion, and language model-based perception. Across these directions, we observe a unifying trend: a paradigm shift from static, pre-built maps to dynamic, sensor-driven perception. Specifically, traditional static maps have provided semantic context for autonomous systems. However, they are costly to construct, difficult to update in real time, and lack generalization across regions, limiting their scalability. In contrast, dynamic representations leverage on-board sensor data for real-time map construction and topology reasoning. Each of the four research directions contributes to this shift through compact spatial modeling, semantic relational reasoning, robust domain knowledge integration, and multimodal scene understanding powered by pre-trained language models. Together, they pave the way for more adaptive, scalable, and explainable autonomous driving systems.

[487] Griffin: Generative Reference and Layout Guided Image Composition

Aryan Mikaeili, Amirhossein Alimohammadi, Negar Hassanpour, Ali Mahdavi-Amiri, Andrea Tagliasacchi

Main category: cs.CV

TL;DR: A training-free method for multi-image layout control that uses reference images instead of text to specify content and placement, providing explicit control for object and part-level composition.

Details

Motivation: Text-to-image models have realistic generation but text-based control is limiting when explicit guidance is needed for precise content placement within images.

Method: Training-free approach that uses single images as references to specify content and guides the model on where to place each element, enabling multi-image layout control without text.

Result: Demonstrated effectiveness across various image composition tasks with explicit and simple control for object and part-level composition.

Conclusion: The proposed method successfully addresses the challenge of multi-image layout control by using image-based references rather than text, providing finer control over content placement.

Abstract: Text-to-image models have achieved a level of realism that enables the generation of highly convincing images. However, text-based control can be a limiting factor when more explicit guidance is needed. Defining both the content and its precise placement within an image is crucial for achieving finer control. In this work, we address the challenge of multi-image layout control, where the desired content is specified through images rather than text, and the model is guided on where to place each element. Our approach is training-free, requires a single image per reference, and provides explicit and simple control for object and part-level composition. We demonstrate its effectiveness across various image composition tasks.

[488] Sparse-Up: Learnable Sparse Upsampling for 3D Generation with High-Fidelity Textures

Lu Xiao, Jiale Zhang, Yang Liu, Taicheng Huang, Xin Tian

Main category: cs.CV

TL;DR: Sparse-Up is a memory-efficient texture modeling framework that preserves high-frequency details by using sparse voxels with surface anchoring and view-domain partitioning to overcome resolution constraints.

Details

Motivation: Existing 3D asset creation methods face a trade-off between cross-view consistency and high-frequency detail preservation, often resulting in torn textures or resolution limitations from explicit voxels.

Method: Uses sparse voxels for texture reconstruction with two key strategies: surface anchoring (learnable upsampling to constrain voxels to mesh surface) and view-domain partitioning (image patch-guided voxel partitioning with local gradient supervision).

Result: Reduces redundant voxels by over 70% compared to traditional voxel upsampling, significantly lowers memory consumption during high-resolution training while maintaining geometric consistency and preserving high-frequency texture details.

Conclusion: Sparse-Up provides an effective solution for high-fidelity 3D texture modeling that breaks through resolution constraints while maintaining memory efficiency and preserving fine texture details.

Abstract: The creation of high-fidelity 3D assets is often hindered by a ‘pixel-level pain point’: the loss of high-frequency details. Existing methods often trade off one aspect for another: either sacrificing cross-view consistency, resulting in torn or drifting textures, or remaining trapped by the resolution ceiling of explicit voxels, forfeiting fine texture detail. In this work, we propose Sparse-Up, a memory-efficient, high-fidelity texture modeling framework that effectively preserves high-frequency details. We use sparse voxels to guide texture reconstruction and ensure multi-view consistency, while leveraging surface anchoring and view-domain partitioning to break through resolution constraints. Surface anchoring employs a learnable upsampling strategy to constrain voxels to the mesh surface, eliminating over 70% of redundant voxels present in traditional voxel upsampling. View-domain partitioning introduces an image patch-guided voxel partitioning scheme, supervising and back-propagating gradients only on visible local patches. Through these two strategies, we can significantly reduce memory consumption during high-resolution voxel training without sacrificing geometric consistency, while preserving high-frequency details in textures.

[489] Color-Pair Guided Robust Zero-Shot 6D Pose Estimation and Tracking of Cluttered Objects on Edge Devices

Xingjian Yang, Ashis G. Banerjee

Main category: cs.CV

TL;DR: A unified framework for 6D pose estimation that combines robust initial pose estimation with fast motion-based tracking using shared lighting-invariant color-pair features, designed for efficient edge device execution.

Details

Motivation: Address the challenge of robust 6D pose estimation under difficult lighting conditions while balancing accuracy and real-time performance on edge devices.

Method: Uses shared lighting-invariant color-pair feature representation for both initial estimation (registration between RGB-D view and 3D mesh) and tracking (temporal correspondence validation with lightweight motion regression).

Result: Competitive pose estimation accuracy with high-fidelity tracking even through abrupt pose changes, as demonstrated in extensive benchmark experiments.

Conclusion: The integrated approach provides effective and robust 6D pose estimation suitable for edge devices, maintaining accuracy while enabling real-time tracking performance.

Abstract: Robust 6D pose estimation of novel objects under challenging illumination remains a significant challenge, often requiring a trade-off between accurate initial pose estimation and efficient real-time tracking. We present a unified framework explicitly designed for efficient execution on edge devices, which synergizes a robust initial estimation module with a fast motion-based tracker. The key to our approach is a shared, lighting-invariant color-pair feature representation that forms a consistent foundation for both stages. For initial estimation, this feature facilitates robust registration between the live RGB-D view and the object’s 3D mesh. For tracking, the same feature logic validates temporal correspondences, enabling a lightweight model to reliably regress the object’s motion. Extensive experiments on benchmark datasets demonstrate that our integrated approach is both effective and robust, providing competitive pose estimation accuracy while maintaining high-fidelity tracking even through abrupt pose changes.

[490] ReWatch-R1: Boosting Complex Video Reasoning in Large Vision-Language Models through Agentic Data Synthesis

Congzhi Zhang, Zhibin Wang, Yinchao Ma, Jiawei Peng, Yihan Wang, Qiang Zhou, Jun Song, Bo Zheng

Main category: cs.CV

TL;DR: ReWatch introduces a large-scale dataset and framework for advanced video reasoning, addressing the data bottleneck in RLVR for complex video tasks through multi-stage synthesis and a novel reward mechanism.

Details

Motivation: Current RLVR methods are underdeveloped for complex video reasoning due to lack of challenging multi-hop questions and high-quality video-grounded Chain-of-Thought data.

Method: Multi-stage synthesis pipeline creating ReWatch dataset components, Multi-Agent ReAct framework for CoT synthesis simulating human re-watching, and RLVR framework with Observation & Reasoning reward mechanism.

Result: ReWatch-R1 achieves state-of-the-art average performance on five challenging video reasoning benchmarks.

Conclusion: The proposed ReWatch dataset and framework successfully advance video reasoning capabilities in LVLMs by addressing critical data limitations and incorporating effective reward mechanisms.

Abstract: While Reinforcement Learning with Verifiable Reward (RLVR) significantly advances image reasoning in Large Vision-Language Models (LVLMs), its application to complex video reasoning remains underdeveloped. This gap stems primarily from a critical data bottleneck: existing datasets lack the challenging, multi-hop questions and high-quality, video-grounded Chain-of-Thought (CoT) data necessary to effectively bootstrap RLVR. To address this, we introduce ReWatch, a large-scale dataset built to foster advanced video reasoning. We propose a novel multi-stage synthesis pipeline to synthesize its three components: ReWatch-Caption, ReWatch-QA, and ReWatch-CoT. A core innovation is our Multi-Agent ReAct framework for CoT synthesis, which simulates a human-like “re-watching” process to generate video-grounded reasoning traces by explicitly modeling information retrieval and verification. Building on this dataset, we develop ReWatch-R1 by post-training a strong baseline LVLM with Supervised Fine-Tuning (SFT) and our RLVR framework. This framework incorporates a novel Observation & Reasoning (O&R) reward mechanism that evaluates both the final answer’s correctness and the reasoning’s alignment with video content, directly penalizing hallucination. Our experiments show that ReWatch-R1 achieves state-of-the-art average performance on five challenging video reasoning benchmarks.

[491] LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

Xiang An, Yin Xie, Kaicheng Yang, Wenkang Zhang, Xiuwei Zhao, Zheng Cheng, Yirui Wang, Songcen Xu, Changrui Chen, Chunsheng Wu, Huajie Tan, Chunyuan Li, Jing Yang, Jie Yu, Xiyao Wang, Bin Qin, Yumeng Wang, Zizhen Yan, Ziyong Feng, Ziwei Liu, Bo Li, Jiankang Deng

Main category: cs.CV

TL;DR: LLaVA-OneVision-1.5 is a new family of Large Multimodal Models that achieves state-of-the-art performance with significantly reduced computational and financial costs through an open, efficient framework built entirely from scratch.

Details

Motivation: To provide an open, efficient, and reproducible framework for building high-quality vision-language models from scratch while reducing computational and financial costs compared to existing approaches.

Method: Developed three components: (1) Large-scale curated datasets (85M pretraining and 26M instruction datasets), (2) Efficient training framework using offline parallel data packing strategy, (3) Complete end-to-end training within $16,000 budget.

Result: LLaVA-OneVision-1.5 achieves state-of-the-art performance across multiple benchmarks. The 8B model outperforms Qwen2.5-VL-7B on 18 of 27 benchmarks, and the 4B model surpasses Qwen2.5-VL-3B on all 27 benchmarks.

Conclusion: LLaVA-OneVision-1.5 demonstrates that high-performance multimodal models can be built efficiently and cost-effectively from scratch, with plans to release LLaVA-OneVision-1.5-RL soon.

Abstract: We present LLaVA-OneVision-1.5, a novel family of Large Multimodal Models (LMMs) that achieve state-of-the-art performance with significantly reduced computational and financial costs. Different from the existing works, LLaVA-OneVision-1.5 provides an open, efficient, and reproducible framework for building high-quality vision-language models entirely from scratch. The LLaVA-OneVision-1.5 release comprises three primary components: (1) Large-Scale Curated Datasets: We construct an 85M concept-balanced pretraining dataset LLaVA-OneVision-1.5-Mid-Traning and a meticulously curated 26M instruction dataset LLaVA-OneVision-1.5-Instruct, collectively encompassing 64B compressed multimodal tokens. (2) Efficient Training Framework: We develop a complete end-to-end efficient training framework leveraging an offline parallel data packing strategy to facilitate the training of LLaVA-OneVision-1.5 within a $16,000 budget. (3) State-of-the-art Performance: Experimental results demonstrate that LLaVA-OneVision1.5 yields exceptionally competitive performance across a broad range of downstream tasks. Specifically, LLaVA-OneVision-1.5-8B outperforms Qwen2.5-VL-7B on 18 of 27 benchmarks, and LLaVA-OneVision-1.5-4B surpasses Qwen2.5-VL-3B on all 27 benchmarks. We anticipate releasing LLaVA-OneVision-1.5-RL shortly and encourage the community to await further updates.

[492] HIVTP: A Training-Free Method to Improve VLMs Efficiency via Hierarchical Visual Token Pruning Using Middle-Layer-Based Importance Score

Jingqi Xu, Jingxi Lu, Chenghao Li, Sreetama Sarkar, Peter A. Beerel

Main category: cs.CV

TL;DR: HIVTP is a training-free hierarchical visual token pruning method that improves Vision-Language Models efficiency by using middle-layer attention maps to identify and retain important visual tokens, reducing inference time by up to 55.1% without accuracy loss.

Details

Motivation: Vision-Language Models suffer from inefficient inference due to the large number of visual tokens output by vision encoders, many of which are unimportant and can be safely pruned to improve efficiency.

Method: Uses attention maps from middle layers of vision encoder to estimate token importance, then applies hierarchical pruning: global stage divides image into regions to retain high-importance tokens, local stage divides into windows to retain most important token per window.

Result: Reduces time-to-first-token by up to 55.1% for LLaVA models, improves token generation throughput by up to 60.9%, maintains accuracy and even improves on some benchmarks compared to prior works.

Conclusion: HIVTP achieves superior accuracy and higher inference efficiency than previous methods, demonstrating effective training-free visual token pruning for VLMs.

Abstract: Vision-Language Models (VLMs) have shown strong capabilities on diverse multimodal tasks. However, the large number of visual tokens output by the vision encoder severely hinders inference efficiency, and prior studies have shown that many of these tokens are not important and can therefore be safely pruned. In this work, we propose HIVTP, a training-free method to improve VLMs efficiency via hierarchical visual token pruning using a novel middle-layer-based importance score. Specifically, we utilize attention maps extracted from the middle layers of the vision encoder, which better reflect fine-grained and object-level attention, to estimate visual token importance. Based on this, we propose a hierarchical visual token pruning method to retain both globally and locally important visual tokens. Specifically, we reshape the 1-D visual token sequence output by the vision encoder into a 2-D spatial layout. In the global retaining stage, we divide the image into regions and retain tokens with higher importance scores in each region; in the local retaining stage, we then divide the image into small windows and retain the most important token in each local window. Experimental results show that our proposed method, HIVTP, can reduce the time-to-first-token (TTFT) of LLaVA-v1.5-7B and LLaVA-Next-7B by up to 50.0% and 55.1%, respectively, and improve the token generation throughput by up to 60.9% and 47.3%, without sacrificing accuracy, and even achieving improvements on certain benchmarks. Compared with prior works, HIVTP achieves better accuracy while offering higher inference efficiency.

[493] Token Merging via Spatiotemporal Information Mining for Surgical Video Understanding

Xixi Jiang, Chen Yang, Dong Zhang, Pingcheng Dong, Xin Yang, Kwang-Ting Cheng

Main category: cs.CV

TL;DR: STIM-TM is a training-free token merging method that reduces computational costs in surgical video understanding by independently merging redundant tokens along temporal and spatial dimensions while preserving critical surgical information.

Details

Motivation: Current Vision Transformer methods for surgical video understanding suffer from high computational costs due to processing massive spatiotemporal tokens, and existing token merging approaches fail to adequately consider video structure and heterogeneous information distribution.

Method: STIM-TM uses a decoupled strategy: temporal component merges spatially corresponding tokens from consecutive frames using saliency weighting, while spatial component prioritizes merging static tokens through temporal stability analysis to protect dynamic regions.

Result: Achieves over 65% GFLOPs reduction while maintaining competitive accuracy across surgical video tasks, and enables efficient training of long-sequence surgical videos.

Conclusion: STIM-TM effectively addresses computational bottlenecks in surgical video applications through spatiotemporal-aware token merging, providing significant efficiency gains without sacrificing performance.

Abstract: Vision Transformer models have shown impressive effectiveness in the surgical video understanding tasks through long-range dependency modeling. However, current methods suffer from prohibitive computational costs due to processing massive spatiotemporal tokens across video frames. While prior work on token merging has advanced model efficiency, they fail to adequately consider the inherent spatiotemporal structure of video data and overlook the heterogeneous nature of information distribution, leading to suboptimal performance. In this paper, we propose a spatiotemporal information mining token merging (STIM-TM) method, representing the first dedicated approach for surgical video understanding. STIM-TM introduces a decoupled strategy that reduces token redundancy along temporal and spatial dimensions independently. Specifically, the temporal component merges spatially corresponding tokens from consecutive frames using saliency weighting, preserving critical sequential information and maintaining continuity. Meanwhile, the spatial component prioritizes merging static tokens through temporal stability analysis, protecting dynamic regions containing essential surgical information. Operating in a training-free manner, STIM-TM achieves significant efficiency gains with over $65%$ GFLOPs reduction while preserving competitive accuracy across comprehensive surgical video tasks. Our method also supports efficient training of long-sequence surgical videos, addressing computational bottlenecks in surgical applications.

Dayu Tan, Ziwei Zhang, Yansan Su, Xin Peng, Yike Dai, Chunhou Zheng, Weimin Zhong

Main category: cs.CV

TL;DR: MSD-KMamba is a 3D multi-modal image segmentation framework that uses bidirectional spatial perception and multi-scale self-distillation to achieve high accuracy while avoiding the quadratic computational complexity of global attention mechanisms.

Details

Motivation: Existing CNN-Transformer hybrid models suffer from high computational complexity due to global attention mechanisms, making it challenging to balance performance with efficiency for complex segmentation tasks.

Method: The framework integrates bidirectional spatial perception to capture long-range dependencies and a multi-scale self-distilled fusion strategy to enhance hierarchical feature representations across different resolution levels.

Result: Extensive experiments show MSD-KMamba outperforms state-of-the-art methods in segmentation accuracy, robustness, and generalization while maintaining high computational efficiency and scalability.

Conclusion: MSD-KMamba effectively addresses the computational complexity bottleneck in volumetric segmentation while providing superior global perception capabilities compared to existing approaches.

Abstract: Numerous CNN-Transformer hybrid models rely on high-complexity global attention mechanisms to capture long-range dependencies, which introduces non-linear computational complexity and leads to significant resource consumption. Although knowledge distillation and sparse attention mechanisms can improve efficiency, they often fall short of delivering the high segmentation accuracy necessary for complex tasks. Balancing model performance with computational efficiency remains a critical challenge. In this work, we propose a novel 3D multi-modal image segmentation framework, termed MSD-KMamba, which integrates bidirectional spatial perception with multi-scale self-distillation. The bidirectional spatial aware branch effectively captures long-range spatial context dependencies across brain regions, while also incorporating a powerful nonlinear feature extraction mechanism that further enhances the model’s ability to learn complex and heterogeneous patterns. In addition, the proposed multi-scale self-distilled fusion strategy strengthens hierarchical feature representations and improves the transfer of semantic information at different resolution levels. By jointly leveraging the bidirectional spatial perception branch and the multi-scale self-distilled fusion strategy, our framework effectively mitigates the bottleneck of quadratic computational complexity in volumetric segmentation, while simultaneously addressing the limitation of insufficient global perception. Extensive experiments on multiple standard benchmark datasets demonstrate that MSD-KMamba consistently outperforms state-of-the-art methods in segmentation accuracy, robustness, and generalization, while maintaining high computational efficiency and favorable scalability. The source code of MSD-KMamba is publicly available at https://github.com/daimao-zhang/MSD-KMamba.

[495] QuantSparse: Comprehensively Compressing Video Diffusion Transformer with Model Quantization and Attention Sparsification

Weilun Feng, Chuanguang Yang, Haotong Qin, Mingqiang Wu, Yuqi Li, Xiangqi Li, Zhulin An, Libo Huang, Yulun Zhang, Michele Magno, Yongjun Xu

Main category: cs.CV

TL;DR: QuantSparse is a unified framework that combines model quantization and attention sparsification to efficiently compress diffusion transformers for video generation, achieving significant performance improvements and computational gains.

Details

Motivation: Diffusion transformers have excellent video generation capabilities but suffer from prohibitive computational and memory costs, limiting practical deployment. While quantization and attention sparsification individually offer compression benefits, they each cause severe performance degradation under aggressive compression.

Method: QuantSparse integrates model quantization with attention sparsification using two key techniques: Multi-Scale Salient Attention Distillation (providing global structural guidance and local salient supervision to mitigate quantization bias) and Second-Order Sparse Attention Reparameterization (exploiting temporal stability of second-order residuals to recover information lost under sparsity).

Result: On HunyuanVideo-13B, QuantSparse achieves 20.88 PSNR, substantially outperforming state-of-the-art quantization baseline Q-VDiT (16.85 PSNR), while delivering 3.68× storage reduction and 1.88× acceleration in end-to-end inference.

Conclusion: QuantSparse successfully addresses the challenges of combining quantization and sparsification for diffusion transformers, achieving superior performance with significant efficiency gains, making practical deployment of video generation models more feasible.

Abstract: Diffusion transformers exhibit remarkable video generation capability, yet their prohibitive computational and memory costs hinder practical deployment. Model quantization and attention sparsification are two promising directions for compression, but each alone suffers severe performance degradation under aggressive compression. Combining them promises compounded efficiency gains, but naive integration is ineffective. The sparsity-induced information loss exacerbates quantization noise, leading to amplified attention shifts. To address this, we propose \textbf{QuantSparse}, a unified framework that integrates model quantization with attention sparsification. Specifically, we introduce \textit{Multi-Scale Salient Attention Distillation}, which leverages both global structural guidance and local salient supervision to mitigate quantization-induced bias. In addition, we develop \textit{Second-Order Sparse Attention Reparameterization}, which exploits the temporal stability of second-order residuals to efficiently recover information lost under sparsity. Experiments on HunyuanVideo-13B demonstrate that QuantSparse achieves 20.88 PSNR, substantially outperforming the state-of-the-art quantization baseline Q-VDiT (16.85 PSNR), while simultaneously delivering a \textbf{3.68$\times$} reduction in storage and \textbf{1.88$\times$} acceleration in end-to-end inference. Our code will be released in https://github.com/wlfeng0509/QuantSparse.

[496] HomeSafeBench: A Benchmark for Embodied Vision-Language Models in Free-Exploration Home Safety Inspection

Siyuan Gao, Jiashu Yao, Haoyu Wen, Yuhang Guo, Zeming Liu, Heyan Huang

Main category: cs.CV

TL;DR: HomeSafeBench is a new benchmark for evaluating embodied agents’ home safety inspection capabilities using dynamic first-person perspective images from simulated environments, addressing limitations of existing benchmarks that use static viewpoints and textual descriptions.

Details

Motivation: Existing benchmarks for home safety inspection oversimplify tasks by using textual descriptions instead of visual information and restrict agents to single static viewpoints, which hinders accurate evaluation of Vision-Language Models and causes omission of occluded hazards.

Method: Proposed HomeSafeBench with 12,900 data points covering five common home safety hazards (fire, electric shock, falling object, trips, child safety), providing dynamic first-person perspective images from simulated home environments and allowing free exploration of rooms.

Result: Evaluation of mainstream VLMs on HomeSafeBench shows poor performance, with the best model achieving only 10.23% F1-score, indicating significant limitations in identifying safety hazards and selecting effective exploration strategies.

Conclusion: HomeSafeBench reveals substantial gaps in current VLM capabilities for home safety inspection and provides a valuable benchmark for future research, with the dataset and code to be publicly available.

Abstract: Embodied agents can identify and report safety hazards in the home environments. Accurately evaluating their capabilities in home safety inspection tasks is curcial, but existing benchmarks suffer from two key limitations. First, they oversimplify safety inspection tasks by using textual descriptions of the environment instead of direct visual information, which hinders the accurate evaluation of embodied agents based on Vision-Language Models (VLMs). Second, they use a single, static viewpoint for environmental observation, which restricts the agents’ free exploration and cause the omission of certain safety hazards, especially those that are occluded from a fixed viewpoint. To alleviate these issues, we propose HomeSafeBench, a benchmark with 12,900 data points covering five common home safety hazards: fire, electric shock, falling object, trips, and child safety. HomeSafeBench provides dynamic first-person perspective images from simulated home environments, enabling the evaluation of VLM capabilities for home safety inspection. By allowing the embodied agents to freely explore the room, HomeSafeBench provides multiple dynamic perspectives in complex environments for a more thorough inspection. Our comprehensive evaluation of mainstream VLMs on HomeSafeBench reveals that even the best-performing model achieves an F1-score of only 10.23%, demonstrating significant limitations in current VLMs. The models particularly struggle with identifying safety hazards and selecting effective exploration strategies. We hope HomeSafeBench will provide valuable reference and support for future research related to home security inspections. Our dataset and code will be publicly available soon.

[497] Confidence Aware SSD Ensemble with Weighted Boxes Fusion for Weapon Detection

Atharva Jadhav, Arush Karekar, Manas Divekar, Shachi Natu

Main category: cs.CV

TL;DR: Ensemble of SSD models with diverse backbones using Weighted Boxes Fusion improves weapon detection robustness in surveillance systems.

Details

Motivation: Public safety requires robust weapon detection systems that can handle challenges like occlusion, varying lighting, and cluttered backgrounds where single models often lack robustness.

Method: Multiple SSD models with different backbone networks (VGG16, ResNet50, EfficientNet, MobileNetV3) were trained on weapon detection and combined using Weighted Boxes Fusion with max confidence scoring.

Result: The ensemble achieved mAP of 0.838, representing 2.948% improvement over the best single model, and consistently outperformed other fusion methods.

Conclusion: Confidence-aware fusion is critical for ensemble performance, and this approach provides a robust solution for real-time weapon detection in surveillance applications.

Abstract: The safety and security of public spaces is of vital importance, driving the need for sophisticated surveillance systems capable of accurately detecting weapons, which are often hampered by issues like partial occlusion, varying lighting, and cluttered backgrounds. While single-model detectors are advanced, they often lack robustness in these challenging conditions. This paper presents the hypothesis that ensemble of Single Shot Multibox Detector (SSD) models with diverse feature extraction backbones can significantly enhance detection robustness. To leverage diverse feature representations, individual SSD models were trained using a selection of backbone networks: VGG16, ResNet50, EfficientNet, and MobileNetV3. The study is conducted on a dataset consisting of images of three distinct weapon classes: guns, heavy weapons and knives. The predictions from these models are combined using the Weighted Boxes Fusion (WBF) method, an ensemble technique designed to optimize bounding box accuracy. Our key finding is that the fusion strategy is as critical as the ensemble’s diversity, a WBF approach using a ‘max’ confidence scoring strategy achieved a mean Average Precision (mAP) of 0.838. This represents a 2.948% relative improvement over the best-performing single model and consistently outperforms other fusion heuristics. This research offers a robust approach to enhancing real-time weapon detection capabilities in surveillance applications by demonstrating that confidence-aware fusion is a key mechanism for improving accuracy metrics of ensembles.

[498] INSTINCT: Instance-Level Interaction Architecture for Query-Based Collaborative Perception

Yunjiang Xu, Lingzhi Li, Jin Wang, Yupeng Ouyang, Benyuan Yang

Main category: cs.CV

TL;DR: INSTINCT is a collaborative perception framework that improves multi-agent LiDAR detection accuracy while significantly reducing communication bandwidth requirements through instance-level interactions.

Details

Motivation: Collaborative perception systems face bandwidth constraints due to frequent interactions and real-time requirements. While query-based instance-level interaction reduces bandwidth demands, LiDAR-focused implementations remain underdeveloped and trail state-of-the-art approaches.

Method: INSTINCT features three core components: 1) quality-aware filtering for high-quality instance feature selection, 2) dual-branch detection routing to separate collaboration-irrelevant and collaboration-relevant instances, and 3) Cross Agent Local Instance Fusion module to aggregate local hybrid instance features. Enhanced ground truth sampling facilitates training with diverse hybrid instance features.

Result: Extensive experiments show INSTINCT achieves superior performance with 13.23%/33.08% improvement in accuracy on DAIR-V2X and V2V4Real datasets while reducing communication bandwidth to 1/281 and 1/264 compared to state-of-the-art methods.

Conclusion: INSTINCT successfully bridges the gap in LiDAR-focused collaborative perception by providing an efficient instance-level interaction architecture that significantly improves detection accuracy while dramatically reducing bandwidth requirements.

Abstract: Collaborative perception systems overcome single-vehicle limitations in long-range detection and occlusion scenarios by integrating multi-agent sensory data, improving accuracy and safety. However, frequent cooperative interactions and real-time requirements impose stringent bandwidth constraints. Previous works proves that query-based instance-level interaction reduces bandwidth demands and manual priors, however, LiDAR-focused implementations in collaborative perception remain underdeveloped, with performance still trailing state-of-the-art approaches. To bridge this gap, we propose INSTINCT (INSTance-level INteraCtion ArchiTecture), a novel collaborative perception framework featuring three core components: 1) a quality-aware filtering mechanism for high-quality instance feature selection; 2) a dual-branch detection routing scheme to decouple collaboration-irrelevant and collaboration-relevant instances; and 3) a Cross Agent Local Instance Fusion module to aggregate local hybrid instance features. Additionally, we enhance the ground truth (GT) sampling technique to facilitate training with diverse hybrid instance features. Extensive experiments across multiple datasets demonstrate that INSTINCT achieves superior performance. Specifically, our method achieves an improvement in accuracy 13.23%/33.08% in DAIR-V2X and V2V4Real while reducing the communication bandwidth to 1/281 and 1/264 compared to state-of-the-art methods. The code is available at https://github.com/CrazyShout/INSTINCT.

[499] CrimEdit: Controllable Editing for Counterfactual Object Removal, Insertion, and Movement

Boseong Jeon, Junghyuk Lee, Jimin Park, Kwanyoung Kim, Jingi Jung, Sangwon Lee, Hyunbo Shim

Main category: cs.CV

TL;DR: CrimEdit is a unified diffusion model that jointly trains removal and insertion task embeddings, enabling efficient object editing with controllable effect handling through classifier-free guidance.

Details

Motivation: To address the unexplored impact of classifier-free guidance on handling object effects in unified removal/insertion models and improve efficiency in composite editing tasks.

Method: Jointly trains task embeddings for removal and insertion in a single model, leverages them in classifier-free guidance scheme, and extends task prompts to spatially distinct regions for object movement in a single denoising step.

Result: Achieves superior object removal, controllable effect insertion, and efficient object movement without requiring additional training or separate removal/insertion stages.

Conclusion: CrimEdit demonstrates that unified training with classifier-free guidance enables efficient and high-quality object editing with controllable effect handling across removal, insertion, and movement tasks.

Abstract: Recent works on object removal and insertion have enhanced their performance by handling object effects such as shadows and reflections, using diffusion models trained on counterfactual datasets. However, the performance impact of applying classifier-free guidance to handle object effects across removal and insertion tasks within a unified model remains largely unexplored. To address this gap and improve efficiency in composite editing, we propose CrimEdit, which jointly trains the task embeddings for removal and insertion within a single model and leverages them in a classifier-free guidance scheme – enhancing the removal of both objects and their effects, and enabling controllable synthesis of object effects during insertion. CrimEdit also extends these two task prompts to be applied to spatially distinct regions, enabling object movement (repositioning) within a single denoising step. By employing both guidance techniques, extensive experiments show that CrimEdit achieves superior object removal, controllable effect insertion, and efficient object movement without requiring additional training or separate removal and insertion stages.

[500] PD-Diag-Net: Clinical-Priors guided Network on Brain MRI for Auxiliary Diagnosis of Parkinson’s Disease

Shuai Shao, Shu Jiang, Shiyuan Zhao, Di Yang, Yan Wang, Yutong Bai, Jianguo Zhang, Jiangtao Wang

Main category: cs.CV

TL;DR: PD-Diag-Net is an automated Parkinson’s disease diagnostic method that uses MRI scans with clinical priors for brain region relevance and aging patterns, achieving 86% accuracy on external tests.

Details

Motivation: Current Parkinson's disease diagnosis is complex, relies heavily on neurologist expertise, and causes delays in early detection, missing timely intervention opportunities.

Method: End-to-end framework with MRI preprocessing, two clinical priors (brain region relevance and aging patterns), and dedicated modules for feature aggregation and diagnosis using brain age gaps as constraints.

Result: Achieved 86% accuracy on external hospital tests and over 96% accuracy in early-stage diagnosis, outperforming existing methods by more than 20%.

Conclusion: PD-Diag-Net provides an effective automated diagnostic solution for Parkinson’s disease with high accuracy and clinical interpretability.

Abstract: Parkinson’s disease (PD) is a common neurodegenerative disorder that severely diminishes patients’ quality of life. Its global prevalence has increased markedly in recent decades. Current diagnostic workflows are complex and heavily reliant on neurologists’ expertise, often resulting in delays in early detection and missed opportunities for timely intervention. To address these issues, we propose an end-to-end automated diagnostic method for PD, termed PD-Diag-Net, which performs risk assessment and auxiliary diagnosis directly from raw MRI scans. This framework first introduces an MRI Pre-processing Module (MRI-Processor) to mitigate inter-subject and inter-scanner variability by flexibly integrating established medical imaging preprocessing tools. It then incorporates two forms of clinical prior knowledge: (1) Brain-Region-Relevance-Prior (Relevance-Prior), which specifies brain regions strongly associated with PD; and (2) Brain-Region-Aging-Prior (Aging-Prior), which reflects the accelerated aging typically observed in PD-associated regions. Building on these priors, we design two dedicated modules: the Relevance-Prior Guided Feature Aggregation Module (Aggregator), which guides the model to focus on PD-associated regions at the inter-subject level, and the Age-Prior Guided Diagnosis Module (Diagnoser), which leverages brain age gaps as auxiliary constraints at the intra-subject level to enhance diagnostic accuracy and clinical interpretability. Furthermore, we collected external test data from our collaborating hospital. Experimental results show that PD-Diag-Net achieves 86% accuracy on external tests and over 96% accuracy in early-stage diagnosis, outperforming existing advanced methods by more than 20%.

[501] DiffPCN: Latent Diffusion Model Based on Multi-view Depth Images for Point Cloud Completion

Zijun Li, Hongyu Yan, Shijie Li, Kunming Luo, Li Lu, Xulei Yang, Weisi Lin

Main category: cs.CV

TL;DR: DiffPCN is a diffusion-based coarse-to-fine framework for point cloud completion that uses depth image projection and denoising to achieve state-of-the-art results.

Details

Motivation: Latent diffusion models have strong generative capabilities but remain underexplored for point cloud completion due to the unstructured and irregular nature of point clouds.

Method: Two-stage approach: 1) Project partial point clouds to depth images, use DepthLDM to generate completed multi-view depth images for coarse point clouds; 2) Point Denoising Network removes artifacts, Association-Aware Point Upsampler refines with local association features.

Result: Achieves state-of-the-art performance in geometric accuracy and shape completeness, significantly improving robustness and consistency of point cloud completion.

Conclusion: DiffPCN successfully adapts latent diffusion models to point cloud completion through a novel coarse-to-fine framework with depth image projection and refinement stages.

Abstract: Latent diffusion models (LDMs) have demonstrated remarkable generative capabilities across various low-level vision tasks. However, their potential for point cloud completion remains underexplored due to the unstructured and irregular nature of point clouds. In this work, we propose DiffPCN, a novel diffusion-based coarse-to-fine framework for point cloud completion. Our approach comprises two stages: an initial stage for generating coarse point clouds, and a refinement stage that improves their quality through point denoising and upsampling. Specifically, we first project the unordered and irregular partial point cloud into structured depth images, which serve as conditions for a well-designed DepthLDM to synthesize completed multi-view depth images that are used to form coarse point clouds. In this way, our DiffPCN can yield high-quality and high-completeness coarse point clouds by leveraging LDM’ s powerful generation and comprehension capabilities. Then, since LDMs inevitably introduce outliers into the generated depth maps, we design a Point Denoising Network to remove artifacts from the coarse point cloud by predicting a per-point distance score. Finally, we devise an Association-Aware Point Upsampler, which guides the upsampling process by leveraging local association features between the input point cloud and the corresponding coarse points, further yielding a dense and high-fidelity output. Experimental results demonstrate that our DiffPCN achieves state-of-the-art performance in geometric accuracy and shape completeness, significantly improving the robustness and consistency of point cloud completion.

[502] Video Panels for Long Video Understanding

Lars Doorenbos, Federico Spurio, Juergen Gall

Main category: cs.CV

TL;DR: A training-free visual prompting method that combines multiple video frames into single panels to improve long-video understanding in VLMs without additional parameters or fine-tuning.

Details

Motivation: Current Video-Language Models underperform on long-video tasks compared to image/short-video tasks, and existing approaches add complexity through novel modules and training. This work aims to maximize existing model performance rather than fine-tuning with limited data.

Method: Proposes a visual prompting strategy that trades spatial details for temporal resolution by combining multiple frames as panels into one image. The approach is training-free, parameter-free, model-agnostic, and seamlessly integrates with existing VLMs.

Result: Extensive experiments on five benchmarks show consistent improvements across various model architectures, sizes, and context windows. On TimeScope (Long) dataset with the longest videos, video QA accuracy improved by up to 19.4%.

Conclusion: The method effectively raises the bar for long video understanding models by providing a simple yet effective visual prompting strategy that enhances temporal resolution without requiring additional training or parameters.

Abstract: Recent Video-Language Models (VLMs) achieve promising results on long-video understanding, but their performance still lags behind that achieved on tasks involving images or short videos. This has led to great interest in improving the long context modeling of VLMs by introducing novel modules and additional complexity. % additional training time. In this paper, we take a different approach: rather than fine-tuning VLMs with the limited data available, we attempt to maximize the performance of existing models. To this end, we propose a novel visual prompting strategy specifically designed for long-video understanding. By combining multiple frames as panels into one image, we effectively trade off spatial details for temporal resolution. Our approach is training-free, parameter-free, and model-agnostic, and can be seamlessly integrated into existing VLMs. Extensive experiments on five established benchmarks across a wide range of model architectures, sizes, and context windows confirm the consistency of our approach. For the TimeScope (Long) dataset, which has the longest videos, the accuracy for video question answering is improved by up to 19.4%. Overall, our method raises the bar for long video understanding models. We will make our code available upon acceptance.

[503] M3DLayout: A Multi-Source Dataset of 3D Indoor Layouts and Structured Descriptions for 3D Generation

Yiheng Zhang, Zhuojiang Cai, Mingdao Wang, Meitong Guo, Tianxiao Li, Li Lin, Yuwang Wang

Main category: cs.CV

TL;DR: M3DLayout is a large-scale, multi-source dataset for 3D indoor layout generation that integrates real-world scans, CAD designs, and procedurally generated scenes to address limitations in existing datasets.

Details

Motivation: Current 3D indoor layout generation models are constrained by limited scale, diversity, and annotation quality of existing datasets, which hinders learning complex spatial and semantic patterns.

Method: Created M3DLayout dataset with 15,080 layouts and 258k object instances from three sources (real scans, CAD designs, procedural scenes), each paired with detailed structured text descriptions. Established benchmark using text-conditioned diffusion model.

Result: The dataset provides a solid foundation for training layout generation models, with multi-source composition enhancing diversity. The Inf3DLayout subset enables generation of more complex and detailed scenes with rich small-object information.

Conclusion: M3DLayout serves as a valuable resource for advancing research in text-driven 3D scene synthesis by providing diverse and richly annotated data that enables learning complex spatial and semantic patterns.

Abstract: In text-driven 3D scene generation, object layout serves as a crucial intermediate representation that bridges high-level language instructions with detailed geometric output. It not only provides a structural blueprint for ensuring physical plausibility but also supports semantic controllability and interactive editing. However, the learning capabilities of current 3D indoor layout generation models are constrained by the limited scale, diversity, and annotation quality of existing datasets. To address this, we introduce M3DLayout, a large-scale, multi-source dataset for 3D indoor layout generation. M3DLayout comprises 15,080 layouts and over 258k object instances, integrating three distinct sources: real-world scans, professional CAD designs, and procedurally generated scenes. Each layout is paired with detailed structured text describing global scene summaries, relational placements of large furniture, and fine-grained arrangements of smaller items. This diverse and richly annotated resource enables models to learn complex spatial and semantic patterns across a wide variety of indoor environments. To assess the potential of M3DLayout, we establish a benchmark using a text-conditioned diffusion model. Experimental results demonstrate that our dataset provides a solid foundation for training layout generation models. Its multi-source composition enhances diversity, notably through the Inf3DLayout subset which provides rich small-object information, enabling the generation of more complex and detailed scenes. We hope that M3DLayout can serve as a valuable resource for advancing research in text-driven 3D scene synthesis.

[504] FastViDAR: Real-Time Omnidirectional Depth Estimation via Alternative Hierarchical Attention

Hangtian Zhao, Xiang Chen, Yizhe Li, Qianhao Wang, Haibo Lu, Fei Gao

Main category: cs.CV

TL;DR: FastViDAR is a framework that takes four fisheye camera inputs and generates a 360° depth map with per-camera depth, fusion depth, and confidence estimates using efficient cross-view feature fusion and ERP fusion.

Details

Motivation: To create an efficient system for generating 360° depth maps from multiple fisheye cameras that can run in real-time on embedded hardware while maintaining competitive performance.

Method: Uses Alternative Hierarchical Attention (AHA) for efficient cross-view feature fusion through separate intra-frame and inter-frame windowed self-attention, and proposes ERP fusion to project multi-view depth estimates to equirectangular coordinates.

Result: Achieves competitive zero-shot performance on real datasets and runs at up to 20 FPS on NVIDIA Orin NX embedded hardware.

Conclusion: FastViDAR provides an efficient solution for real-time 360° depth estimation from multiple fisheye cameras with reduced computational overhead.

Abstract: In this paper we propose FastViDAR, a novel framework that takes four fisheye camera inputs and produces a full $360^\circ$ depth map along with per-camera depth, fusion depth, and confidence estimates. Our main contributions are: (1) We introduce Alternative Hierarchical Attention (AHA) mechanism that efficiently fuses features across views through separate intra-frame and inter-frame windowed self-attention, achieving cross-view feature mixing with reduced overhead. (2) We propose a novel ERP fusion approach that projects multi-view depth estimates to a shared equirectangular coordinate system to obtain the final fusion depth. (3) We generate ERP image-depth pairs using HM3D and 2D3D-S datasets for comprehensive evaluation, demonstrating competitive zero-shot performance on real datasets while achieving up to 20 FPS on NVIDIA Orin NX embedded hardware. Project page: \href{https://3f7dfc.github.io/FastVidar/}{https://3f7dfc.github.io/FastVidar/}

[505] HieraTok: Multi-Scale Visual Tokenizer Improves Image Reconstruction and Generation

Cong Chen, Ziyuan Huang, Cheng Zou, Muzhi Zhu, Kaixiang Ji, Jiajia Liu, Jingdong Chen, Hao Chen, Chunhua Shen

Main category: cs.CV

TL;DR: HieraTok is a multi-scale Vision Transformer tokenizer that uses multi-scale downsampling and scale-causal attention to improve image reconstruction and generation, achieving significant performance gains over single-scale approaches.

Details

Motivation: To overcome the limitation of single-scale representations in Vision Transformer tokenizers by enabling multi-scale modeling that captures both global semantic features and high-resolution structural details.

Method: Uses multi-scale downsampling on token maps and a scale-causal attention mechanism that progressively flows information from low-resolution to high-resolution features.

Result: 27.2% improvement in rFID (1.47 → 1.07), 1.38× faster convergence, 18.9% boost in gFID (16.4 → 13.3), and achieves state-of-the-art rFID of 0.45 and gFID of 1.82 among ViT tokenizers.

Conclusion: HieraTok demonstrates the effectiveness of multi-scale ViT tokenizers for visual generation tasks, advancing the field with its novel architecture and achieving superior performance metrics.

Abstract: In this work, we present HieraTok, a novel multi-scale Vision Transformer (ViT)-based tokenizer that overcomes the inherent limitation of modeling single-scale representations. This is realized through two key designs: (1) multi-scale downsampling applied to the token map generated by the tokenizer encoder, producing a sequence of multi-scale tokens, and (2) a scale-causal attention mechanism that enables the progressive flow of information from low-resolution global semantic features to high-resolution structural details. Coupling these designs, HieraTok achieves significant improvements in both image reconstruction and generation tasks. Under identical settings, the multi-scale visual tokenizer outperforms its single-scale counterpart by a 27.2% improvement in rFID ($1.47 \rightarrow 1.07$). When integrated into downstream generation frameworks, it achieves a $1.38\times$ faster convergence rate and an 18.9% boost in gFID ($16.4 \rightarrow 13.3$), which may be attributed to the smoother and more uniformly distributed latent space. Furthermore, by scaling up the tokenizer’s training, we demonstrate its potential by a sota rFID of 0.45 and a gFID of 1.82 among ViT tokenizers. To the best of our knowledge, we are the first to introduce multi-scale ViT-based tokenizer in image reconstruction and image generation. We hope our findings and designs advance the ViT-based tokenizers in visual generation tasks.

[506] GRS-SLAM3R: Real-Time Dense SLAM with Gated Recurrent State

Guole Shen, Tianchen Deng, Yanbo Wang, Yongtao Chen, Yilin Shen, Jiuming Liu, Jingchuan Wang

Main category: cs.CV

TL;DR: GRS-SLAM3R is an end-to-end SLAM framework that improves dense scene reconstruction by incorporating spatial memory and global consistency, achieving superior accuracy with real-time performance.

Details

Motivation: Existing DUSt3R-based methods only use image pairs for pointmap estimation, overlooking spatial memory and global consistency, leading to limitations in dense visual SLAM.

Method: The framework uses sequential input, a transformer-based gated update module for spatial memory, and submap partitioning with local alignment and global registration to maintain consistency.

Result: Experiments show superior reconstruction accuracy compared to existing methods while maintaining real-time performance.

Conclusion: GRS-SLAM3R effectively addresses the limitations of previous approaches by incorporating spatial memory and global consistency mechanisms, achieving state-of-the-art dense scene reconstruction.

Abstract: DUSt3R-based end-to-end scene reconstruction has recently shown promising results in dense visual SLAM. However, most existing methods only use image pairs to estimate pointmaps, overlooking spatial memory and global consistency.To this end, we introduce GRS-SLAM3R, an end-to-end SLAM framework for dense scene reconstruction and pose estimation from RGB images without any prior knowledge of the scene or camera parameters. Unlike existing DUSt3R-based frameworks, which operate on all image pairs and predict per-pair point maps in local coordinate frames, our method supports sequentialized input and incrementally estimates metric-scale point clouds in the global coordinate. In order to improve consistent spatial correlation, we use a latent state for spatial memory and design a transformer-based gated update module to reset and update the spatial memory that continuously aggregates and tracks relevant 3D information across frames. Furthermore, we partition the scene into submaps, apply local alignment within each submap, and register all submaps into a common world frame using relative constraints, producing a globally consistent map. Experiments on various datasets show that our framework achieves superior reconstruction accuracy while maintaining real-time performance.

[507] ResAD++: Towards Class Agnostic Anomaly Detection via Residual Feature Learning

Xincheng Yao, Chao Shi, Muming Zhao, Guangtao Zhai, Chongyang Zhang

Main category: cs.CV

TL;DR: ResAD++ is a class-agnostic anomaly detection framework that uses residual features and feature hypersphere constraining to achieve generalization across diverse new classes without retraining.

Details

Motivation: Current anomaly detection methods perform poorly on new classes because their representation learning remains class-related through feature correlation. A class-agnostic approach is needed that can generalize across different domains without target data fine-tuning.

Method: Proposes residual features by subtracting normal reference features to achieve feature decorrelation. Uses feature hypersphere constraining to address scale correlation. Enhanced with logbarrier bidirectional contraction OCC loss and vector quantization-based feature distribution matching.

Result: Comprehensive experiments on eight real-world datasets show ResAD++ achieves remarkable anomaly detection results when directly applied to new classes, outperforming state-of-the-art methods and the base ResAD version.

Conclusion: ResAD++ effectively addresses class-agnostic anomaly detection by learning residual feature distributions and constraining feature scales, enabling strong generalization to diverse new classes without retraining.

Abstract: This paper explores the problem of class-agnostic anomaly detection (AD), where the objective is to train one class-agnostic AD model that can generalize to detect anomalies in diverse new classes from different domains without any retraining or fine-tuning on the target data. When applied for new classes, the performance of current single- and multi-class AD methods is still unsatisfactory. One fundamental reason is that representation learning in existing methods is still class-related, namely, feature correlation. To address this issue, we propose residual features and construct a simple but effective framework, termed ResAD. Our core insight is to learn the residual feature distribution rather than the initial feature distribution. Residual features are formed by matching and then subtracting normal reference features. In this way, we can effectively realize feature decorrelation. Even in new classes, the distribution of normal residual features would not remarkably shift from the learned distribution. In addition, we think that residual features still have one issue: scale correlation. To this end, we propose a feature hypersphere constraining approach, which learns to constrain initial normal residual features into a spatial hypersphere for enabling the feature scales of different classes as consistent as possible. Furthermore, we propose a novel logbarrier bidirectional contraction OCC loss and vector quantization based feature distribution matching module to enhance ResAD, leading to the improved version of ResAD (ResAD++). Comprehensive experiments on eight real-world AD datasets demonstrate that our ResAD++ can achieve remarkable AD results when directly used in new classes, outperforming state-of-the-art competing methods and also surpassing ResAD. The code is available at https://github.com/xcyao00/ResAD.

[508] Poivre: Self-Refining Visual Pointing with Reinforcement Learning

Wenjie Yang, Zengfeng Huang

Main category: cs.CV

TL;DR: The paper proposes Poivre, a self-refining procedure for visual pointing that enables VLMs to iteratively refine coordinates through reinforcement learning, achieving state-of-the-art performance on Point-Bench.

Details

Motivation: Current VLMs perform poorly on visual pointing tasks compared to humans, primarily because they are required to complete pointing in a single step without the ability to refine their estimates.

Method: Proposed Point, Visualize, then Refine (Poivre) procedure using reinforcement learning with a process reward to enable iterative coordinate refinement.

Result: Poivre-7B achieves new state-of-the-art on Point-Bench, outperforming proprietary models like Gemini-2.5-Pro and large open-source models like Molmo-72B by over 3%.

Conclusion: The self-refining approach with RL training significantly improves visual pointing performance, bridging the gap between VLMs and human capabilities.

Abstract: Visual pointing, which aims to localize a target by predicting its coordinates on an image, has emerged as an important problem in the realm of vision-language models (VLMs). Despite its broad applicability, recent benchmarks show that current VLMs still fall far behind human performance on this task. A key limitation is that VLMs are typically required to complete the pointing task in a single step, akin to asking humans to point at an object without seeing their own fingers. To address this issue, we propose a simple yet effective self-refining procedure: Point, Visualize, then Refine (Poivre). This procedure enables a VLM to first mark its estimated point, then iteratively refine the coordinates if necessary. Inspired by advances of reasoning models in the natural language domain, we employ reinforcement learning (RL) to incentivize this self-refining ability. For the RL training, we design a neat process reward that is not only empirically effective but also grounded in appealing properties. Our trained model, Poivre-7B, sets a new state of the art on Point-Bench, outperforming both proprietary models such as Gemini-2.5-Pro and large open-source models such as Molmo-72B by over 3%. To support future research, we release our training and inference code, dataset, and the Poivre-7B checkpoint.

[509] PVTAdpNet: Polyp Segmentation using Pyramid vision transformer with a novel Adapter block

Arshia Yousefi Nezhad, Helia Aghaei, Hedieh Sajedi

Main category: cs.CV

TL;DR: PVTAdpNet is a novel deep learning model for polyp segmentation in colorectal cancer detection, combining U-Net architecture with Pyramid Vision Transformer and adapter-based skip connections to achieve high accuracy in real-time.

Details

Motivation: Address limitations of traditional colonoscopy including high miss rates due to polyp variability, and improve early detection of colorectal cancer.

Method: Integrates U-Net-style encoder-decoder structure with Pyramid Vision Transformer backbone, novel residual blocks, adapter-based skip connections, and squeeze-and-excitation attention for feature refinement.

Result: Achieves Dice coefficient of 0.8851 and mIoU of 0.8167 on out-of-distribution datasets, demonstrating real-time accurate polyp segmentation.

Conclusion: PVTAdpNet shows superior performance for clinical applications in colorectal cancer detection with high accuracy and real-time capabilities.

Abstract: Colorectal cancer ranks among the most common and deadly cancers, emphasizing the need for effective early detection and treatment. To address the limitations of traditional colonoscopy, including high miss rates due to polyp variability, we introduce the Pyramid Vision Transformer Adapter Residual Network (PVTAdpNet). This model integrates a U-Net-style encoder-decoder structure with a Pyramid Vision Transformer backbone, novel residual blocks, and adapter-based skip connections. The design enhances feature extraction, dense prediction, and gradient flow, supported by squeeze-and-excitation attention for improved channel-wise feature refinement. PVTAdpNet achieves real-time, accurate polyp segmentation, demonstrating superior performance on benchmark datasets with high mDice and mIoU scores, making it highly suitable for clinical applications. PVTAdpNet obtains a high Dice coefficient of 0.8851 and a mean Intersection over Union (mIoU) of 0.8167 on out-of-distribution polyp datasets. Evaluation of the PolypGen dataset demonstrates PVTAdpNet’s capability for real-time, accurate performance within familiar distributions. The source code of our network is available at https://github.com/ayousefinejad/PVTAdpNet.git

[510] UniAlignment: Semantic Alignment for Unified Image Generation, Understanding, Manipulation and Perception

Xinyang Song, Libin Wang, Weining Wang, Shaozhen Liu, Dandan Zheng, Jingdong Chen, Qi Li, Zhenan Sun

Main category: cs.CV

TL;DR: UniAlignment is a unified multimodal generation framework using a single diffusion transformer with dual-stream training for enhanced cross-modal consistency and instruction-following.

Details

Motivation: Existing approaches for multimodal tasks rely on fragmented architectures using vision-language models or modular designs, leading to computational inefficiency and limited semantic comprehension across modalities.

Method: Proposes UniAlignment with dual-stream diffusion training incorporating intrinsic-modal and cross-modal semantic alignment within a single diffusion transformer.

Result: Extensive experiments show UniAlignment outperforms existing baselines across multiple tasks and benchmarks, demonstrating superior multimodal semantic consistency.

Conclusion: The framework highlights the significant potential of diffusion models in unified multimodal generation, achieving robust instruction-following and cross-modal consistency.

Abstract: The remarkable success of diffusion models in text-to-image generation has sparked growing interest in expanding their capabilities to a variety of multi-modal tasks, including image understanding, manipulation, and perception. These tasks require advanced semantic comprehension across both visual and textual modalities, especially in scenarios involving complex semantic instructions. However, existing approaches often rely heavily on vision-language models (VLMs) or modular designs for semantic guidance, leading to fragmented architectures and computational inefficiency. To address these challenges, we propose UniAlignment, a unified multimodal generation framework within a single diffusion transformer. UniAlignment introduces a dual-stream diffusion training strategy that incorporates both intrinsic-modal semantic alignment and cross-modal semantic alignment, thereby enhancing the model’s cross-modal consistency and instruction-following robustness. Additionally, we present SemGen-Bench, a new benchmark specifically designed to evaluate multimodal semantic consistency under complex textual instructions. Extensive experiments across multiple tasks and benchmarks demonstrate that UniAlignment outperforms existing baselines, underscoring the significant potential of diffusion models in unified multimodal generation.

[511] GenView++: Unifying Adaptive View Generation and Quality-Driven Supervision for Contrastive Representation Learning

Xiaojie Li, Bei Wang, Jianlong Wu, Yue Yu, Liqiang Nie, Min Zhang

Main category: cs.CV

TL;DR: GenView++ is a unified framework that improves contrastive learning through multi-source adaptive view generation and quality-driven contrastive learning, achieving significant performance gains in both vision and vision-language tasks.

Details

Motivation: Current contrastive learning methods face limitations in constructing diverse, semantically coherent positive pairs and lack mechanisms to assess pair quality, leading to suboptimal supervision where all pairs are treated equally.

Method: Proposes two innovations: 1) Multi-source adaptive view generation that synthesizes diverse yet semantically coherent views using image-conditioned, text-conditioned, and image-text-conditioned strategies, and 2) Quality-driven contrastive learning that assesses semantic alignment and diversity to dynamically reweight training contributions.

Result: Improves MoCov2 by +2.5% on ImageNet linear classification, raises average zero-shot classification accuracy by +12.31% over CLIP and +5.31% over SLIP across ten datasets, and improves Flickr30k text retrieval R@5 by +3.2%.

Conclusion: GenView++ effectively addresses both construction and learning challenges in contrastive learning, demonstrating superior performance across vision and vision-language tasks through its unified framework.

Abstract: The success of contrastive learning depends on the construction and utilization of high-quality positive pairs. However, current methods face critical limitations on two fronts: on the construction side, both handcrafted and generative augmentations often suffer from limited diversity and risk semantic corruption; on the learning side, the absence of a quality assessment mechanism leads to suboptimal supervision where all pairs are treated equally. To tackle these challenges, we propose GenView++, a unified framework that addresses both fronts by introducing two synergistic innovations. To improve pair construction, GenView++ introduces a multi-source adaptive view generation mechanism to synthesize diverse yet semantically coherent views by dynamically modulating generative parameters across image-conditioned, text-conditioned, and image-text-conditioned strategies. Second, a quality-driven contrastive learning mechanism assesses each pair’s semantic alignment and diversity to dynamically reweight their training contribution, prioritizing high-quality pairs while suppressing redundant or misaligned pairs. Extensive experiments demonstrate the effectiveness of GenView++ across both vision and vision-language tasks. For vision representation learning, it improves MoCov2 by +2.5% on ImageNet linear classification. For vision-language learning, it raises the average zero-shot classification accuracy by +12.31% over CLIP and +5.31% over SLIP across ten datasets, and further improves Flickr30k text retrieval R@5 by +3.2%. The code is available at https://github.com/xiaojieli0903/GenViewPlusPlus.

[512] A Modality-Tailored Graph Modeling Framework for Urban Region Representation via Contrastive Learning

Yaya Zhao, Kaiqi Zhao, Zixuan Tang, Zhiyuan Liu, Xiaoling Lu, Yalei Du

Main category: cs.CV

TL;DR: MTGRR is a modality-tailored graph framework for urban region representation that addresses limitations in existing approaches by using specialized GNN architectures for different modality types and spatially-aware multimodal fusion.

Details

Motivation: Existing graph-based models use identical architectures across all modalities and neglect spatial heterogeneity in fusion, leading to suboptimal region representations.

Method: Categorizes modalities into aggregated-level and point-level groups, uses MoE graph architecture with expert GNNs for aggregated modalities, dual-level GNN for point-level modality, and spatially-aware multimodal fusion with dynamic region-specific weights.

Result: Experiments on two real-world datasets across six modalities and three tasks show MTGRR consistently outperforms state-of-the-art baselines.

Conclusion: MTGRR effectively addresses modality-specific characteristics and spatial heterogeneity, providing superior urban region representations for downstream tasks.

Abstract: Graph-based models have emerged as a powerful paradigm for modeling multimodal urban data and learning region representations for various downstream tasks. However, existing approaches face two major limitations. (1) They typically employ identical graph neural network architectures across all modalities, failing to capture modality-specific structures and characteristics. (2) During the fusion stage, they often neglect spatial heterogeneity by assuming that the aggregation weights of different modalities remain invariant across regions, resulting in suboptimal representations. To address these issues, we propose MTGRR, a modality-tailored graph modeling framework for urban region representation, built upon a multimodal dataset comprising point of interest (POI), taxi mobility, land use, road element, remote sensing, and street view images. (1) MTGRR categorizes modalities into two groups based on spatial density and data characteristics: aggregated-level and point-level modalities. For aggregated-level modalities, MTGRR employs a mixture-of-experts (MoE) graph architecture, where each modality is processed by a dedicated expert GNN to capture distinct modality-specific characteristics. For the point-level modality, a dual-level GNN is constructed to extract fine-grained visual semantic features. (2) To obtain effective region representations under spatial heterogeneity, a spatially-aware multimodal fusion mechanism is designed to dynamically infer region-specific modality fusion weights. Building on this graph modeling framework, MTGRR further employs a joint contrastive learning strategy that integrates region aggregated-level, point-level, and fusion-level objectives to optimize region representations. Experiments on two real-world datasets across six modalities and three tasks demonstrate that MTGRR consistently outperforms state-of-the-art baselines, validating its effectiveness.

[513] Texture Vector-Quantization and Reconstruction Aware Prediction for Generative Super-Resolution

Qifan Li, Jiale Zou, Jinhua Zhang, Wei Long, Xinyu Zhou, Shuhang Gu

Main category: cs.CV

TL;DR: Proposes Texture Vector-Quantization and Reconstruction Aware Prediction strategies to address quantization errors and sub-optimal prior modeling in VQ-based super-resolution models.

Details

Motivation: Existing VQ-based methods have large quantization errors due to visual signal richness and train predictors with code-level supervision that doesn't consider final reconstruction errors, leading to sub-optimal prior modeling.

Method: Texture Vector-Quantization only models missing textures’ prior using codebook, and Reconstruction Aware Prediction uses straight-through estimator to train index predictor directly with image-level supervision.

Result: The proposed TVQ&RAP model delivers photo-realistic super-resolution results with small computational cost.

Conclusion: The proposed strategies effectively address quantization errors and sub-optimal prior modeling in VQ-based super-resolution, achieving high-quality results efficiently.

Abstract: Vector-quantized based models have recently demonstrated strong potential for visual prior modeling. However, existing VQ-based methods simply encode visual features with nearest codebook items and train index predictor with code-level supervision. Due to the richness of visual signal, VQ encoding often leads to large quantization error. Furthermore, training predictor with code-level supervision can not take the final reconstruction errors into consideration, result in sub-optimal prior modeling accuracy. In this paper we address the above two issues and propose a Texture Vector-Quantization and a Reconstruction Aware Prediction strategy. The texture vector-quantization strategy leverages the task character of super-resolution and only introduce codebook to model the prior of missing textures. While the reconstruction aware prediction strategy makes use of the straight-through estimator to directly train index predictor with image-level supervision. Our proposed generative SR model (TVQ&RAP) is able to deliver photo-realistic SR results with small computational cost.

[514] GroupCoOp: Group-robust Fine-tuning via Group Prompt Learning

Nayeong Kim, Seong Joon Oh, Suha Kwak

Main category: cs.CV

TL;DR: GroupCoOp is a parameter-efficient fine-tuning method for vision-language models that uses group-specific text prompts to address spurious correlations from subgroup imbalance in training data, achieving state-of-the-art group robustness while training only 0.016% of parameters.

Details

Motivation: Parameter-efficient fine-tuned VLMs are vulnerable to spurious correlations from subgroup imbalance in fine-tuning datasets, which affects their group robustness.

Method: Proposes Group Context Optimization (GroupCoOp) that employs group-specific text prompts as group representatives serving as multiple classifiers for each target class, leveraging the semantic knowledge of VLM text encoders.

Result: Achieved best results on five benchmarks across five CLIP architectures, occasionally outperforming methods that fine-tune the entire network despite training only 0.016% of parameters.

Conclusion: GroupCoOp effectively enhances group robustness of fine-tuned VLMs by addressing subgroup imbalance issues through group-specific prompts, demonstrating superior performance with extreme parameter efficiency.

Abstract: Parameter-efficient fine-tuning (PEFT) of vision-language models (VLMs) excels in various vision tasks thanks to the rich knowledge and generalization ability of VLMs. However, recent studies revealed that such fine-tuned VLMs are vulnerable to spurious correlations stemming from the subgroup imbalance in the fine-tuning datasets. To resolve this issue, we propose Group Context Optimization (GroupCoOp), a simple and effective debiased fine-tuning algorithm that enhances the group robustness of fine-tuned VLMs. Its key idea is to employ group-specific text prompts as group representatives serving as multiple classifiers for their target class. The rich semantic knowledge of the text encoder of VLM enables the discovery of effective group prompts even for groups with a small number of training samples. Leveraging the group prompts for each class addresses the issues caused by the group-imbalanced training set, such as the neglect of minority groups and the scattered distribution of each class in the embedding space. GroupCoOp achieved the best results on five benchmarks across five CLIP architectures and occasionally outperformed prior methods that fine-tune the entire network, despite training only 0.016% of the network’s parameters.

[515] From Unstable to Playable: Stabilizing Angry Birds Levels via Object Segmentation

Mahdi Farrokhimaleki, Parsa Rahmati, Richard Zhao

Main category: cs.CV

TL;DR: Proposes a method to identify and repair unstable levels generated by PCG models using object segmentation and visual analysis, demonstrated on Angry Birds.

Details

Motivation: Ensuring consistently high-quality, industry-standard content from PCG remains challenging despite its efficiency in creating diverse environments.

Method: Uses object segmentation and visual analysis of level images to detect structural gaps and perform targeted repairs, evaluating multiple segmentation models.

Result: Experimental results show improved stability and playability of AI-generated levels.

Conclusion: The image-based approach is designed to be applicable to a wide range of 2D games with similar level structures, though evaluation was specific to Angry Birds.

Abstract: Procedural Content Generation (PCG) techniques enable automatic creation of diverse and complex environments. While PCG facilitates more efficient content creation, ensuring consistently high-quality, industry-standard content remains a significant challenge. In this research, we propose a method to identify and repair unstable levels generated by existing PCG models. We use Angry Birds as a case study, demonstrating our method on game levels produced by established PCG approaches. Our method leverages object segmentation and visual analysis of level images to detect structural gaps and perform targeted repairs. We evaluate multiple object segmentation models and select the most effective one as the basis for our repair pipeline. Experimental results show that our method improves the stability and playability of AI-generated levels. Although our evaluation is specific to Angry Birds, our image-based approach is designed to be applicable to a wide range of 2D games with similar level structures.

[516] Controllable Generation of Large-Scale 3D Urban Layouts with Semantic and Structural Guidance

Mengyuan Niu, Xinxin Zhuo, Ruizhe Wang, Yuyue Huang, Junyan Yang, Qiao Wang

Main category: cs.CV

TL;DR: A controllable framework for generating large-scale 3D vector urban layouts that fuses geometric and semantic attributes to create realistic urban models with user control.

Details

Motivation: Existing urban modeling methods have limitations - image-based approaches lack geometric continuity and scalability, while graph-based methods overlook parcel semantics. There's a need for methods that combine both geometric and semantic information for realistic urban layout generation.

Method: Fuses geometric and semantic attributes, introduces edge weights, embeds building height in the graph, and extends 2D layouts to 3D structures. Users can directly control output by modifying semantic attributes.

Result: The method produces valid, large-scale urban models that maintain geometric continuity and incorporate semantic information, demonstrating effectiveness for data-driven planning and design.

Conclusion: The proposed framework provides an effective tool for generating controllable, large-scale 3D urban layouts that combine both geometric and semantic attributes, addressing limitations of existing methods.

Abstract: Urban modeling is essential for city planning, scene synthesis, and gaming. Existing image-based methods generate diverse layouts but often lack geometric continuity and scalability, while graph-based methods capture structural relations yet overlook parcel semantics. We present a controllable framework for large-scale 3D vector urban layout generation, conditioned on both geometry and semantics. By fusing geometric and semantic attributes, introducing edge weights, and embedding building height in the graph, our method extends 2D layouts to realistic 3D structures. It also enables users to directly control the output by modifying semantic attributes. Experiments show that it produces valid, large-scale urban models, offering an effective tool for data-driven planning and design.

[517] A Multi-Camera Vision-Based Approach for Fine-Grained Assembly Quality Control

Ali Nazeri, Shashank Mishra, Achim Wagner, Martin Ruskowski, Didier Stricker, Jason Rambach

Main category: cs.CV

TL;DR: A multi-view quality control system using three cameras and image fusion outperforms single-view methods in detecting improperly fastened small assembly parts like screws, addressing occlusions and lighting issues in manufacturing.

Details

Motivation: Existing single-view imaging and manual inspection methods are prone to errors due to occlusions, restricted perspectives, and lighting inconsistencies, requiring additional inspection stations that disrupt assembly lines and increase costs.

Method: Integrates a multi-camera imaging system with advanced object detection algorithms, capturing images from three camera views and using a tailored image fusion methodology to combine results from multiple views.

Result: Significantly outperforms single-view methods, achieving high precision and recall rates in identifying improperly fastened small assembly parts such as screws.

Conclusion: Overcomes single-view limitations by providing a scalable, cost-effective, and accurate quality control mechanism that ensures reliability and safety of assembly lines, with the dataset made publicly available for further research.

Abstract: Quality control is a critical aspect of manufacturing, particularly in ensuring the proper assembly of small components in production lines. Existing solutions often rely on single-view imaging or manual inspection, which are prone to errors due to occlusions, restricted perspectives, or lighting inconsistencies. These limitations require the installation of additional inspection stations, which could disrupt the assembly line and lead to increased downtime and costs. This paper introduces a novel multi-view quality control module designed to address these challenges, integrating a multi-camera imaging system with advanced object detection algorithms. By capturing images from three camera views, the system provides comprehensive visual coverage of components of an assembly process. A tailored image fusion methodology combines results from multiple views, effectively resolving ambiguities and enhancing detection reliability. To support this system, we developed a unique dataset comprising annotated images across diverse scenarios, including varied lighting conditions, occlusions, and angles, to enhance applicability in real-world manufacturing environments. Experimental results show that our approach significantly outperforms single-view methods, achieving high precision and recall rates in the identification of improperly fastened small assembly parts such as screws. This work contributes to industrial automation by overcoming single-view limitations, and providing a scalable, cost-effective, and accurate quality control mechanism that ensures the reliability and safety of the assembly line. The dataset used in this study is publicly available to facilitate further research in this domain.

[518] Assessing Visual Privacy Risks in Multimodal AI: A Novel Taxonomy-Grounded Evaluation of Vision-Language Models

Efthymios Tsaprazlis, Tiantian Feng, Anil Ramakrishna, Rahul Gupta, Shrikanth Narayanan

Main category: cs.CV

TL;DR: This paper examines how AI models, particularly Vision-Language Models (VLMs), understand and handle privacy concepts, revealing significant limitations despite their advanced capabilities.

Details

Motivation: While LLMs and VLMs show impressive reasoning and pattern recognition abilities, they demonstrate critical limitations in understanding privacy principles, creating a need to assess and improve their privacy awareness.

Method: The authors introduce a comprehensive Visual Privacy Taxonomy based on legal frameworks and evaluate state-of-the-art VLMs using this taxonomy to test their understanding of contextual privacy.

Result: Evaluation revealed significant inconsistencies in VLMs’ understanding of contextual privacy, showing they lack robust privacy awareness despite their other advanced capabilities.

Conclusion: There is an urgent need for more privacy-aware AI systems, and the paper provides both a foundational taxonomy for future research and a benchmark highlighting current model limitations in privacy understanding.

Abstract: Artificial Intelligence have profoundly transformed the technological landscape in recent years. Large Language Models (LLMs) have demonstrated impressive abilities in reasoning, text comprehension, contextual pattern recognition, and integrating language with visual understanding. While these advances offer significant benefits, they also reveal critical limitations in the models’ ability to grasp the notion of privacy. There is hence substantial interest in determining if and how these models can understand and enforce privacy principles, particularly given the lack of supporting resources to test such a task. In this work, we address these challenges by examining how legal frameworks can inform the capabilities of these emerging technologies. To this end, we introduce a comprehensive, multi-level Visual Privacy Taxonomy that captures a wide range of privacy issues, designed to be scalable and adaptable to existing and future research needs. Furthermore, we evaluate the capabilities of several state-of-the-art Vision-Language Models (VLMs), revealing significant inconsistencies in their understanding of contextual privacy. Our work contributes both a foundational taxonomy for future research and a critical benchmark of current model limitations, demonstrating the urgent need for more robust, privacy-aware AI systems.

[519] Uni4D-LLM: A Unified SpatioTemporal-Aware VLM for 4D Understanding and Generation

Hanyu Zhou, Gim Hee Lee

Main category: cs.CV

TL;DR: Uni4D-LLM is the first unified vision-language model framework that jointly handles 4D scene understanding and generation using spatiotemporal-aware visual representations within a single Transformer architecture.

Details

Motivation: Existing 3D/4D approaches use separate models for understanding (autoregressive) and generation (diffusion), creating a paradigm gap that prevents unified handling of both tasks, especially in dynamic 4D settings requiring spatiotemporal modeling.

Method: Extracts semantic features for understanding and noisy-injected appearance features for generation, incorporates 4D geometric cues, fuses them via adaptive cross-attention into spatiotemporal-aware representations, and integrates both tasks into a single LLM with task-specific heads using instruction fine-tuning.

Result: Achieves competitive or superior results compared to state-of-the-art models on multiple benchmarks, demonstrating true unification of 4D scene understanding and generation.

Conclusion: Uni4D-LLM successfully unifies 4D scene understanding and generation through shared representations and architecture, enabling joint handling of both tasks within one Transformer-based framework with spatiotemporal awareness.

Abstract: Vision-language models (VLMs) have demonstrated strong performance in 2D scene understanding and generation, but extending this unification to the physical world remains an open challenge. Existing 3D and 4D approaches typically embed scene geometry into autoregressive model for semantic understanding and diffusion model for content generation. This paradigm gap prevents a single model from jointly handling both tasks, especially in dynamic 4D settings where spatiotemporal modeling is critical. We propose Uni4D-LLM, the first unified VLM framework with spatiotemporal awareness for 4D scene understanding and generation. Our design is guided by two key insights: 1) Unification requires a shared representation. We extract semantic features for understanding and noisy-injected appearance features for generation, incorporate 4D geometric cues, and fuse them into a spatiotemporal-aware visual representation through adaptive cross-attention. 2) Unification requires a shared architecture. Both autoregression and diffusion are built on Transformer backbones, and this enables integration into a single LLM with task-specific heads. By aligning visual and linguistic representations, our Uni4D-LLM produces predictions for both understanding and generation within one Transformer-based framework. We further apply instruction fine-tuning on diverse 4D vision-language datasets to improve generalization across tasks. Extensive experiments on multiple benchmarks demonstrate that Uni4D-LLM achieves competitive or superior results compared to state-of-the-art models and offers the first true unification of 4D scene understanding and generation.

[520] 2nd Place Report of MOSEv2 Challenge 2025: Concept Guided Video Object Segmentation via SeC

Zhixiong Zhang, Shuangrui Ding, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Jiaqi Wang

Main category: cs.CV

TL;DR: The paper evaluates the zero-shot performance of Segment Concept (SeC) framework on MOSEv2 dataset, achieving 39.7 JFn without fine-tuning and ranking 2nd in the Complex VOS challenge.

Details

Motivation: Previous semi-supervised VOS methods rely heavily on appearance-based matching and lack robustness against visual changes, occlusions, and scene shifts due to insufficient high-level conceptual understanding of targets.

Method: Uses the Segment Concept (SeC) framework which employs a Large Vision-Language Model (LVLM) to establish deep semantic understanding of objects for more persistent segmentation.

Result: Achieved 39.7 JFn on MOSEv2 test set without any fine-tuning on training data, ranking 2nd place in the Complex VOS track of the 7th Large-scale Video Object Segmentation Challenge.

Conclusion: The SeC framework demonstrates strong zero-shot performance on complex video object segmentation tasks by leveraging semantic understanding through LVLMs, showing robustness against challenging scenarios where appearance-based methods fail.

Abstract: Semi-supervised Video Object Segmentation aims to segment a specified target throughout a video sequence, initialized by a first-frame mask. Previous methods rely heavily on appearance-based pattern matching and thus exhibit limited robustness against challenges such as drastic visual changes, occlusions, and scene shifts. This failure is often attributed to a lack of high-level conceptual understanding of the target. The recently proposed Segment Concept (SeC) framework mitigated this limitation by using a Large Vision-Language Model (LVLM) to establish a deep semantic understanding of the object for more persistent segmentation. In this work, we evaluate its zero-shot performance on the challenging coMplex video Object SEgmentation v2 (MOSEv2) dataset. Without any fine-tuning on the training set, SeC achieved 39.7 \JFn on the test set and ranked 2nd place in the Complex VOS track of the 7th Large-scale Video Object Segmentation Challenge.

[521] Towards Fine-Grained Text-to-3D Quality Assessment: A Benchmark and A Two-Stage Rank-Learning Metric

Bingyang Cui, Yujie Zhang, Qi Yang, Zhu Li, Yiling Xu

Main category: cs.CV

TL;DR: The paper introduces T23D-CompBench, a comprehensive benchmark for compositional Text-to-3D quality assessment, and proposes Rank2Score, a two-stage training method that outperforms existing metrics.

Details

Motivation: Existing Text-to-3D quality assessment faces challenges: outdated/fragmented benchmarks and limitations in objective metrics that result in non-representative feature extraction and reduced robustness.

Method: 1) Created T23D-CompBench with 5 components and 12 sub-components for compositional prompts, generating 3,600 textured meshes from 10 state-of-the-art models with 129,600 human ratings. 2) Proposed Rank2Score with two-stage training: first stage uses supervised contrastive regression and curriculum learning for pairwise training, second stage refines predictions using mean opinion scores.

Result: Rank2Score consistently outperforms existing metrics across multiple dimensions and can serve as a reward function to optimize generative models.

Conclusion: The proposed benchmark and Rank2Score method effectively address limitations in Text-to-3D quality assessment, providing better alignment with human judgments and practical utility for model optimization.

Abstract: Recent advances in Text-to-3D (T23D) generative models have enabled the synthesis of diverse, high-fidelity 3D assets from textual prompts. However, existing challenges restrict the development of reliable T23D quality assessment (T23DQA). First, existing benchmarks are outdated, fragmented, and coarse-grained, making fine-grained metric training infeasible. Moreover, current objective metrics exhibit inherent design limitations, resulting in non-representative feature extraction and diminished metric robustness. To address these limitations, we introduce T23D-CompBench, a comprehensive benchmark for compositional T23D generation. We define five components with twelve sub-components for compositional prompts, which are used to generate 3,600 textured meshes from ten state-of-the-art generative models. A large-scale subjective experiment is conducted to collect 129,600 reliable human ratings across different perspectives. Based on T23D-CompBench, we further propose Rank2Score, an effective evaluator with two-stage training for T23DQA. Rank2Score enhances pairwise training via supervised contrastive regression and curriculum learning in the first stage, and subsequently refines predictions using mean opinion scores to achieve closer alignment with human judgments in the second stage. Extensive experiments and downstream applications demonstrate that Rank2Score consistently outperforms existing metrics across multiple dimensions and can additionally serve as a reward function to optimize generative models. The project is available at https://cbysjtu.github.io/Rank2Score/.

[522] CE-FAM: Concept-Based Explanation via Fusion of Activation Maps

Michihiro Kuroki, Toshihiko Yamasaki

Main category: cs.CV

TL;DR: CE-FAM is a concept-based explanation method that identifies what concepts an image classifier learns, which regions are associated with them, and how they contribute to predictions, outperforming existing approaches.

Details

Motivation: Existing saliency maps highlight important regions but leave interpretation to users, while concept-based explanations clarify contributions but few methods can simultaneously reveal learned concepts, their regions, and prediction contributions.

Method: Uses a branched network that shares activation maps with an image classifier and learns to mimic VLM embeddings. Predicts concepts and represents regions via weighted sum of activation maps using concept prediction gradients, quantifying contributions based on impact on classification score.

Result: Outperforms existing approaches in qualitative and quantitative evaluations, excels in zero-shot inference for unseen concepts, and provides a general framework without requiring annotated datasets.

Conclusion: CE-FAM effectively bridges the gap between saliency maps and concept-based explanations by providing comprehensive insights into learned concepts, their spatial regions, and prediction contributions while leveraging VLM knowledge.

Abstract: Although saliency maps can highlight important regions to explain the reasoning behind image classification in artificial intelligence (AI), the meaning of these regions is left to the user’s interpretation. In contrast, conceptbased explanations decompose AI predictions into humanunderstandable concepts, clarifying their contributions. However, few methods can simultaneously reveal what concepts an image classifier learns, which regions are associated with them, and how they contribute to predictions. We propose a novel concept-based explanation method, Concept-based Explanation via Fusion of Activation Maps (CE-FAM). It employs a branched network that shares activation maps with an image classifier and learns to mimic the embeddings of a Vision and Language Model (VLM). The branch network predicts concepts in an image, and their corresponding regions are represented by a weighted sum of activation maps, with weights given by the gradients of the concept prediction scores. Their contributions are quantified based on their impact on the image classification score. Our method provides a general framework for identifying the concept regions and their contributions while leveraging VLM knowledge to handle arbitrary concepts without requiring an annotated dataset. Furthermore, we introduce a novel evaluation metric to assess the accuracy of the concept regions. Our qualitative and quantitative evaluations demonstrate our method outperforms existing approaches and excels in zero-shot inference for unseen concepts.

[523] FairViT-GAN: A Hybrid Vision Transformer with Adversarial Debiasing for Fair and Explainable Facial Beauty Prediction

Djamel Eddine Boukhari

Main category: cs.CV

TL;DR: FairViT-GAN is a hybrid CNN-ViT framework with adversarial debiasing that achieves state-of-the-art facial beauty prediction while significantly reducing demographic bias.

Details

Motivation: Address limitations in current FBP models including architectural constraints, demographic biases, and lack of transparency. Existing CNN methods struggle with global facial harmony while ViTs miss fine details, and models often perpetuate societal biases.

Method: Hybrid framework combining CNN branch for local feature extraction and ViT branch for global context modeling, with adversarial debiasing mechanism to produce protected attribute-invariant representations.

Result: Achieves Pearson Correlation of 0.9230 and RMSE of 0.2650 on SCUT-FBP5500 benchmark. Reduces performance gap between ethnic subgroups by 82.9%, with adversary classification accuracy dropping to 52.1% (near random chance).

Conclusion: FairViT-GAN provides a robust, transparent, and significantly fairer blueprint for responsible AI systems in subjective visual assessment.

Abstract: Facial Beauty Prediction (FBP) has made significant strides with the application of deep learning, yet state-of-the-art models often exhibit critical limitations, including architectural constraints, inherent demographic biases, and a lack of transparency. Existing methods, primarily based on Convolutional Neural Networks (CNNs), excel at capturing local texture but struggle with global facial harmony, while Vision Transformers (ViTs) effectively model long-range dependencies but can miss fine-grained details. Furthermore, models trained on benchmark datasets can inadvertently learn and perpetuate societal biases related to protected attributes like ethnicity. To address these interconnected challenges, we propose \textbf{FairViT-GAN}, a novel hybrid framework that synergistically integrates a CNN branch for local feature extraction and a ViT branch for global context modeling. More significantly, we introduce an adversarial debiasing mechanism where the feature extractor is explicitly trained to produce representations that are invariant to protected attributes, thereby actively mitigating algorithmic bias. Our framework’s transparency is enhanced by visualizing the distinct focus of each architectural branch. Extensive experiments on the SCUT-FBP5500 benchmark demonstrate that FairViT-GAN not only sets a new state-of-the-art in predictive accuracy, achieving a Pearson Correlation of \textbf{0.9230} and reducing RMSE to \textbf{0.2650}, but also excels in fairness. Our analysis reveals a remarkable \textbf{82.9% reduction in the performance gap} between ethnic subgroups, with the adversary’s classification accuracy dropping to near-random chance (52.1%). We believe FairViT-GAN provides a robust, transparent, and significantly fairer blueprint for developing responsible AI systems for subjective visual assessment.

[524] Sim-DETR: Unlock DETR for Temporal Sentence Grounding

Jiajin Tang, Zhengxuan Wei, Yuchen Zhu, Cheng Shi, Guanbin Li, Liang Lin, Sibei Yang

Main category: cs.CV

TL;DR: Sim-DETR improves temporal sentence grounding by addressing query conflicts in DETR through constrained self-attention and query-to-frame alignment.

Details

Motivation: Standard DETR enhancement strategies degrade performance in temporal sentence grounding due to query conflicts between similar target moments and internal conflicts between global semantics and local localization.

Method: Extends standard DETR with two modifications: (1) constraining self-attention between queries based on semantic and positional overlap, and (2) adding query-to-frame alignment to bridge global and local contexts.

Result: Sim-DETR unlocks the full potential of DETR for temporal sentence grounding, offering a strong baseline for future research.

Conclusion: The proposed Sim-DETR provides a simple yet powerful solution to address query conflicts in DETR for temporal sentence grounding, demonstrating improved performance over standard approaches.

Abstract: Temporal sentence grounding aims to identify exact moments in a video that correspond to a given textual query, typically addressed with detection transformer (DETR) solutions. However, we find that typical strategies designed to enhance DETR do not improve, and may even degrade, its performance in this task. We systematically analyze and identify the root causes of this abnormal behavior: (1) conflicts between queries from similar target moments and (2) internal query conflicts due to the tension between global semantics and local localization. Building on these insights, we propose a simple yet powerful baseline, Sim-DETR, which extends the standard DETR with two minor modifications in the decoder layers: (1) constraining self-attention between queries based on their semantic and positional overlap and (2) adding query-to-frame alignment to bridge the global and local contexts. Experiments demonstrate that Sim-DETR unlocks the full potential of DETR for temporal sentence grounding, offering a strong baseline for future research.

[525] Not All Tokens are Guided Equal: Improving Guidance in Visual Autoregressive Models

Ky Dan Nguyen, Hoang Lam Tran, Anh-Dung Dinh, Daochang Liu, Weidong Cai, Xiuying Wang, Chang Xu

Main category: cs.CV

TL;DR: Information-Grounding Guidance (IGG) addresses patch inconsistency in autoregressive image generation by using attention to anchor guidance to semantically important regions, improving image quality and semantic faithfulness.

Details

Motivation: Autoregressive models for image generation suffer from information inconsistencies between patches across timesteps due to progressive resolution scaling, which scatters guidance signals and causes drift from conditioning information.

Method: Proposed Information-Grounding Guidance (IGG) that adaptively reinforces informative patches during sampling through attention mechanisms to keep guidance and content aligned.

Result: IGG delivers sharper, more coherent, and semantically grounded images across class-conditioned and text-to-image generation tasks, setting new benchmarks for AR-based methods.

Conclusion: IGG successfully tackles the patch inconsistency problem in autoregressive image generation by anchoring guidance to semantically important regions, resulting in improved image quality and faithfulness to conditioning information.

Abstract: Autoregressive (AR) models based on next-scale prediction are rapidly emerging as a powerful tool for image generation, but they face a critical weakness: information inconsistencies between patches across timesteps introduced by progressive resolution scaling. These inconsistencies scatter guidance signals, causing them to drift away from conditioning information and leaving behind ambiguous, unfaithful features. We tackle this challenge with Information-Grounding Guidance (IGG), a novel mechanism that anchors guidance to semantically important regions through attention. By adaptively reinforcing informative patches during sampling, IGG ensures that guidance and content remain tightly aligned. Across both class-conditioned and text-to-image generation tasks, IGG delivers sharper, more coherent, and semantically grounded images, setting a new benchmark for AR-based methods.

[526] Learning Adaptive Pseudo-Label Selection for Semi-Supervised 3D Object Detection

Taehun Kong, Tae-Kyun Kim

Main category: cs.CV

TL;DR: Proposes a novel semi-supervised 3D object detection framework with learnable pseudo-labeling that automatically selects high-quality pseudo-labels using context-adaptive thresholds and score fusion, supervised by alignment with ground truth boxes.

Details

Motivation: To reduce costly 3D annotations by utilizing unlabeled data more effectively, addressing limitations of previous methods that manually set thresholds and overlook contextual information like object distances, classes, and learning states.

Method: Introduces two networks at teacher output level for reliable pseudo-label quality assessment via score fusion and context-adaptive thresholds, supervised by pseudo-label alignment with GT boxes. Uses soft supervision strategy for robust learning under pseudo-label noise.

Result: Extensive experiments on KITTI and Waymo datasets show the method selects high-precision pseudo-labels while maintaining wider context coverage and higher recall rate, significantly improving SS3DOD performance.

Conclusion: The proposed learnable pseudo-labeling framework effectively addresses pseudo-label quality assessment challenges in semi-supervised 3D object detection, outperforming existing methods through adaptive thresholding and robust supervision.

Abstract: Semi-supervised 3D object detection (SS3DOD) aims to reduce costly 3D annotations utilizing unlabeled data. Recent studies adopt pseudo-label-based teacher-student frameworks and demonstrate impressive performance. The main challenge of these frameworks is in selecting high-quality pseudo-labels from the teacher’s predictions. Most previous methods, however, select pseudo-labels by comparing confidence scores over thresholds manually set. The latest works tackle the challenge either by dynamic thresholding or refining the quality of pseudo-labels. Such methods still overlook contextual information e.g. object distances, classes, and learning states, and inadequately assess the pseudo-label quality using partial information available from the networks. In this work, we propose a novel SS3DOD framework featuring a learnable pseudo-labeling module designed to automatically and adaptively select high-quality pseudo-labels. Our approach introduces two networks at the teacher output level. These networks reliably assess the quality of pseudo-labels by the score fusion and determine context-adaptive thresholds, which are supervised by the alignment of pseudo-labels over GT bounding boxes. Additionally, we introduce a soft supervision strategy that can learn robustly under pseudo-label noises. This helps the student network prioritize cleaner labels over noisy ones in semi-supervised learning. Extensive experiments on the KITTI and Waymo datasets demonstrate the effectiveness of our method. The proposed method selects high-precision pseudo-labels while maintaining a wider coverage of contexts and a higher recall rate, significantly improving relevant SS3DOD methods.

[527] Tunable-Generalization Diffusion Powered by Self-Supervised Contextual Sub-Data for Low-Dose CT Reconstruction

Guoquan Wei, Zekun Zhou, Liu Shi, Wenzhe Shan, Qiegen Liu

Main category: cs.CV

TL;DR: SuperDiff is a self-supervised method for low-dose CT denoising that uses contextual sub-data similarity adaptive sensing and latent diffusion models to achieve superior reconstruction and generalization without requiring paired clean data.

Details

Motivation: Current deep learning methods for low-dose CT denoising rely heavily on paired data and generalize poorly. Self-supervised methods face challenges in generalizing to different dose levels. There's a need for methods that work without clean data pairs and can handle varying dose levels.

Method: Uses contextual sub-data similarity adaptive sensing in the projection domain to provide initial prior. Combines knowledge distillation with latent diffusion models for image optimization. Employs pixel-level self-correcting fusion for fine-grained reconstruction. Can be flexibly applied to different dose levels including unseen doses.

Result: SuperDiff consistently outperforms existing state-of-the-art methods in both reconstruction and generalization performance on datasets and real data. It requires only LDCT projection domain data for training and testing.

Conclusion: The proposed SuperDiff method provides an effective solution for low-dose CT reconstruction that doesn’t require paired data, achieves superior performance, and demonstrates strong generalization capabilities across different dose levels.

Abstract: Current models based on deep learning for low-dose CT denoising rely heavily on paired data and generalize poorly. Even the more concerned diffusion models need to learn the distribution of clean data for reconstruction, which is difficult to satisfy in medical clinical applications. At the same time, self-supervised-based methods face the challenge of significant degradation of generalizability of models pre-trained for the current dose to expand to other doses. To address these issues, this paper proposes a novel method of tunable-generalization diffusion powered by self-supervised contextual sub-data for low-dose CT reconstruction, named SuperDiff. Firstly, a contextual subdata similarity adaptive sensing strategy is designed for denoising centered on the LDCT projection domain, which provides an initial prior for the subsequent progress. Subsequently, the initial prior is used to combine knowledge distillation with a deep combination of latent diffusion models for optimizing image details. The pre-trained model is used for inference reconstruction, and the pixel-level self-correcting fusion technique is proposed for fine-grained reconstruction of the image domain to enhance the image fidelity, using the initial prior and the LDCT image as a guide. In addition, the technique is flexibly applied to the generalization of upper and lower doses or even unseen doses. Dual-domain strategy cascade for self-supervised LDCT denoising, SuperDiff requires only LDCT projection domain data for training and testing. Full qualitative and quantitative evaluations on both datasets and real data show that SuperDiff consistently outperforms existing state-of-the-art methods in terms of reconstruction and generalization performance.

[528] AssemblyHands-X: Modeling 3D Hand-Body Coordination for Understanding Bimanual Human Activities

Tatsuro Banno, Takehiko Ohkawa, Ruicong Liu, Ryosuke Furuta, Yoichi Sato

Main category: cs.CV

TL;DR: AssemblyHands-X is the first markerless 3D hand-body benchmark for bimanual activities, showing that joint modeling of hand and body cues improves action recognition over using hands or body alone.

Details

Motivation: Existing datasets lack kinematic-level annotations for both hands and body in bimanual activities, and marker-based systems introduce visual artifacts that limit generalization to natural videos.

Method: Constructed a pipeline combining multi-view triangulation with SMPL-X mesh fitting for reliable 3D hand-body pose annotation, then validated different input representations across graph convolution and spatio-temporal attention models.

Result: Pose-based action inference is more efficient and accurate than video baselines, and joint modeling of hand and body cues improves action recognition over using hands or upper body alone.

Conclusion: Modeling interdependent hand-body dynamics is crucial for holistic understanding of bimanual activities, as hand-body coordination significantly impacts action recognition performance.

Abstract: Bimanual human activities inherently involve coordinated movements of both hands and body. However, the impact of this coordination in activity understanding has not been systematically evaluated due to the lack of suitable datasets. Such evaluation demands kinematic-level annotations (e.g., 3D pose) for the hands and body, yet existing 3D activity datasets typically annotate either hand or body pose. Another line of work employs marker-based motion capture to provide full-body pose, but the physical markers introduce visual artifacts, thereby limiting models’ generalization to natural, markerless videos. To address these limitations, we present AssemblyHands-X, the first markerless 3D hand-body benchmark for bimanual activities, designed to study the effect of hand-body coordination for action recognition. We begin by constructing a pipeline for 3D pose annotation from synchronized multi-view videos. Our approach combines multi-view triangulation with SMPL-X mesh fitting, yielding reliable 3D registration of hands and upper body. We then validate different input representations (e.g., video, hand pose, body pose, or hand-body pose) across recent action recognition models based on graph convolution or spatio-temporal attention. Our extensive experiments show that pose-based action inference is more efficient and accurate than video baselines. Moreover, joint modeling of hand and body cues improves action recognition over using hands or upper body alone, highlighting the importance of modeling interdependent hand-body dynamics for a holistic understanding of bimanual activities.

[529] LifeCLEF Plant Identification Task 2015

Herve Goeau, Pierre Bonnet, Alexis Joly

Main category: cs.CV

TL;DR: The LifeCLEF 2015 plant identification challenge evaluated large-scale plant identification methods using 100K+ images of 1000 West European species collected through participatory sensing.

Details

Motivation: To evaluate plant identification methods under real-world biodiversity monitoring conditions at a very large scale, close to actual biodiversity monitoring scenarios.

Method: Used a dataset of over 100,000 images covering 1000 plant species from West Europe, built through a large-scale participatory sensing platform involving tens of thousands of contributors since 2011.

Result: The challenge provided resources and assessments for evaluating plant identification systems, with multiple research groups participating and developing various approaches.

Conclusion: The LifeCLEF 2015 challenge successfully established a large-scale evaluation framework for plant identification using crowd-sourced data, enabling analysis of different methodological approaches in realistic biodiversity monitoring conditions.

Abstract: The LifeCLEF plant identification challenge aims at evaluating plant identification methods and systems at a very large scale, close to the conditions of a real-world biodiversity monitoring scenario. The 2015 evaluation was actually conducted on a set of more than 100K images illustrating 1000 plant species living in West Europe. The main originality of this dataset is that it was built through a large-scale participatory sensing plateform initiated in 2011 and which now involves tens of thousands of contributors. This overview presents more precisely the resources and assessments of the challenge, summarizes the approaches and systems employed by the participating research groups, and provides an analysis of the main outcomes.

Jinghan Xu Yuyang Zhang Qixuan Cai Jiancheng Chen Keqiu Li

Main category: cs.CV

TL;DR: CCU is a cross-modal contrastive unlearning framework that selectively removes visual data while preserving cross-modal knowledge and intra-class structural stability, achieving 7.12% accuracy improvement with only 7% unlearning time.

Details

Motivation: Visual modality is most vulnerable to privacy leakage in multimodal applications, and existing unlearning methods fail to preserve cross-modal knowledge and maintain intra-class structural stability during visual unlearning.

Method: Three key components: (a) selective visual unlearning using inverse contrastive learning, (b) cross-modal knowledge retention through semantic consistency, and (c) dual-set contrastive separation to isolate structural perturbations between unlearn and retain sets.

Result: Extensive experiments on three datasets show superiority, achieving 7.12% accuracy improvement with only 7% of the unlearning time compared to top-accuracy baseline.

Conclusion: CCU effectively addresses privacy leakage in multimodal applications by enabling selective visual unlearning while preserving cross-modal knowledge and model performance.

Abstract: Visual modality is the most vulnerable to privacy leakage in real-world multimodal applications like autonomous driving with visual and radar data; Machine unlearning removes specific training data from pre-trained models to address privacy leakage, however, existing methods fail to preserve cross-modal knowledge and maintain intra-class structural stability of retain data, leading to reduced overall and other modalities’ performance during visual unlearning; to address these challenges, we propose a Cross-modal Contrastive Unlearning (CCU) framework, which integrates three key components: (a) selective visual unlearning: employing inverse contrastive learning to dissociate visual representations from their original semantics, (b) cross-modal knowledge retention: preserving other modalities’ discriminability through semantic consistency, and (c) dual-set contrastive separation: preserving the model performance via isolation of structural perturbations between the unlearn set and retain set; extensive experiments on three datasets demonstrate the superiority of CCU, and our method achieves a 7.12% accuracy improvement with only 7% of the unlearning time compared to the top-accuracy baseline.

[531] Q-FSRU: Quantum-Augmented Frequency-Spectral For Medical Visual Question Answering

Rakesh Thakur, Yusra Tariq, Rakesh Chandra Joshi

Main category: cs.CV

TL;DR: Q-FSRU is a medical VQA model that combines frequency domain processing with quantum-inspired retrieval to improve accuracy and explainability in clinical image-text reasoning tasks.

Details

Motivation: Clinical questions requiring both image and text understanding remain challenging in healthcare AI, needing better methods for handling complex reasoning tasks.

Method: Uses Fast Fourier Transform to shift image and text features to frequency domain, then applies quantum-inspired retrieval to fetch relevant medical facts from external sources, merging both for enhanced reasoning.

Result: Outperforms previous models on VQA-RAD dataset, particularly on complex cases requiring image-text reasoning, with improved performance and explainability.

Conclusion: The combination of frequency processing and quantum information offers a promising approach for developing intelligent, transparent AI tools to assist doctors.

Abstract: Solving tough clinical questions that require both image and text understanding is still a major challenge in healthcare AI. In this work, we propose Q-FSRU, a new model that combines Frequency Spectrum Representation and Fusion (FSRU) with a method called Quantum Retrieval-Augmented Generation (Quantum RAG) for medical Visual Question Answering (VQA). The model takes in features from medical images and related text, then shifts them into the frequency domain using Fast Fourier Transform (FFT). This helps it focus on more meaningful data and filter out noise or less useful information. To improve accuracy and ensure that answers are based on real knowledge, we add a quantum inspired retrieval system. It fetches useful medical facts from external sources using quantum-based similarity techniques. These details are then merged with the frequency-based features for stronger reasoning. We evaluated our model using the VQA-RAD dataset, which includes real radiology images and questions. The results showed that Q-FSRU outperforms earlier models, especially on complex cases needing image text reasoning. The mix of frequency and quantum information improves both performance and explainability. Overall, this approach offers a promising way to build smart, clear, and helpful AI tools for doctors.

[532] LifeCLEF Plant Identification Task 2014

Herve Goeau, Alexis Joly, Pierre Bonnet, Souheil Selmi, Jean-Francois Molino, Daniel Barthelemy, Nozha Boujemaa

Main category: cs.CV

TL;DR: The LifeCLEFs plant identification task evaluates systems for identifying 500 plant species using 7 image types, including leaf scans and unconstrained photos of flowers, fruits, and other plant parts, collected through citizen science.

Details

Motivation: To create a realistic plant identification system testbed using citizen science data from amateur and expert botanists, making the task closer to real-world application conditions.

Method: The task uses a dataset of 500 plant species with 7 image types: leaf scans and 6 unconstrained view types (flower, fruit, stem & bark, branch, leaf, entire view) collected through a citizen science initiative by Tela Botanica.

Result: 27 submitted runs from 10 groups across 6 countries participated, employing distinct and original methods, confirming the Image & Multimedia Retrieval community’s interest in biodiversity and botany.

Conclusion: The fourth year of this task demonstrates continued community engagement in plant identification and highlights further challenging studies in this domain.

Abstract: The LifeCLEFs plant identification task provides a testbed for a system-oriented evaluation of plant identification about 500 species trees and herbaceous plants. Seven types of image content are considered: scan and scan-like pictures of leaf, and 6 kinds of detailed views with unconstrained conditions, directly photographed on the plant: flower, fruit, stem & bark, branch, leaf and entire view. The main originality of this data is that it was specifically built through a citizen sciences initiative conducted by Tela Botanica, a French social network of amateur and expert botanists. This makes the task closer to the conditions of a real-world application. This overview presents more precisely the resources and assessments of task, summarizes the retrieval approaches employed by the participating groups, and provides an analysis of the main evaluation results. With a total of ten groups from six countries and with a total of twenty seven submitted runs, involving distinct and original methods, this fourth year task confirms Image & Multimedia Retrieval community interest for biodiversity and botany, and highlights further challenging studies in plant identification.

[533] EWC-Guided Diffusion Replay for Exemplar-Free Continual Learning in Medical Imaging

Anoushka Harit, William Prew, Zhongtian Sun, Florian Markowetz

Main category: cs.CV

TL;DR: A continual learning framework for medical imaging that combines class-conditional diffusion replay with Elastic Weight Consolidation to adapt foundation models without storing patient data, achieving near-joint training performance while preserving privacy.

Details

Motivation: Medical imaging foundation models need to adapt over time, but full retraining is often prevented by privacy constraints and high costs, requiring privacy-preserving continual learning approaches.

Method: Pairs class-conditional diffusion replay (avoiding patient data storage) with Elastic Weight Consolidation, using a compact Vision Transformer backbone evaluated on MedMNIST v2 tasks and CheXpert.

Result: Achieves 0.851 AUROC on CheXpert, reduces forgetting by over 30% compared to DER++, approaches joint training performance (0.869 AUROC), while remaining efficient and privacy-preserving.

Conclusion: Provides a practical route for scalable, privacy-aware continual adaptation of clinical imaging models by combining replay diffusion and synaptic stability mechanisms.

Abstract: Medical imaging foundation models must adapt over time, yet full retraining is often blocked by privacy constraints and cost. We present a continual learning framework that avoids storing patient exemplars by pairing class conditional diffusion replay with Elastic Weight Consolidation. Using a compact Vision Transformer backbone, we evaluate across eight MedMNIST v2 tasks and CheXpert. On CheXpert our approach attains 0.851 AUROC, reduces forgetting by more than 30% relative to DER\texttt{++}, and approaches joint training at 0.869 AUROC, while remaining efficient and privacy preserving. Analyses connect forgetting to two measurable factors: fidelity of replay and Fisher weighted parameter drift, highlighting the complementary roles of replay diffusion and synaptic stability. The results indicate a practical route for scalable, privacy aware continual adaptation of clinical imaging models.

[534] Adversarial Versus Federated: An Adversarial Learning based Multi-Modality Cross-Domain Federated Medical Segmentation

You Zhou, Lijiang Chen, Shuchang Lyu, Guangxia Cui, Wenpei Bai, Zheng Zhou, Meng Li, Guangliang Cheng, Huiyu Zhou, Qi Zhao

Main category: cs.CV

TL;DR: FedDA is a federated domain adaptation framework for cross-domain medical image segmentation that uses feature-level adversarial learning to align feature maps across clients, enabling single modality clients to process cross-modality data.

Details

Motivation: Address the challenge of modality heterogeneity in federated learning for medical imaging, where different clients possess different medical image modalities due to resource imbalance or data issues.

Method: Propose feature-level adversarial learning among clients by aligning feature maps through adversarial training mechanism to enhance model generalization and reduce domain-shift impact.

Result: Comprehensive experiments on three medical image datasets show FedDA achieves cross-domain federated aggregation, enables single modality clients with cross-modality processing, and outperforms state-of-the-art federated aggregation algorithms.

Conclusion: FedDA successfully addresses modality heterogeneity in federated medical image segmentation through adversarial domain adaptation, providing robust cross-modality processing capabilities.

Abstract: Federated learning enables collaborative training of machine learning models among different clients while ensuring data privacy, emerging as the mainstream for breaking data silos in the healthcare domain. However, the imbalance of medical resources, data corruption or improper data preservation may lead to a situation where different clients possess medical images of different modality. This heterogeneity poses a significant challenge for cross-domain medical image segmentation within the federated learning framework. To address this challenge, we propose a new Federated Domain Adaptation (FedDA) segmentation training framework. Specifically, we propose a feature-level adversarial learning among clients by aligning feature maps across clients through embedding an adversarial training mechanism. This design can enhance the model’s generalization on multiple domains and alleviate the negative impact from domain-shift. Comprehensive experiments on three medical image datasets demonstrate that our proposed FedDA substantially achieves cross-domain federated aggregation, endowing single modality client with cross-modality processing capabilities, and consistently delivers robust performance compared to state-of-the-art federated aggregation algorithms in objective and subjective assessment. Our code are available at https://github.com/GGbond-study/FedDA.

[535] EditScore: Unlocking Online RL for Image Editing via High-Fidelity Reward Modeling

Xin Luo, Jiahao Wang, Chenyuan Wu, Shitao Xiao, Xiyan Jiang, Defu Lian, Jiajun Zhang, Dong Liu, Zheng liu

Main category: cs.CV

TL;DR: This paper introduces EditScore, a specialized reward model for instruction-guided image editing that enables effective reinforcement learning by providing high-fidelity reward signals, overcoming previous limitations in the field.

Details

Motivation: Current instruction-guided image editing models struggle with complex instructions and require multiple samples. Reinforcement learning offers promise but has been hindered by the lack of high-fidelity, efficient reward signals for evaluating editing quality.

Method: Developed EditReward-Bench benchmark for systematic reward model evaluation, then created EditScore reward models (7B-72B) through meticulous data curation and filtering. Implemented self-ensemble strategy tailored for generative nature of EditScore. Applied RL framework to OmniGen2 base model using EditScore as reward signal.

Result: EditScore matches performance of proprietary VLMs, with largest variant surpassing GPT-5 in benchmark. EditScore enables efficient and robust policy optimization where other VLMs fail. Final model shows substantial and consistent performance uplift when applied to OmniGen2.

Conclusion: A high-fidelity, domain-specialized reward model is key to unlocking RL’s full potential in image editing. This work provides the first systematic path from benchmarking to reward modeling to RL training in this domain.

Abstract: Instruction-guided image editing has achieved remarkable progress, yet current models still face challenges with complex instructions and often require multiple samples to produce a desired result. Reinforcement Learning (RL) offers a promising solution, but its adoption in image editing has been severely hindered by the lack of a high-fidelity, efficient reward signal. In this work, we present a comprehensive methodology to overcome this barrier, centered on the development of a state-of-the-art, specialized reward model. We first introduce EditReward-Bench, a comprehensive benchmark to systematically evaluate reward models on editing quality. Building on this benchmark, we develop EditScore, a series of reward models (7B-72B) for evaluating the quality of instruction-guided image editing. Through meticulous data curation and filtering, EditScore effectively matches the performance of learning proprietary VLMs. Furthermore, coupled with an effective self-ensemble strategy tailored for the generative nature of EditScore, our largest variant even surpasses GPT-5 in the benchmark. We then demonstrate that a high-fidelity reward model is the key to unlocking online RL for image editing. Our experiments show that, while even the largest open-source VLMs fail to provide an effective learning signal, EditScore enables efficient and robust policy optimization. Applying our framework to a strong base model, OmniGen2, results in a final model that shows a substantial and consistent performance uplift. Overall, this work provides the first systematic path from benchmarking to reward modeling to RL training in image editing, showing that a high-fidelity, domain-specialized reward model is the key to unlocking the full potential of RL in this domain.

[536] MoReact: Generating Reactive Motion from Textual Descriptions

Xiyan Xu, Sirui Xu, Yu-Xiong Wang, Liang-Yan Gui

Main category: cs.CV

TL;DR: MoReact is a diffusion-based method for text-driven human reaction generation that disentangles global trajectory and local motion generation to create realistic, diverse reactions that semantically match described interactions.

Details

Motivation: Existing methods fail to integrate rich semantic information and lack adaptive responsiveness to diverse interaction scenarios. Current approaches either treat multiple individuals as a single entity or rely solely on one person's motion, missing the semantic underpinnings of human interactions.

Method: MoReact uses a diffusion-based approach that sequentially generates global trajectories first, then local motions. It introduces a novel interaction loss to enhance realism of close interactions and is trained on data adapted from a two-person motion dataset.

Result: The method produces realistic, diverse, and controllable reactions that closely match counterpart movements while adhering to textual guidance. Experiments demonstrate efficacy for this novel task of text-driven human reaction generation.

Conclusion: MoReact effectively addresses limitations of existing models by focusing on text-driven reaction generation with sequential trajectory-motion disentanglement, achieving better alignment with actions and text descriptions while enhancing interaction realism.

Abstract: Modeling and generating human reactions poses a significant challenge with broad applications for computer vision and human-computer interaction. Existing methods either treat multiple individuals as a single entity, directly generating interactions, or rely solely on one person’s motion to generate the other’s reaction, failing to integrate the rich semantic information that underpins human interactions. Yet, these methods often fall short in adaptive responsiveness, i.e., the ability to accurately respond to diverse and dynamic interaction scenarios. Recognizing this gap, our work introduces an approach tailored to address the limitations of existing models by focusing on text-driven human reaction generation. Our model specifically generates realistic motion sequences for individuals that responding to the other’s actions based on a descriptive text of the interaction scenario. The goal is to produce motion sequences that not only complement the opponent’s movements but also semantically fit the described interactions. To achieve this, we present MoReact, a diffusion-based method designed to disentangle the generation of global trajectories and local motions sequentially. This approach stems from the observation that generating global trajectories first is crucial for guiding local motion, ensuring better alignment with given action and text. Furthermore, we introduce a novel interaction loss to enhance the realism of generated close interactions. Our experiments, utilizing data adapted from a two-person motion dataset, demonstrate the efficacy of our approach for this novel task, which is capable of producing realistic, diverse, and controllable reactions that not only closely match the movements of the counterpart but also adhere to the textual guidance. Please find our webpage at https://xiyan-xu.github.io/MoReactWebPage.

[537] Revisit the Imbalance Optimization in Multi-task Learning: An Experimental Analysis

Yihang Guo, Tianyuan Yu, Liang Bai, Yanming Guo, Yirun Ruan, William Li, Weishi Zheng

Main category: cs.CV

TL;DR: Multi-task learning suffers from optimization imbalance where task interference causes subpar performance. The paper shows gradient norm correlates with this imbalance, and scaling losses by gradient norms achieves performance comparable to expensive grid search.

Details

Motivation: To understand and address the persistent problem of optimization imbalance in multi-task learning, where task interference leads to worse performance than single-task models despite promising potential.

Method: Systematic experimental analysis examining factors contributing to optimization imbalance, including testing existing methods across datasets, analyzing Vision Foundation Models, and investigating gradient dynamics.

Result: Found strong correlation between optimization imbalance and task-specific gradient norms. Demonstrated that scaling task losses according to gradient norms achieves performance comparable to computationally expensive grid search.

Conclusion: Understanding and controlling gradient dynamics is a more direct path to stable multi-task learning than developing increasingly complex methods, with gradient norm-based loss scaling providing an effective solution.

Abstract: Multi-task learning (MTL) aims to build general-purpose vision systems by training a single network to perform multiple tasks jointly. While promising, its potential is often hindered by “unbalanced optimization”, where task interference leads to subpar performance compared to single-task models. To facilitate research in MTL, this paper presents a systematic experimental analysis to dissect the factors contributing to this persistent problem. Our investigation confirms that the performance of existing optimization methods varies inconsistently across datasets, and advanced architectures still rely on costly grid-searched loss weights. Furthermore, we show that while powerful Vision Foundation Models (VFMs) provide strong initialization, they do not inherently resolve the optimization imbalance, and merely increasing data quantity offers limited benefits. A crucial finding emerges from our analysis: a strong correlation exists between the optimization imbalance and the norm of task-specific gradients. We demonstrate that this insight is directly applicable, showing that a straightforward strategy of scaling task losses according to their gradient norms can achieve performance comparable to that of an extensive and computationally expensive grid search. Our comprehensive analysis suggests that understanding and controlling gradient dynamics is a more direct path to stable MTL than developing increasingly complex methods.

[538] Bridging the Task Gap: Multi-Task Adversarial Transferability in CLIP and Its Derivatives

Kuanrong Liu, Siyuan Liang, Cheng Qian, Ming Zhang, Xiaochun Cao

Main category: cs.CV

TL;DR: This paper analyzes adversarial example transfer across CLIP-based models and proposes MT-AdvCLIP, a framework that enhances cross-task attack effectiveness by leveraging fine-grained tasks.

Details

Motivation: CLIP struggles with fine-grained tasks and its robustness to adversarial perturbations remains underexplored. Understanding adversarial transfer across tasks is crucial for assessing CLIP's generalization limits and security risks.

Method: Proposed Multi-Task Adversarial CLIP (MT-AdvCLIP) with task-aware feature aggregation loss to generate perturbations with enhanced cross-task generalization capability, strengthening attack effectiveness of fine-grained task models on shared CLIP backbone.

Result: MT-AdvCLIP significantly improves adversarial transfer success rate by over 39% on average across multiple tasks against various CLIP-derived models without increasing perturbation budget.

Conclusion: The study reveals adversarial example transfer mechanisms in multi-task CLIP models and provides insights for multi-task robustness evaluation and adversarial example design.

Abstract: As a general-purpose vision-language pretraining model, CLIP demonstrates strong generalization ability in image-text alignment tasks and has been widely adopted in downstream applications such as image classification and image-text retrieval. However, it struggles with fine-grained tasks such as object detection and semantic segmentation. While many variants aim to improve CLIP on these tasks, its robustness to adversarial perturbations remains underexplored. Understanding how adversarial examples transfer across tasks is key to assessing CLIP’s generalization limits and security risks. In this work, we conduct a systematic empirical analysis of the cross-task transfer behavior of CLIP-based models on image-text retrieval, object detection, and semantic segmentation under adversarial perturbations. We find that adversarial examples generated from fine-grained tasks (e.g., object detection and semantic segmentation) often exhibit stronger transfer potential than those from coarse-grained tasks, enabling more effective attacks against the original CLIP model. Motivated by this observation, we propose a novel framework, Multi-Task Adversarial CLIP (MT-AdvCLIP), which introduces a task-aware feature aggregation loss and generates perturbations with enhanced cross-task generalization capability. This design strengthens the attack effectiveness of fine-grained task models on the shared CLIP backbone. Experimental results on multiple public datasets show that MT-AdvCLIP significantly improves the adversarial transfer success rate (The average attack success rate across multiple tasks is improved by over 39%.) against various CLIP-derived models, without increasing the perturbation budget. This study reveals the transfer mechanism of adversarial examples in multi-task CLIP models, offering new insights into multi-task robustness evaluation and adversarial example design.

[539] Token Painter: Training-Free Text-Guided Image Inpainting via Mask Autoregressive Models

Longtao Jiang, Mingfei Han, Lei Chen, Yongqiang Yu, Feng Zhao, Xiaojun Chang, Zhihui Li

Main category: cs.CV

TL;DR: Token Painter is a training-free text-guided image inpainting method using Mask AutoRegressive models that addresses background consistency and text alignment issues through dual-stream encoder fusion and adaptive attention enhancement.

Details

Motivation: Diffusion-based inpainting methods struggle with text alignment and background consistency due to modeling the entire image in latent space. MAR models offer better local controllability but need improvement for text-guided tasks.

Method: Proposes Token Painter with two key components: Dual-Stream Encoder Information Fusion (DEIF) for semantic fusion in frequency domain, and Adaptive Decoder Attention Score Enhancing (ADAE) for attention enhancement on guidance and inpainting tokens.

Result: Outperforms prior state-of-the-art methods across almost all metrics and delivers superior visual results without requiring training.

Conclusion: The training-free Token Painter method effectively solves text-guided image inpainting challenges by leveraging MAR models with novel fusion and attention mechanisms.

Abstract: Text-guided image inpainting aims to inpaint masked image regions based on a textual prompt while preserving the background. Although diffusion-based methods have become dominant, their property of modeling the entire image in latent space makes it challenging for the results to align well with prompt details and maintain a consistent background. To address these issues, we explore Mask AutoRegressive (MAR) models for this task. MAR naturally supports image inpainting by generating latent tokens corresponding to mask regions, enabling better local controllability without altering the background. However, directly applying MAR to this task makes the inpainting content either ignore the prompts or be disharmonious with the background context. Through analysis of the attention maps from the inpainting images, we identify the impact of background tokens on text tokens during the MAR generation, and leverage this to design \textbf{Token Painter}, a training-free text-guided image inpainting method based on MAR. Our approach introduces two key components: (1) Dual-Stream Encoder Information Fusion (DEIF), which fuses the semantic and context information from text and background in frequency domain to produce novel guidance tokens, allowing MAR to generate text-faithful inpainting content while keeping harmonious with background context. (2) Adaptive Decoder Attention Score Enhancing (ADAE), which adaptively enhances attention scores on guidance tokens and inpainting tokens to further enhance the alignment of prompt details and the content visual quality. Extensive experiments demonstrate that our training-free method outperforms prior state-of-the-art methods across almost all metrics and delivers superior visual results. Codes will be released.

[540] DriveE2E: Closed-Loop Benchmark for End-to-End Autonomous Driving through Real-to-Simulation

Haibao Yu, Wenxian Yang, Ruiyang Hao, Chuanye Wang, Jiaru Zhong, Ping Luo, Zaiqing Nie

Main category: cs.CV

TL;DR: A closed-loop evaluation framework for autonomous driving that integrates real-world traffic scenarios into CARLA simulator using infrastructure sensors and digital twins of intersections.

Details

Motivation: Current CARLA benchmarks use manually configured traffic scenarios that diverge from real-world conditions, limiting their ability to reflect actual driving performance.

Method: Extracted 800 dynamic traffic scenarios from 100-hour infrastructure sensor videos and created static digital twin assets for 15 real-world intersections with consistent visual appearance in CARLA.

Result: Created a challenging evaluation framework that accurately replicates real-world traffic and environmental characteristics, enabling more realistic simulations.

Conclusion: Provides a comprehensive closed-loop benchmark for evaluating end-to-end autonomous driving models that better reflects real-world driving conditions.

Abstract: Closed-loop evaluation is increasingly critical for end-to-end autonomous driving. Current closed-loop benchmarks using the CARLA simulator rely on manually configured traffic scenarios, which can diverge from real-world conditions, limiting their ability to reflect actual driving performance. To address these limitations, we introduce a simple yet challenging closed-loop evaluation framework that closely integrates real-world driving scenarios into the CARLA simulator with infrastructure cooperation. Our approach involves extracting 800 dynamic traffic scenarios selected from a comprehensive 100-hour video dataset captured by high-mounted infrastructure sensors, and creating static digital twin assets for 15 real-world intersections with consistent visual appearance. These digital twins accurately replicate the traffic and environmental characteristics of their real-world counterparts, enabling more realistic simulations in CARLA. This evaluation is challenging due to the diversity of driving behaviors, locations, weather conditions, and times of day at complex urban intersections. In addition, we provide a comprehensive closed-loop benchmark for evaluating end-to-end autonomous driving models. Project URL: \href{https://github.com/AIR-THU/DriveE2E}{https://github.com/AIR-THU/DriveE2E}.

[541] Learning Encoding-Decoding Direction Pairs to Unveil Concepts of Influence in Deep Vision Networks

Alexandros Doumanoglou, Kurt Driessens, Dimitrios Zarpalas

Main category: cs.CV

TL;DR: The paper proposes a method to recover encoding-decoding direction pairs in deep vision networks, enabling interpretability, debugging, and model correction by identifying concept embeddings and their latent factors.

Details

Motivation: To open the black box of deep networks by understanding how they represent concepts as directions in latent space, enabling model understanding, debugging, and improvement through concept unlearning.

Method: Identifies decoding directions via directional clustering of activations, estimates encoding directions with signal vectors under a probabilistic view, and leverages network weights through Uncertainty Region Alignment to reveal interpretable directions.

Result: (a) Recovers ground-truth direction pairs on synthetic data; (b) Decoding directions map to monosemantic, interpretable concepts and outperform unsupervised baselines on real data; (c) Signal vectors faithfully estimate encoding directions, validated via activation maximization.

Conclusion: The method successfully recovers interpretable concept directions, enabling applications in understanding global model behavior, explaining individual predictions, and producing counterfactuals or correcting errors through interventions.

Abstract: Empirical evidence shows that deep vision networks represent concepts as directions in latent space, vectors we call concept embeddings. Each concept has a latent factor-a scalar-indicating its presence in an input patch. For a given patch, multiple latent factors are encoded into a compact representation by linearly combining concept embeddings, with the factors as coefficients. Since these embeddings enable such encoding, we call them encoding directions. A latent factor can be recovered via the inner product with a filter, a vector we call a decoding direction. These encoding-decoding direction pairs are not directly accessible, but recovering them helps open the black box of deep networks, enabling understanding, debugging, and improving models. Decoder directions attribute meaning to latent codes, while encoding directions assess concept influence on predictions, with both enabling model correction by unlearning irrelevant concepts. Unlike prior matrix decomposition, autoencoder, or dictionary learning methods that rely on feature reconstruction, we propose a new perspective: decoding directions are identified via directional clustering of activations, and encoding directions are estimated with signal vectors under a probabilistic view. We further leverage network weights through a novel technique, Uncertainty Region Alignment, which reveals interpretable directions affecting predictions. Our analysis shows that (a) on synthetic data, our method recovers ground-truth direction pairs; (b) on real data, decoding directions map to monosemantic, interpretable concepts and outperform unsupervised baselines; and (c) signal vectors faithfully estimate encoding directions, validated via activation maximization. Finally, we demonstrate applications in understanding global model behavior, explaining individual predictions, and intervening to produce counterfactuals or correct errors.

[542] SAR-KnowLIP: Towards Multimodal Foundation Models for Remote Sensing

Yi Yang, Xiaokun Zhang, Qingchen Fang, Ziqi Ye, Rui Li, Li Liu, Haipeng Wang

Main category: cs.CV

TL;DR: SAR-KnowLIP is the first universal SAR multimodal foundational model that addresses the gap in cross-modal AI for synthetic aperture radar imagery by incorporating geographic information, hierarchical cognitive chain-of-thought annotations, and self-consistent iterative optimization.

Details

Motivation: Existing cross-modal AI methods are mostly designed for RGB imagery, leaving a significant gap in modeling synthetic aperture radar (SAR) imagery, which has all-day, all-weather imaging capabilities and plays an irreplaceable role in remote sensing scene understanding.

Method: The approach includes: (1) constructing SAR-GEOVL-1M dataset with geographic projection properties; (2) generating aligned structured text through hierarchical cognitive chain-of-thought (HCoT); (3) designing Self-Consistent Iterative Optimization mechanism for cross-modal alignment; (4) establishing unified evaluation benchmark across 11 downstream tasks.

Result: SAR-KnowLIP demonstrates leading performance compared to 14 leading foundation models, particularly in object counting and land-cover classification tasks.

Conclusion: SAR-KnowLIP’s large-scale multimodal data, transferable model architecture, and comprehensive experimental benchmark are expected to significantly advance the development of SAR multimodal baseline models.

Abstract: Cross-modal artificial intelligence has garnered widespread attention in recent years, achieving significant progress in the study of natural images. However, existing methods are mostly designed for RGB imagery, leaving a significant gap in modeling synthetic aperture radar (SAR) imagery. SAR, with its all-day, all-weather imaging capabilities, plays an irreplaceable role in remote sensing scene understanding. To address this gap, this paper proposes SAR-KnowLIP, the first universal SAR multimodal foundational model, along with reusable data and evaluation baselines. Specifically: (1) This work introduces the critical yet long-overlooked attribute of geographic information into remote sensing research, constructing SAR-GEOVL-1M (the first large-scale SAR dataset with complete geographic projection properties), covering multiple satellite platforms, 120,000 images, and 135 cities. (2) Aligned structured text is generated through a hierarchical cognitive chain-of-thought (HCoT), providing more than one million multi-dimensional semantic annotations of landforms, regional functions, target attributes, and spatial relationships. (3) We design a Self-Consistent Iterative Optimization mechanism that continuously enhances cross-modal alignment through a self-supervised closed loop of contrastive, matching, and reconstruction learning on a transferable multimodal encoder. (4) A unified evaluation benchmark is established across 11 representative downstream vision and vision-language tasks, with comparisons against 14 leading foundation models, where SAR-KnowLIP demonstrates leading performance, particularly in object counting and land-cover classification. We expect that SAR-KnowLIP’s large-scale multimodal data, transferable model architecture, and comprehensive experimental benchmark will significantly advance the development of SAR multimodal baseline models.

[543] AutoPrune: Each Complexity Deserves a Pruning Policy

Hanshi Wang, Yuhao Xu, Zekun Xu, Jin Gao, Yufan Liu, Weiming Hu, Ke Wang, Zhipeng Zhang

Main category: cs.CV

TL;DR: AutoPrune is a training-free framework that adaptively prunes visual tokens in vision-language models based on sample complexity, achieving 89% token reduction while retaining 96.7% accuracy.

Details

Motivation: Existing pruning methods use fixed schedules that don't align with the model's reasoning trajectory, failing to accommodate diverse input complexities.

Method: Quantifies mutual information between visual and textual tokens, then projects this to a budget-constrained logistic retention curve that adapts to task complexity.

Result: Prunes 89% of visual tokens, reduces FLOPs by 76.8% while retaining 96.7% accuracy on LLaVA-1.5-7B, outperforming PDrop by 9.1%.

Conclusion: Complexity-adaptive pruning effectively reduces computational demands while maintaining performance by aligning token elimination with the model’s reasoning process.

Abstract: The established redundancy in visual tokens within large vision-language models allows pruning to effectively reduce their substantial computational demands. Previous methods typically employ heuristic layer-specific pruning strategies where, although the number of tokens removed may differ across decoder layers, the overall pruning schedule is fixed and applied uniformly to all input samples and tasks, failing to align token elimination with the model’s holistic reasoning trajectory. Cognitive science indicates that human visual processing often begins with broad exploration to accumulate evidence before narrowing focus as the target becomes distinct. Our experiments reveal an analogous pattern in these models. This observation suggests that neither a fixed pruning schedule nor a heuristic layer-wise strategy can optimally accommodate the diverse complexities inherent in different inputs. To overcome this limitation, we introduce Complexity-Adaptive Pruning (AutoPrune), a training-free, plug-and-play framework that tailors pruning policies to varying sample and task complexities. Specifically, AutoPrune quantifies the mutual information between visual and textual tokens, then projects this signal to a budget-constrained logistic retention curve. Each such logistic curve, defined by its unique shape, corresponds to the specific complexity of different tasks and can guarantee adherence to predefined computational constraints. We evaluate AutoPrune on standard vision-language tasks and on Vision-Language-Action models for autonomous driving. Notably, when applied to LLaVA-1.5-7B, our method prunes 89% of visual tokens and reduces inference FLOPs by 76.8% while retaining 96.7% of the original accuracy averaged over all tasks. This corresponds to a 9.1% improvement over the recent work PDrop, demonstrating the effectiveness. Code is available at https://github.com/AutoLab-SAI-SJTU/AutoPrune.

[544] CrashSplat: 2D to 3D Vehicle Damage Segmentation in Gaussian Splatting

Dragoş-Andrei Chileban, Andrei-Ştefan Bulzan, Cosmin Cernǎzanu-Glǎvan

Main category: cs.CV

TL;DR: This paper introduces an automatic car damage detection pipeline using 3D Gaussian Splatting for single-view 3D segmentation, enabling damage detection from limited views where multi-view approaches fail.

Details

Motivation: Current car damage detection methods mainly use 2D image analysis, but 3D reconstruction can provide more comprehensive and geometrically accurate damage assessment. 3D Gaussian Splatting shows promise for accurate 3D reconstruction from limited views.

Method: The pipeline performs 3D damage segmentation by up-lifting 2D masks. Uses a learning-free approach for single-view 3D-GS segmentation: projects Gaussians onto image plane using SfM camera parameters, then filters them using Z-buffering with normal distribution model of depth and opacities.

Result: The method is particularly effective for challenging car damage detection scenarios where target objects (scratches, small dents) may only be clearly visible in a single view, making multi-view consistency approaches impractical.

Conclusion: The proposed single-view 3D Gaussian Splatting segmentation approach provides an effective solution for car damage detection in scenarios with limited views, overcoming limitations of multi-view methods.

Abstract: Automatic car damage detection has been a topic of significant interest for the auto insurance industry as it promises faster, accurate, and cost-effective damage assessments. However, few works have gone beyond 2D image analysis to leverage 3D reconstruction methods, which have the potential to provide a more comprehensive and geometrically accurate representation of the damage. Moreover, recent methods employing 3D representations for novel view synthesis, particularly 3D Gaussian Splatting (3D-GS), have demonstrated the ability to generate accurate and coherent 3D reconstructions from a limited number of views. In this work we introduce an automatic car damage detection pipeline that performs 3D damage segmentation by up-lifting 2D masks. Additionally, we propose a simple yet effective learning-free approach for single-view 3D-GS segmentation. Specifically, Gaussians are projected onto the image plane using camera parameters obtained via Structure from Motion (SfM). They are then filtered through an algorithm that utilizes Z-buffering along with a normal distribution model of depth and opacities. Through experiments we found that this method is particularly effective for challenging scenarios like car damage detection, where target objects (e.g., scratches, small dents) may only be clearly visible in a single view, making multi-view consistency approaches impractical or impossible. The code is publicly available at: https://github.com/DragosChileban/CrashSplat.

[545] HunyuanImage 3.0 Technical Report

Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, Tiankai Hang, Duojun Huang, Jie Jiang, Zhengkai Jiang, Weijie Kong, Changlin Li, Donghao Li, Junzhe Li, Xin Li, Yang Li, Zhenxi Li, Zhimin Li, Jiaxin Lin, Linus, Lucaz Liu, Shu Liu, Songtao Liu, Yu Liu, Yuhong Liu, Yanxin Long, Fanbin Lu, Qinglin Lu, Yuyang Peng, Yuanbo Peng, Xiangwei Shen, Yixuan Shi, Jiale Tao, Yangyu Tao, Qi Tian, Pengfei Wan, Chunyu Wang, Kai Wang, Lei Wang, Linqing Wang, Lucas Wang, Qixun Wang, Weiyan Wang, Hao Wen, Bing Wu, Jianbing Wu, Yue Wu, Senhao Xie, Fang Yang, Miles Yang, Xiaofeng Yang, Xuan Yang, Zhantao Yang, Jingmiao Yu, Zheng Yuan, Chao Zhang, Jian-Wei Zhang, Peizhen Zhang, Shi-Xue Zhang, Tao Zhang, Weigang Zhang, Yepeng Zhang, Yingfang Zhang, Zihao Zhang, Zijian Zhang, Penghao Zhao, Zhiyuan Zhao, Xuefei Zhe, Jianchen Zhu, Zhao Zhong

Main category: cs.CV

TL;DR: HunyuanImage 3.0 is a native multimodal model that unifies understanding and generation in an autoregressive framework, featuring an 80B parameter Mixture-of-Experts architecture with 13B active parameters per token, making it the largest open-source image generative model.

Details

Motivation: To create a unified multimodal model that combines understanding and generation capabilities within a single autoregressive framework, and to provide the community with a state-of-the-art foundation model for multimodal research.

Method: Used meticulous data curation, advanced architecture design, native Chain-of-Thoughts schema, progressive pre-training, aggressive post-training, and efficient infrastructure. Built a Mixture-of-Experts model with 80B total parameters (13B active per token).

Result: Extensive experiments show HunyuanImage 3.0 rivals previous state-of-the-art models in both automatic and human evaluation of text-image alignment and visual quality.

Conclusion: The model represents a significant advancement in multimodal AI and has been open-sourced to enable community exploration and foster a vibrant multimodal ecosystem.

Abstract: We present HunyuanImage 3.0, a native multimodal model that unifies multimodal understanding and generation within an autoregressive framework, with its image generation module publicly available. The achievement of HunyuanImage 3.0 relies on several key components, including meticulous data curation, advanced architecture design, a native Chain-of-Thoughts schema, progressive model pre-training, aggressive model post-training, and an efficient infrastructure that enables large-scale training and inference. With these advancements, we successfully trained a Mixture-of-Experts (MoE) model comprising over 80 billion parameters in total, with 13 billion parameters activated per token during inference, making it the largest and most powerful open-source image generative model to date. We conducted extensive experiments and the results of automatic and human evaluation of text-image alignment and visual quality demonstrate that HunyuanImage 3.0 rivals previous state-of-the-art models. By releasing the code and weights of HunyuanImage 3.0, we aim to enable the community to explore new ideas with a state-of-the-art foundation model, fostering a dynamic and vibrant multimodal ecosystem. All open source assets are publicly available at https://github.com/Tencent-Hunyuan/HunyuanImage-3.0

[546] ColLab: A Collaborative Spatial Progressive Data Engine for Referring Expression Comprehension and Generation

Shilan Zhang, Jirui Huang, Ruilin Yao, Cong Wang, Yaxiong Chen, Peng Xu, Shengwu Xiong

Main category: cs.CV

TL;DR: ColLab is an automated data generation engine for Referring Expression Comprehension and Generation that eliminates manual annotation through collaborative multimodal model interaction and spatial progressive augmentation.

Details

Motivation: Existing REC and REG datasets rely on labor-intensive manual annotation that is difficult to scale, creating a need for automated data generation methods.

Method: Uses Collaborative Multimodal Model Interaction (CMMI) with MLLMs and LLMs for description generation, and Spatial Progressive Augmentation (SPA) to enhance spatial expressiveness among duplicate instances.

Result: Significantly accelerates annotation process while improving quality and discriminability of generated expressions; partially adopted in ICCV 2025 MARS2 Challenge dataset generation.

Conclusion: ColLab enables fully automated REC and REG data generation without human supervision, producing diverse and challenging samples that better reflect real-world reasoning demands.

Abstract: Referring Expression Comprehension (REC) and Referring Expression Generation (REG) are fundamental tasks in multimodal understanding, supporting precise object localization through natural language. However, existing REC and REG datasets rely heavily on manual annotation, which is labor-intensive and difficult to scale. In this paper, we propose ColLab, a collaborative spatial progressive data engine that enables fully automated REC and REG data generation without human supervision. Specifically, our method introduces a Collaborative Multimodal Model Interaction (CMMI) strategy, which leverages the semantic understanding of multimodal large language models (MLLMs) and large language models (LLMs) to generate descriptions. Furthermore, we design a module termed Spatial Progressive Augmentation (SPA) to enhance spatial expressiveness among duplicate instances. Experiments demonstrate that ColLab significantly accelerates the annotation process of REC and REG while improving the quality and discriminability of the generated expressions. In addition to the core methodological contribution, our framework was partially adopted in the data generation pipeline of the ICCV 2025 MARS2 Challenge on Multimodal Reasoning, enriching the dataset with diverse and challenging samples that better reflect real-world reasoning demands.

[547] Reinforcement Learning with Inverse Rewards for World Model Post-training

Yang Ye, Tianyu He, Shuo Yang, Jiang Bian

Main category: cs.CV

TL;DR: RLIR is a post-training framework that uses inverse dynamics models to derive verifiable reward signals from generated videos, improving action-following capability in video world models without requiring large-scale preference annotations.

Details

Motivation: Current video world models have improved visual quality and temporal consistency but lack accurate action-following capability. Reinforcement learning could help but is impractical due to high costs of preference annotations and infeasibility of rule-based video verifiers.

Method: Proposes RLIR framework that uses an Inverse Dynamics Model to recover input actions from generated videos, mapping high-dimensional video to low-dimensional action space for verifiable reward signals, optimized via Group Relative Policy Optimization.

Result: Experiments show 5-10% gains in action-following, up to 10% improvements in visual quality, and higher human preference scores across autoregressive and diffusion paradigms.

Conclusion: RLIR is the first post-training method specifically designed to enhance action-following in video world models, providing an effective solution without requiring expensive preference annotations.

Abstract: World models simulate dynamic environments, enabling agents to interact with diverse input modalities. Although recent advances have improved the visual quality and temporal consistency of video world models, their ability of accurately modeling human-specified actions remains under-explored. Reinforcement learning presents a promising approach for directly improving the suboptimal action-following capability of pre-trained models, assuming that an appropriate reward function can be defined. However, transferring reinforcement learning post-training methods to world model is impractical due to the prohibitive cost of large-scale preference annotations and the infeasibility of constructing rule-based video verifiers. To address this gap, we propose Reinforcement Learning with Inverse Rewards (RLIR), a post-training framework that derives verifiable reward signals by recovering input actions from generated videos using an Inverse Dynamics Model. By mapping high-dimensional video modality to a low-dimensional action space, RLIR provides an objective and verifiable reward for optimization via Group Relative Policy Optimization. Experiments across autoregressive and diffusion paradigms demonstrate 5-10% gains in action-following, up to 10% improvements in visual quality, and higher human preference scores, establishing RLIR as the first post-training method specifically designed to enhance action-following in video world models.

[548] A Novel Hybrid Deep Learning and Chaotic Dynamics Approach for Thyroid Cancer Classification

Nada Bouchekout, Abdelkrim Boukabou, Morad Grimes, Yassine Habchi, Yassine Himeur, Hamzah Ali Alkhazaleh, Shadi Atalla, Wathiq Mansoor

Main category: cs.CV

TL;DR: An intelligent classification method combining adaptive CNN with CDF9/7 wavelets modulated by chaotic system achieves state-of-the-art thyroid cancer diagnosis from ultrasound images with 98.17% accuracy.

Details

Motivation: Timely and accurate diagnosis is crucial for effective thyroid cancer treatment and improved patient outcomes, addressing the global rise in thyroid cancer cases.

Method: Couples Adaptive CNN with CDF9/7 wavelets whose detail coefficients are modulated by n-scroll chaotic system to enrich discriminative features. Evaluated on DDTI thyroid ultrasound dataset using 5-fold cross-validation.

Result: Achieves 98.17% accuracy, 98.76% sensitivity, 97.58% specificity, 97.55% F1-score, and AUC of 0.9912. Outperforms state-of-the-art backbones (EfficientNetV2-S, Swin-T, ViT-B/16, ConvNeXt-T) by +1.23 points in accuracy. Chaotic modulation improves accuracy by +8.79 percentage points.

Conclusion: The wavelet-chaos-CNN pipeline delivers state-of-the-art thyroid ultrasound classification with strong generalization, practical runtime characteristics (28.7 ms per image), and suitability for clinical integration.

Abstract: Timely and accurate diagnosis is crucial in addressing the global rise in thyroid cancer, ensuring effective treatment strategies and improved patient outcomes. We present an intelligent classification method that couples an Adaptive Convolutional Neural Network (CNN) with Cohen-Daubechies-Feauveau (CDF9/7) wavelets whose detail coefficients are modulated by an n-scroll chaotic system to enrich discriminative features. We evaluate on the public DDTI thyroid ultrasound dataset (n = 1,638 images; 819 malignant / 819 benign) using 5-fold cross-validation, where the proposed method attains 98.17% accuracy, 98.76% sensitivity, 97.58% specificity, 97.55% F1-score, and an AUC of 0.9912. A controlled ablation shows that adding chaotic modulation to CDF9/7 improves accuracy by +8.79 percentage points over a CDF9/7-only CNN (from 89.38% to 98.17%). To objectively position our approach, we trained state-of-the-art backbones on the same data and splits: EfficientNetV2-S (96.58% accuracy; AUC 0.987), Swin-T (96.41%; 0.986), ViT-B/16 (95.72%; 0.983), and ConvNeXt-T (96.94%; 0.987). Our method outperforms the best of these by +1.23 points in accuracy and +0.0042 in AUC, while remaining computationally efficient (28.7 ms per image; 1,125 MB peak VRAM). Robustness is further supported by cross-dataset testing on TCIA (accuracy 95.82%) and transfer to an ISIC skin-lesion subset (n = 28 unique images, augmented to 2,048; accuracy 97.31%). Explainability analyses (Grad-CAM, SHAP, LIME) highlight clinically relevant regions. Altogether, the wavelet-chaos-CNN pipeline delivers state-of-the-art thyroid ultrasound classification with strong generalization and practical runtime characteristics suitable for clinical integration.

[549] VFSI: Validity First Spatial Intelligence for Constraint-Guided Traffic Diffusion

Kargi Chauhan, Leilani H. Gilpin

Main category: cs.CV

TL;DR: VFSI enforces physical constraints during diffusion sampling to fix systematic violations in traffic simulation models, reducing collisions by 67% and improving validity by 87% without model retraining.

Details

Motivation: Current diffusion models for traffic simulation systematically violate physical constraints - 50% of generated trajectories show vehicles colliding, driving off roads, or spawning inside buildings. Physical validity is treated as emergent rather than required.

Method: Proposed Validity-First Spatial Intelligence (VFSI) uses energy-based guidance during diffusion sampling to enforce constraints. Incorporates collision avoidance and kinematic constraints as energy functions to guide denoising toward physically valid trajectories.

Result: Across 200 urban scenarios from Waymo Open Motion Dataset, VFSI reduced collision rates by 67% (24.6% to 8.1%) and improved overall validity by 87% (50.3% to 94.2%). Also improved realism metrics (ADE: 1.34m to 1.21m).

Conclusion: Explicit constraint enforcement during inference is both necessary and sufficient for physically valid traffic simulation. Model-agnostic approach demonstrates that physical validity should be an architectural requirement, not emergent property.

Abstract: Modern diffusion models generate realistic traffic simulations but systematically violate physical constraints. In a large-scale evaluation of SceneDiffuser++, a state-of-the-art traffic simulator, we find that 50% of generated trajectories violate basic physical laws - vehicles collide, drive off roads, and spawn inside buildings. This reveals a fundamental limitation: current models treat physical validity as an emergent property rather than an architectural requirement. We propose Validity-First Spatial Intelligence (VFSI), which enforces constraints through energy-based guidance during diffusion sampling, without model retraining. By incorporating collision avoidance and kinematic constraints as energy functions, we guide the denoising process toward physically valid trajectories. Across 200 urban scenarios from the Waymo Open Motion Dataset, VFSI reduces collision rates by 67% (24.6% to 8.1%) and improves overall validity by 87% (50.3% to 94.2%), while simultaneously improving realism metrics (ADE: 1.34m to 1.21m). Our model-agnostic approach demonstrates that explicit constraint enforcement during inference is both necessary and sufficient for physically valid traffic simulation.

[550] Towards Redundancy Reduction in Diffusion Models for Efficient Video Super-Resolution

Jinpei Guo, Yifei Ji, Zheng Chen, Yufei Wang, Sizhuo Ma, Yong Guo, Yulun Zhang, Jian Wang

Main category: cs.CV

TL;DR: OASIS is an efficient one-step diffusion model with attention specialization for real-world video super-resolution that reduces computational redundancy while maintaining performance.

Details

Motivation: Direct adaptation of diffusion models to video super-resolution creates redundancy since low-quality videos already contain substantial content information, leading to increased computational overhead and learning burden.

Method: Proposes attention specialization routing that assigns attention heads to different patterns according to their intrinsic behaviors, and a progressive training strategy starting with temporally consistent degradations then shifting to inconsistent settings.

Result: Achieves state-of-the-art performance on both synthetic and real-world datasets with 6.2× speedup over one-step diffusion baselines like SeedVR2.

Conclusion: OASIS effectively mitigates redundancy in diffusion models for VSR while preserving pretrained knowledge, enabling efficient and high-performance video super-resolution.

Abstract: Diffusion models have recently shown promising results for video super-resolution (VSR). However, directly adapting generative diffusion models to VSR can result in redundancy, since low-quality videos already preserve substantial content information. Such redundancy leads to increased computational overhead and learning burden, as the model performs superfluous operations and must learn to filter out irrelevant information. To address this problem, we propose OASIS, an efficient $\textbf{o}$ne-step diffusion model with $\textbf{a}$ttention $\textbf{s}$pecialization for real-world v$\textbf{i}$deo $\textbf{s}$uper-resolution. OASIS incorporates an attention specialization routing that assigns attention heads to different patterns according to their intrinsic behaviors. This routing mitigates redundancy while effectively preserving pretrained knowledge, allowing diffusion models to better adapt to VSR and achieve stronger performance. Moreover, we propose a simple yet effective progressive training strategy, which starts with temporally consistent degradations and then shifts to inconsistent settings. This strategy facilitates learning under complex degradations. Extensive experiments demonstrate that OASIS achieves state-of-the-art performance on both synthetic and real-world datasets. OASIS also provides superior inference speed, offering a $\textbf{6.2$\times$}$ speedup over one-step diffusion baselines such as SeedVR2. The code will be available at \href{https://github.com/jp-guo/OASIS}{https://github.com/jp-guo/OASIS}.

[551] RPG360: Robust 360 Depth Estimation with Perspective Foundation Models and Graph Optimization

Dongki Jung, Jaehoon Choi, Yonghan Lee, Dinesh Manocha

Main category: cs.CV

TL;DR: RPG360 is a training-free 360 monocular depth estimation method that uses perspective foundation models and graph optimization to achieve robust depth estimation for omnidirectional images without requiring labeled datasets.

Details

Motivation: The increasing use of 360 images across various domains has created a need for robust depth estimation techniques, but obtaining large-scale labeled datasets for 360 depth estimation remains a significant challenge.

Method: Converts 360 images into six-face cubemap representations, uses perspective foundation models to estimate depth and surface normals, and introduces a novel depth scale alignment technique using graph-based optimization with per-face scale parameters to ensure consistency across cubemap faces.

Result: Achieves superior performance across diverse datasets (Matterport3D, Stanford2D3D, 360Loc) and demonstrates benefits in downstream tasks: feature matching (3.2-5.4% improvement) and Structure from Motion (0.2-9.7% improvement in AUC@5).

Conclusion: The proposed RPG360 method provides a robust, training-free solution for 360 monocular depth estimation that leverages foundation models’ zero-shot robustness and graph optimization for depth scale consistency, showing strong performance across multiple datasets and downstream applications.

Abstract: The increasing use of 360 images across various domains has emphasized the need for robust depth estimation techniques tailored for omnidirectional images. However, obtaining large-scale labeled datasets for 360 depth estimation remains a significant challenge. In this paper, we propose RPG360, a training-free robust 360 monocular depth estimation method that leverages perspective foundation models and graph optimization. Our approach converts 360 images into six-face cubemap representations, where a perspective foundation model is employed to estimate depth and surface normals. To address depth scale inconsistencies across different faces of the cubemap, we introduce a novel depth scale alignment technique using graph-based optimization, which parameterizes the predicted depth and normal maps while incorporating an additional per-face scale parameter. This optimization ensures depth scale consistency across the six-face cubemap while preserving 3D structural integrity. Furthermore, as foundation models exhibit inherent robustness in zero-shot settings, our method achieves superior performance across diverse datasets, including Matterport3D, Stanford2D3D, and 360Loc. We also demonstrate the versatility of our depth estimation approach by validating its benefits in downstream tasks such as feature matching 3.2 ~ 5.4% and Structure from Motion 0.2 ~ 9.7% in AUC@5.

[552] Advancing Multi-agent Traffic Simulation via R1-Style Reinforcement Fine-Tuning

Muleilan Pei, Shaoshuai Shi, Shaojie Shen

Main category: cs.CV

TL;DR: SMART-R1 is a novel R1-style reinforcement fine-tuning framework for multi-agent traffic simulation that addresses distributional shift through metric-oriented policy optimization and iterative SFT-RFT-SFT training, achieving state-of-the-art performance on Waymo benchmarks.

Details

Motivation: Existing data-driven traffic simulators rely on supervised learning but suffer from distributional shift between training and testing, which undermines model generalization in unseen environments.

Method: Proposes SMART-R1 with R1-style reinforcement fine-tuning for next-token prediction models, featuring metric-oriented policy optimization and iterative SFT-RFT-SFT training strategy alternating between Supervised Fine-Tuning and Reinforcement Fine-Tuning.

Result: Achieves state-of-the-art performance with overall realism meta score of 0.7858 on Waymo Open Sim Agents Challenge, ranking first on the leaderboard.

Conclusion: SMART-R1 demonstrates that simple yet powerful R1-style training framework effectively enhances foundation models for multi-agent traffic simulation, providing better alignment with human preferences and evaluation metrics.

Abstract: Scalable and realistic simulation of multi-agent traffic behavior is critical for advancing autonomous driving technologies. Although existing data-driven simulators have made significant strides in this domain, they predominantly rely on supervised learning to align simulated distributions with real-world driving scenarios. A persistent challenge, however, lies in the distributional shift that arises between training and testing, which often undermines model generalization in unseen environments. To address this limitation, we propose SMART-R1, a novel R1-style reinforcement fine-tuning paradigm tailored for next-token prediction models to better align agent behavior with human preferences and evaluation metrics. Our approach introduces a metric-oriented policy optimization algorithm to improve distribution alignment and an iterative “SFT-RFT-SFT” training strategy that alternates between Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT) to maximize performance gains. Extensive experiments on the large-scale Waymo Open Motion Dataset (WOMD) validate the effectiveness of this simple yet powerful R1-style training framework in enhancing foundation models. The results on the Waymo Open Sim Agents Challenge (WOSAC) showcase that SMART-R1 achieves state-of-the-art performance with an overall realism meta score of 0.7858, ranking first on the leaderboard at the time of submission.

[553] TREAT-Net: Tabular-Referenced Echocardiography Analysis for Acute Coronary Syndrome Treatment Prediction

Diane Kim, Minh Nguyen Nhat To, Sherif Abdalla, Teresa S. M. Tsang, Purang Abolmaesumi, and Christina Luong

Main category: cs.CV

TL;DR: TREAT-Net is a multimodal deep learning framework that uses echocardiography videos and clinical records to predict ACS treatment, achieving 67.6% balanced accuracy and 71.1% AUROC.

Details

Motivation: Coronary angiography is invasive and resource-intensive, causing diagnostic delays and postponed treatment for ACS patients. A non-invasive alternative is needed for timely triage.

Method: Multimodal deep learning framework integrating echocardiography videos and clinical records using tabular-guided cross-attention and late fusion mechanism.

Result: Outperformed unimodal and non-fused baselines with 67.6% balanced accuracy and 71.1% AUROC. Cross-modality agreement showed 88.6% accuracy for intervention prediction.

Conclusion: TREAT-Net shows potential as a non-invasive tool for timely and accurate patient triage, especially beneficial for underserved populations with limited access to angiography.

Abstract: Coronary angiography remains the gold standard for diagnosing Acute Coronary Syndrome (ACS). However, its resource-intensive and invasive nature can expose patients to procedural risks and diagnostic delays, leading to postponed treatment initiation. In this work, we introduce TREAT-Net, a multimodal deep learning framework for ACS treatment prediction that leverages non-invasive modalities, including echocardiography videos and structured clinical records. TREAT-Net integrates tabular-guided cross-attention to enhance video interpretation, along with a late fusion mechanism to align predictions across modalities. Trained on a dataset of over 9000 ACS cases, the model outperforms unimodal and non-fused baselines, achieving a balanced accuracy of 67.6% and an AUROC of 71.1%. Cross-modality agreement analysis demonstrates 88.6% accuracy for intervention prediction. These findings highlight the potential of TREAT-Net as a non-invasive tool for timely and accurate patient triage, particularly in underserved populations with limited access to coronary angiography.

[554] Gaze Estimation for Human-Robot Interaction: Analysis Using the NICO Platform

Matej Palider, Omar Eldardeer, Viktor Kocur

Main category: cs.CV

TL;DR: Evaluation of gaze estimation methods in HRI shared workspace scenarios shows angular errors comparable to general benchmarks but practical limitations with median error of 16.48 cm in real-world distances.

Details

Motivation: To assess the practical effectiveness of current gaze estimation methods in Human-Robot Interaction (HRI) contexts, specifically in shared workspace scenarios where accurate gaze tracking is crucial for natural interaction.

Method: Introduced a new annotated dataset collected with the NICO robotic platform and evaluated four state-of-the-art gaze estimation models in a shared workspace HRI context.

Result: Angular errors were close to those reported on general-purpose benchmarks, but when converted to actual distance in the shared workspace, the best median error was 16.48 cm, revealing practical limitations of current methods.

Conclusion: Current gaze estimation methods have significant practical limitations in HRI applications despite good angular accuracy, and recommendations are provided for better integration of gaze estimation as a modality in HRI systems.

Abstract: This paper evaluates the current gaze estimation methods within an HRI context of a shared workspace scenario. We introduce a new, annotated dataset collected with the NICO robotic platform. We evaluate four state-of-the-art gaze estimation models. The evaluation shows that the angular errors are close to those reported on general-purpose benchmarks. However, when expressed in terms of distance in the shared workspace the best median error is 16.48 cm quantifying the practical limitations of current methods. We conclude by discussing these limitations and offering recommendations on how to best integrate gaze estimation as a modality in HRI systems.

[555] SIE3D: Single-image Expressive 3D Avatar generation via Semantic Embedding and Perceptual Expression Loss

Zhiqi Huang, Dulongkai Cui, Jinglu Hu

Main category: cs.CV

TL;DR: SIE3D generates expressive 3D head avatars from a single image and text description, enabling fine-grained control over facial expressions through a novel conditioning scheme and perceptual expression loss.

Details

Motivation: Current methods lack fine-grained, intuitive control over expressions via text when generating 3D head avatars from single images.

Method: Fuses identity features from images with semantic text embeddings through a novel conditioning scheme, and introduces a perceptual expression loss using a pre-trained expression classifier to regularize generation.

Result: Significantly improves controllability and realism, outperforming competitive methods in identity preservation and expression fidelity on a single consumer-grade GPU.

Conclusion: SIE3D provides an effective framework for generating high-fidelity 3D avatars with detailed text-based expression control.

Abstract: Generating high-fidelity 3D head avatars from a single image is challenging, as current methods lack fine-grained, intuitive control over expressions via text. This paper proposes SIE3D, a framework that generates expressive 3D avatars from a single image and descriptive text. SIE3D fuses identity features from the image with semantic embedding from text through a novel conditioning scheme, enabling detailed control. To ensure generated expressions accurately match the text, it introduces an innovative perceptual expression loss function. This loss uses a pre-trained expression classifier to regularize the generation process, guaranteeing expression accuracy. Extensive experiments show SIE3D significantly improves controllability and realism, outperforming competitive methods in identity preservation and expression fidelity on a single consumer-grade GPU. Project page: https://blazingcrystal1747.github.io/SIE3D/

[556] FrameMind: Frame-Interleaved Chain-of-Thought for Video Reasoning via Reinforcement Learning

Haonan Ge, Yiwei Wang, Kai-Wei Chang, Hang Wu, Yujun Cai

Main category: cs.CV

TL;DR: FrameMind is an RL-trained framework that enables dynamic frame sampling for video understanding, using FiCOT to alternate between textual reasoning and active visual perception.

Details

Motivation: Current video models use fixed frame sampling, limiting their ability to adaptively gather visual evidence for tasks requiring broad temporal coverage or fine-grained spatial detail.

Method: FrameMind uses reinforcement learning with Frame-Interleaved Chain-of-Thought (FiCOT) for multi-turn reasoning, Dynamic Resolution Frame Sampling (DRFS) for training, and DRFS-GRPO for policy optimization without frame-level annotations.

Result: Extensive experiments on MLVU and VideoMME benchmarks show significant performance improvements over existing models, advancing state-of-the-art in video understanding.

Conclusion: FrameMind enables flexible and efficient video understanding through dynamic visual information gathering, outperforming traditional static sampling approaches.

Abstract: Current video understanding models rely on fixed frame sampling strategies, processing predetermined visual inputs regardless of the specific reasoning requirements of each question. This static approach limits their ability to adaptively gather visual evidence, leading to suboptimal performance on tasks that require either broad temporal coverage or fine-grained spatial detail. In this paper, we introduce FrameMind, an end-to-end framework trained with reinforcement learning that enables models to dynamically request visual information during reasoning through Frame-Interleaved Chain-of-Thought (FiCOT). Unlike traditional approaches, FrameMind operates in multiple turns where the model alternates between textual reasoning and active visual perception, using tools to extract targeted frames or video clips based on identified knowledge gaps. To train effective dynamic sampling policies, we propose Dynamic Resolution Frame Sampling (DRFS), which exposes models to diverse temporal-spatial trade-offs during learning, and DRFS-GRPO, a group-relative policy optimization algorithm that learns from outcome-based rewards without requiring frame-level annotations. Extensive experiments on challenging benchmarks like MLVU and VideoMME demonstrate that our method significantly outperforms existing models, advancing the state of the art in flexible and efficient video understanding.

[557] Generalized Category Discovery in Hyperspectral Images via Prototype Subspace Modeling

Xianlu Li, Nicolas Nadisic, Shaoguang Huang, Aleksandra Pizurica

Main category: cs.CV

TL;DR: First GCD framework for hyperspectral images using prototype subspace modeling with basis vectors instead of single prototypes, achieving state-of-the-art performance.

Details

Motivation: Existing GCD methods designed for RGB images don't generalize well to hyperspectral images due to their high-dimensionality and complex spectral structures.

Method: Proposes prototype subspace modeling using basis vectors per category with orthogonality and reconstruction constraints for better class structure representation.

Result: Significantly outperforms state-of-the-art GCD methods on real-world HSI datasets.

Conclusion: Establishes a strong foundation for generalized category discovery in hyperspectral settings with the proposed subspace modeling approach.

Abstract: Generalized category discovery~(GCD) seeks to jointly identify both known and novel categories in unlabeled data. While prior works have mainly focused on RGB images, their assumptions and modeling strategies do not generalize well to hyperspectral images~(HSI), which are inherently high-dimensional and exhibit complex spectral structures. In this paper, we propose the first GCD framework tailored for HSI, introducing a prototype subspace modeling model to better capture class structure. Instead of learning a single prototype vector for each category as in existing methods such as SimGCD, we model each category using a set of basis vectors, forming a subspace representation that enables greater expressiveness and discrimination in a high-dimensional feature space. To guide the learning of such bases, we enforce two key constraints: (1) a basis orthogonality constraint that promotes inter-class separability, and (2) a reconstruction constraint that ensures each prototype basis can effectively reconstruct its corresponding class samples. Experimental results on real-world HSI demonstrate that our method significantly outperforms state-of-the-art GCD methods, establishing a strong foundation for generalized category discovery in hyperspectral settings.

[558] Hazy Pedestrian Trajectory Prediction via Physical Priors and Graph-Mamba

Jian Chen, Zhuoran Zheng, Han Hu, Guijuan Zhang, Dianjie Lu, Liang Li, Chen Lyu

Main category: cs.CV

TL;DR: A deep learning model combining atmospheric scattering physics with pedestrian interaction modeling for trajectory prediction in hazy weather, achieving significant performance improvements in dense haze scenarios.

Details

Motivation: Address physical information degradation and ineffective pedestrian interaction modeling in pedestrian trajectory prediction under hazy weather conditions.

Method: Combines differentiable atmospheric scattering model for haze mitigation, adaptive Mamba variant for feature extraction, and heterogeneous graph attention network with spatio-temporal fusion for pedestrian relationship modeling.

Result: 37.2% and 41.5% reduction in minADE/minFDE metrics compared to SOTA in dense haze scenarios (visibility < 30m), with 78% inference speed increase over native Mamba.

Conclusion: Provides a new modeling paradigm for reliable perception in intelligent transportation systems in adverse environments.

Abstract: To address the issues of physical information degradation and ineffective pedestrian interaction modeling in pedestrian trajectory prediction under hazy weather conditions, we propose a deep learning model that combines physical priors of atmospheric scattering with topological modeling of pedestrian relationships. Specifically, we first construct a differentiable atmospheric scattering model that decouples haze concentration from light degradation through a network with physical parameter estimation, enabling the learning of haze-mitigated feature representations. Second, we design an adaptive scanning state space model for feature extraction. Our adaptive Mamba variant achieves a 78% inference speed increase over native Mamba while preserving long-range dependency modeling. Finally, to efficiently model pedestrian relationships, we develop a heterogeneous graph attention network, using graph matrices to model multi-granularity interactions between pedestrians and groups, combined with a spatio-temporal fusion module to capture the collaborative evolution patterns of pedestrian movements. Furthermore, we constructed a new pedestrian trajectory prediction dataset based on ETH/UCY to evaluate the effectiveness of the proposed method. Experiments show that our method reduces the minADE / minFDE metrics by 37.2% and 41.5%, respectively, compared to the SOTA models in dense haze scenarios (visibility < 30m), providing a new modeling paradigm for reliable perception in intelligent transportation systems in adverse environments.

[559] $\mathbf{R}^3$: Reconstruction, Raw, and Rain: Deraining Directly in the Bayer Domain

Nate Rothschild, Moshe Kimhi, Avi Mendelson, Chaim Baskin

Main category: cs.CV

TL;DR: Learning directly on raw Bayer mosaics yields superior image reconstruction compared to post-ISP sRGB images, as demonstrated through rain removal experiments with a new benchmark dataset and evaluation metric.

Details

Motivation: Image reconstruction from corrupted images is crucial, but most networks are trained on post-ISP sRGB images which irreversibly mix colors, clip dynamic range, and blur fine detail. The paper aims to show these losses are avoidable by working directly with raw data.

Method: The paper evaluates post-ISP and Bayer reconstruction pipelines, curates Raw-Rain (first public benchmark of real rainy scenes in both 12-bit Bayer and bit-depth-matched sRGB), and introduces Information Conservation Score (ICS) as a color-invariant metric. A raw-domain model is developed for reconstruction.

Result: On the test split, the raw-domain model improves sRGB results by up to +0.99 dB PSNR and +1.2% ICS, while running faster with half of the GFLOPs compared to traditional approaches.

Conclusion: The results advocate for an ISP-last paradigm for low-level vision and open the door to end-to-end learnable camera pipelines by demonstrating superior performance when learning directly on raw Bayer data.

Abstract: Image reconstruction from corrupted images is crucial across many domains. Most reconstruction networks are trained on post-ISP sRGB images, even though the image-signal-processing pipeline irreversibly mixes colors, clips dynamic range, and blurs fine detail. This paper uses the rain degradation problem as a use case to show that these losses are avoidable, and demonstrates that learning directly on raw Bayer mosaics yields superior reconstructions. To substantiate the claim, we (i) evaluate post-ISP and Bayer reconstruction pipelines, (ii) curate Raw-Rain, the first public benchmark of real rainy scenes captured in both 12-bit Bayer and bit-depth-matched sRGB, and (iii) introduce Information Conservation Score (ICS), a color-invariant metric that aligns more closely with human opinion than PSNR or SSIM. On the test split, our raw-domain model improves sRGB results by up to +0.99 dB PSNR and +1.2% ICS, while running faster with half of the GFLOPs. The results advocate an ISP-last paradigm for low-level vision and open the door to end-to-end learnable camera pipelines.

[560] Joint Superpixel and Self-Representation Learning for Scalable Hyperspectral Image Clustering

Xianlu Li, Nicolas Nadisic, Shaoguang Huang, Aleksandra Pizurica

Main category: cs.CV

TL;DR: A unified end-to-end framework that jointly optimizes superpixel segmentation and subspace clustering for hyperspectral images using a feedback mechanism between self-representation and differentiable superpixel modules.

Details

Motivation: Existing superpixel-based subspace clustering methods perform segmentation independently from clustering, resulting in misaligned partitions that don't support the clustering objective, while also facing computational scalability issues with hyperspectral images.

Method: Joint optimization framework with feedback mechanism: self-representation network based on unfolded ADMM guides a differentiable superpixel module, with each superpixel learning a unique compactness parameter for adaptive segmentation.

Result: Extensive experiments on benchmark HSI datasets show superior clustering accuracy compared to state-of-the-art methods, with clustering-aware partitions that preserve both spectral and spatial structure.

Conclusion: The proposed unified framework effectively addresses the misalignment between segmentation and clustering objectives, achieving better performance through joint optimization and adaptive superpixel learning.

Abstract: Subspace clustering is a powerful unsupervised approach for hyperspectral image (HSI) analysis, but its high computational and memory costs limit scalability. Superpixel segmentation can improve efficiency by reducing the number of data points to process. However, existing superpixel-based methods usually perform segmentation independently of the clustering task, often producing partitions that do not align with the subsequent clustering objective. To address this, we propose a unified end-to-end framework that jointly optimizes superpixel segmentation and subspace clustering. Its core is a feedback mechanism: a self-representation network based on unfolded Alternating Direction Method of Multipliers (ADMM) provides a model-driven signal to guide a differentiable superpixel module. This joint optimization yields clustering-aware partitions that preserve both spectral and spatial structure. Furthermore, our superpixel network learns a unique compactness parameter for each superpixel, enabling more flexible and adaptive segmentation. Extensive experiments on benchmark HSI datasets demonstrate that our method consistently achieves superior accuracy compared with state-of-the-art clustering approaches.

[561] A Second-Order Perspective on Pruning at Initialization and Knowledge Transfer

Leonardo Iurada, Beatrice Occhiena, Tatiana Tommasi

Main category: cs.CV

TL;DR: Pruning pre-trained vision models on one task maintains zero-shot performance on unseen tasks, and fine-tuning recovers performance on held-out tasks due to favorable loss landscapes from pre-training.

Details

Motivation: To address the challenge of pruning pre-trained vision models when downstream tasks are unknown, exploring how data influences pruning and its impact on task performance.

Method: Investigate pruning-at-initialization on pre-trained vision models using one task’s data, then evaluate zero-shot performance on unseen tasks and fine-tuning recovery.

Result: Pruning on one task retains zero-shot performance on unseen tasks, and fine-tuning improves performance on both seen and held-out tasks.

Conclusion: Extensive pre-training creates favorable loss landscapes that enable effective pruning without task-specific data, maintaining performance across tasks.

Abstract: The widespread availability of pre-trained vision models has enabled numerous deep learning applications through their transferable representations. However, their computational and storage costs often limit practical deployment. Pruning-at-Initialization has emerged as a promising approach to compress models before training, enabling efficient task-specific adaptation. While conventional wisdom suggests that effective pruning requires task-specific data, this creates a challenge when downstream tasks are unknown in advance. In this paper, we investigate how data influences the pruning of pre-trained vision models. Surprisingly, pruning on one task retains the model’s zero-shot performance also on unseen tasks. Furthermore, fine-tuning these pruned models not only improves performance on original seen tasks but can recover held-out tasks’ performance. We attribute this phenomenon to the favorable loss landscapes induced by extensive pre-training on large-scale datasets.

Hosein Hasani, Amirmohammad Izadi, Fatemeh Askari, Mobin Bagherian, Sadegh Mohammadian, Mohammad Izadi, Mahdieh Soleymani Baghshah

Main category: cs.CV

TL;DR: Grounding IDs are latent identifiers that emerge when LVLMs process visual structures like partitions, enabling better object binding across modalities and reducing hallucinations.

Details

Motivation: To understand why simple visual structures improve LVLM performance and uncover the internal mechanisms behind these gains.

Method: Used representation analysis to identify Grounding IDs in embedding space and performed causal interventions to confirm their role in object-symbol binding.

Result: Found that Grounding IDs create robust within-partition alignment, reduce modality gap, mediate binding between objects and cues, and strengthen attention between related components.

Conclusion: Grounding IDs are a key symbolic mechanism that explains how external visual cues enhance multimodal binding, offering both interpretability and practical robustness improvements.

Abstract: Large vision-language models (LVLMs) show strong performance across multimodal benchmarks but remain limited in structured reasoning and precise grounding. Recent work has demonstrated that adding simple visual structures, such as partitions and annotations, improves accuracy, yet the internal mechanisms underlying these gains remain unclear. We investigate this phenomenon and propose the concept of Grounding IDs, latent identifiers induced by external cues that bind objects to their designated partitions across modalities. Through representation analysis, we find that these identifiers emerge as robust within-partition alignment in embedding space and reduce the modality gap between image and text. Causal interventions further confirm that these identifiers mediate binding between objects and symbolic cues. We show that Grounding IDs strengthen attention between related components, which in turn improves cross-modal grounding and reduces hallucinations. Taken together, our results identify Grounding IDs as a key symbolic mechanism explaining how external cues enhance multimodal binding, offering both interpretability and practical improvements in robustness.

[563] Autoregressive Video Generation beyond Next Frames Prediction

Sucheng Ren, Chen Chen, Zhenbang Wang, Liangchen Song, Xiangxin Zhu, Alan Yuille, Yinfei Yang, Jiasen Lu

Main category: cs.CV

TL;DR: VideoAR challenges the frame-by-frame paradigm in video generation by proposing spatiotemporal cubes as prediction units, enabling simultaneous modeling of spatial and temporal dimensions for improved quality and efficiency.

Details

Motivation: The paper questions whether frames are the appropriate atomic units for video autoregression, arguing that the frame-by-frame approach may not be optimal for capturing spatiotemporal relationships in video data.

Method: VideoAR is a unified framework that supports multiple prediction units including full frames, key-detail frames, multiscale refinements, and spatiotemporal cubes. The core innovation is using spatiotemporal cubes as prediction units for autoregressive modeling across both spatial and temporal dimensions.

Result: Cube-based prediction consistently delivers superior quality, speed, and temporal coherence compared to frame-based approaches. The method surpasses state-of-the-art baselines on VBench while achieving faster inference and scaling to minute-long sequences.

Conclusion: The work demonstrates that removing the frame-by-frame constraint enables better video generation and motivates rethinking sequence decomposition in spatiotemporal domains beyond traditional frame-based approaches.

Abstract: Autoregressive models for video generation typically operate frame-by-frame, extending next-token prediction from language to video’s temporal dimension. We question that unlike word as token is universally agreed in language if frame is a appropriate prediction unit? To address this, we present VideoAR, a unified framework that supports a spectrum of prediction units including full frames, key-detail frames, multiscale refinements, and spatiotemporal cubes. Among these designs, we find model video generation using \textit{spatiotemporal} cubes as prediction units, which allows autoregressive models to operate across both spatial and temporal dimensions simultaneously. This approach eliminates the assumption that frames are the natural atomic units for video autoregression. We evaluate VideoAR across diverse prediction strategies, finding that cube-based prediction consistently delivers superior quality, speed, and temporal coherence. By removing the frame-by-frame constraint, our video generator surpasses state-of-the-art baselines on VBench while achieving faster inference and enabling seamless scaling to minute-long sequences. We hope this work will motivate rethinking sequence decomposition in video and other spatiotemporal domains.

Prerit Gupta, Shourya Verma, Ananth Grama, Aniket Bera

Main category: cs.CV

TL;DR: DualFlow is a unified framework for multi-modal two-person motion generation that uses rectified flow for efficient sampling and RAG for semantic grounding, achieving state-of-the-art results in text-to-motion, music-to-motion, and multi-modal interactive benchmarks.

Details

Motivation: Generating realistic, context-aware two-person motion conditioned on diverse modalities remains challenging in computer graphics, animation, and human-computer interaction.

Method: Uses rectified flow for deterministic straight-line sampling paths between noise and data, employs Retrieval-Augmented Generation (RAG) with music features and LLM-based text decompositions, contrastive objective for alignment, and synchronization loss for inter-person coordination.

Result: Extensive evaluations show consistent gains in motion quality, responsiveness, and efficiency. Produces temporally coherent and rhythmically synchronized motions.

Conclusion: DualFlow sets state-of-the-art in multi-modal human motion generation, offering improved performance across text-to-motion, music-to-motion, and multi-modal interactive tasks.

Abstract: Generating realistic, context-aware two-person motion conditioned on diverse modalities remains a central challenge in computer graphics, animation, and human-computer interaction. We introduce DualFlow, a unified and efficient framework for multi-modal two-person motion generation. DualFlow conditions 3D motion synthesis on diverse inputs, including text, music, and prior motion sequences. Leveraging rectified flow, it achieves deterministic straight-line sampling paths between noise and data, reducing inference time and mitigating error accumulation common in diffusion-based models. To enhance semantic grounding, DualFlow employs a Retrieval-Augmented Generation (RAG) module that retrieves motion exemplars using music features and LLM-based text decompositions of spatial relations, body movements, and rhythmic patterns. We use contrastive objective that further strengthens alignment with conditioning signals and introduce synchronization loss that improves inter-person coordination. Extensive evaluations across text-to-motion, music-to-motion, and multi-modal interactive benchmarks show consistent gains in motion quality, responsiveness, and efficiency. DualFlow produces temporally coherent and rhythmically synchronized motions, setting state-of-the-art in multi-modal human motion generation.

[565] SVAC: Scaling Is All You Need For Referring Video Object Segmentation

Li Zhang, Haoxiang Gao, Zhihao Zhang, Luoxiao Huang, Tao Zhang

Main category: cs.CV

TL;DR: SVAC is a unified model for Referring Video Object Segmentation that scales up input frames and segmentation tokens while using compression techniques to handle computational challenges.

Details

Motivation: Current RVOS methods face challenges including insufficient use of MLLMs' prior knowledge, high computational costs for long videos, and poor handling of complex temporal dynamics.

Method: SVAC uses Anchor-Based Spatio-Temporal Compression (ASTC) to compress visual tokens while preserving structure, and Clip-Specific Allocation (CSA) strategy to handle dynamic object behaviors across video clips.

Result: SVAC achieves state-of-the-art performance on multiple RVOS benchmarks with competitive efficiency.

Conclusion: The proposed SVAC model effectively addresses computational challenges in RVOS while improving segmentation performance through enhanced video-language interaction and temporal dynamics handling.

Abstract: Referring Video Object Segmentation (RVOS) aims to segment target objects in video sequences based on natural language descriptions. While recent advances in Multi-modal Large Language Models (MLLMs) have improved RVOS performance through enhanced text-video understanding, several challenges remain, including insufficient exploitation of MLLMs’ prior knowledge, prohibitive computational and memory costs for long-duration videos, and inadequate handling of complex temporal dynamics. In this work, we propose SVAC, a unified model that improves RVOS by scaling up input frames and segmentation tokens to enhance video-language interaction and segmentation precision. To address the resulting computational challenges, SVAC incorporates the Anchor-Based Spatio-Temporal Compression (ASTC) module to compress visual tokens while preserving essential spatio-temporal structure. Moreover, the Clip-Specific Allocation (CSA) strategy is introduced to better handle dynamic object behaviors across video clips. Experimental results demonstrate that SVAC achieves state-of-the-art performance on multiple RVOS benchmarks with competitive efficiency. Our code is available at https://github.com/lizhang1998/SVAC.

[566] GANji: A Framework for Introductory AI Image Generation

Chandon Hamel, Mike Busch

Main category: cs.CV

TL;DR: GANji is a lightweight framework for benchmarking AI image generation models (VAE, GAN, DDPM) using Japanese Kanji characters, revealing trade-offs between image quality and computational efficiency.

Details

Motivation: To address the computational barrier in comparative studies of generative models and provide an accessible benchmarking tool for researchers and practitioners.

Method: Systematically compares VAE, GAN, and DDPM performance using a dataset of 10,314 Japanese Kanji characters, evaluating image fidelity with FID scores and sampling time.

Result: DDPM achieved highest image fidelity (FID score of 26.2) but was over 2,000 times slower in sampling time compared to VAE and GAN.

Conclusion: GANji is an effective and accessible framework that reveals fundamental trade-offs between model architecture, computational cost, and visual quality, suitable for both educational and research purposes.

Abstract: The comparative study of generative models often requires significant computational resources, creating a barrier for researchers and practitioners. This paper introduces GANji, a lightweight framework for benchmarking foundational AI image generation techniques using a dataset of 10,314 Japanese Kanji characters. It systematically compares the performance of a Variational Autoencoder (VAE), a Generative Adversarial Network (GAN), and a Denoising Diffusion Probabilistic Model (DDPM). The results demonstrate that while the DDPM achieves the highest image fidelity, with a Fr'echet Inception Distance (FID) score of 26.2, its sampling time is over 2,000 times slower than the other models. The GANji framework is an effective and accessible tool for revealing the fundamental trade-offs between model architecture, computational cost, and visual quality, making it ideal for both educational and research purposes.

[567] Generalist Scanner Meets Specialist Locator: A Synergistic Coarse-to-Fine Framework for Robust GUI Grounding

Zhecheng Li, Guoxian Song, Yiwei Wang, Zhen Xiong, Junsong Yuan, Yujun Cai

Main category: cs.CV

TL;DR: GMS is a coarse-to-fine framework that combines general vision-language models as ‘Scanner’ and task-specific GUI grounding models as ‘Locator’ to significantly improve GUI grounding performance.

Details

Motivation: Grounding natural language queries in GUIs is challenging due to diverse UI elements across applications and the need for precise spatial coordinate prediction.

Method: A synergistic framework with five stages using hierarchical search and cross-modal communication, where general VLMs scan for regions of interest and fine-tuned grounding models locate precise coordinates.

Result: GMS achieves 35.7% accuracy on ScreenSpot-Pro dataset, representing a 10× improvement over individual components (2.0% for Scanner, 3.7% for Locator) and outperforms other baselines.

Conclusion: The GMS framework demonstrates robust performance and potential for general-purpose GUI grounding by effectively leveraging complementary strengths of different model types.

Abstract: Grounding natural language queries in graphical user interfaces (GUIs) presents a challenging task that requires models to comprehend diverse UI elements across various applications and systems, while also accurately predicting the spatial coordinates for the intended operation. To tackle this problem, we propose GMS: Generalist Scanner Meets Specialist Locator, a synergistic coarse-to-fine framework that effectively improves GUI grounding performance. GMS leverages the complementary strengths of general vision-language models (VLMs) and small, task-specific GUI grounding models by assigning them distinct roles within the framework. Specifically, the general VLM acts as a ‘Scanner’ to identify potential regions of interest, while the fine-tuned grounding model serves as a ‘Locator’ that outputs precise coordinates within these regions. This design is inspired by how humans perform GUI grounding, where the eyes scan the interface and the brain focuses on interpretation and localization. Our whole framework consists of five stages and incorporates hierarchical search with cross-modal communication to achieve promising prediction results. Experimental results on the ScreenSpot-Pro dataset show that while the ‘Scanner’ and ‘Locator’ models achieve only $2.0%$ and $3.7%$ accuracy respectively when used independently, their integration within GMS framework yields an overall accuracy of $35.7%$, representing a $10 \times$ improvement. Additionally, GMS significantly outperforms other strong baselines under various settings, demonstrating its robustness and potential for general-purpose GUI grounding.

[568] EYE-DEX: Eye Disease Detection and EXplanation System

Youssef Sabiri, Walid Houmaidi, Amine Abouaomar

Main category: cs.CV

TL;DR: EYE-DEX is an automated deep learning framework that achieves 92.36% accuracy in classifying 10 retinal diseases from fundus images using fine-tuned VGG16, with Grad-CAM visual explanations for transparency.

Details

Motivation: Retinal disease affects 2.2 billion people globally with $411 billion annual productivity losses. Manual diagnosis by ophthalmologists is time-consuming and subjective, requiring automated solutions.

Method: Benchmarked three pre-trained CNN models (VGG16, VGG19, ResNet50) on 21,577 retinal fundus images from Retinal Disease Dataset, with fine-tuned VGG16 achieving best performance. Integrated Grad-CAM for visual explanations.

Result: Fine-tuned VGG16 achieved state-of-the-art 92.36% test accuracy in classifying 10 retinal conditions, outperforming other models.

Conclusion: EYE-DEX provides accurate automated retinal disease diagnosis with transparent visual explanations, enhancing clinician trust in AI-assisted diagnostics.

Abstract: Retinal disease diagnosis is critical in preventing vision loss and reducing socioeconomic burdens. Globally, over 2.2 billion people are affected by some form of vision impairment, resulting in annual productivity losses estimated at $411 billion. Traditional manual grading of retinal fundus images by ophthalmologists is time-consuming and subjective. In contrast, deep learning has revolutionized medical diagnostics by automating retinal image analysis and achieving expert-level performance. In this study, we present EYE-DEX, an automated framework for classifying 10 retinal conditions using the large-scale Retinal Disease Dataset comprising 21,577 eye fundus images. We benchmark three pre-trained Convolutional Neural Network (CNN) models–VGG16, VGG19, and ResNet50–with our finetuned VGG16 achieving a state-of-the-art global benchmark test accuracy of 92.36%. To enhance transparency and explainability, we integrate the Gradient-weighted Class Activation Mapping (Grad-CAM) technique to generate visual explanations highlighting disease-specific regions, thereby fostering clinician trust and reliability in AI-assisted diagnostics.

[569] Latent Visual Reasoning

Bangzheng Li, Ximeng Sun, Jiang Liu, Ze Wang, Jialian Wu, Xiaodong Yu, Hao Chen, Emad Barsoum, Muhao Chen, Zicheng Liu

Main category: cs.CV

TL;DR: LVR enables autoregressive reasoning directly in visual embedding space, achieving substantial gains on perception-intensive VQA tasks.

Details

Motivation: Current MLLMs with CoT reasoning are constrained to language space, treating visual information as static preconditions rather than actively reasoning in visual space.

Method: Projects images into visual tokens in joint semantic space, trains language model to generate latent states that reconstruct key visual tokens, and interleaves LVR with text generation using adapted GRPO algorithm for reinforcement learning.

Result: Achieves 71.67% on MMVP benchmark compared to 66.67% with Qwen2.5-VL, showing substantial improvement in fine-grained visual understanding and perception.

Conclusion: LVR paradigm enables direct visual reasoning in embedding space, significantly enhancing perception capabilities of multimodal models beyond language-only reasoning approaches.

Abstract: Multimodal Large Language Models (MLLMs) have achieved notable gains in various tasks by incorporating Chain-of-Thought (CoT) reasoning in language spaces. Recent work extends this direction by leveraging external tools for visual editing, thereby enhancing the visual signal along the reasoning trajectories. Nevertheless, these approaches remain fundamentally constrained: reasoning is still confined to the language space, with visual information treated as static preconditions. We introduce Latent Visual Reasoning (LVR), a new paradigm that enables autoregressive reasoning directly in the visual embedding space. A visual encoder first projects images into visual tokens within a joint semantic space shared with the language model. The language model is then trained to generate latent states that reconstruct key visual tokens critical for answering the query, constituting the process of latent visual reasoning. By interleaving LVR with standard text generation, our model achieves substantial gains on perception-intensive visual question answering tasks. In addition, we adapt the GRPO algorithm to conduct reinforcement learning on latent reasoning, further balancing LVR and textual generation. We show that LVR substantially improves fine-grained visual understanding and perception, achieving 71.67% on MMVP compared to 66.67% with Qwen2.5-VL. Code base and model weights will be released later.

[570] Analysis of Bias in Deep Learning Facial Beauty Regressors

Chandon Hamel, Mike Busch

Main category: cs.CV

TL;DR: AI facial beauty prediction models exhibit significant ethnicity-based bias, with both SCUT-FBP5500 and MEBeauty-trained models showing prediction disparities across ethnic groups even on balanced datasets, amplifying societal beauty biases rather than mitigating them.

Details

Motivation: To investigate and warn about AI's role in shaping aesthetic norms while providing pathways toward equitable beauty technologies, as bias can be introduced even from seemingly balanced sources in facial beauty prediction systems.

Method: Comparative analysis of models trained on SCUT-FBP5500 and MEBeauty datasets using rigorous statistical validation (Kruskal-Wallis H-tests, post hoc Dunn analyses) and cross-dataset validation on the balanced FairFace dataset.

Result: Both models exhibited significant prediction disparities across ethnic groups (p < 0.001), with only 4.8-9.5% of inter-group comparisons satisfying distributional parity criteria, showing algorithmic amplification of societal beauty biases rather than mitigation.

Conclusion: Current AI beauty prediction approaches are inadequate, and mitigation strategies are needed to address the significant ethnicity-based bias in facial beauty prediction systems.

Abstract: Bias can be introduced to AI systems even from seemingly balanced sources, and AI facial beauty prediction is subject to ethnicity-based bias. This work sounds warnings about AI’s role in shaping aesthetic norms while providing potential pathways toward equitable beauty technologies through comparative analysis of models trained on SCUT-FBP5500 and MEBeauty datasets. Employing rigorous statistical validation (Kruskal-Wallis H-tests, post hoc Dunn analyses). It is demonstrated that both models exhibit significant prediction disparities across ethnic groups $(p < 0.001)$, even when evaluated on the balanced FairFace dataset. Cross-dataset validation shows algorithmic amplification of societal beauty biases rather than mitigation based on prediction and error parity. The findings underscore the inadequacy of current AI beauty prediction approaches, with only 4.8-9.5% of inter-group comparisons satisfying distributional parity criteria. Mitigation strategies are proposed and discussed in detail.

[571] Asymmetric VAE for One-Step Video Super-Resolution Acceleration

Jianze Li, Yong Guo, Yulun Zhang, Xiaokang Yang

Main category: cs.CV

TL;DR: FastVSR is a one-step diffusion model for video super-resolution that achieves significant speed improvements through a high-compression VAE with spatial compression ratio of 16 and a stable training framework.

Details

Motivation: Current diffusion-based video super-resolution models have reduced sampling steps to one, but there's still significant room for optimization in inference efficiency and computational cost reduction.

Method: Proposes FastVSR with f16 VAE (16x spatial compression), uses pixel shuffle and channel replication for upsampling, and implements a lower-bound-guided training strategy for stable convergence.

Result: Achieves speedups of 111.9x compared to multi-step models and 3.92x compared to existing one-step models while maintaining performance.

Conclusion: FastVSR demonstrates substantial improvements in inference efficiency for video super-resolution through high compression VAE and stable training framework, making diffusion models more practical for real-world applications.

Abstract: Diffusion models have significant advantages in the field of real-world video super-resolution and have demonstrated strong performance in past research. In recent diffusion-based video super-resolution (VSR) models, the number of sampling steps has been reduced to just one, yet there remains significant room for further optimization in inference efficiency. In this paper, we propose FastVSR, which achieves substantial reductions in computational cost by implementing a high compression VAE (spatial compression ratio of 16, denoted as f16). We design the structure of the f16 VAE and introduce a stable training framework. We employ pixel shuffle and channel replication to achieve additional upsampling. Furthermore, we propose a lower-bound-guided training strategy, which introduces a simpler training objective as a lower bound for the VAE’s performance. It makes the training process more stable and easier to converge. Experimental results show that FastVSR achieves speedups of 111.9 times compared to multi-step models and 3.92 times compared to existing one-step models. We will release code and models at https://github.com/JianzeLi-114/FastVSR.

[572] Accelerating Cerebral Diagnostics with BrainFusion: A Comprehensive MRI Tumor Framework

Walid Houmaidi, Youssef Sabiri, Salmane El Mansour Billah, Amine Abouaomar

Main category: cs.CV

TL;DR: BrainFusion combines fine-tuned CNNs (VGG16, ResNet50, Xception) with YOLOv8 for brain tumor classification and localization from MRI, achieving 99.86% accuracy with VGG16 and enhanced interpretability through bounding boxes and explainable AI.

Details

Motivation: Early and accurate brain tumor classification is crucial for effective treatment and improved patient outcomes, requiring reliable diagnostic systems.

Method: Fine-tuned CNN models (VGG16, ResNet50, Xception) for tumor classification combined with YOLOv8 for tumor localization using bounding boxes on MRI data from Brain Tumor MRI Dataset.

Result: Fine-tuned VGG16 achieved 99.86% test accuracy, substantially exceeding previous benchmarks, with enhanced clinical interpretability through localization and explainable AI.

Conclusion: This approach demonstrates the transformative potential of deep learning for faster, more reliable brain tumor diagnoses, contributing to improved patient care and survival rates.

Abstract: The early and accurate classification of brain tumors is crucial for guiding effective treatment strategies and improving patient outcomes. This study presents BrainFusion, a significant advancement in brain tumor analysis using magnetic resonance imaging (MRI) by combining fine-tuned convolutional neural networks (CNNs) for tumor classification–including VGG16, ResNet50, and Xception–with YOLOv8 for precise tumor localization with bounding boxes. Leveraging the Brain Tumor MRI Dataset, our experiments reveal that the fine-tuned VGG16 model achieves test accuracy of 99.86%, substantially exceeding previous benchmarks. Beyond setting a new accuracy standard, the integration of bounding-box localization and explainable AI techniques further enhances both the clinical interpretability and trustworthiness of the system’s outputs. Overall, this approach underscores the transformative potential of deep learning in delivering faster, more reliable diagnoses, ultimately contributing to improved patient care and survival rates.

[573] Beyond Isolated Facts: Synthesizing Narrative and Grounded Supervision for VideoQA

Jianxin Liang, Tan Yue, Yuxuan Wang, Yueqian Wang, Zhihan Yin, Huishuai Zhang, Dongyan Zhao

Main category: cs.CV

TL;DR: The paper introduces a framework to enhance VideoQA by synthesizing richer supervision through Question-Based Paraphrasing (QBP) and Question-Based Captioning (QBC), moving beyond isolated factual QA pairs to capture narrative structure and visual rationales.

Details

Motivation: Current VideoQA models are limited by 'bag-of-facts' supervision that fails to capture narrative and causal event structure, leading to shallow video understanding.

Method: Propose two strategies: QBP synthesizes diverse questions into holistic narrative paragraphs reconstructing event structure; QBC generates fine-grained visual rationales grounding answers in specific evidence. Use synthetic data to train models under unified next-token prediction.

Result: Achieves state-of-the-art results: 72.5% on STAR (+4.9%) with 3B model, 80.8% on NExT-QA with 7B model. Both QBP and QBC enhance cross-dataset generalization, with QBP accelerating convergence by over 2.5x.

Conclusion: Shifting from isolated facts to narrative coherence and grounded rationales creates a more accurate, efficient, and generalizable VideoQA training paradigm.

Abstract: The performance of Video Question Answering (VideoQA) models is fundamentally constrained by the nature of their supervision, which typically consists of isolated, factual question-answer pairs. This “bag-of-facts” approach fails to capture the underlying narrative and causal structure of events, limiting models to a shallow understanding of video content. To move beyond this paradigm, we introduce a framework to synthesize richer supervisory signals. We propose two complementary strategies: Question-Based Paraphrasing (QBP), which synthesizes the diverse inquiries (what, how, why) from a video’s existing set of question-answer pairs into a holistic narrative paragraph that reconstructs the video’s event structure; and Question-Based Captioning (QBC), which generates fine-grained visual rationales, grounding the answer to each question in specific, relevant evidence. Leveraging powerful generative models, we use this synthetic data to train VideoQA models under a unified next-token prediction objective. Extensive experiments on STAR and NExT-QA validate our approach, demonstrating significant accuracy gains and establishing new state-of-the-art results, such as improving a 3B model to 72.5% on STAR (+4.9%) and a 7B model to 80.8% on NExT-QA. Beyond accuracy, our analysis reveals that both QBP and QBC substantially enhance cross-dataset generalization, with QBP additionally accelerating model convergence by over 2.5x. These results demonstrate that shifting data synthesis from isolated facts to narrative coherence and grounded rationales yields a more accurate, efficient, and generalizable training paradigm.

Moxin Zhao, Nan Meng, Jason Pui Yin Cheung, Chris Yuk Kwan Tang, Chenxi Yu, Wenting Zhong, Pengyu Lu, Chang Shi, Yipeng Zhuang, Teng Zhang

Main category: cs.CV

TL;DR: LatXGen is a generative framework that synthesizes lateral spinal radiographs from posterior RGBD images, enabling radiation-free sagittal alignment assessment for Adolescent Idiopathic Scoliosis.

Details

Motivation: Current radiation-free methods focus mainly on coronal plane assessment, leaving sagittal alignment evaluation largely unexplored without ionizing radiation, creating a critical gap in comprehensive AIS evaluation.

Method: A dual-stage architecture with attention-based FFC module for anatomical feature integration and Spatial Deformation Network for morphological variations. Uses cross-modality translation from RGBD input to radiographic domain.

Result: LatXGen produces anatomically accurate radiographs and outperforms existing GAN-based methods in both visual fidelity and quantitative metrics, using a dataset of 3,264 RGBD and lateral radiograph pairs.

Conclusion: The framework offers a promising radiation-free solution for sagittal spine assessment and advances comprehensive AIS evaluation by enabling reliable sagittal alignment estimation without ionizing radiation.

Abstract: Adolescent Idiopathic Scoliosis (AIS) is a complex three-dimensional spinal deformity, and accurate morphological assessment requires evaluating both coronal and sagittal alignment. While previous research has made significant progress in developing radiation-free methods for coronal plane assessment, reliable and accurate evaluation of sagittal alignment without ionizing radiation remains largely underexplored. To address this gap, we propose LatXGen, a novel generative framework that synthesizes realistic lateral spinal radiographs from posterior Red-Green-Blue and Depth (RGBD) images of unclothed backs. This enables accurate, radiation-free estimation of sagittal spinal alignment. LatXGen tackles two core challenges: (1) inferring sagittal spinal morphology changes from a lateral perspective based on posteroanterior surface geometry, and (2) performing cross-modality translation from RGBD input to the radiographic domain. The framework adopts a dual-stage architecture that progressively estimates lateral spinal structure and synthesizes corresponding radiographs. To enhance anatomical consistency, we introduce an attention-based Fast Fourier Convolution (FFC) module for integrating anatomical features from RGBD images and 3D landmarks, and a Spatial Deformation Network (SDN) to model morphological variations in the lateral view. Additionally, we construct the first large-scale paired dataset for this task, comprising 3,264 RGBD and lateral radiograph pairs. Experimental results demonstrate that LatXGen produces anatomically accurate radiographs and outperforms existing GAN-based methods in both visual fidelity and quantitative metrics. This study offers a promising, radiation-free solution for sagittal spine assessment and advances comprehensive AIS evaluation.

[575] Euclid’s Gift: Enhancing Spatial Perception and Reasoning in Vision-Language Models via Geometric Surrogate Tasks

Shijie Lian, Changti Wu, Laurence Tianruo Yang, Hang Yuan, Bin Yu, Lei Zhang, Kai Chen

Main category: cs.CV

TL;DR: The paper proposes using Euclidean geometry problem-solving as a surrogate task to enhance spatial intelligence in Multimodal Large Language Models (MLLMs), achieving significant zero-shot improvements across multiple spatial reasoning benchmarks.

Details

Motivation: Spatial intelligence remains a critical unresolved challenge for MLLMs, despite encompassing important abilities like shape visualization, mental rotation, relational positioning, and numerosity estimation.

Method: Created Euclid30K dataset with 30K geometry problems, then used Group Relative Policy Optimization (GRPO) to finetune Qwen2.5VL and RoboBrain2.0 models, enabling them to identify shapes, count entities, and perform multi-step deductive reasoning using Euclidean principles.

Result: Models achieved substantial zero-shot gains across four spatial reasoning benchmarks (Super-CLEVR, Omni3DBench, VSI-Bench, and MindCube). Mean VSI-Bench accuracy rose from 34.5% to 40.5%, with RoboBrain2.0-Euclid-7B achieving 49.6% accuracy, surpassing previous state-of-the-art.

Conclusion: This is the first systematic study showing that geometry-centric fine-tuning can confer vision-language models with broadly transferable spatial skills, demonstrating the effectiveness of using Euclidean geometry as a surrogate task for spatial intelligence.

Abstract: Spatial intelligence spans a rich suite of abilities, including visualising and transforming shapes, mentally rotating objects, judging relational positions and containment, and estimating numerosity. However, it still remains a critical unresolved challenge for Multimodal Large Language Models (MLLMs).To fill this gap, we propose to treat Euclidean geometry problem-solving as a surrogate task. Specifically, we meticulously constructed a curated multimodal dataset, called Euclid30K, comprising approximately 30K plane and solid geometry problems. To enable the model to acquire and apply Euclidean principles from these geometry problems, we employed Group Relative Policy Optimization (GRPO) to finetune the Qwen2.5VL family and RoboBrain2.0 family, inspiring the models to identify shapes, count, and relate entities, and perform multi-step deductive reasoning using Euclidean principles. Our experiments demonstrate that the resulting models achieve substantial zero-shot gains across four spatial reasoning benchmarks (Super-CLEVR, Omni3DBench, VSI-Bench, and MindCube) without any task-specific adaptations. Notably, after training on the Euclid30K, the mean VSI-Bench accuracy of all evaluated models rose from 34.5% to 40.5%, improving by 5.5 percentage points. Among them, RoboBrain2.0-Euclid-7B achieves 49.6% accuracy, surpassing the previous state-of-the-art model, Spatial-MLLM.To our knowledge, this is the first systematic study showing that geometry-centric fine-tuning can confer vision-language models with broadly transferable spatial skills. Code and Euclid30K dataset can be found in https://zgca-ai4edu.github.io/Euclids_Gift.

[576] High-Order Progressive Trajectory Matching for Medical Image Dataset Distillation

Le Dong, Jinghao Bian, Jingyang Hou, Jingliang Hu, Yilei Shi, Weisheng Dong, Xiao Xiang Zhu, Lichao Mou

Main category: cs.CV

TL;DR: Proposes a novel dataset distillation method for medical images using shape-wise potential and easy-to-complex matching to capture intermediate optimization states, improving performance while preserving privacy.

Details

Motivation: Medical image analysis faces data sharing challenges due to privacy regulations. Existing trajectory matching methods focus only on terminal states, missing crucial information from intermediate optimization states.

Method: Uses shape-wise potential to capture geometric structure of parameter trajectories and easy-to-complex matching strategy that progressively addresses parameters based on complexity.

Result: Experiments on medical image classification show improved distillation performance while preserving privacy and maintaining model accuracy comparable to training on original datasets.

Conclusion: The proposed method effectively addresses limitations of existing trajectory matching approaches for medical dataset distillation, enabling better data sharing while protecting privacy.

Abstract: Medical image analysis faces significant challenges in data sharing due to privacy regulations and complex institutional protocols. Dataset distillation offers a solution to address these challenges by synthesizing compact datasets that capture essential information from real, large medical datasets. Trajectory matching has emerged as a promising methodology for dataset distillation; however, existing methods primarily focus on terminal states, overlooking crucial information in intermediate optimization states. We address this limitation by proposing a shape-wise potential that captures the geometric structure of parameter trajectories, and an easy-to-complex matching strategy that progressively addresses parameters based on their complexity. Experiments on medical image classification tasks demonstrate that our method improves distillation performance while preserving privacy and maintaining model accuracy comparable to training on the original datasets. Our code is available at https://github.com/Bian-jh/HoP-TM.

[577] Combining Discrepancy-Confusion Uncertainty and Calibration Diversity for Active Fine-Grained Image Classification

Yinghao Jin, Xi Yang

Main category: cs.CV

TL;DR: DECERN is a novel active learning method for fine-grained image classification that combines discrepancy-confusion uncertainty and calibration diversity to select the most informative samples under limited annotation budgets.

Details

Motivation: In fine-grained image classification, assessing sample informativeness is challenging due to subtle inter-class differences, making traditional active learning methods less effective.

Method: DECERN introduces a multifaceted informativeness measure combining discrepancy-confusion uncertainty (quantifying category directionality and structural stability) and calibration diversity (uncertainty-weighted clustering to diversify samples while maintaining local representativeness).

Result: Extensive experiments on 7 fine-grained image datasets across 26 experimental settings demonstrate superior performance compared to state-of-the-art methods.

Conclusion: The proposed DECERN method effectively addresses the challenges of active learning in fine-grained image classification by combining uncertainty and diversity measures to select the most valuable samples for annotation.

Abstract: Active learning (AL) aims to build high-quality labeled datasets by iteratively selecting the most informative samples from an unlabeled pool under limited annotation budgets. However, in fine-grained image classification, assessing this informativeness is especially challenging due to subtle inter-class differences. In this paper, we introduce a novel method, combining discrepancy-confusion uncertainty and calibration diversity for active fine-grained image classification (DECERN), to effectively perceive the distinctiveness between fine-grained images and evaluate the sample value. DECERN introduces a multifaceted informativeness measure that combines discrepancy-confusion uncertainty and calibration diversity. The discrepancy-confusion uncertainty quantifies the category directionality and structural stability of fine-grained unlabeled data during local feature fusion. Subsequently, uncertainty-weighted clustering is performed to diversify the uncertainty samples. Then we calibrate the diversity to maximize the global diversity of the selected sample while maintaining its local representativeness. Extensive experiments conducted on 7 fine-grained image datasets across 26 distinct experimental settings demonstrate that our method achieves superior performance compared to state-of-the-art methods.

[578] NeMo: Needle in a Montage for Video-Language Understanding

Zi-Yuan Hu, Shuo Liang, Duo Zheng, Yanyang Li, Yeyao Tao, Shijia Huang, Wei Feng, Jia Qin, Jianguang Yu, Jing Huang, Meng Fang, Yin Li, Liwei Wang

Main category: cs.CV

TL;DR: The paper introduces NeMoBench, a benchmark for evaluating VideoLLMs’ temporal reasoning capabilities through the Needle in a Montage (NeMo) task, featuring 31,378 QA pairs from 13,486 videos.

Details

Motivation: Recent advances in video large language models require new evaluation protocols for complex temporal reasoning in video-language understanding, inspired by the needle in a haystack test used for LLMs.

Method: Developed a scalable automated data generation pipeline to create high-quality video question answering data for the NeMo task, which assesses long-context recall and temporal grounding.

Result: Generated NeMoBench with 31,378 QA pairs from 13,486 videos of varying durations. Evaluated 20 state-of-the-art models, providing extensive results and insights into their capabilities and limitations.

Conclusion: The proposed pipeline reliably generates high-quality evaluation data, enabling continuous updates to NeMoBench with the latest videos, and provides comprehensive assessment of VideoLLMs’ temporal reasoning abilities.

Abstract: Recent advances in video large language models (VideoLLMs) call for new evaluation protocols and benchmarks for complex temporal reasoning in video-language understanding. Inspired by the needle in a haystack test widely used by LLMs, we introduce a novel task of Needle in a Montage (NeMo), designed to assess VideoLLMs’ critical reasoning capabilities, including long-context recall and temporal grounding. To generate video question answering data for our task, we develop a scalable automated data generation pipeline that facilitates high-quality data synthesis. Built upon the proposed pipeline, we present NeMoBench, a video-language benchmark centered on our task. Specifically, our full set of NeMoBench features 31,378 automatically generated question-answer (QA) pairs from 13,486 videos with various durations ranging from seconds to hours. Experiments demonstrate that our pipeline can reliably and automatically generate high-quality evaluation data, enabling NeMoBench to be continuously updated with the latest videos. We evaluate 20 state-of-the-art models on our benchmark, providing extensive results and key insights into their capabilities and limitations. Our project page is available at: https://lavi-lab.github.io/NeMoBench.

[579] Tumor Synthesis conditioned on Radiomics

Jonghun Kim, Inye Na, Eun Sook Ko, Hyunjin Park

Main category: cs.CV

TL;DR: A tumor-generation model using radiomics features as conditions to generate diverse 3D medical images, enabling user-specified tumor characteristics and aiding in medical training and treatment planning.

Details

Motivation: Address privacy concerns and data scarcity in 3D medical imaging (CT/MRI) by overcoming limitations in output diversity of existing generative models.

Method: GAN-based model for tumor mask generation and diffusion-based approach for tumor texture generation, both conditioned on radiomics features (size, shape, texture).

Result: Successfully tested on tumors in four organs (kidney, lung, breast, brain) across CT and MRI; synthesized images aid downstream task training and pass expert authenticity evaluations.

Conclusion: The method enables flexible tumor manipulation and generation, showing potential for treatment planning with diverse synthesized tumors.

Abstract: Due to privacy concerns, obtaining large datasets is challenging in medical image analysis, especially with 3D modalities like Computed Tomography (CT) and Magnetic Resonance Imaging (MRI). Existing generative models, developed to address this issue, often face limitations in output diversity and thus cannot accurately represent 3D medical images. We propose a tumor-generation model that utilizes radiomics features as generative conditions. Radiomics features are high-dimensional handcrafted semantic features that are biologically well-grounded and thus are good candidates for conditioning. Our model employs a GAN-based model to generate tumor masks and a diffusion-based approach to generate tumor texture conditioned on radiomics features. Our method allows the user to generate tumor images according to user-specified radiomics features such as size, shape, and texture at an arbitrary location. This enables the physicians to easily visualize tumor images to better understand tumors according to changing radiomics features. Our approach allows for the removal, manipulation, and repositioning of tumors, generating various tumor types in different scenarios. The model has been tested on tumors in four different organs (kidney, lung, breast, and brain) across CT and MRI. The synthesized images are shown to effectively aid in training for downstream tasks and their authenticity was also evaluated through expert evaluations. Our method has potential usage in treatment planning with diverse synthesized tumors.

[580] Simulating Post-Neoadjuvant Chemotherapy Breast Cancer MRI via Diffusion Model with Prompt Tuning

Jonghun Kim, Hyunjin Park

Main category: cs.CV

TL;DR: A diffusion model with prompt tuning is proposed to generate post-treatment DCE-MRI images from pre-treatment images for predicting breast cancer response to neoadjuvant chemotherapy.

Details

Motivation: Accurate prediction of NAC response helps with treatment planning for breast cancer patients, and current monitoring using follow-up DCE-MRI could be improved with predictive modeling.

Method: Uses maximum intensity projection images from DCE-MRI and leverages diffusion models with prompt tuning to incorporate clinical factors affecting NAC response, generating post-treatment images from pre-treatment ones.

Result: The model outperformed other generative models in image quality metrics and better reflected tumor size changes according to pathological complete response (pCR). Ablation studies confirmed the method’s design choices.

Conclusion: The proposed approach has potential to contribute to precision medicine by improving NAC response prediction for breast cancer treatment planning.

Abstract: Neoadjuvant chemotherapy (NAC) is a common therapy option before the main surgery for breast cancer. Response to NAC is monitored using follow-up dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI). Accurate prediction of NAC response helps with treatment planning. Here, we adopt maximum intensity projection images from DCE-MRI to generate post-treatment images (i.e., 3 or 12 weeks after NAC) from pre-treatment images leveraging the emerging diffusion model. We introduce prompt tuning to account for the known clinical factors affecting response to NAC. Our model performed better than other generative models in image quality metrics. Our model was better at generating images that reflected changes in tumor size according to pCR compared to other models. Ablation study confirmed the design choices of our method. Our study has the potential to help with precision medicine.

[581] Talk in Pieces, See in Whole: Disentangling and Hierarchical Aggregating Representations for Language-based Object Detection

Sojung An, Kwanyong Park, Yong Jae Lee, Donghyun Kim

Main category: cs.CV

TL;DR: The paper proposes TaSe framework to improve vision-language models’ ability to handle complex queries with descriptive attributes and relations by disentangling text into objects, attributes, and relations, then hierarchically aggregating them.

Details

Motivation: Current VLMs struggle with complex queries involving descriptive attributes and relational clauses due to text encoders treating sentences as bags-of-words, leading to false positives in object detection.

Method: TaSe framework with three components: hierarchical synthetic captioning dataset, Talk in Pieces module for disentangling text embeddings into objects/attributes/relations, and See in Whole module for hierarchical aggregation guided by novel loss functions.

Result: Experimental results on OmniLabel benchmark show 24% performance improvement, demonstrating effectiveness of linguistic compositionality approach.

Conclusion: Disentangling and hierarchically structuring linguistic representations significantly enhances VLMs’ ability to handle complex queries and improves multimodal perception for language-based object detection.

Abstract: While vision-language models (VLMs) have made significant progress in multimodal perception (e.g., open-vocabulary object detection) with simple language queries, state-of-the-art VLMs still show limited ability to perceive complex queries involving descriptive attributes and relational clauses. Our in-depth analysis shows that these limitations mainly stem from text encoders in VLMs. Such text encoders behave like bags-of-words and fail to separate target objects from their descriptive attributes and relations in complex queries, resulting in frequent false positives. To address this, we propose restructuring linguistic representations according to the hierarchical relations within sentences for language-based object detection. A key insight is the necessity of disentangling textual tokens into core components-objects, attributes, and relations (“talk in pieces”)-and subsequently aggregating them into hierarchically structured sentence-level representations (“see in whole”). Building on this principle, we introduce the TaSe framework with three main contributions: (1) a hierarchical synthetic captioning dataset spanning three tiers from category names to descriptive sentences; (2) Talk in Pieces, the three-component disentanglement module guided by a novel disentanglement loss function, transforms text embeddings into subspace compositions; and (3) See in Whole, which learns to aggregate disentangled components into hierarchically structured embeddings with the guide of proposed hierarchical objectives. The proposed TaSe framework strengthens the inductive bias of hierarchical linguistic structures, resulting in fine-grained multimodal representations for language-based object detection. Experimental results under the OmniLabel benchmark show a 24% performance improvement, demonstrating the importance of linguistic compositionality.

[582] MMRQA: Signal-Enhanced Multimodal Large Language Models for MRI Quality Assessment

Fankai Jia, Daisong Gan, Zhe Zhang, Zhaochi Wen, Chenchen Dan, Dong Liang, Haifeng Wang

Main category: cs.CV

TL;DR: MMRQA is a novel MRI quality assessment framework that combines multimodal LLMs with signal processing to bridge quantitative metrics and semantic understanding, achieving state-of-the-art performance with interpretable outputs.

Details

Motivation: Traditional MRI quality assessment methods face trade-offs: signal-based approaches lack semantic understanding while deep learning methods sacrifice interpretability. There's a need for clinically interpretable quality assessment that works across diverse protocols and data-scarce scenarios.

Method: MMRQA integrates three innovations: 1) robust metric extraction using MRQy with simulated artifacts, 2) structured transformation of metrics into question-answer pairs using Qwen, and 3) parameter-efficient fusion via LoRA adaptation of LLaVA-OneVision.

Result: The framework achieves state-of-the-art performance on MR-ART, FastMRI, and MyConnectome benchmarks, demonstrating strong zero-shot generalization capabilities as validated by comprehensive ablation studies.

Conclusion: MMRQA successfully bridges quantitative analysis with semantic reasoning, generating clinically interpretable outputs that enhance quality control in dynamic medical settings while maintaining strong generalization across protocols.

Abstract: Magnetic resonance imaging (MRI) quality assessment is crucial for clinical decision-making, yet remains challenging due to data scarcity and protocol variability. Traditional approaches face fundamental trade-offs: signal-based methods like MRIQC provide quantitative metrics but lack semantic understanding, while deep learning approaches achieve high accuracy but sacrifice interpretability. To address these limitations, we introduce the Multimodal MRI Quality Assessment (MMRQA) framework, pioneering the integration of multimodal large language models (MLLMs) with acquisition-aware signal processing. MMRQA combines three key innovations: robust metric extraction via MRQy augmented with simulated artifacts, structured transformation of metrics into question-answer pairs using Qwen, and parameter-efficient fusion through Low-Rank Adaptation (LoRA) of LLaVA-OneVision. Evaluated on MR-ART, FastMRI, and MyConnectome benchmarks, MMRQA achieves state-of-the-art performance with strong zero-shot generalization, as validated by comprehensive ablation studies. By bridging quantitative analysis with semantic reasoning, our framework generates clinically interpretable outputs that enhance quality control in dynamic medical settings.

[583] An Efficient 3D Latent Diffusion Model for T1-contrast Enhanced MRI Generation

Zach Eidex, Mojtaba Safari, Jie Ding, Richard Qiu, Justin Roper, David Yu, Hui-Kuo Shu, Zhen Tian, Hui Mao, Xiaofeng Yang

Main category: cs.CV

TL;DR: A 3D deep learning framework called T1C-RFlow generates synthetic T1-contrast enhanced MRI images from pre-contrast multiparametric MRI, eliminating the need for gadolinium-based contrast agents while achieving high-quality results faster than previous diffusion models.

Details

Motivation: Gadolinium-based contrast agents (GBCAs) used in T1w MRI have limitations including contraindications for patients at risk of nephrogenic systemic fibrosis and imaging inconsistencies due to variations in GBCA administration.

Method: The T1C-RFlow model uses a pretrained autoencoder to create latent space representations from T1w and T2-FLAIR images, then trains a rectified flow diffusion model in this latent space. The model was trained on a large curated dataset of glioma, meningioma, and metastases patients.

Result: T1C-RFlow outperformed benchmark 3D models (pix2pix, DDPM, DiT-3D) with superior quantitative metrics and significantly faster denoising times (6.9 s/volume vs 37.7s for DDPM). It achieved high SSIM scores (0.905-0.937) and low NMSE values across all tumor types.

Conclusion: The proposed method generates synthetic T1C images that closely resemble ground truth T1C in much less time than previous diffusion models, potentially enabling practical contrast-agent-free MRI for brain tumors.

Abstract: Objective: Gadolinium-based contrast agents (GBCAs) are commonly employed with T1w MRI to enhance lesion visualization but are restricted in patients at risk of nephrogenic systemic fibrosis and variations in GBCA administration can introduce imaging inconsistencies. This study develops an efficient 3D deep-learning framework to generate T1-contrast enhanced images (T1C) from pre-contrast multiparametric MRI. Approach: We propose the 3D latent rectified flow (T1C-RFlow) model for generating high-quality T1C images. First, T1w and T2-FLAIR images are input into a pretrained autoencoder to acquire an efficient latent space representation. A rectified flow diffusion model is then trained in this latent space representation. The T1C-RFlow model was trained on a curated dataset comprised of the BraTS 2024 glioma (GLI; 1480 patients), meningioma (MEN; 1141 patients), and metastases (MET; 1475 patients) datasets. Selected patients were split into train (N=2860), validation (N=612), and test (N=614) sets. Results: Both qualitative and quantitative results demonstrate that the T1C-RFlow model outperforms benchmark 3D models (pix2pix, DDPM, Diffusion Transformers (DiT-3D)) trained in the same latent space. T1C-RFlow achieved the following metrics - GLI: NMSE 0.044 +/- 0.047, SSIM 0.935 +/- 0.025; MEN: NMSE 0.046 +/- 0.029, SSIM 0.937 +/- 0.021; MET: NMSE 0.098 +/- 0.088, SSIM 0.905 +/- 0.082. T1C-RFlow had the best tumor reconstruction performance and significantly faster denoising times (6.9 s/volume, 200 steps) than conventional DDPM models in both latent space (37.7s, 1000 steps) and patch-based in image space (4.3 hr/volume). Significance: Our proposed method generates synthetic T1C images that closely resemble ground truth T1C in much less time than previous diffusion models. Further development may permit a practical method for contrast-agent-free MRI for brain tumors.

[584] BALR-SAM: Boundary-Aware Low-Rank Adaptation of SAM for Resource-Efficient Medical Image Segmentation

Zelin Liu, Sicheng Dong, Bocheng Li, Yixuan Yang, Jiacheng Ruan, Chenxu Zhou, Suncheng Xiang

Main category: cs.CV

TL;DR: BALR-SAM is a boundary-aware low-rank adaptation framework that enhances SAM for medical imaging by combining complementary detail enhancement, low-rank adapters, and low-rank tensor attention to achieve superior segmentation performance while updating only 1.8% of parameters.

Details

Motivation: Vision foundation models like SAM struggle with medical image segmentation due to lack of domain-specific adaptation, and there's a need for efficient fine-tuning methods that maintain strong performance with minimal resource demands in clinical practice.

Method: Three key components: (1) Complementary Detail Enhancement Network with depthwise separable convolutions and multi-scale fusion for boundary-sensitive features; (2) Low-rank adapters in Vision Transformer blocks for optimized medical feature representation; (3) Low-rank tensor attention mechanism in mask decoder to reduce memory usage by 75%.

Result: BALR-SAM outperforms several state-of-the-art methods including fully fine-tuned MedSAM on standard medical segmentation datasets, while updating only 1.8% (11.7M) of parameters and operating without requiring prompts.

Conclusion: The proposed BALR-SAM framework successfully addresses the domain adaptation challenges of vision foundation models in medical imaging, achieving superior segmentation performance with significantly reduced computational resources and parameter updates.

Abstract: Vision foundation models like the Segment Anything Model (SAM), pretrained on large-scale natural image datasets, often struggle in medical image segmentation due to a lack of domain-specific adaptation. In clinical practice, fine-tuning such models efficiently for medical downstream tasks with minimal resource demands, while maintaining strong performance, is challenging. To address these issues, we propose BALR-SAM, a boundary-aware low-rank adaptation framework that enhances SAM for medical imaging. It combines three tailored components: (1) a Complementary Detail Enhancement Network (CDEN) using depthwise separable convolutions and multi-scale fusion to capture boundary-sensitive features essential for accurate segmentation; (2) low-rank adapters integrated into SAM’s Vision Transformer blocks to optimize feature representation and attention for medical contexts, while simultaneously significantly reducing the parameter space; and (3) a low-rank tensor attention mechanism in the mask decoder, cutting memory usage by 75% and boosting inference speed. Experiments on standard medical segmentation datasets show that BALR-SAM, without requiring prompts, outperforms several state-of-the-art (SOTA) methods, including fully fine-tuned MedSAM, while updating just 1.8% (11.7M) of its parameters.

[585] UniVid: The Open-Source Unified Video Model

Jiabin Luo, Junhui Lin, Zeyu Zhang, Biao Wu, Meng Fang, Ling Chen, Hao Tang

Main category: cs.CV

TL;DR: UniVid is a unified video architecture that combines MLLM with diffusion decoder via lightweight adapter, enabling both video understanding and generation with improved prompt adherence and temporal reasoning.

Details

Motivation: Address challenges in unified video modeling: maintaining semantic faithfulness during flow-based generation due to text-visual token imbalance and limitations of uniform cross-modal attention, and efficiently extending image-centric MLLMs to video without costly retraining.

Method: Couples MLLM with diffusion decoder through lightweight adapter, introduces Temperature Modality Alignment for better prompt adherence and Pyramid Reflection for efficient temporal reasoning via dynamic keyframe selection.

Result: State-of-the-art performance with 2.2% improvement on VBench-Long total score vs EasyAnimateV5.1, and 1.0% and 3.3% accuracy gains on MSVD-QA and ActivityNet-QA respectively vs best prior 7B baselines.

Conclusion: UniVid successfully unifies video generation and understanding capabilities through efficient architectural design and novel techniques for modality alignment and temporal reasoning.

Abstract: Unified video modeling that combines generation and understanding capabilities is increasingly important but faces two key challenges: maintaining semantic faithfulness during flow-based generation due to text-visual token imbalance and the limitations of uniform cross-modal attention across the flow trajectory, and efficiently extending image-centric MLLMs to video without costly retraining. We present UniVid, a unified architecture that couples an MLLM with a diffusion decoder through a lightweight adapter, enabling both video understanding and generation. We introduce Temperature Modality Alignment to improve prompt adherence and Pyramid Reflection for efficient temporal reasoning via dynamic keyframe selection. Extensive experiments on standard benchmarks demonstrate state-of-the-art performance, achieving a 2.2% improvement on VBench-Long total score compared to EasyAnimateV5.1, and 1.0% and 3.3% accuracy gains on MSVD-QA and ActivityNet-QA, respectively, compared with the best prior 7B baselines.

[586] Forge4D: Feed-Forward 4D Human Reconstruction and Interpolation from Uncalibrated Sparse-view Videos

Yingdong Hu, Yisheng He, Jinnan Chen, Weihao Yuan, Kejie Qiu, Zehong Lin, Siyu Zhu, Zilong Dong, Jun Zhang

Main category: cs.CV

TL;DR: Forge4D is a feed-forward 4D human reconstruction model that efficiently creates temporally aligned representations from uncalibrated sparse-view videos, enabling novel view and novel time synthesis through streaming 3D Gaussian reconstruction and dense motion prediction.

Details

Motivation: Existing methods for dynamic 3D human reconstruction from sparse-view videos are either too slow or cannot generate novel-time representations, limiting downstream applications.

Method: The model uses streaming 3D Gaussian reconstruction with learnable state tokens for temporal consistency, and a motion prediction module with occlusion-aware Gaussian fusion for interpolation. It employs self-supervised retargeting loss and optical flow loss for motion supervision.

Result: Extensive experiments show effectiveness on both in-domain and out-of-domain datasets, demonstrating successful 4D reconstruction and interpolation capabilities.

Conclusion: Forge4D provides an efficient solution for instant 4D human reconstruction from sparse-view videos, enabling both novel view and novel time synthesis through joint streaming reconstruction and motion prediction.

Abstract: Instant reconstruction of dynamic 3D humans from uncalibrated sparse-view videos is critical for numerous downstream applications. Existing methods, however, are either limited by the slow reconstruction speeds or incapable of generating novel-time representations. To address these challenges, we propose Forge4D, a feed-forward 4D human reconstruction and interpolation model that efficiently reconstructs temporally aligned representations from uncalibrated sparse-view videos, enabling both novel view and novel time synthesis. Our model simplifies the 4D reconstruction and interpolation problem as a joint task of streaming 3D Gaussian reconstruction and dense motion prediction. For the task of streaming 3D Gaussian reconstruction, we first reconstruct static 3D Gaussians from uncalibrated sparse-view images and then introduce learnable state tokens to enforce temporal consistency in a memory-friendly manner by interactively updating shared information across different timestamps. For novel time synthesis, we design a novel motion prediction module to predict dense motions for each 3D Gaussian between two adjacent frames, coupled with an occlusion-aware Gaussian fusion process to interpolate 3D Gaussians at arbitrary timestamps. To overcome the lack of the ground truth for dense motion supervision, we formulate dense motion prediction as a dense point matching task and introduce a self-supervised retargeting loss to optimize this module. An additional occlusion-aware optical flow loss is introduced to ensure motion consistency with plausible human movement, providing stronger regularization. Extensive experiments demonstrate the effectiveness of our model on both in-domain and out-of-domain datasets. Project page and code at: https://zhenliuzju.github.io/huyingdong/Forge4D.

[587] Scalable Audio-Visual Masked Autoencoders for Efficient Affective Video Facial Analysis

Xuecheng Wu, Junxiao Xue, Xinyi Yin, Yunyun Shi, Liangyu Fu, Danlei Huang, Yifan Wang, Jia Zhang, Jiayu Nie, Jun Wang

Main category: cs.CV

TL;DR: AVF-MAE++ is a scalable audio-visual masked autoencoder framework that addresses data limitations in affective video facial analysis through dual masking, enhanced correlation learning, and progressive training.

Details

Motivation: Affective video facial analysis suffers from limited data availability and lacks exploration of scaling properties. Previous methods struggle with capturing intra- and inter-modal correlations in audio-visual representations.

Method: Proposes AVF-MAE++ with dual masking strategy across modalities, enhanced modality encoders, Iterative Audio-Visual Correlation Learning Module, and progressive semantic injection strategy with three training stages.

Result: Achieves state-of-the-art performance across 17 datasets covering three major AVFA tasks, with comprehensive ablation studies validating each component’s importance.

Conclusion: The framework successfully demonstrates the importance of scaling in AVFA and provides effective solutions for cross-modal correlation learning, with publicly released code and models.

Abstract: Affective video facial analysis (AVFA) has emerged as a key research field for building emotion-aware intelligent systems, yet this field continues to suffer from limited data availability. In recent years, the self-supervised learning (SSL) technique of Masked Autoencoders (MAE) has gained momentum, with growing adaptations in its audio-visual contexts. While scaling has proven essential for breakthroughs in general multi-modal learning domains, its specific impact on AVFA remains largely unexplored. Another core challenge in this field is capturing both intra- and inter-modal correlations through scalable audio-visual representations. To tackle these issues, we propose AVF-MAE++, a family of audio-visual MAE models designed to efficiently investigate the scaling properties in AVFA while enhancing cross-modal correlation modeling. Our framework introduces a novel dual masking strategy across audio and visual modalities and strengthens modality encoders with a more holistic design to better support scalable pre-training. Additionally, we present the Iterative Audio-Visual Correlation Learning Module, which improves correlation learning within the SSL paradigm, bridging the limitations of previous methods. To support smooth adaptation and reduce overfitting risks, we further introduce a progressive semantic injection strategy, organizing the model training into three structured stages. Extensive experiments conducted on 17 datasets, covering three major AVFA tasks, demonstrate that AVF-MAE++ achieves consistent state-of-the-art performance across multiple benchmarks. Comprehensive ablation studies further highlight the importance of each proposed component and provide deeper insights into the design choices driving these improvements. Our code and models have been publicly released at Github.

[588] EVLF-FM: Explainable Vision Language Foundation Model for Medicine

Yang Bai, Haoran Cheng, Yang Zhou, Jun Zhou, Arun Thirunavukarasu, Yuhe Ke, Jie Yao, Kanae Fukutsu, Chrystie Wan Ning Quek, Ashley Hong, Laura Gutierrez, Zhen Ling Teo, Darren Shu Jeng Ting, Brian T. Soetikno, Christopher S. Nielsen, Tobias Elze, Zengxiang Li, Linh Le Dinh, Hiok Hong Chan, Victor Koh, Marcus Tan, Kelvin Z. Li, Leonard Yip, Ching Yu Cheng, Yih Chung Tham, Gavin Siew Wei Tan, Leopold Schmetterer, Marcus Ang, Rahat Hussain, Jod Mehta, Tin Aung, Lionel Tim-Ee Cheng, Tran Nguyen Tuan Anh, Chee Leong Cheng, Tien Yin Wong, Nan Liu, Iain Beehuat Tan, Soon Thye Lim, Eyal Klang, Tony Kiat Hon Lim, Rick Siow Mong Goh, Yong Liu, Daniel Shu Wei Ting

Main category: cs.CV

TL;DR: EVLF-FM is a multimodal vision-language foundation model that unifies broad diagnostic capabilities with fine-grained explainability across multiple medical imaging modalities, achieving state-of-the-art performance in disease diagnosis and visual grounding.

Details

Motivation: Current medical AI foundation models are modality-specific and lack transparent reasoning processes, which hinders clinical adoption. There's a need for unified systems with explainability capabilities.

Method: Developed using over 1.3 million samples from 23 datasets across 11 imaging modalities and 6 clinical specialties. Uses hybrid training combining supervised and visual reinforcement fine-tuning to enable pixel-level visual grounding and reasoning capabilities.

Result: Achieved highest average accuracy (0.858) and F1-score (0.797) in disease diagnostics, outperforming leading models. In medical visual grounding, achieved average mIOU of 0.743 and Acc@0.5 of 0.837 across nine modalities. Strong zero-shot and few-shot performance confirmed in external validation.

Conclusion: EVLF-FM is an early multi-disease VLM model with explainability and reasoning capabilities that could advance adoption of and trust in foundation models for real-world clinical deployment.

Abstract: Despite the promise of foundation models in medical AI, current systems remain limited - they are modality-specific and lack transparent reasoning processes, hindering clinical adoption. To address this gap, we present EVLF-FM, a multimodal vision-language foundation model (VLM) designed to unify broad diagnostic capability with fine-grain explainability. The development and testing of EVLF-FM encompassed over 1.3 million total samples from 23 global datasets across eleven imaging modalities related to six clinical specialties: dermatology, hepatology, ophthalmology, pathology, pulmonology, and radiology. External validation employed 8,884 independent test samples from 10 additional datasets across five imaging modalities. Technically, EVLF-FM is developed to assist with multiple disease diagnosis and visual question answering with pixel-level visual grounding and reasoning capabilities. In internal validation for disease diagnostics, EVLF-FM achieved the highest average accuracy (0.858) and F1-score (0.797), outperforming leading generalist and specialist models. In medical visual grounding, EVLF-FM also achieved stellar performance across nine modalities with average mIOU of 0.743 and Acc@0.5 of 0.837. External validations further confirmed strong zero-shot and few-shot performance, with competitive F1-scores despite a smaller model size. Through a hybrid training strategy combining supervised and visual reinforcement fine-tuning, EVLF-FM not only achieves state-of-the-art accuracy but also exhibits step-by-step reasoning, aligning outputs with visual evidence. EVLF-FM is an early multi-disease VLM model with explainability and reasoning capabilities that could advance adoption of and trust in foundation models for real-world clinical deployment.

[589] FreeAction: Training-Free Techniques for Enhanced Fidelity of Trajectory-to-Video Generation

Seungwook Kim, Seunghyeon Lee, Minsu Cho

Main category: cs.CV

TL;DR: Training-free inference-time techniques for robot video generation that actively incorporate action parameters to guide diffusion processes, improving action coherence and visual quality.

Details

Motivation: To build effective world models and robotics foundation models by generating realistic robot videos from explicit action trajectories, moving beyond passive action conditioning.

Method: Two inference-time techniques: 1) Action-scaled classifier-free guidance that modulates guidance strength based on action magnitude, and 2) Action-scaled noise truncation that adjusts initial noise distribution to match motion dynamics.

Result: Significant improvements in action coherence and visual quality across diverse robot environments, as demonstrated on real robot manipulation datasets.

Conclusion: Actively incorporating action parameters in diffusion-based video generation through inference-time techniques effectively enhances controllability and realism of generated robot videos.

Abstract: Generating realistic robot videos from explicit action trajectories is a critical step toward building effective world models and robotics foundation models. We introduce two training-free, inference-time techniques that fully exploit explicit action parameters in diffusion-based robot video generation. Instead of treating action vectors as passive conditioning signals, our methods actively incorporate them to guide both the classifier-free guidance process and the initialization of Gaussian latents. First, action-scaled classifier-free guidance dynamically modulates guidance strength in proportion to action magnitude, enhancing controllability over motion intensity. Second, action-scaled noise truncation adjusts the distribution of initially sampled noise to better align with the desired motion dynamics. Experiments on real robot manipulation datasets demonstrate that these techniques significantly improve action coherence and visual quality across diverse robot environments.

[590] Cycle Diffusion Model for Counterfactual Image Generation

Fangrui Huang, Alan Wang, Binxu Li, Bailey Trang, Ridvan Yesiloglu, Tianyu Hua, Wei Peng, Ehsan Adeli

Main category: cs.CV

TL;DR: Cycle Diffusion Model (CDM) uses cycle training to fine-tune diffusion models for better conditioning faithfulness and image quality in medical image synthesis.

Details

Motivation: Ensuring conditioning faithfulness and high-quality synthetic images for direct or counterfactual generation remains challenging in medical image synthesis using deep generative models.

Method: A cycle training framework that fine-tunes diffusion models by incorporating cycle constraints to enforce consistency between generated and original images.

Result: Experiments on 3D brain MRI datasets show improved conditioning accuracy and enhanced image quality as measured by FID and SSIM metrics.

Conclusion: The cycle strategy in CDM is effective for refining diffusion-based medical image generation, with applications in data augmentation, counterfactual modeling, and disease progression modeling.

Abstract: Deep generative models have demonstrated remarkable success in medical image synthesis. However, ensuring conditioning faithfulness and high-quality synthetic images for direct or counterfactual generation remains a challenge. In this work, we introduce a cycle training framework to fine-tune diffusion models for improved conditioning adherence and enhanced synthetic image realism. Our approach, Cycle Diffusion Model (CDM), enforces consistency between generated and original images by incorporating cycle constraints, enabling more reliable direct and counterfactual generation. Experiments on a combined 3D brain MRI dataset (from ABCD, HCP aging & young adults, ADNI, and PPMI) show that our method improves conditioning accuracy and enhances image quality as measured by FID and SSIM. The results suggest that the cycle strategy used in CDM can be an effective method for refining diffusion-based medical image generation, with applications in data augmentation, counterfactual, and disease progression modeling.

[591] When MLLMs Meet Compression Distortion: A Coding Paradigm Tailored to MLLMs

Jinming Liu, Zhaoyang Jia, Jiahao Li, Bin Li, Xin Jin, Wenjun Zeng, Yan Lu

Main category: cs.CV

TL;DR: Proposes CoTAM, an image codec tailored for Multimodal Large Language Models that adaptively protects multi-level features to achieve up to 35.99% bitrate savings while maintaining MLLM task performance.

Details

Motivation: Conventional image codecs optimized for human visual fidelity are ill-suited for MLLMs, which require preservation of diverse features for multiple downstream tasks. Compression artifacts unevenly impact different-level features, affecting MLLM performance.

Method: Uses CLIP’s shallow-layer attention to generate importance maps for adaptive bit allocation, and integrates a lightweight decoder adapter with multi-level loss function to reconstruct both low-level details and high-level semantic context.

Result: Achieves up to 35.99% bitrate saving while maintaining same performance on MLLM tasks, outperforming previous state-of-the-art neural codecs.

Conclusion: The proposed CoTAM codec effectively addresses the unique requirements of MLLMs by adaptively protecting multi-level features, demonstrating significant bandwidth efficiency gains without compromising task performance.

Abstract: The increasing deployment of powerful Multimodal Large Language Models (MLLMs), typically hosted on cloud platforms, urgently requires effective compression techniques to efficiently transmit signal inputs (e.g., images, videos) from edge devices with minimal bandwidth usage. However, conventional image codecs are optimized for fidelity to serve the Human Visual System (HVS) and ill-suited for MLLMs, in which diverse downstream tasks are jointly considered. In this paper, we first systematically analyze the impact of compression artifacts on several mainstream MLLMs. We find that: Compression distortion unevenly impacts different-level image features, leading to varying effects on MLLMs’ downstream tasks depending on their feature-level reliance. Motivated by this discovery, we propose an image Codec TAilored to MLLMs (CoTAM) designed to adaptively protect multi-level features and suit different demands of downstream tasks. The encoder leverages CLIP’s shallow-layer attention to generate an importance map for bit allocation, preserving critical semantic regions. Concurrently, the decoder integrates a lightweight adapter with a multi-level loss function to ensure the faithful reconstruction both of low-level details and high-level semantic context for robust synthesis of cross-level features. Extensive experiments validate that our method achieves up to 35.99% bitrate saving while maintaining the same performance on the MLLM tasks, outperforming previous SOTA neural codecs.

[592] TemMed-Bench: Evaluating Temporal Medical Image Reasoning in Vision-Language Models

Junyi Zhang, Jia-Chen Gu, Wenbo Hu, Yu Zhou, Robinson Piramuthu, Nanyun Peng

Main category: cs.CV

TL;DR: TemMed-Bench is the first benchmark for analyzing patient condition changes across clinical visits using temporal medical images, revealing LVLMs’ limitations in temporal reasoning and showing multi-modal retrieval augmentation improves performance.

Details

Motivation: Existing medical reasoning benchmarks focus on single-visit analysis, which deviates from real clinical practice where doctors track patient changes over time using historical data.

Method: Created TemMed-Bench with three tasks (VQA, report generation, image-pair selection) and 17K+ knowledge corpus. Evaluated 12 LVLMs and explored multi-modal retrieval augmentation combining visual and textual modalities.

Result: Most LVLMs lack temporal reasoning ability, with many performing at random-guessing level. GPT o3, o4-mini and Claude 3.5 Sonnet performed best but still below desired level. Multi-modal retrieval improved VQA performance by 2.59% on average.

Conclusion: LVLMs have significant limitations in temporal medical image reasoning. Multi-modal retrieval augmentation shows promise for addressing this challenge and should be further explored.

Abstract: Existing medical reasoning benchmarks for vision-language models primarily focus on analyzing a patient’s condition based on an image from a single visit. However, this setting deviates significantly from real-world clinical practice, where doctors typically refer to a patient’s historical conditions to provide a comprehensive assessment by tracking their changes over time. In this paper, we introduce TemMed-Bench, the first benchmark designed for analyzing changes in patients’ conditions between different clinical visits, which challenges large vision-language models (LVLMs) to reason over temporal medical images. TemMed-Bench consists of a test set comprising three tasks - visual question-answering (VQA), report generation, and image-pair selection - and a supplementary knowledge corpus of over 17,000 instances. With TemMed-Bench, we conduct an evaluation of six proprietary and six open-source LVLMs. Our results show that most LVLMs lack the ability to analyze patients’ condition changes over temporal medical images, and a large proportion perform only at a random-guessing level in the closed-book setting. In contrast, GPT o3, o4-mini and Claude 3.5 Sonnet demonstrate comparatively decent performance, though they have yet to reach the desired level. Furthermore, we explore augmenting the input with both retrieved visual and textual modalities in the medical domain. We also show that multi-modal retrieval augmentation yields notably higher performance gains than no retrieval and textual retrieval alone across most models on our benchmark, with the VQA task showing an average improvement of 2.59%. Overall, we compose a benchmark grounded on real-world clinical practice, and it reveals LVLMs’ limitations in temporal medical image reasoning, as well as highlighting the use of multi-modal retrieval augmentation as a potentially promising direction worth exploring to address this challenge.

[593] S$^2$NN: Sub-bit Spiking Neural Networks

Wenjie Wei, Malu Zhang, Jieyuan Zhang, Ammar Belatreche, Shuai Wang, Yimeng Shan, Hanwen Liu, Honglin Cao, Guoqing Wang, Yang Yang, Haizhou Li

Main category: cs.CV

TL;DR: Sub-bit Spiking Neural Networks (S²NNs) represent weights with less than one bit, achieving superior performance and efficiency for edge computing through outlier-aware quantization and membrane potential-based feature distillation.

Details

Motivation: To address the storage and computational demands of large-scale Spiking Neural Networks (SNNs) for resource-limited deployment, despite recent advances in binary SNNs.

Method: Proposes S²NNs with outlier-aware sub-bit weight quantization (OS-Quant) to mitigate codeword selection bias and membrane potential-based feature distillation (MPFD) for improved performance guidance.

Result: Extensive experiments on vision and non-vision tasks show S²NNs outperform existing quantized SNNs in both performance and efficiency.

Conclusion: S²NNs represent a promising approach for energy-efficient edge computing applications by achieving sub-bit weight representation with improved compression and acceleration capabilities.

Abstract: Spiking Neural Networks (SNNs) offer an energy-efficient paradigm for machine intelligence, but their continued scaling poses challenges for resource-limited deployment. Despite recent advances in binary SNNs, the storage and computational demands remain substantial for large-scale networks. To further explore the compression and acceleration potential of SNNs, we propose Sub-bit Spiking Neural Networks (S$^2$NNs) that represent weights with less than one bit. Specifically, we first establish an S$^2$NN baseline by leveraging the clustering patterns of kernels in well-trained binary SNNs. This baseline is highly efficient but suffers from \textit{outlier-induced codeword selection bias} during training. To mitigate this issue, we propose an \textit{outlier-aware sub-bit weight quantization} (OS-Quant) method, which optimizes codeword selection by identifying and adaptively scaling outliers. Furthermore, we propose a \textit{membrane potential-based feature distillation} (MPFD) method, improving the performance of highly compressed S$^2$NN via more precise guidance from a teacher model. Extensive results on vision and non-vision tasks reveal that S$^2$NN outperforms existing quantized SNNs in both performance and efficiency, making it promising for edge computing applications.

[594] Dynamic Orchestration of Multi-Agent System for Real-World Multi-Image Agricultural VQA

Yan Ke, Xin Yu, Heming Du, Scott Chapman, Helen Huang

Main category: cs.CV

TL;DR: A self-reflective multi-agent framework for agricultural visual question answering that handles multi-image inputs and integrates external context through collaborative roles.

Details

Motivation: Existing agricultural VQA approaches are limited to text-only queries or single images, failing to handle real-world scenarios with multi-image inputs across spatial scales and growth stages, and lacking systematic quality control.

Method: Proposes a four-agent framework: Retriever (gathers external info), Reflector (assesses adequacy and triggers reformulation), Answerer (drafts responses in parallel), Improver (refines answers through iterative checks and multi-image alignment).

Result: Achieves competitive performance on the AgMMU benchmark for multi-image agricultural QA.

Conclusion: The multi-agent collaborative framework effectively addresses limitations of existing approaches by enabling context enrichment, reflective reasoning, and iterative improvement for agricultural VQA.

Abstract: Agricultural visual question answering is essential for providing farmers and researchers with accurate and timely knowledge. However, many existing approaches are predominantly developed for evidence-constrained settings such as text-only queries or single-image cases. This design prevents them from coping with real-world agricultural scenarios that often require multi-image inputs with complementary views across spatial scales, and growth stages. Moreover, limited access to up-to-date external agricultural context makes these systems struggle to adapt when evidence is incomplete. In addition, rigid pipelines often lack systematic quality control. To address this gap, we propose a self-reflective and self-improving multi-agent framework that integrates four roles, the Retriever, the Reflector, the Answerer, and the Improver. They collaborate to enable context enrichment, reflective reasoning, answer drafting, and iterative improvement. A Retriever formulates queries and gathers external information, while a Reflector assesses adequacy and triggers sequential reformulation and renewed retrieval. Two Answerers draft candidate responses in parallel to reduce bias. The Improver refines them through iterative checks while ensuring that information from multiple images is effectively aligned and utilized. Experiments on the AgMMU benchmark show that our framework achieves competitive performance on multi-image agricultural QA.

[595] GSM8K-V: Can Vision Language Models Solve Grade School Math Word Problems in Visual Contexts

Fan Yuan, Yuchen Yan, Yifan Jiang, Haoran Zhao, Tao Feng, Jinyan Chen, Yanwei Lou, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, Yueting Zhuang

Main category: cs.CV

TL;DR: GSM8K-V is a new visual mathematical reasoning benchmark created by converting text-based GSM8K problems into multi-image visual format, revealing significant performance gaps in current vision language models.

Details

Motivation: Existing visual mathematical reasoning benchmarks are limited to geometry, lack math word problems, and rarely assess reasoning across multiple images, creating a gap in comprehensive evaluation of VLMs' mathematical reasoning capabilities.

Method: Systematically mapped GSM8K text samples into visual form using an automated image-generation pipeline combined with human annotation, creating 1,319 high-quality multi-image mathematical reasoning samples.

Result: Current VLMs show substantial performance gap between text-based GSM8K (95.22% accuracy for best model) and visual GSM8K-V (46.93% accuracy), indicating visual mathematical reasoning remains a challenging task.

Conclusion: GSM8K-V provides a new benchmark for visual mathematical reasoning that reveals current model limitations and guides development of more robust and generalizable vision language models.

Abstract: Vision language models (VLMs) achieve unified modeling of images and text, enabling them to accomplish complex real-world tasks through perception, planning, and reasoning. Among these tasks, reasoning is particularly representative, with mathematical reasoning serving as a prominent example. It highlights the high-level capability of VLMs to comprehend mathematical information in images and to perform sophisticated reasoning. Recently, numerous visual mathematical reasoning benchmarks have been proposed, but they are often restricted to geometry, lack coverage of math word problems, and rarely assess reasoning across multiple images. To address these gaps, we introduce GSM8K-V, a purely visual multi-image mathematical reasoning benchmark. GSM8K-V is built by systematically mapping each sample from the widely used text-based GSM8K into visual form. Through a carefully designed automated image-generation pipeline combined with meticulous human annotation, we curate 1,319 high-quality samples. We evaluate a wide range of open-source and closed-source models on GSM8K-V. Results show that although existing VLMs have nearly saturated performance on text-based GSM8K, there remains substantial room for improvement on GSM8K-V. For example, the best-performing model, Gemini-2.5-Pro, achieves 95.22% accuracy on GSM8K but only 46.93% on GSM8K-V. We conduct a comprehensive analysis of GSM8K-V, examining the limitations of current models as well as potential directions for improvement. GSM8K-V offers a new perspective on visual mathematical reasoning and establishes a benchmark to guide the development of more robust and generalizable VLMs.

[596] Skeleton-based Robust Registration Framework for Corrupted 3D Point Clouds

Yongqiang Wang, Weigang Li, Wenping Liu, Zhiqiang Tian, Jinling Li

Main category: cs.CV

TL;DR: A skeleton-based robust registration framework (SRRF) is proposed to handle corrupted point clouds by integrating skeletal structures and combining transformations from both point cloud and skeleton alignments, achieving superior performance over state-of-the-art methods.

Details

Motivation: Real-world point clouds often suffer from sensor limitations, environmental noise, and preprocessing errors, causing density distortions, noise contamination, and geometric deformations that challenge existing registration methods relying on direct point matching or surface features.

Method: The framework introduces a corruption-resilient skeletal representation and integrates skeletal structures into registration. It combines transformations from corrupted point cloud alignment and skeleton alignment, using a distribution distance loss function to enforce consistency between source and target skeletons.

Result: Experimental evaluations on diverse corrupted datasets show SRRF consistently outperforms state-of-the-art registration methods across various corruption scenarios including density distortions, noise contamination, and geometric deformations.

Conclusion: SRRF demonstrates robustness in handling corrupted point clouds by considering both local geometric features and global skeleton structure stability, making it a potential approach for 3D perception tasks in real-world scenarios.

Abstract: Point cloud registration is fundamental in 3D vision applications, including autonomous driving, robotics, and medical imaging, where precise alignment of multiple point clouds is essential for accurate environment reconstruction. However, real-world point clouds are often affected by sensor limitations, environmental noise, and preprocessing errors, making registration challenging due to density distortions, noise contamination, and geometric deformations. Existing registration methods rely on direct point matching or surface feature extraction, which are highly susceptible to these corruptions and lead to reduced alignment accuracy. To address these challenges, a skeleton-based robust registration framework is presented, which introduces a corruption-resilient skeletal representation to improve registration robustness and accuracy. The framework integrates skeletal structures into the registration process and combines the transformations obtained from both the corrupted point cloud alignment and its skeleton alignment to achieve optimal registration. In addition, a distribution distance loss function is designed to enforce the consistency between the source and target skeletons, which significantly improves the registration performance. This framework ensures that the alignment considers both the original local geometric features and the global stability of the skeleton structure, resulting in robust and accurate registration results. Experimental evaluations on diverse corrupted datasets demonstrate that SRRF consistently outperforms state-of-the-art registration methods across various corruption scenarios, including density distortions, noise contamination, and geometric deformations. The results confirm the robustness of SRRF in handling corrupted point clouds, making it a potential approach for 3D perception tasks in real-world scenarios.

[597] An Enhanced Pyramid Feature Network Based on Long-Range Dependencies for Multi-Organ Medical Image Segmentation

Dayu Tan, Cheng Kong, Yansen Su, Hai Chen, Dongliang Yang, Junfeng Xia, Chunhou Zheng

Main category: cs.CV

TL;DR: LamFormer is a U-shaped network for multi-organ medical image segmentation that uses Linear Attention Mamba (LAM) to capture multi-scale long-range dependencies with lower computational cost than Transformers, while maintaining good local detail extraction.

Details

Motivation: Address the high computational cost of Transformers and their deficiencies in extracting local detailed information in multi-organ medical image segmentation tasks.

Method: Proposes LamFormer with Linear Attention Mamba (LAM) in pyramid encoder, Parallel Hierarchical Feature Aggregation (PHFA) module, and Reduced Transformer (RT) for global modeling of up-sampled features.

Result: Outperforms existing segmentation methods on seven complex and diverse datasets, achieving a balance between model performance and complexity.

Conclusion: LamFormer demonstrates exceptional performance in multi-organ medical image segmentation while addressing computational efficiency and local detail extraction limitations of Transformer-based methods.

Abstract: In the field of multi-organ medical image segmentation, recent methods frequently employ Transformers to capture long-range dependencies from image features. However, these methods overlook the high computational cost of Transformers and their deficiencies in extracting local detailed information. To address high computational costs and inadequate local detail information, we reassess the design of feature extraction modules and propose a new deep-learning network called LamFormer for fine-grained segmentation tasks across multiple organs. LamFormer is a novel U-shaped network that employs Linear Attention Mamba (LAM) in an enhanced pyramid encoder to capture multi-scale long-range dependencies. We construct the Parallel Hierarchical Feature Aggregation (PHFA) module to aggregate features from different layers of the encoder, narrowing the semantic gap among features while filtering information. Finally, we design the Reduced Transformer (RT), which utilizes a distinct computational approach to globally model up-sampled features. RRT enhances the extraction of detailed local information and improves the network’s capability to capture long-range dependencies. LamFormer outperforms existing segmentation methods on seven complex and diverse datasets, demonstrating exceptional performance. Moreover, the proposed network achieves a balance between model performance and model complexity.

[598] Robust Partial 3D Point Cloud Registration via Confidence Estimation under Global Context

Yongqiang Wang, Weigang Li, Wenping Liu, Zhe Xu, Zhiqiang Tian

Main category: cs.CV

TL;DR: CEGC is a confidence-driven framework for partial 3D point cloud registration that jointly models overlap confidence and correspondence reliability using global context, outperforming state-of-the-art methods in accuracy and robustness.

Details

Motivation: Partial point cloud registration faces challenges from structural ambiguity, partial visibility, and noise, making accurate alignment difficult in complex scenes.

Method: Uses hybrid overlap confidence estimation (semantic descriptors + geometric similarity) and context-aware matching with global attention to assign soft confidence scores, guiding a differentiable weighted SVD solver for transformation computation.

Result: Outperforms state-of-the-art methods on ModelNet40, ScanObjectNN, and 7Scenes datasets in accuracy, robustness, and generalization.

Conclusion: CEGC provides an interpretable and scalable solution for partial point cloud registration under challenging conditions.

Abstract: Partial point cloud registration is essential for autonomous perception and 3D scene understanding, yet it remains challenging owing to structural ambiguity, partial visibility, and noise. We address these issues by proposing Confidence Estimation under Global Context (CEGC), a unified, confidence-driven framework for robust partial 3D registration. CEGC enables accurate alignment in complex scenes by jointly modeling overlap confidence and correspondence reliability within a shared global context. Specifically, the hybrid overlap confidence estimation module integrates semantic descriptors and geometric similarity to detect overlapping regions and suppress outliers early. The context-aware matching strategy smitigates ambiguity by employing global attention to assign soft confidence scores to correspondences, improving robustness. These scores guide a differentiable weighted singular value decomposition solver to compute precise transformations. This tightly coupled pipeline adaptively down-weights uncertain regions and emphasizes contextually reliable matches. Experiments on ModelNet40, ScanObjectNN, and 7Scenes 3D vision datasets demonstrate that CEGC outperforms state-of-the-art methods in accuracy, robustness, and generalization. Overall, CEGC offers an interpretable and scalable solution to partial point cloud registration under challenging conditions.

[599] UI-UG: A Unified MLLM for UI Understanding and Generation

Hao Yang, Weijie Qiu, Ru Zhang, Zhou Fang, Ruichao Mao, Xiaoyu Lin, Maji Huang, Zhaosong Huang, Teng Guo, Shuoyang Liu, Hai Rao

Main category: cs.CV

TL;DR: UI-UG is a unified MLLM that combines UI understanding and generation capabilities, achieving SOTA performance on understanding tasks and competitive generation quality with much lower computational cost.

Details

Motivation: MLLMs face challenges in domain-specific UI tasks, particularly in understanding accuracy and generation quality for complex modern user interfaces.

Method: Uses SFT with GRPO for understanding tasks, DPO for generation tasks, and proposes an industrial workflow including LLM-friendly DSL, training strategies, rendering processes, and evaluation metrics.

Result: Achieves SOTA performance on understanding tasks, outperforming larger general-purpose MLLMs and similarly-sized UI-specialized models. Matches larger MLLMs in UI generation at fraction of computational cost.

Conclusion: Integrating understanding and generation tasks improves accuracy and quality for both tasks, demonstrating the effectiveness of the unified approach.

Abstract: Although Multimodal Large Language Models (MLLMs) have been widely applied across domains, they are still facing challenges in domain-specific tasks, such as User Interface (UI) understanding accuracy and UI generation quality. In this paper, we introduce UI-UG (a unified MLLM for UI Understanding and Generation), integrating both capabilities. For understanding tasks, we employ Supervised Fine-tuning (SFT) combined with Group Relative Policy Optimization (GRPO) to enhance fine-grained understanding on the modern complex UI data. For generation tasks, we further use Direct Preference Optimization (DPO) to make our model generate human-preferred UIs. In addition, we propose an industrially effective workflow, including the design of an LLM-friendly domain-specific language (DSL), training strategies, rendering processes, and evaluation metrics. In experiments, our model achieves state-of-the-art (SOTA) performance on understanding tasks, outperforming both larger general-purpose MLLMs and similarly-sized UI-specialized models. Our model is also on par with these larger MLLMs in UI generation performance at a fraction of the computational cost. We also demonstrate that integrating understanding and generation tasks can improve accuracy and quality for both tasks.

[600] ASIA: Adaptive 3D Segmentation using Few Image Annotations

Sai Raj Kishore Perla, Aditya Vora, Sauradip Nag, Ali Mahdavi-Amiri, Hao Zhang

Main category: cs.CV

TL;DR: ASIA is a novel framework for 3D segmentation that uses few image annotations to segment both semantic and non-semantic parts in 3D objects, leveraging diffusion models to transfer 2D segmentations to 3D space.

Details

Motivation: Existing 3D segmentation methods require multi-view images (hard to collect), 3D model annotations (demanding), or ambiguous text descriptions. ASIA aims to provide a more practical solution using only a few user-annotated in-the-wild images.

Method: Leverages text-to-image diffusion models (Stable Diffusion) to transfer segmentations from 2D to 3D. Optimizes text tokens for each segment, fine-tunes with cross-view part correspondence loss, segments multi-view renderings, fuses labels in UV-space via voting, refines with Noise Optimization, and maps back to mesh.

Result: Outperforms existing methods by a noticeable margin in both quantitative and qualitative evaluations. Provides practical and generalizable solution for both semantic and non-semantic 3D segmentation tasks.

Conclusion: ASIA offers a practical framework for controllable 3D segmentation using minimal image annotations, effectively handling both semantic and non-semantic parts even when annotated and target objects differ significantly in geometry or structure.

Abstract: We introduce ASIA (Adaptive 3D Segmentation using few Image Annotations), a novel framework that enables segmentation of possibly non-semantic and non-text-describable “parts” in 3D. Our segmentation is controllable through a few user-annotated in-the-wild images, which are easier to collect than multi-view images, less demanding to annotate than 3D models, and more precise than potentially ambiguous text descriptions. Our method leverages the rich priors of text-to-image diffusion models, such as Stable Diffusion (SD), to transfer segmentations from image space to 3D, even when the annotated and target objects differ significantly in geometry or structure. During training, we optimize a text token for each segment and fine-tune our model with a novel cross-view part correspondence loss. At inference, we segment multi-view renderings of the 3D mesh, fuse the labels in UV-space via voting, refine them with our novel Noise Optimization technique, and finally map the UV-labels back onto the mesh. ASIA provides a practical and generalizable solution for both semantic and non-semantic 3D segmentation tasks, outperforming existing methods by a noticeable margin in both quantitative and qualitative evaluations.

[601] Uni-X: Mitigating Modality Conflict with a Two-End-Separated Architecture for Unified Multimodal Models

Jitai Hao, Hao Liu, Xinyan Xiao, Qiang Huang, Jun Yu

Main category: cs.CV

TL;DR: Uni-X is a novel unified multimodal model architecture that addresses gradient conflicts between vision and text modalities by using a two-end-separated, middle-shared X-shaped design, achieving superior efficiency and performance compared to standard autoregressive transformers.

Details

Motivation: Standard autoregressive transformers for unified multimodal models suffer from severe gradient conflicts between vision and text modalities, especially in shallow and deep layers, due to fundamentally different low-level statistical properties of images and text.

Method: Proposed Uni-X architecture with modality-specific initial and final layers for low-level processing, while maintaining shared parameters in middle layers for high-level semantic fusion, creating an X-shaped design that eliminates gradient conflicts at both ends.

Result: Uni-X achieves superior training efficiency and, when scaled to 3B parameters, matches or surpasses 7B AR-based UMMs, achieving GenEval score of 82 for image generation alongside strong performance in text and vision understanding tasks.

Conclusion: Uni-X establishes a parameter-efficient and scalable foundation for future unified multimodal modeling by effectively resolving gradient conflicts through its X-shaped architecture.

Abstract: Unified Multimodal Models (UMMs) built on shared autoregressive (AR) transformers are attractive for their architectural simplicity. However, we identify a critical limitation: when trained on multimodal inputs, modality-shared transformers suffer from severe gradient conflicts between vision and text, particularly in shallow and deep layers. We trace this issue to the fundamentally different low-level statistical properties of images and text, while noting that conflicts diminish in middle layers where representations become more abstract and semantically aligned. To overcome this challenge, we propose Uni-X, a two-end-separated, middle-shared architecture. Uni-X dedicates its initial and final layers to modality-specific processing, while maintaining shared parameters in the middle layers for high-level semantic fusion. This X-shaped design not only eliminates gradient conflicts at both ends but also further alleviates residual conflicts in the shared layers. Extensive experiments validate the effectiveness of Uni-X. Under identical training conditions, Uni-X achieves superior training efficiency compared to strong baselines. When scaled to 3B parameters with larger training data, Uni-X matches or surpasses 7B AR-based UMMs, achieving a GenEval score of 82 for image generation alongside strong performance in text and vision understanding tasks. These results establish Uni-X as a parameter-efficient and scalable foundation for future unified multimodal modeling. Our code is available at https://github.com/CURRENTF/Uni-X

[602] SVGThinker: Instruction-Aligned and Reasoning-Driven Text-to-SVG Generation

Hanqi Chen, Zhongyin Zhao, Ye Chen, Zhujin Liang, Bingbing Ni

Main category: cs.CV

TL;DR: SVGThinker is a reasoning-driven framework for text-to-SVG generation that addresses weak generalization and poor instruction adherence by aligning SVG code production with visualization processes and supporting all SVG primitives.

Details

Motivation: To overcome limitations in text-to-SVG generation including weak generalization and poor adherence to input instructions, while leveraging advances in large language models.

Method: Uses a pipeline that renders primitives sequentially, annotates images and code with multimodal models, builds stepwise updates mirroring primitive addition, and trains LLMs with supervised fine-tuning that exposes chain-of-thought reasoning.

Result: Produces more stable, editable, and higher-quality SVGs than state-of-the-art baselines while preserving vector graphics structural advantages and enabling precise hierarchical editing.

Conclusion: SVGThinker opens new directions for design, content creation, and automated graphics generation by enabling precise and hierarchical editing unlike image-based methods.

Abstract: Scalable Vector Graphics (SVG) is a code-based representation for 2D visuals. Leveraging recent advances in large language models (LLMs), we study text-to-SVG generation and address two persistent gaps: weak generalization and poor adherence to input instructions. We present SVGThinker, a reasoning-driven framework that aligns the production of SVG code with the visualization process and supports the full set of SVG primitives. Our pipeline first renders each primitive in sequence and uses a multimodal model to annotate the image and code; we then build stepwise updates that mirror the incremental addition of primitives. On this data, we train an LLM with supervised fine-tuning that exposes its chain-of-thought as intermediate reasoning, improving robustness and reducing errors and hallucinations. Experiments against state-of-the-art baselines show that SVGThinker produces more stable, editable, and higher-quality SVGs while preserving the structural advantages of vector graphics. Unlike image-based methods, our outputs enable precise and hierarchical editing, opening new directions for design, content creation, and automated graphics generation.

[603] REALIGN: Regularized Procedure Alignment with Matching Video Embeddings via Partial Gromov-Wasserstein Optimal Transport

Soumyadeep Chandra, Kaushik Roy

Main category: cs.CV

TL;DR: REALIGN is a self-supervised framework for procedure learning that uses Regularized Fused Partial Gromov-Wasserstein Optimal Transport to handle variable step orders and irrelevant frames in instructional videos, outperforming prior methods.

Details

Motivation: Real-world instructional videos contain background segments, repeated actions, and non-monotonic step orders that violate assumptions of existing alignment methods, requiring a more robust approach.

Method: Uses Regularized Fused Partial Gromov-Wasserstein Optimal Transport (R-FPGWOT) to jointly model visual correspondences and temporal relations, integrated with inter-sequence contrastive learning for stable training.

Result: Achieves up to 18.9% average F1-score improvements and over 30% temporal IoU gains across benchmarks (EgoProceL, ProceL, CrossTask), with more interpretable transport maps.

Conclusion: REALIGN provides a robust framework for procedure learning that effectively handles real-world video variability while preserving key-step orderings and filtering noise.

Abstract: Learning from procedural videos remains a core challenge in self-supervised representation learning, as real-world instructional data often contains background segments, repeated actions, and steps presented out of order. Such variability violates the strong monotonicity assumptions underlying many alignment methods. Prior state-of-the-art approaches, such as OPEL, leverage Kantorovich Optimal Transport (KOT) to build frame-to-frame correspondences, but rely solely on feature similarity and fail to capture the higher-order temporal structure of a task. In this paper, we introduce REALIGN, a self-supervised framework for procedure learning based on Regularized Fused Partial Gromov-Wasserstein Optimal Transport (R-FPGWOT). In contrast to KOT, our formulation jointly models visual correspondences and temporal relations under a partial alignment scheme, enabling robust handling of irrelevant frames, repeated actions, and non-monotonic step orders common in instructional videos. To stabilize training, we integrate FPGWOT distances with inter-sequence contrastive learning, avoiding the need for multiple regularizers and preventing collapse to degenerate solutions. Across egocentric (EgoProceL) and third-person (ProceL, CrossTask) benchmarks, REALIGN achieves up to 18.9% average F1-score improvements and over 30% temporal IoU gains, while producing more interpretable transport maps that preserve key-step orderings and filter out noise.

[604] FrameThinker: Learning to Think with Long Videos via Multi-Turn Frame Spotlighting

Zefeng He, Xiaoye Qu, Yafu Li, Siyuan Huang, Daizong Liu, Yu Cheng

Main category: cs.CV

TL;DR: FrameThinker is a novel framework that enables Large Vision-Language Models (LVLMs) to iteratively interrogate long video content through strategic frame selection and dynamic reasoning, achieving state-of-the-art performance with significantly reduced computational cost.

Details

Motivation: Current LVLMs for video understanding suffer from inefficient uniform frame sampling and static textual reasoning, which are inadequate for handling visually intensive long video tasks.

Method: Proposes a two-phase training strategy: 1) Supervised Fine-Tuning (SFT) to teach fundamental action capabilities, followed by 2) Reinforcement Learning (RL) with comprehensive reward design to optimize strategic decision-making for frame selection and reasoning.

Result: Achieves significant improvements (+10.4% average) over baselines while drastically reducing processed frames. The 7B model establishes new SOTA on LongVideo-Reason (76.1% accuracy) using only 20.6 frames on average - outperforming LongVILA-R1 (72.0%) with over 20x fewer frames (vs. 512).

Conclusion: FrameThinker demonstrates unparalleled efficiency and effectiveness in long video reasoning, enabling LVLMs to think strategically about video content through iterative interrogation and optimized frame selection.

Abstract: While Large Vision-Language Models (LVLMs) have achieved substantial progress in video understanding, their application to long video reasoning is hindered by uniform frame sampling and static textual reasoning, which are inefficient and struggle to handle visually intensive video tasks. To overcome these challenges, in this paper, we introduce the concept of thinking with long videos and propose a novel framework FrameThinker. Within this framework, LVLMs are able to iteratively interrogate video content. Developing such video reasoning capabilities in LVLMs presents notable challenges, particularly in adapting the model to new video actions (e.g. select frame), and designing reward functions to guide LVLMs to adopt the newly introduced action. To solve these challenges, we propose a two-phase training strategy, first employing Supervised Fine-Tuning (SFT) to instill fundamental action capabilities, followed by Reinforcement Learning (RL) to optimize a strategic decision-making policy.Notably, in this RL phase, we conduct an in-depth and comprehensive exploration of the reward design for each action and format reward. Extensive experiments on reasoning benchmarks like Video-Holmes, LongVideo-Reason, and long-video understanding benchmarks such as LongVideoBench, MLVU, VideoMME, and LVBench, demonstrate that FrameThinker achieves a significant average improvement of +10.4% over baselines while drastically reducing the number of processed frames. Most notably, our 7B model, FrameThinker establishes a new state-of-the-art on LongVideo-Reason, achieving 76.1% accuracy using an average of only 20.6 frames. This not only outperforms the competitive LongVILA-R1 (72.0%) but does so with over 20x fewer frames (vs. 512), demonstrating unparalleled efficiency and effectiveness.

[605] Vid-LLM: A Compact Video-based 3D Multimodal LLM with Reconstruction-Reasoning Synergy

Haijier Chen, Bo Xu, Shoujian Zhang, Haoze Liu, Jiaxuan Lin, Jingrong Wang

Main category: cs.CV

TL;DR: Vid-LLM is a video-based 3D Multimodal Large Language Model that processes video inputs without external 3D data, using geometric priors and a Cross-Task Adapter to achieve superior 3D scene understanding across multiple tasks.

Details

Motivation: Extending multimodal reasoning from 2D to 3D domains is challenging, and existing 3D-MLLMs depend on 3D data inputs which limit scalability and real-world deployment.

Method: Uses video inputs without external 3D data, integrates geometric priors via Cross-Task Adapter, employs Metric Depth Model for geometric consistency, and applies two-stage distillation optimization for training.

Result: Achieves superior performance on 3D Question Answering, 3D Dense Captioning, and 3D Visual Grounding tasks across diverse benchmarks, demonstrating strong multi-task capabilities.

Conclusion: Vid-LLM provides a practical and scalable solution for 3D scene understanding by leveraging video inputs and geometric priors without requiring external 3D data, enabling real-world deployment.

Abstract: Recent developments in Multimodal Large Language Models (MLLMs) have significantly improved Vision-Language (VL) reasoning in 2D domains. However, extending these capabilities to 3D scene understanding remains a major challenge. Existing 3D Multimodal Large Language Models (3D-MLLMs) often depend on 3D data inputs, which limits scalability and generalization. To address this limitation, we propose Vid-LLM, a video-based 3D-MLLM that directly processes video inputs without requiring external 3D data, making it practical for real-world deployment. In our method, the geometric prior are directly used to improve the performance of the sceen perception. To integrate the geometric cues into the MLLM compactly, we design a Cross-Task Adapter (CTA) module to align the 3D geometric priors with the vision-language representations. To ensure geometric consistency and integrity, we introduce a Metric Depth Model that recovers real-scale geometry from the reconstruction outputs. Finally, the model is fine-tuned with a two-stage distillation optimization strategy, realizing fast convergence and stabilizes training. Extensive experiments across diverse benchmarks verified the effectiveness of our method on 3D Question Answering, 3D Dense Captioning and 3D Visual Grounding tasks, demonstrating the superior multi-task capabilities.

[606] OMeGa: Joint Optimization of Explicit Meshes and Gaussian Splats for Robust Scene-Level Surface Reconstruction

Yuhang Cao, Haojun Yan, Danya Yao

Main category: cs.CV

TL;DR: OMeGa is an end-to-end framework that jointly optimizes triangle meshes and 2D Gaussian splats for improved indoor scene reconstruction, addressing geometry inaccuracies in texture-less regions through mesh constraints and iterative refinement.

Details

Motivation: Existing neural rendering methods suffer from inaccurate geometry in texture-less indoor regions and decouple mesh extraction from optimization, missing opportunities to leverage mesh geometry for guiding splat optimization.

Method: Joint optimization of explicit triangle mesh and 2D Gaussian splats via flexible binding strategy, integrating mesh constraints and monocular normal supervision, with iterative mesh-refinement that splits high-error faces and prunes unreliable ones.

Result: Achieves state-of-the-art performance on indoor reconstruction benchmarks, reducing Chamfer-L1 by 47.3% over 2DGS baseline while maintaining competitive novel-view rendering quality.

Conclusion: OMeGa effectively addresses prior limitations in indoor texture-less reconstruction through joint optimization and mesh-guided splat learning.

Abstract: Neural rendering with Gaussian splatting has advanced novel view synthesis, and most methods reconstruct surfaces via post-hoc mesh extraction. However, existing methods suffer from two limitations: (i) inaccurate geometry in texture-less indoor regions, and (ii) the decoupling of mesh extraction from optimization, thereby missing the opportunity to leverage mesh geometry to guide splat optimization. In this paper, we present OMeGa, an end-to-end framework that jointly optimizes an explicit triangle mesh and 2D Gaussian splats via a flexible binding strategy, where spatial attributes of Gaussian Splats are expressed in the mesh frame and texture attributes are retained on splats. To further improve reconstruction accuracy, we integrate mesh constraints and monocular normal supervision into the optimization, thereby regularizing geometry learning. In addition, we propose a heuristic, iterative mesh-refinement strategy that splits high-error faces and prunes unreliable ones to further improve the detail and accuracy of the reconstructed mesh. OMeGa achieves state-of-the-art performance on challenging indoor reconstruction benchmarks, reducing Chamfer-$L_1$ by 47.3% over the 2DGS baseline while maintaining competitive novel-view rendering quality. The experimental results demonstrate that OMeGa effectively addresses prior limitations in indoor texture-less reconstruction.

[607] Towards Foundation Models for Cryo-ET Subtomogram Analysis

Runmin Jiang, Wanyue Feng, Yuntian Yang, Shriya Pingulkar, Hong Wang, Xi Xiao, Xiaoyu Cao, Genpei Zhang, Xiao Wang, Xiaolong Wu, Tianyang Wang, Yang Liu, Xingjian Li, Min Xu

Main category: cs.CV

TL;DR: This paper introduces the first foundation model for cryo-ET subtomogram analysis, featuring a large-scale synthetic data generator (CryoEngine), an adaptive phase tokenization-enhanced vision transformer (APT-ViT), and noise-resilient contrastive learning (NRCL) to address annotation scarcity, severe noise, and poor generalization.

Details

Motivation: Cryo-ET enables in situ visualization of macromolecular structures, but effective analysis is hindered by scarce annotations, severe noise, and poor generalization in subtomogram classification, alignment, and averaging tasks.

Method: Developed CryoEngine for generating 904k synthetic subtomograms from 452 particle classes; designed APT-ViT with adaptive phase tokenization for geometric and semantic robustness; implemented NRCL strategy for noise-resilient representation learning.

Result: Achieved state-of-the-art performance on all three major subtomogram tasks across 24 synthetic and real datasets, with strong generalization to unseen datasets.

Conclusion: The proposed foundation model advances scalable and robust subtomogram analysis in cryo-ET by addressing key challenges through synthetic data generation, equivariance-enhancing architecture, and noise-resilient learning strategies.

Abstract: Cryo-electron tomography (cryo-ET) enables in situ visualization of macromolecular structures, where subtomogram analysis tasks such as classification, alignment, and averaging are critical for structural determination. However, effective analysis is hindered by scarce annotations, severe noise, and poor generalization. To address these challenges, we take the first step towards foundation models for cryo-ET subtomograms. First, we introduce CryoEngine, a large-scale synthetic data generator that produces over 904k subtomograms from 452 particle classes for pretraining. Second, we design an Adaptive Phase Tokenization-enhanced Vision Transformer (APT-ViT), which incorporates adaptive phase tokenization as an equivariance-enhancing module that improves robustness to both geometric and semantic variations. Third, we introduce a Noise-Resilient Contrastive Learning (NRCL) strategy to stabilize representation learning under severe noise conditions. Evaluations across 24 synthetic and real datasets demonstrate state-of-the-art (SOTA) performance on all three major subtomogram tasks and strong generalization to unseen datasets, advancing scalable and robust subtomogram analysis in cryo-ET.

[608] Similarity-Aware Selective State-Space Modeling for Semantic Correspondence

Seungwook Kim, Minsu Cho

Main category: cs.CV

TL;DR: MambaMatcher is a novel semantic correspondence method that uses selective state-space models to efficiently model high-dimensional correlations, achieving state-of-the-art performance while overcoming computational limitations of traditional approaches.

Details

Motivation: Traditional feature-metric methods miss complex inter-correlation relationships, while recent correlation-metric approaches suffer from high computational costs due to processing 4D correlation maps.

Method: Uses selective state-space models (SSMs) with a similarity-aware selective scan mechanism adapted from Mamba’s linear-complexity algorithm to refine 4D correlation maps without compromising feature map resolution or receptive field.

Result: Achieves state-of-the-art performance on standard semantic correspondence benchmarks.

Conclusion: MambaMatcher effectively overcomes limitations of previous methods by efficiently modeling high-dimensional correlations using SSMs.

Abstract: Establishing semantic correspondences between images is a fundamental yet challenging task in computer vision. Traditional feature-metric methods enhance visual features but may miss complex inter-correlation relationships, while recent correlation-metric approaches are hindered by high computational costs due to processing 4D correlation maps. We introduce MambaMatcher, a novel method that overcomes these limitations by efficiently modeling high-dimensional correlations using selective state-space models (SSMs). By implementing a similarity-aware selective scan mechanism adapted from Mamba’s linear-complexity algorithm, MambaMatcher refines the 4D correlation map effectively without compromising feature map resolution or receptive field. Experiments on standard semantic correspondence benchmarks demonstrate that MambaMatcher achieves state-of-the-art performance.

[609] CLQ: Cross-Layer Guided Orthogonal-based Quantization for Diffusion Transformers

Kai Liu, Shaoqiu Zhang, Linghe Kong, Yulun Zhang

Main category: cs.CV

TL;DR: CLQ is a cross-layer guided orthogonal-based quantization method for diffusion transformers (DiTs) that enables W4A4 quantization with minimal performance degradation, achieving 3.98x memory saving and 3.95x speedup.

Details

Motivation: DiTs face practical deployment challenges on edge devices due to large model size and complexity. While model post-training quantization (PTQ) can reduce memory and speed up inference, it causes performance degradation that needs to be mitigated.

Method: CLQ uses three key designs: cross-block calibration (CBC) for accurate calibration data, orthogonal-based smoothing (OBS) to quantify and smooth outlier channels using block Hadamard matrix, and cross-layer parameter searching (CLPS).

Result: CLQ successfully compresses DiTs into W4A4 format with negligible degradation in visual quality and metrics, achieving 3.98x memory saving and 3.95x speedup for both image and video generation models.

Conclusion: CLQ provides an effective quantization solution for DiTs that maintains high visual quality while significantly reducing memory usage and accelerating inference, making DiTs more practical for edge device deployment.

Abstract: Visual generation quality has been greatly promoted with the rapid advances in diffusion transformers (DiTs), which is attributed to the scaling of model size and complexity. However, these attributions also hinder the practical deployment of DiTs on edge devices, limiting their development and application. Serve as an efficient model compression technique, model post-training quantization (PTQ) can reduce the memory consumption and speed up the inference, with inevitable performance degradation. To alleviate the degradation, we propose CLQ, a cross-layer guided orthogonal-based quantization method for DiTs. To be specific, CLQ consists of three key designs. First, we observe that the calibration data used by most of the PTQ methods can not honestly represent the distribution of the activations. Therefore, we propose cross-block calibration (CBC) to obtain accurate calibration data, with which the quantization can be better guided. Second, we propose orthogonal-based smoothing (OBS), which quantifies the outlier score of each channel and leverages block Hadamard matrix to smooth the outliers with negligible overhead. Third, we propose cross-layer parameter searching (CLPS) to search. We evaluate CLQ with both image generation and video generation models and successfully compress the model into W4A4 with negligible degradation in visual quality and metrics. CLQ achieves 3.98x memory saving and 3.95x speedup. Our code is available at \hyperlink{https://github.com/Kai-Liu001/CLQ}{https://github.com/Kai-Liu001/CLQ}.

[610] TP-MVCC: Tri-plane Multi-view Fusion Model for Silkie Chicken Counting

Sirui Chen, Yuhong Feng, Yifeng Wang, Jianghai Liao, Qi Zhang

Main category: cs.CV

TL;DR: TP-MVCC is a multi-view chicken counting model that uses tri-plane fusion and geometric projection to integrate features from multiple cameras onto a ground plane, achieving 95.1% accuracy in dense, occluded farming scenarios.

Details

Motivation: Accurate animal counting is essential for smart farming but remains difficult in crowded scenes due to occlusions and limited camera views.

Method: Leverages geometric projection and tri-plane fusion to integrate features from multiple cameras onto a unified ground plane, extracting single-view features, aligning them via spatial transformation, and decoding a scene-level density map.

Result: Significantly outperforms single-view and conventional fusion comparisons, achieving 95.1% accuracy and strong robustness in dense, occluded scenarios.

Conclusion: Demonstrates practical potential for intelligent agriculture with superior performance in challenging farming conditions.

Abstract: Accurate animal counting is essential for smart farming but remains difficult in crowded scenes due to occlusions and limited camera views. To address this, we propose a tri-plane-based multi-view chicken counting model (TP-MVCC), which leverages geometric projection and tri-plane fusion to integrate features from multiple cameras onto a unified ground plane. The framework extracts single-view features, aligns them via spatial transformation, and decodes a scene-level density map for precise chicken counting. In addition, we construct the first multi-view dataset of silkie chickens under real farming conditions. Experiments show that TP-MVCC significantly outperforms single-view and conventional fusion comparisons, achieving 95.1% accuracy and strong robustness in dense, occluded scenarios, demonstrating its practical potential for intelligent agriculture.

[611] Hyperspherical Latents Improve Continuous-Token Autoregressive Generation

Guolin Ke, Hui Xue

Main category: cs.CV

TL;DR: SphereAR addresses variance collapse in autoregressive image generation by constraining VAE latents to a fixed-radius hypersphere, achieving state-of-the-art results for AR models on ImageNet.

Details

Motivation: Autoregressive models for image generation suffer from heterogeneous variance in VAE latents that gets amplified during decoding, especially under classifier-free guidance, leading to variance collapse and performance gaps compared to diffusion and masked-generation models.

Method: Proposes SphereAR which constrains all AR inputs and outputs to lie on a fixed-radius hypersphere using hyperspherical VAEs, removing the scale component that causes variance collapse and stabilizing AR decoding.

Result: SphereAR-H (943M) achieves FID 1.34 on ImageNet, setting new SOTA for AR models. Smaller variants SphereAR-L (479M) and SphereAR-B (208M) achieve FID 1.54 and 1.92 respectively, matching or surpassing much larger baselines like MAR-H (943M, 1.55) and VAR-d30 (2B, 1.92).

Conclusion: SphereAR is the first pure next-token autoregressive image generator with raster order that surpasses diffusion and masked-generation models at comparable parameter scales, demonstrating the effectiveness of hyperspherical constraints for stabilizing AR image generation.

Abstract: Autoregressive (AR) models are promising for image generation, yet continuous-token AR variants often trail latent diffusion and masked-generation models. The core issue is heterogeneous variance in VAE latents, which is amplified during AR decoding, especially under classifier-free guidance (CFG), and can cause variance collapse. We propose SphereAR to address this issue. Its core design is to constrain all AR inputs and outputs – including after CFG – to lie on a fixed-radius hypersphere (constant $\ell_2$ norm), leveraging hyperspherical VAEs. Our theoretical analysis shows that hyperspherical constraint removes the scale component (the primary cause of variance collapse), thereby stabilizing AR decoding. Empirically, on ImageNet generation, SphereAR-H (943M) sets a new state of the art for AR models, achieving FID 1.34. Even at smaller scales, SphereAR-L (479M) reaches FID 1.54 and SphereAR-B (208M) reaches 1.92, matching or surpassing much larger baselines such as MAR-H (943M, 1.55) and VAR-d30 (2B, 1.92). To our knowledge, this is the first time a pure next-token AR image generator with raster order surpasses diffusion and masked-generation models at comparable parameter scales.

[612] NeRV-Diffusion: Diffuse Implicit Neural Representations for Video Synthesis

Yixuan Ren, Hanyu Wang, Hao Chen, Bo He, Abhinav Shrivastava

Main category: cs.CV

TL;DR: NeRV-Diffusion is an implicit latent video diffusion model that synthesizes videos by generating neural network weights, which form an implicit neural representation (INR) to decode videos from frame indices.

Details

Motivation: To enable efficient and high-quality video synthesis by compressing and generating videos holistically as unified neural networks, avoiding temporal cross-frame attentions in denoising and using dedicated decoders.

Method: Two-stage framework: 1) Hypernetwork-based tokenizer encodes videos to neural parameter space, 2) Implicit diffusion transformer denoises latent INR weights. Uses Gaussian-distributed INR weights with reused bottleneck latent, reformed weight assignment, upsampling connections, and input coordinates.

Result: Superior video generation quality over previous INR-based models and comparable performance to state-of-the-art non-implicit models on UCF-101 and Kinetics-600 benchmarks. Creates smooth INR weight space for seamless interpolations.

Conclusion: NeRV-Diffusion successfully demonstrates that generating videos via neural network weights enables efficient high-quality synthesis with smooth interpolation capabilities, bridging the gap between INR-based and traditional video generation approaches.

Abstract: We present NeRV-Diffusion, an implicit latent video diffusion model that synthesizes videos via generating neural network weights. The generated weights can be rearranged as the parameters of a convolutional neural network, which forms an implicit neural representation (INR), and decodes into videos with frame indices as the input. Our framework consists of two stages: 1) A hypernetworkbased tokenizer that encodes raw videos from pixel space to neural parameter space, where the bottleneck latent serves as INR weights to decode. 2) An implicit diffusion transformer that denoises on the latent INR weights. In contrast to traditional video tokenizers that encode videos into frame-wise feature maps, NeRV-Diffusion compresses and generates a video holistically as a unified neural network. This enables efficient and high-quality video synthesis via obviating temporal cross-frame attentions in the denoiser and decoding video latent with dedicated decoders. To achieve Gaussian-distributed INR weights with high expressiveness, we reuse the bottleneck latent across all NeRV layers, as well as reform its weight assignment, upsampling connection and input coordinates. We also introduce SNR-adaptive loss weighting and scheduled sampling for effective training of the implicit diffusion model. NeRV-Diffusion reaches superior video generation quality over previous INR-based models and comparable performance to most recent state-of-the-art non-implicit models on real-world video benchmarks including UCF-101 and Kinetics-600. It also brings a smooth INR weight space that facilitates seamless interpolations between frames or videos.

[613] LaMoGen: Laban Movement-Guided Diffusion for Text-to-Motion Generation

Heechang Kim, Gwanghyun Kim, Se Young Chun

Main category: cs.CV

TL;DR: A zero-shot inference-time optimization method that integrates Laban Effort and Shape components into text-guided motion generation models for fine-grained expressive control of human motion.

Details

Motivation: Achieving fine-grained expressive motion control in text-to-motion synthesis is challenging due to limited motion style diversity in datasets and difficulty expressing quantitative characteristics in natural language.

Method: Proposes a zero-shot inference-time optimization method that updates text embeddings of pretrained diffusion models during sampling to guide motion generation toward desired Laban Effort and Shape components without additional motion data.

Result: The approach successfully manipulates motion attributes according to target Laban tags, yielding diverse expressive motion qualities while preserving motion identity.

Conclusion: The method enables interpretable and expressive control of human motion generation by integrating Laban movement analysis quantification methods into text-guided motion generation models.

Abstract: Diverse human motion generation is an increasingly important task, having various applications in computer vision, human-computer interaction and animation. While text-to-motion synthesis using diffusion models has shown success in generating high-quality motions, achieving fine-grained expressive motion control remains a significant challenge. This is due to the lack of motion style diversity in datasets and the difficulty of expressing quantitative characteristics in natural language. Laban movement analysis has been widely used by dance experts to express the details of motion including motion quality as consistent as possible. Inspired by that, this work aims for interpretable and expressive control of human motion generation by seamlessly integrating the quantification methods of Laban Effort and Shape components into the text-guided motion generation models. Our proposed zero-shot, inference-time optimization method guides the motion generation model to have desired Laban Effort and Shape components without any additional motion data by updating the text embedding of pretrained diffusion models during the sampling step. We demonstrate that our approach yields diverse expressive motion qualities while preserving motion identity by successfully manipulating motion attributes according to target Laban tags.

[614] DRIFT: Divergent Response in Filtered Transformations for Robust Adversarial Defense

Amira Guesmi, Muhammad Shafique

Main category: cs.CV

TL;DR: DRIFT is a stochastic ensemble defense that disrupts gradient consensus to prevent adversarial transferability, achieving strong robustness with minimal overhead.

Details

Motivation: Deep neural networks are vulnerable to adversarial examples, and most defenses fail when gradients can be reliably estimated. The key vulnerability is gradient consensus - where randomized transformations yield aligned gradients that attackers exploit.

Method: DRIFT uses a stochastic ensemble of lightweight, learnable filters trained to actively disrupt gradient consensus through gradient dissonance. It combines prediction consistency, Jacobian separation, logit-space separation, and adversarial robustness in training.

Result: DRIFT achieves substantial robustness gains on ImageNet across CNNs and Vision Transformers, outperforming state-of-the-art preprocessing, adversarial training, and diffusion-based defenses under various attack types.

Conclusion: Gradient divergence is established as a practical and generalizable principle for adversarial defense, with DRIFT delivering strong protection with negligible runtime and memory costs.

Abstract: Deep neural networks remain highly vulnerable to adversarial examples, and most defenses collapse once gradients can be reliably estimated. We identify \emph{gradient consensus} – the tendency of randomized transformations to yield aligned gradients – as a key driver of adversarial transferability. Attackers exploit this consensus to construct perturbations that remain effective across transformations. We introduce \textbf{DRIFT} (Divergent Response in Filtered Transformations), a stochastic ensemble of lightweight, learnable filters trained to actively disrupt gradient consensus. Unlike prior randomized defenses that rely on gradient masking, DRIFT enforces \emph{gradient dissonance} by maximizing divergence in Jacobian- and logit-space responses while preserving natural predictions. Our contributions are threefold: (i) we formalize gradient consensus and provide a theoretical analysis linking consensus to transferability; (ii) we propose a consensus-divergence training strategy combining prediction consistency, Jacobian separation, logit-space separation, and adversarial robustness; and (iii) we show that DRIFT achieves substantial robustness gains on ImageNet across CNNs and Vision Transformers, outperforming state-of-the-art preprocessing, adversarial training, and diffusion-based defenses under adaptive white-box, transfer-based, and gradient-free attacks. DRIFT delivers these improvements with negligible runtime and memory cost, establishing gradient divergence as a practical and generalizable principle for adversarial defense.

[615] Mitigating Visual Hallucinations via Semantic Curriculum Preference Optimization in MLLMs

Yuanshuai Li, Yuping Yan, Junfeng Tang, Yunxuan Li, Zeqi Zheng, Yaochu Jin

Main category: cs.CV

TL;DR: SCPO is a novel alignment framework for Multimodal Large Language Models that uses semantic curriculum learning to reduce visual hallucinations by up to 62.9% while maintaining general capabilities.

Details

Motivation: MLLMs suffer from visual hallucinations where responses contradict visual evidence, and existing DPO methods fail to capture fine-grained semantic differences and encourage shortcut learning.

Method: Proposes Semantic Curriculum Preference Optimization (SCPO) with progressive easy-to-hard curriculum using Semantic Curriculum Preference Pairs dataset, dynamic reference model, and symmetric bidirectional objective for simultaneous text and visual preference learning.

Result: SCPO reduces hallucination rate by up to 62.9% on LLaVA models across various scales and versions, improves factuality while preserving general capabilities, and maintains stable performance on general vision-language benchmarks.

Conclusion: SCPO is the first framework to unify semantics, symmetry, and curriculum for MLLM alignment, effectively mitigating visual hallucinations while maintaining model capabilities.

Abstract: Multimodal Large Language Models (MLLMs) have significantly improved the performance of various tasks, but continue to suffer from visual hallucinations, a critical issue where generated responses contradict visual evidence. While Direct Preference Optimization(DPO) is widely used for alignment, its application to MLLMs often fails to capture fine-grained semantic differences and encourages shortcut learning. To address these challenges, we propose Semantic Curriculum Preference Optimization (SCPO), a novel framework for MLLM alignment. SCPO employs a progressive, easy-to-hard curriculum built upon our Semantic Curriculum Preference Pairs dataset, which provides fine-grained semantic contrasts sorted by difficulty. This curriculum is trained with a dynamic reference model and a novel symmetric, bidirectional objective to facilitate simultaneous learning from both textual and visual preferences. To our knowledge, SCPO is the first framework to unify semantics, symmetry, and curriculum for MLLMs alignment, effectively mitigating visual hallucinations. Extensive experiments on LLaVA models across various scales and versions validate that SCPO demonstrates superior performance compared to baseline models on multiple hallucination benchmarks, reducing the hallucination rate by up to 62.9%. Moreover, evaluations on generalized benchmarks show that SCPO improves factuality while preserving general capabilities, with its performance remaining stable across general vision-language benchmarks.

[616] Real-Aware Residual Model Merging for Deepfake Detection

Jinhee Park, Guisik Kim, Choongsang Cho, Junseok Kwon

Main category: cs.CV

TL;DR: R²M is a training-free model merging framework for deepfake detection that combines specialized detectors by decomposing them into shared Real components and Fake residuals, enabling effective detection without retraining when new forgery methods emerge.

Details

Motivation: Deepfake generators evolve rapidly, making exhaustive data collection and repeated retraining impractical. Model merging is suitable for deepfake detection since specialists share the same binary decision but differ in generator-specific artifacts.

Method: Real-aware Residual Model Merging (R²M) estimates a shared Real component via low-rank factorization, decomposes specialists into Real-aligned parts and Fake residuals, denoises residuals with layerwise rank truncation, and aggregates them with per-task norm matching.

Result: R²M outperforms joint training and other merging baselines across in-distribution, cross-dataset, and unseen-dataset scenarios. It maintains composability - new forgery families only require fine-tuning one specialist and re-merging.

Conclusion: R²M provides an effective training-free solution for deepfake detection that adapts to evolving generators through composable model merging, eliminating the need for exhaustive retraining while maintaining strong performance across diverse scenarios.

Abstract: Deepfake generators evolve quickly, making exhaustive data collection and repeated retraining impractical. We argue that model merging is a natural fit for deepfake detection: unlike generic multi-task settings with disjoint labels, deepfake specialists share the same binary decision and differ in generator-specific artifacts. Empirically, we show that simple weight averaging preserves Real representations while attenuating Fake-specific cues. Building upon these findings, we propose Real-aware Residual Model Merging (R$^2$M), a training-free parameter-space merging framework. R$^2$M estimates a shared Real component via a low-rank factorization of task vectors, decomposes each specialist into a Real-aligned part and a Fake residual, denoises residuals with layerwise rank truncation, and aggregates them with per-task norm matching to prevent any single generator from dominating. A concise rationale explains why a simple head suffices: the Real component induces a common separation direction in feature space, while truncated residuals contribute only minor off-axis variations. Across in-distribution, cross-dataset, and unseen-dataset, R$^2$M outperforms joint training and other merging baselines. Importantly, R$^2$M is also composable: when a new forgery family appears, we fine-tune one specialist and re-merge, eliminating the need for retraining.

[617] DINOReg: Strong Point Cloud Registration with Vision Foundation Model

Congjia Chen, Yufu Qu

Main category: cs.CV

TL;DR: DINOReg is a point cloud registration network that effectively fuses visual features from DINOv2 with geometric features at patch level, achieving state-of-the-art performance on RGB-D datasets.

Details

Motivation: Existing point cloud registration methods rely mainly on geometric information and don't fully exploit the rich texture and semantic information available in RGB-D images. Current multi-modal approaches perform feature fusion in an image-lossy manner.

Method: Uses DINOv2 to extract visual features from images and fuses them with geometric features at patch level. Introduces mixed positional embedding to encode both image space and point cloud space positional information.

Result: Achieves significant improvements on RGBD-3DMatch and RGBD-3DLoMatch datasets: 14.2% increase in patch inlier ratio and 15.7% increase in registration recall compared to state-of-the-art methods.

Conclusion: The proposed DINOReg effectively combines visual and geometric information through patch-level fusion and mixed positional embedding, demonstrating superior performance in point cloud registration tasks.

Abstract: Point cloud registration is a fundamental task in 3D computer vision. Most existing methods rely solely on geometric information for feature extraction and matching. Recently, several studies have incorporated color information from RGB-D data into feature extraction. Although these methods achieve remarkable improvements, they have not fully exploited the abundant texture and semantic information in images, and the feature fusion is performed in an image-lossy manner, which limit their performance. In this paper, we propose DINOReg, a registration network that sufficiently utilizes both visual and geometric information to solve the point cloud registration problem. Inspired by advances in vision foundation models, we employ DINOv2 to extract informative visual features from images, and fuse visual and geometric features at the patch level. This design effectively combines the rich texture and global semantic information extracted by DINOv2 with the detailed geometric structure information captured by the geometric backbone. Additionally, a mixed positional embedding is proposed to encode positional information from both image space and point cloud space, which enhances the model’s ability to perceive spatial relationships between patches. Extensive experiments on the RGBD-3DMatch and RGBD-3DLoMatch datasets demonstrate that our method achieves significant improvements over state-of-the-art geometry-only and multi-modal registration methods, with a 14.2% increase in patch inlier ratio and a 15.7% increase in registration recall. The code is publicly available at https://github.com/ccjccjccj/DINOReg.

[618] Mask Clustering-based Annotation Engine for Large-Scale Submeter Land Cover Mapping

Hao Chen, Fang Xu, Tamer Saleh, Weifeng Hao, Gui-Song Xia

Main category: cs.CV

TL;DR: MCAE is a mask clustering-based annotation engine that enables efficient large-scale land cover mapping of submeter resolution imagery by treating semantically consistent mask groups as minimal annotating units, improving annotation efficiency by 1-2 orders of magnitude.

Details

Motivation: Submeter resolution imagery offers fine-grained detail but is underutilized for large-scale land cover mapping due to lack of sufficient high-quality annotated datasets. Existing annotation methods are unreliable or prohibitively expensive given the rich visual detail and massive data volumes.

Method: Proposed Mask Clustering-based Annotation Engine (MCAE) inspired by spatial autocorrelation principle. It treats semantically consistent mask groups as minimal annotating units to enable simultaneous annotation of multiple instances.

Result: Built HiCity-LC dataset with ~14 billion labeled pixels, supporting city-scale land cover maps across five major Chinese cities with classification accuracies above 85%. First publicly available submeter resolution city-level land cover benchmark.

Conclusion: MCAE demonstrates scalability and practical utility for large-scale submeter resolution mapping, significantly improving annotation efficiency while preserving label quality, semantic diversity, and spatial representativeness.

Abstract: Recent advances in remote sensing technology have made submeter resolution imagery increasingly accessible, offering remarkable detail for fine-grained land cover analysis. However, its full potential remains underutilized - particularly for large-scale land cover mapping - due to the lack of sufficient, high-quality annotated datasets. Existing labels are typically derived from pre-existing products or manual annotation, which are often unreliable or prohibitively expensive, particularly given the rich visual detail and massive data volumes of submeter imagery. Inspired by the spatial autocorrelation principle, which suggests that objects of the same class tend to co-occur with similar visual features in local neighborhoods, we propose the Mask Clustering-based Annotation Engine (MCAE), which treats semantically consistent mask groups as the minimal annotating units to enable efficient, simultaneous annotation of multiple instances. It significantly improves annotation efficiency by one to two orders of magnitude, while preserving label quality, semantic diversity, and spatial representativeness. With MCAE, we build a high-quality annotated dataset of about 14 billion labeled pixels, referred to as HiCity-LC, which supports the generation of city-scale land cover maps across five major Chinese cities with classification accuracies above 85%. It is the first publicly available submeter resolution city-level land cover benchmark, highlighting the scalability and practical utility of MCAE for large-scale, submeter resolution mapping. The dataset is available at https://github.com/chenhaocs/MCAE

[619] CMT: Mid-Training for Efficient Learning of Consistency, Mean Flow, and Flow Map Models

Zheyuan Hu, Chieh-Hsin Lai, Yuki Mitsufuji, Stefano Ermon

Main category: cs.CV

TL;DR: Consistency Mid-Training (CMT) introduces a lightweight intermediate training stage between diffusion pre-training and flow map training that stabilizes training, reduces computational costs, and achieves state-of-the-art few-step generation performance.

Details

Motivation: Existing flow map models like Consistency Models and Mean Flow suffer from unstable training, sensitivity to hyperparameters, and high computational costs when converting diffusion models to few-step generators.

Method: CMT trains a model to map points along a solver trajectory from a pre-trained diffusion model directly to the solver-generated clean sample, creating a trajectory-consistent and stable initialization for flow map training.

Result: CMT achieves state-of-the-art two-step FIDs: 1.97 on CIFAR-10, 1.32 on ImageNet 64x64, and 1.84 on ImageNet 512x512, while using up to 98% less training data and GPU time compared to Consistency Models.

Conclusion: CMT provides a principled, efficient, and general framework for training flow map models that significantly reduces training time and computational requirements while improving generation quality.

Abstract: Flow map models such as Consistency Models (CM) and Mean Flow (MF) enable few-step generation by learning the long jump of the ODE solution of diffusion models, yet training remains unstable, sensitive to hyperparameters, and costly. Initializing from a pre-trained diffusion model helps, but still requires converting infinitesimal steps into a long-jump map, leaving instability unresolved. We introduce mid-training, the first concept and practical method that inserts a lightweight intermediate stage between the (diffusion) pre-training and the final flow map training (i.e., post-training) for vision generation. Concretely, Consistency Mid-Training (CMT) is a compact and principled stage that trains a model to map points along a solver trajectory from a pre-trained model, starting from a prior sample, directly to the solver-generated clean sample. It yields a trajectory-consistent and stable initialization. This initializer outperforms random and diffusion-based baselines and enables fast, robust convergence without heuristics. Initializing post-training with CMT weights further simplifies flow map learning. Empirically, CMT achieves state of the art two step FIDs: 1.97 on CIFAR-10, 1.32 on ImageNet 64x64, and 1.84 on ImageNet 512x512, while using up to 98% less training data and GPU time, compared to CMs. On ImageNet 256x256, CMT reaches 1-step FID 3.34 while cutting total training time by about 50% compared to MF from scratch (FID 3.43). This establishes CMT as a principled, efficient, and general framework for training flow map models.

[620] PCICF: A Pedestrian Crossing Identification and Classification Framework

Junyi Gu, Beatriz Cabrero-Daniel, Ali Nouri, Lydia Armini, Christian Berger

Main category: cs.CV

TL;DR: PCICF is a framework for systematically identifying and classifying vulnerable road user (VRU) situations to support operational design domain (ODD) incident analysis for robotaxis, using space-filling curves to transform multi-dimensional scenario features into patterns matched against the MoreSMIRK dataset.

Details

Motivation: Robotaxis operate in urban ODDs and need reliable VRU detection. End-to-end AI systems require high-quality data for training and evaluation. Current datasets like SMIRK only cover single pedestrians, limiting analysis of complex multi-pedestrian crossing situations.

Method: Extended SMIRK dataset to MoreSMIRK with multi-pedestrian crossing situations. Used space-filling curves (SFCs) to transform multi-dimensional scenario features into characteristic patterns for matching with MoreSMIRK dictionary entries.

Result: Successfully evaluated PCICF on PIE dataset (150+ annotated pedestrian crossing videos). Framework can identify and classify complex pedestrian crossings, including when groups merge or split. Computationally efficient enough for potential onboard ODD detection.

Conclusion: PCICF provides effective systematic identification and classification of VRU situations for robotaxi ODD analysis. Framework shows promise for real-world deployment and is available as open-source with complete dataset and algorithms.

Abstract: We have recently observed the commercial roll-out of robotaxis in various countries. They are deployed within an operational design domain (ODD) on specific routes and environmental conditions, and are subject to continuous monitoring to regain control in safety-critical situations. Since ODDs typically cover urban areas, robotaxis must reliably detect vulnerable road users (VRUs) such as pedestrians, bicyclists, or e-scooter riders. To better handle such varied traffic situations, end-to-end AI, which directly compute vehicle control actions from multi-modal sensor data instead of only for perception, is on the rise. High quality data is needed for systematically training and evaluating such systems within their OOD. In this work, we propose PCICF, a framework to systematically identify and classify VRU situations to support ODD’s incident analysis. We base our work on the existing synthetic dataset SMIRK, and enhance it by extending its single-pedestrian-only design into the MoreSMIRK dataset, a structured dictionary of multi-pedestrian crossing situations constructed systematically. We then use space-filling curves (SFCs) to transform multi-dimensional features of scenarios into characteristic patterns, which we match with corresponding entries in MoreSMIRK. We evaluate PCICF with the large real-world dataset PIE, which contains more than 150 manually annotated pedestrian crossing videos. We show that PCICF can successfully identify and classify complex pedestrian crossings, even when groups of pedestrians merge or split. By leveraging computationally efficient components like SFCs, PCICF has even potential to be used onboard of robotaxis for OOD detection for example. We share an open-source replication package for PCICF containing its algorithms, the complete MoreSMIRK dataset and dictionary, as well as our experiment results presented in: https://github.com/Claud1234/PCICF

[621] CORE-3D: Context-aware Open-vocabulary Retrieval by Embeddings in 3D

Mohamad Amin Mirzaei, Pantea Amoie, Ali Ekhterachian, Matin Mirzababaei

Main category: cs.CV

TL;DR: The paper proposes an improved method for 3D semantic mapping using SemanticSAM for better mask generation and context-aware CLIP encoding for richer semantic context, achieving superior performance in 3D scene understanding tasks.

Details

Motivation: Existing zero-shot 3D semantic mapping methods produce fragmented masks and inaccurate semantic assignments due to direct use of raw masks from vision-language models, limiting effectiveness in complex environments.

Method: Leverages SemanticSAM with progressive granularity refinement for more accurate object-level masks, and employs context-aware CLIP encoding with multiple contextual views and empirical weighting for richer semantic context.

Result: Experimental evaluation on multiple 3D scene understanding tasks shows significant improvements over existing methods in 3D semantic segmentation and object retrieval from language queries across several benchmark datasets.

Conclusion: The approach effectively addresses limitations of previous methods by improving mask generation quality and semantic context integration, demonstrating superior performance in 3D scene understanding applications.

Abstract: 3D scene understanding is fundamental for embodied AI and robotics, supporting reliable perception for interaction and navigation. Recent approaches achieve zero-shot, open-vocabulary 3D semantic mapping by assigning embedding vectors to 2D class-agnostic masks generated via vision-language models (VLMs) and projecting these into 3D. However, these methods often produce fragmented masks and inaccurate semantic assignments due to the direct use of raw masks, limiting their effectiveness in complex environments. To address this, we leverage SemanticSAM with progressive granularity refinement to generate more accurate and numerous object-level masks, mitigating the over-segmentation commonly observed in mask generation models such as vanilla SAM, and improving downstream 3D semantic segmentation. To further enhance semantic context, we employ a context-aware CLIP encoding strategy that integrates multiple contextual views of each mask using empirically determined weighting, providing much richer visual context. We evaluate our approach on multiple 3D scene understanding tasks, including 3D semantic segmentation and object retrieval from language queries, across several benchmark datasets. Experimental results demonstrate significant improvements over existing methods, highlighting the effectiveness of our approach.

[622] RapidMV: Leveraging Spatio-Angular Representations for Efficient and Consistent Text-to-Multi-View Synthesis

Seungwook Kim, Yichun Shi, Kejie Li, Minsu Cho, Peng Wang

Main category: cs.CV

TL;DR: RapidMV is a fast text-to-multi-view generative model that produces 32 multi-view images in ~5 seconds using a novel spatio-angular latent space for improved efficiency and consistency.

Details

Motivation: To bridge the gap between text prompts and 3D asset generation by efficiently creating synthetic multi-view images, addressing the need for faster and more consistent multi-view generation.

Method: Proposes a novel spatio-angular latent space that encodes both spatial appearance and angular viewpoint deviations into a single latent, with a strategically decomposed multi-step training process.

Result: Outperforms existing methods in consistency and latency, with competitive quality and text-image alignment, generating 32 multi-view images in approximately 5 seconds.

Conclusion: RapidMV provides an efficient and consistent solution for text-to-multi-view generation, serving as a crucial bridge for synthetic 3D asset creation with superior speed and multi-view consistency.

Abstract: Generating synthetic multi-view images from a text prompt is an essential bridge to generating synthetic 3D assets. In this work, we introduce RapidMV, a novel text-to-multi-view generative model that can produce 32 multi-view synthetic images in just around 5 seconds. In essence, we propose a novel spatio-angular latent space, encoding both the spatial appearance and angular viewpoint deviations into a single latent for improved efficiency and multi-view consistency. We achieve effective training of RapidMV by strategically decomposing our training process into multiple steps. We demonstrate that RapidMV outperforms existing methods in terms of consistency and latency, with competitive quality and text-image alignment.

[623] Proxy-GS: Efficient 3D Gaussian Splatting via Proxy Mesh

Yuanyuan Gao, Yuning Gong, Yifei Liu, Li Jingfeng, Zhihang Zhong, Dingwen Zhang, Yanci Zhang, Dan Xu, Xiao Sun

Main category: cs.CV

TL;DR: Proxy-GS introduces occlusion awareness to 3D Gaussian Splatting using a fast proxy system that produces occlusion depth maps in under 1ms, enabling efficient Gaussian culling and improved rendering quality in occluded scenes.

Details

Motivation: Current 3D Gaussian Splatting methods and their MLP-based variants suffer from significant redundancy due to lack of occlusion awareness, leading to computation overhead and reduced rendering efficiency in large-scale scenes.

Method: A fast proxy system generates precise occlusion depth maps (1000x1000 resolution in <1ms) that guides: 1) culling of anchors and Gaussians for rendering acceleration, and 2) densification towards surfaces during training to avoid inconsistencies in occluded regions.

Result: In heavily occluded scenarios like MatrixCity Streets, Proxy-GS achieves over 2.5x speedup compared to Octree-GS while delivering substantially higher rendering quality, especially for MLP-based Gaussian splatting variants.

Conclusion: Proxy-GS effectively addresses occlusion redundancy in 3D Gaussian Splatting through proxy-guided culling and training, achieving both faster rendering speeds and improved visual fidelity in complex scenes.

Abstract: 3D Gaussian Splatting (3DGS) has emerged as an efficient approach for achieving photorealistic rendering. Recent MLP-based variants further improve visual fidelity but introduce substantial decoding overhead during rendering. To alleviate computation cost, several pruning strategies and level-of-detail (LOD) techniques have been introduced, aiming to effectively reduce the number of Gaussian primitives in large-scale scenes. However, our analysis reveals that significant redundancy still remains due to the lack of occlusion awareness. In this work, we propose Proxy-GS, a novel pipeline that exploits a proxy to introduce Gaussian occlusion awareness from any view. At the core of our approach is a fast proxy system capable of producing precise occlusion depth maps at a resolution of 1000x1000 under 1ms. This proxy serves two roles: first, it guides the culling of anchors and Gaussians to accelerate rendering speed. Second, it guides the densification towards surfaces during training, avoiding inconsistencies in occluded regions, and improving the rendering quality. In heavily occluded scenarios, such as the MatrixCity Streets dataset, Proxy-GS not only equips MLP-based Gaussian splatting with stronger rendering capability but also achieves faster rendering speed. Specifically, it achieves more than 2.5x speedup over Octree-GS, and consistently delivers substantially higher rendering quality. Code will be public upon acceptance.

Runmin Zhang, Jialiang Wang, Si-Yuan Cao, Zhu Yu, Junchen Yu, Guangyi Zhang, Hui-Liang Shen

Main category: cs.CV

TL;DR: DCFlow is an unsupervised cross-modal flow estimation framework that uses decoupled optimization and cross-modal consistency constraints to address modality discrepancies and geometric misalignment without ground-truth flow supervision.

Details

Motivation: Previous approaches implicitly learn flow estimation from appearance similarity alone, which fails to address modality discrepancy and geometric misalignment between different modalities.

Method: Uses collaborative training of modality transfer and flow estimation networks with a decoupled optimization strategy, geometry-aware data synthesis pipeline, outlier-robust loss, and cross-modal consistency constraint.

Result: Achieves state-of-the-art performance among unsupervised approaches and can be integrated with various flow estimation networks, as demonstrated on a comprehensive cross-modal flow benchmark.

Conclusion: DCFlow effectively addresses cross-modal flow estimation challenges through its novel decoupled optimization and consistency constraints, outperforming previous unsupervised methods.

Abstract: This work presents DCFlow, a novel unsupervised cross-modal flow estimation framework that integrates a decoupled optimization strategy and a cross-modal consistency constraint. Unlike previous approaches that implicitly learn flow estimation solely from appearance similarity, we introduce a decoupled optimization strategy with task-specific supervision to address modality discrepancy and geometric misalignment distinctly. This is achieved by collaboratively training a modality transfer network and a flow estimation network. To enable reliable motion supervision without ground-truth flow, we propose a geometry-aware data synthesis pipeline combined with an outlier-robust loss. Additionally, we introduce a cross-modal consistency constraint to jointly optimize both networks, significantly improving flow prediction accuracy. For evaluation, we construct a comprehensive cross-modal flow benchmark by repurposing public datasets. Experimental results demonstrate that DCFlow can be integrated with various flow estimation networks and achieves state-of-the-art performance among unsupervised approaches.

[625] UI2V-Bench: An Understanding-based Image-to-video Generation Benchmark

Ailing Zhang, Lina Lei, Dehong Kong, Zhixin Wang, Jiaqi Xu, Fenglong Song, Chun-Le Guo, Chang Liu, Fan Li, Jie Chen

Main category: cs.CV

TL;DR: UI2V-Bench is a new benchmark for evaluating Image-to-Video (I2V) models, focusing on semantic understanding and reasoning rather than just video quality and temporal consistency.

Details

Motivation: Existing I2V evaluation benchmarks overlook models' ability to understand specific subject semantics and ensure generated videos align with physical laws and human commonsense.

Method: Proposes UI2V-Bench with four evaluation dimensions (spatial understanding, attribute binding, category understanding, reasoning) using MLLM-based evaluation methods including instance-level pipeline and feedback-based reasoning pipeline, plus human evaluation.

Result: Evaluated various open-source and closed-source I2V models using ~500 text-image pairs, showing strong alignment between MLLM-based metrics and human evaluations.

Conclusion: UI2V-Bench fills a critical gap in I2V evaluation by emphasizing semantic comprehension and reasoning ability, providing a robust framework for future research and model development.

Abstract: Generative diffusion models are developing rapidly and attracting increasing attention due to their wide range of applications. Image-to-Video (I2V) generation has become a major focus in the field of video synthesis. However, existing evaluation benchmarks primarily focus on aspects such as video quality and temporal consistency, while largely overlooking the model’s ability to understand the semantics of specific subjects in the input image or to ensure that the generated video aligns with physical laws and human commonsense. To address this gap, we propose UI2V-Bench, a novel benchmark for evaluating I2V models with a focus on semantic understanding and reasoning. It introduces four primary evaluation dimensions: spatial understanding, attribute binding, category understanding, and reasoning. To assess these dimensions, we design two evaluation methods based on Multimodal Large Language Models (MLLMs): an instance-level pipeline for fine-grained semantic understanding, and a feedback-based reasoning pipeline that enables step-by-step causal assessment for more accurate evaluation. UI2V-Bench includes approximately 500 carefully constructed text-image pairs and evaluates a range of both open source and closed-source I2V models across all defined dimensions. We further incorporate human evaluations, which show strong alignment with the proposed MLLM-based metrics. Overall, UI2V-Bench fills a critical gap in I2V evaluation by emphasizing semantic comprehension and reasoning ability, offering a robust framework and dataset to support future research and model development in the field.

[626] Can you SPLICE it together? A Human Curated Benchmark for Probing Visual Reasoning in VLMs

Mohamad Ballout, Okajevo Wilfred, Seyedalireza Yaghoubi, Nohayr Muhammad Abdelmoneim, Julius Mayer, Elia Bruni

Main category: cs.CV

TL;DR: SPLICE is a human-curated benchmark from COIN dataset for evaluating event-based reasoning across temporal, causal, spatial, contextual, and knowledge dimensions. It shows VLMs significantly underperform humans in rearranging event sequences, relying more on language priors than visual understanding.

Details

Motivation: To probe event-based reasoning capabilities across multiple dimensions and assess the gap between human and machine visual reasoning, particularly in understanding complex event sequences.

Method: Created SPLICE benchmark with 3,381 human-filtered videos from COIN dataset, segmented into 11,423 event clips. Evaluated humans and state-of-the-art VLMs on rearranging clips into coherent sequences. Used human-annotated textual descriptions to test reliance on language vs visual understanding.

Result: Significant performance gap between humans and VLMs. Human-annotated text improved model accuracy but not human performance, indicating VLMs rely more on language priors. VLMs performed better on temporal/causal reasoning tasks than contextual/spatial ones, and better on everyday tasks than specialized ones.

Conclusion: VLMs still fall short of human-level visual reasoning capabilities, with persistent challenges in understanding complex event sequences. Models show stronger performance on temporal/causal reasoning and everyday tasks, but struggle with contextual/spatial reasoning and specialized domains.

Abstract: In this work, we introduce SPLICE, a human-curated benchmark derived from the COIN instructional video dataset, designed to probe event-based reasoning across multiple dimensions: temporal, causal, spatial, contextual, and general knowledge. SPLICE includes 3,381 human-filtered videos spanning 12 categories and 180 sub-categories, such as sports, engineering, and housework. These videos are segmented into a total of 11,423 event clips. We evaluate both human participants and state-of-the-art vision-language models (VLMs) on the task of rearranging these clips into coherent event sequences to assess visual reasoning capabilities. Results reveal a significant gap: VLMs struggle to match human performance. While human-annotated textual descriptions improve model accuracy, they do not affect human performance, suggesting that models rely more on language priors than on visual understanding. Even with annotations, VLMs fall short of human-level reasoning, underscoring persistent challenges in visual reasoning. A deeper analysis across sub-categories shows that VLMs perform relatively better on videos where temporal and causal reasoning are dominant, compared to those where contextual and spatial reasoning are dominant. They also perform better on everyday tasks than on specialized ones.

[627] NeoWorld: Neural Simulation of Explorable Virtual Worlds via Progressive 3D Unfolding

Yanpeng Zhao, Shanyan Guan, Yunbo Wang, Yanhao Ge, Wei Li, Xiaokang Yang

Main category: cs.CV

TL;DR: NeoWorld is a deep learning framework that generates interactive 3D virtual worlds from single images using a hybrid approach that renders key foreground objects in full 3D while synthesizing backgrounds in 2D for efficiency.

Details

Motivation: Inspired by on-demand worldbuilding from Simulacron-3, the goal is to create expansive virtual environments where only actively explored regions are rendered with high realism, overcoming limitations of global world generation and 2D hallucination methods.

Method: Uses object-centric 3D representations with hybrid scene structure - key foreground objects modeled in full 3D while backgrounds and non-interacted regions synthesized in 2D. Implements cutting-edge representation learning and object-to-3D techniques for flexible viewpoint manipulation and scene animation controlled by natural language commands.

Result: Significantly outperforms existing 2D and depth-layered 2.5D methods on the WorldScore benchmark. Enables progressive unfolding of virtual worlds with increasing 3D detail as users interact, delivering dynamic, immersive, and visually coherent exploration experiences.

Conclusion: NeoWorld successfully demonstrates a novel approach to interactive 3D world generation that balances visual realism with computational efficiency through its hybrid 3D/2D rendering strategy, enabling physically plausible scene animation and natural language control.

Abstract: We introduce NeoWorld, a deep learning framework for generating interactive 3D virtual worlds from a single input image. Inspired by the on-demand worldbuilding concept in the science fiction novel Simulacron-3 (1964), our system constructs expansive environments where only the regions actively explored by the user are rendered with high visual realism through object-centric 3D representations. Unlike previous approaches that rely on global world generation or 2D hallucination, NeoWorld models key foreground objects in full 3D, while synthesizing backgrounds and non-interacted regions in 2D to ensure efficiency. This hybrid scene structure, implemented with cutting-edge representation learning and object-to-3D techniques, enables flexible viewpoint manipulation and physically plausible scene animation, allowing users to control object appearance and dynamics using natural language commands. As users interact with the environment, the virtual world progressively unfolds with increasing 3D detail, delivering a dynamic, immersive, and visually coherent exploration experience. NeoWorld significantly outperforms existing 2D and depth-layered 2.5D methods on the WorldScore benchmark.

[628] VNODE: A Piecewise Continuous Volterra Neural Network

Siddharth Roheda, Aniruddha Bala, Rohit Chowdhury, Rohan Jaiswal

Main category: cs.CV

TL;DR: VNODE combines Volterra filtering with neural ODEs for image classification, using discrete feature extraction and continuous state evolution to achieve better performance with fewer parameters.

Details

Motivation: Inspired by the visual cortex's alternating discrete and continuous processing, the authors aim to create a more efficient neural network architecture that captures complex patterns while reducing parameter count.

Method: Hybrid approach alternating between discrete Volterra feature extraction and continuous ODE-driven state evolution, integrating nonlinear Volterra filtering with neural ordinary differential equations.

Result: Consistently outperforms state-of-the-art models on benchmark datasets (CIFAR10, Imagenet1K) with improved computational complexity and substantially fewer parameters than conventional deep architectures.

Conclusion: VNODE successfully demonstrates that combining discrete Volterra filtering with continuous neural ODEs creates an efficient and effective architecture for image classification tasks.

Abstract: This paper introduces Volterra Neural Ordinary Differential Equations (VNODE), a piecewise continuous Volterra Neural Network that integrates nonlinear Volterra filtering with continuous time neural ordinary differential equations for image classification. Drawing inspiration from the visual cortex, where discrete event processing is interleaved with continuous integration, VNODE alternates between discrete Volterra feature extraction and ODE driven state evolution. This hybrid formulation captures complex patterns while requiring substantially fewer parameters than conventional deep architectures. VNODE consistently outperforms state of the art models with improved computational complexity as exemplified on benchmark datasets like CIFAR10 and Imagenet1K.

[629] Generalist Multi-Class Anomaly Detection via Distillation to Two Heterogeneous Student Networks

Hangil Park, Yongmin Seo, Tae-Kyun Kim

Main category: cs.CV

TL;DR: A dual-model ensemble approach using knowledge distillation for general anomaly detection, achieving state-of-the-art performance across both industrial and semantic anomaly detection tasks.

Details

Motivation: Current anomaly detection methods are biased towards industrial inspection and struggle with generalization across different domains like semantic anomaly detection. Existing general AD methods are sensitive to dataset-specific settings and single-class tasks.

Method: Proposes a dual-model ensemble with teacher-student knowledge distillation: an Encoder-Decoder model for patch-level industrial defects and an Encoder-Encoder model for semantic AD, both using shared DINOv2 encoder. Uses Noisy-OR objective and joint probability scoring.

Result: Achieved state-of-the-art performance on 8 benchmarks: 99.7% AUROC on MVTec-AD (industrial) and 97.8% on CIFAR-10 (semantic), outperforming prior general AD models and even specialist models in multi-class settings.

Conclusion: The proposed method successfully bridges the gap between industrial and semantic anomaly detection, demonstrating strong generalization across multiple domains in both single-class and multi-class settings.

Abstract: Anomaly detection (AD) plays an important role in various real-world applications. Recent advancements in AD, however, are often biased towards industrial inspection, struggle to generalize to broader tasks like semantic anomaly detection and vice versa. Although recent methods have attempted to address general anomaly detection, their performance remains sensitive to dataset-specific settings and single-class tasks. In this paper, we propose a novel dual-model ensemble approach based on knowledge distillation (KD) to bridge this gap. Our framework consists of a teacher and two student models: an Encoder-Decoder model, specialized in detecting patch-level minor defects for industrial AD and an Encoder-Encoder model, optimized for semantic AD. Both models leverage a shared pre-trained encoder (DINOv2) to extract high-quality feature representations. The dual models are jointly learned using the Noisy-OR objective, and the final anomaly score is obtained using the joint probability via local and semantic anomaly scores derived from the respective models. We evaluate our method on eight public benchmarks under both single-class and multi-class settings: MVTec-AD, MVTec-LOCO, VisA and Real-IAD for industrial inspection and CIFAR-10/100, FMNIST and View for semantic anomaly detection. The proposed method achieved state-of-the-art accuracies in both domains, in multi-class as well as single-class settings, demonstrating generalization across multiple domains of anomaly detection. Our model achieved an image-level AUROC of 99.7% on MVTec-AD and 97.8% on CIFAR-10, which is significantly better than the prior general AD models in multi-class settings and even higher than the best specialist models on individual benchmarks.

[630] Performance-Efficiency Trade-off for Fashion Image Retrieval

Julio Hurtado, Haoran Ni, Duygu Sap, Connor Mattinson, Martin Lotz

Main category: cs.CV

TL;DR: A selective representation framework for second-hand fashion image retrieval that shrinks databases to 10% of original size while maintaining accuracy through clustering, coreset selection, and outlier removal.

Details

Motivation: The fashion industry's environmental impact drives need for scalable second-hand marketplaces, requiring efficient large-scale valuation of used garments through image retrieval.

Method: Combines clustering and coreset selection to identify representative samples, plus neighbor-homogeneity consistency score for outlier removal to filter uncharacteristic samples before selection.

Result: Maintains near-optimal retrieval accuracy while reducing computational costs by 90% (database size reduced to 10%), with outlier removal further improving performance by eliminating non-discriminative samples.

Conclusion: Strategic pruning and selective representation enables efficient second-hand fashion retrieval with minimal accuracy loss, supporting scalable marketplace operations.

Abstract: The fashion industry has been identified as a major contributor to waste and emissions, leading to an increased interest in promoting the second-hand market. Machine learning methods play an important role in facilitating the creation and expansion of second-hand marketplaces by enabling the large-scale valuation of used garments. We contribute to this line of work by addressing the scalability of second-hand image retrieval from databases. By introducing a selective representation framework, we can shrink databases to 10% of their original size without sacrificing retrieval accuracy. We first explore clustering and coreset selection methods to identify representative samples that capture the key features of each garment and its internal variability. Then, we introduce an efficient outlier removal method, based on a neighbour-homogeneity consistency score measure, that filters out uncharacteristic samples prior to selection. We evaluate our approach on three public datasets: DeepFashion Attribute, DeepFashion Con2Shop, and DeepFashion2. The results demonstrate a clear performance-efficiency trade-off by strategically pruning and selecting representative vectors of images. The retrieval system maintains near-optimal accuracy, while greatly reducing computational costs by reducing the images added to the vector database. Furthermore, applying our outlier removal method to clustering techniques yields even higher retrieval performance by removing non-discriminative samples before the selection.

[631] Robust Multimodal Semantic Segmentation with Balanced Modality Contributions

Jiaqi Tan, Xu Zheng, Fangyu Li, Yang Liu

Main category: cs.CV

TL;DR: EQUISeg is a multimodal segmentation framework that addresses modality imbalance through equal encoding and a self-guided module with mutual guidance mechanism.

Details

Motivation: Existing multimodal segmentation methods suffer from imbalanced modal dependencies where performance degrades significantly when a dominant modality deteriorates in real-world scenarios.

Method: Proposes EQUISeg with four-stage Cross-modal Transformer Block (CMTB) for efficient multimodal fusion and hierarchical selection, plus Self-guided Module (SGM) with mutual guidance mechanism to adaptively adjust modality contributions.

Result: Extensive experiments on multiple datasets demonstrate significant performance gains and effective alleviation of modality imbalance adverse effects.

Conclusion: EQUISeg successfully balances modality contributions and enhances robustness under degraded conditions in multimodal segmentation tasks.

Abstract: Multimodal semantic segmentation enhances model robustness by exploiting cross-modal complementarities. However, existing methods often suffer from imbalanced modal dependencies, where overall performance degrades significantly once a dominant modality deteriorates in real-world scenarios. Thus, modality balance has become acritical challenge for practical multimodal segmentation. To address this issue, we propose EQUISeg, a multimodal segmentation framework that balances modality contributions through equal encoding of modalities. Built upon a four-stage Cross-modal Transformer Block(CMTB), EQUISeg enables efficient multimodal fusion and hierarchical selection. Furthermore, we design a Self-guided Module(SGM) that mitigates modality imbalance by introducing a mutual guidance mechanism, enabling each modality to adaptively adjust its contribution and enhance robustness under degraded conditions. Extensive experiments on multiple datasets demonstrate that EQUISeg achieves significant performance gains and effectively alleviates the adverse effects of modality imbalance in segmentation tasks.

[632] Instruction Guided Multi Object Image Editing with Quantity and Layout Consistency

Jiaqi Tan, Fangyu Li, Yang Liu

Main category: cs.CV

TL;DR: QL-Adapter is a framework for multiple object editing that addresses challenges in enforcing object counts, spatial layouts, and handling diverse categories through two core modules: Image-Layout Fusion Module and Cross-Modal Augmentation Module.

Details

Motivation: Standard CLIP text encoders often fail in complex scenes with many objects during instruction-driven image editing, particularly in maintaining object counts and spatial layouts across diverse categories.

Method: QL-Adapter uses two modules: ILFM fuses layout priors with ViT patch tokens from CLIP image encoder to enhance spatial understanding, and CMAM injects image features into the text branch to enrich textual embeddings and improve instruction following.

Result: QL-Adapter achieves state-of-the-art performance on the QL-Edit task and significantly outperforms existing models, as demonstrated through extensive experiments on the QL-Dataset benchmark.

Conclusion: The proposed QL-Adapter framework effectively addresses the challenges of multiple object editing by combining layout fusion and cross-modal augmentation, enabling better instruction following and spatial structure understanding in complex scenes.

Abstract: Instruction driven image editing with standard CLIP text encoders often fails in complex scenes with many objects. We present QL-Adapter, a framework for multiple object editing that tackles two challenges: enforcing object counts and spatial layouts, and accommodating diverse categories. QL-Adapter consists of two core modules: the Image-Layout Fusion Module (ILFM) and the Cross-Modal Augmentation Module (CMAM). ILFM fuses layout priors with ViT patch tokens from the CLIP image encoder to strengthen spatial structure understanding. CMAM injects image features into the text branch to enrich textual embeddings and improve instruction following. We further build QL-Dataset, a benchmark that spans broad category, layout, and count variations, and define the task of quantity and layout consistent image editing (QL-Edit). Extensive experiments show that QL-Adapter achieves state of the art performance on QL-Edit and significantly outperforms existing models.

[633] SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer

Junsong Chen, Yuyang Zhao, Jincheng Yu, Ruihang Chu, Junyu Chen, Shuai Yang, Xianbang Wang, Yicheng Pan, Daquan Zhou, Huan Ling, Haozhe Liu, Hongwei Yi, Hao Zhang, Muyang Li, Yukang Chen, Han Cai, Sanja Fidler, Ping Luo, Song Han, Enze Xie

Main category: cs.CV

TL;DR: SANA-Video is an efficient diffusion model for generating high-resolution (720x1280), minute-long videos with fast inference speed, deployable on consumer GPUs like RTX 5090.

Details

Motivation: To enable efficient, high-quality video generation with low computational cost and fast inference speed, making video generation more accessible.

Method: Uses Linear DiT with linear attention for efficiency and constant-memory KV cache for block linear attention to enable long video generation with fixed memory cost.

Result: Achieves competitive performance with state-of-the-art models while being 16x faster in latency, with 2.4x speedup in inference (29s vs 71s for 5-second 720p video), trained in only 12 days on 64 H100 GPUs (1% of MovieGen’s cost).

Conclusion: SANA-Video enables low-cost, high-quality video generation with efficient architecture designs that make high-resolution, long-duration video synthesis practical on consumer hardware.

Abstract: We introduce SANA-Video, a small diffusion model that can efficiently generate videos up to 720x1280 resolution and minute-length duration. SANA-Video synthesizes high-resolution, high-quality and long videos with strong text-video alignment at a remarkably fast speed, deployable on RTX 5090 GPU. Two core designs ensure our efficient, effective and long video generation: (1) Linear DiT: We leverage linear attention as the core operation, which is more efficient than vanilla attention given the large number of tokens processed in video generation. (2) Constant-Memory KV cache for Block Linear Attention: we design block-wise autoregressive approach for long video generation by employing a constant-memory state, derived from the cumulative properties of linear attention. This KV cache provides the Linear DiT with global context at a fixed memory cost, eliminating the need for a traditional KV cache and enabling efficient, minute-long video generation. In addition, we explore effective data filters and model training strategies, narrowing the training cost to 12 days on 64 H100 GPUs, which is only 1% of the cost of MovieGen. Given its low cost, SANA-Video achieves competitive performance compared to modern state-of-the-art small diffusion models (e.g., Wan 2.1-1.3B and SkyReel-V2-1.3B) while being 16x faster in measured latency. Moreover, SANA-Video can be deployed on RTX 5090 GPUs with NVFP4 precision, accelerating the inference speed of generating a 5-second 720p video from 71s to 29s (2.4x speedup). In summary, SANA-Video enables low-cost, high-quality video generation.

[634] Diffusion Bridge or Flow Matching? A Unifying Framework and Comparative Analysis

Kaizhen Zhu, Mokai Pan, Zhechuan Yu, Jingya Wang, Jingyi Yu, Ye Shi

Main category: cs.CV

TL;DR: This paper provides the first unified theoretical and experimental comparison between Diffusion Bridge and Flow Matching models, showing that Diffusion Bridge has lower cost functions and more stable trajectories, while Flow Matching becomes less effective with reduced training data.

Details

Motivation: There is confusion about which approach (Diffusion Bridge vs Flow Matching) is generally preferable, and substantial discrepancies in their modeling assumptions have hindered unified theoretical understanding of their relative merits.

Method: Recast both frameworks through Stochastic Optimal Control and Optimal Transport perspectives, propose a novel Diffusion Bridge architecture using latent Transformer, and implement Flow Matching with same structure for fair comparison across multiple tasks.

Result: Theoretical analysis shows Diffusion Bridge has lower cost function and more stable trajectories. Flow Matching’s interpolation coefficients become ineffective with reduced training data. Experiments across 6 tasks confirm theoretical predictions.

Conclusion: The paper systematically delineates the respective advantages and disadvantages of Diffusion Bridge and Flow Matching models, providing guidance for practitioners on when to use each approach based on theoretical and empirical evidence.

Abstract: Diffusion Bridge and Flow Matching have both demonstrated compelling empirical performance in transformation between arbitrary distributions. However, there remains confusion about which approach is generally preferable, and the substantial discrepancies in their modeling assumptions and practical implementations have hindered a unified theoretical account of their relative merits. We have, for the first time, provided a unified theoretical and experimental validation of these two models. We recast their frameworks through the lens of Stochastic Optimal Control and prove that the cost function of the Diffusion Bridge is lower, guiding the system toward more stable and natural trajectories. Simultaneously, from the perspective of Optimal Transport, interpolation coefficients $t$ and $1-t$ of Flow Matching become increasingly ineffective when the training data size is reduced. To corroborate these theoretical claims, we propose a novel, powerful architecture for Diffusion Bridge built on a latent Transformer, and implement a Flow Matching model with the same structure to enable a fair performance comparison in various experiments. Comprehensive experiments are conducted across Image Inpainting, Super-Resolution, Deblurring, Denoising, Translation, and Style Transfer tasks, systematically varying both the distributional discrepancy (different difficulty) and the training data size. Extensive empirical results align perfectly with our theoretical predictions and allow us to delineate the respective advantages and disadvantages of these two models. Our code is available at https://anonymous.4open.science/r/DBFM-3E8E/.

[635] Foggy Crowd Counting: Combining Physical Priors and KAN-Graph

Yuhao Wang, Zhuoran Zheng, Han Hu, Dianjie Lu, Guijuan Zhang, Chen Lyu

Main category: cs.CV

TL;DR: A crowd counting method for foggy environments that combines atmospheric scattering physics with deep learning, using differentiable scattering models and novel network architectures to handle haze degradation.

Details

Motivation: Address key challenges in crowd counting under foggy conditions: long-range target blurring, local feature degradation, and image contrast attenuation caused by atmospheric scattering.

Method: 1) Differentiable atmospheric scattering model with transmittance dynamic estimation and scattering parameter calibration; 2) MSA-KAN network based on Kolmogorov-Arnold Theorem for learnable edge activation; 3) Weather-aware GCN that dynamically constructs spatial adjacency matrices.

Result: Achieves 12.2%-27.5% reduction in MAE metrics compared to mainstream algorithms on four public datasets in dense fog scenarios.

Conclusion: The synergistic optimization of physical mechanisms and data-driven approaches significantly improves crowd counting accuracy under complex meteorological conditions.

Abstract: Aiming at the key challenges of crowd counting in foggy environments, such as long-range target blurring, local feature degradation, and image contrast attenuation, this paper proposes a crowd-counting method with a physical a priori of atmospheric scattering, which improves crowd counting accuracy under complex meteorological conditions through the synergistic optimization of the physical mechanism and data-driven.Specifically, first, the method introduces a differentiable atmospheric scattering model and employs transmittance dynamic estimation and scattering parameter adaptive calibration techniques to accurately quantify the nonlinear attenuation laws of haze on targets with different depths of field.Secondly, the MSA-KAN was designed based on the Kolmogorov-Arnold Representation Theorem to construct a learnable edge activation function. By integrating a multi-layer progressive architecture with adaptive skip connections, it significantly enhances the model’s nonlinear representation capability in feature-degraded regions, effectively suppressing feature confusion under fog interference.Finally, we further propose a weather-aware GCN that dynamically constructs spatial adjacency matrices using deep features extracted by MSA-KAN. Experiments on four public datasets demonstrate that our method achieves a 12.2%-27.5% reduction in MAE metrics compared to mainstream algorithms in dense fog scenarios.

[636] VTPerception-R1: Enhancing Multimodal Reasoning via Explicit Visual and Textual Perceptual Grounding

Yizhuo Ding, Mingkang Chen, Zhibang Feng, Tong Xiao, Wanying Qu, Wenqi Shao, Yanwei Fu

Main category: cs.CV

TL;DR: VTPerception-R1 is a two-stage framework that improves multimodal reasoning by decoupling perception from reasoning, using explicit perception strategies with textual cues to enhance accuracy and robustness.

Details

Motivation: Multimodal large language models often fail to ground reasoning in perceptual evidence, leading to poor performance on tasks requiring visual understanding.

Method: Two-stage framework: Stage 1 uses perception-augmented fine-tuning, Stage 2 applies perception-aware reinforcement learning with visual, textual, and consistency rewards.

Result: Significantly improves reasoning accuracy and robustness across diverse multimodal tasks, with best improvements seen in smaller models.

Conclusion: VTPerception-R1 provides a scalable and auditable solution for perception-grounded multimodal reasoning, demonstrating the effectiveness of explicit perception strategies.

Abstract: Multimodal large language models (MLLMs) often struggle to ground reasoning in perceptual evidence. We present a systematic study of perception strategies-explicit, implicit, visual, and textual-across four multimodal benchmarks and two MLLMs. Our findings show that explicit perception, especially when paired with textual cues, consistently yields the best improvements, particularly for smaller models. Based on this insight, we propose VTPerception-R1, a unified two-stage framework that decouples perception from reasoning. Stage 1 introduces perception-augmented fine-tuning, and Stage 2 applies perception-aware reinforcement learning with novel visual, textual, and consistency rewards. Experiments demonstrate that VTPerception-R1 significantly improves reasoning accuracy and robustness across diverse tasks, offering a scalable and auditable solution for perception-grounded multimodal reasoning. Our code is available at: https://github.com/yizhuoDi/VTPerceprion-R1.

[637] TokenSwap: Backdoor Attack on the Compositional Understanding of Large Vision-Language Models

Zhifang Zhang, Qiqi Tao, Jiaqi Lv, Na Zhao, Lei Feng, Joey Tianyi Zhou

Main category: cs.CV

TL;DR: TokenSwap is a stealthy backdoor attack on large vision-language models that subtly disrupts object relationship understanding rather than forcing fixed target patterns, making it harder to detect.

Details

Motivation: Existing backdoor attacks on LVLMs use fixed target patterns that are easy to detect due to model overconfidence. TokenSwap aims to create more evasive attacks by focusing on disrupting compositional understanding.

Method: TokenSwap injects visual triggers into training samples while swapping grammatical roles of key tokens in textual answers. It uses an adaptive token-weighted loss to emphasize learning of swapped tokens, associating visual triggers with bags-of-words behavior.

Result: TokenSwap achieves high attack success rates while maintaining superior evasiveness and stealthiness across multiple benchmarks and various LVLM architectures.

Conclusion: TokenSwap demonstrates that backdoor attacks can be made more stealthy by targeting compositional understanding rather than using fixed patterns, highlighting new security vulnerabilities in LVLMs.

Abstract: Large vision-language models (LVLMs) have achieved impressive performance across a wide range of vision-language tasks, while they remain vulnerable to backdoor attacks. Existing backdoor attacks on LVLMs aim to force the victim model to generate a predefined target pattern, which is either inserted into or replaces the original content. We find that these fixed-pattern attacks are relatively easy to detect, because the attacked LVLM tends to memorize such frequent patterns in the training dataset, thereby exhibiting overconfidence on these targets given poisoned inputs. To address these limitations, we introduce TokenSwap, a more evasive and stealthy backdoor attack that focuses on the compositional understanding capabilities of LVLMs. Instead of enforcing a fixed targeted content, TokenSwap subtly disrupts the understanding of object relationships in text. Specifically, it causes the backdoored model to generate outputs that mention the correct objects in the image but misrepresent their relationships (i.e., bags-of-words behavior). During training, TokenSwap injects a visual trigger into selected samples and simultaneously swaps the grammatical roles of key tokens in the corresponding textual answers. However, the poisoned samples exhibit only subtle differences from the original ones, making it challenging for the model to learn the backdoor behavior. To address this, TokenSwap employs an adaptive token-weighted loss that explicitly emphasizes the learning of swapped tokens, such that the visual triggers and bags-of-words behavior are associated. Extensive experiments demonstrate that TokenSwap achieves high attack success rates while maintaining superior evasiveness and stealthiness across multiple benchmarks and various LVLM architectures.

[638] SCOPE: Semantic Conditioning for Sim2Real Category-Level Object Pose Estimation in Robotics

Peter Hönig, Stefan Thalhammer, Jean-Baptiste Weibel, Matthias Hirschmanner, Markus Vincze

Main category: cs.CV

TL;DR: SCOPE is a diffusion-based category-level object pose estimation model that uses DINOv2 features as continuous semantic priors to eliminate the need for discrete category labels, enabling generalization to both known and unknown object categories.

Details

Motivation: Robots in open environments encounter unknown objects requiring semantic understanding for manipulation. Current methods struggle with generalization beyond known categories and face Sim2Real gaps in pose estimation.

Method: Combines DINOv2 features with photorealistic training data and a noise model for point normals. Uses cross-attention to inject continuous semantic priors for learning canonicalized object coordinate systems across instances.

Result: Outperforms state-of-the-art in synthetically trained category-level object pose estimation with 31.9% relative improvement on 5°5cm metric. Achieves up to 100% success rate in grasping unseen objects from unknown categories.

Conclusion: SCOPE enables effective object pose estimation and manipulation for both known and unknown object categories through continuous semantic priors, bridging the Sim2Real gap and supporting generalization beyond training distribution.

Abstract: Object manipulation requires accurate object pose estimation. In open environments, robots encounter unknown objects, which requires semantic understanding in order to generalize both to known categories and beyond. To resolve this challenge, we present SCOPE, a diffusion-based category-level object pose estimation model that eliminates the need for discrete category labels by leveraging DINOv2 features as continuous semantic priors. By combining these DINOv2 features with photorealistic training data and a noise model for point normals, we reduce the Sim2Real gap in category-level object pose estimation. Furthermore, injecting the continuous semantic priors via cross-attention enables SCOPE to learn canonicalized object coordinate systems across object instances beyond the distribution of known categories. SCOPE outperforms the current state of the art in synthetically trained category-level object pose estimation, achieving a relative improvement of 31.9% on the 5$^\circ$5cm metric. Additional experiments on two instance-level datasets demonstrate generalization beyond known object categories, enabling grasping of unseen objects from unknown categories with a success rate of up to 100%. Code available: https://github.com/hoenigpeter/scope.

[639] Causal-Adapter: Taming Text-to-Image Diffusion for Faithful Counterfactual Generation

Lei Tong, Zhihua Liu, Chaochao Lu, Dino Oglic, Tom Diethe, Philip Teare, Sotirios A. Tsaftaris, Chen Jin

Main category: cs.CV

TL;DR: Causal-Adapter is a modular framework that adapts frozen text-to-image diffusion models for counterfactual image generation using causal interventions and attribute regularization strategies.

Details

Motivation: To enable precise counterfactual image generation by propagating causal effects to dependent attributes while preserving core image identity, overcoming limitations of prompt engineering approaches that lack explicit causal structure.

Method: Leverages structural causal modeling with two attribute regularization strategies: prompt-aligned injection for semantic control and conditioned token contrastive loss for disentangling attribute factors and reducing spurious correlations.

Result: Achieves state-of-the-art performance with 91% MAE reduction on Pendulum dataset for accurate attribute control and 87% FID reduction on ADNI for high-fidelity MRI image generation.

Conclusion: The approach enables robust, generalizable counterfactual editing with faithful attribute modification and strong identity preservation in image generation.

Abstract: We present Causal-Adapter, a modular framework that adapts frozen text-to-image diffusion backbones for counterfactual image generation. Our method enables causal interventions on target attributes, consistently propagating their effects to causal dependents without altering the core identity of the image. In contrast to prior approaches that rely on prompt engineering without explicit causal structure, Causal-Adapter leverages structural causal modeling augmented with two attribute regularization strategies: prompt-aligned injection, which aligns causal attributes with textual embeddings for precise semantic control, and a conditioned token contrastive loss to disentangle attribute factors and reduce spurious correlations. Causal-Adapter achieves state-of-the-art performance on both synthetic and real-world datasets, with up to 91% MAE reduction on Pendulum for accurate attribute control and 87% FID reduction on ADNI for high-fidelity MRI image generation. These results show that our approach enables robust, generalizable counterfactual editing with faithful attribute modification and strong identity preservation.

[640] BFSM: 3D Bidirectional Face-Skull Morphable Model

Zidu Wang, Meng Xu, Miao Xu, Hengyuan Ma, Jiankuo Zhao, Xutao Li, Xiangyu Zhu, Zhen Lei

Main category: cs.CV

TL;DR: A joint face-skull morphable model (BFSM) that enables bidirectional shape inference between face and skull through shared coefficients, with applications in medical diagnostics and surgical planning.

Details

Motivation: Address the scarcity of paired face-skull data, improve registration accuracy, and include underrepresented craniofacial deformity cases for more inclusive medical applications.

Method: Constructed dataset with 200+ samples including normal and craniofacial conditions, proposed dense ray matching registration for topological consistency, and developed 3D bidirectional morphable model with tissue thickness modeling.

Result: Extensive experiments confirm robustness and accuracy; model enables 3D face-skull reconstruction from single images and surgical planning prediction.

Conclusion: BFSM successfully addresses key challenges in face-skull modeling and demonstrates strong potential for medical applications including diagnostics and surgical planning.

Abstract: Building a joint face-skull morphable model holds great potential for applications such as remote diagnostics, surgical planning, medical education, and physically based facial simulation. However, realizing this vision is constrained by the scarcity of paired face-skull data, insufficient registration accuracy, and limited exploration of reconstruction and clinical applications. Moreover, individuals with craniofacial deformities are often overlooked, resulting in underrepresentation and limited inclusivity. To address these challenges, we first construct a dataset comprising over 200 samples, including both normal cases and rare craniofacial conditions. Each case contains a CT-based skull, a CT-based face, and a high-fidelity textured face scan. Secondly, we propose a novel dense ray matching registration method that ensures topological consistency across face, skull, and their tissue correspondences. Based on this, we introduce the 3D Bidirectional Face-Skull Morphable Model (BFSM), which enables shape inference between the face and skull through a shared coefficient space, while also modeling tissue thickness variation to support one-to-many facial reconstructions from the same skull, reflecting individual changes such as fat over time. Finally, we demonstrate the potential of BFSM in medical applications, including 3D face-skull reconstruction from a single image and surgical planning prediction. Extensive experiments confirm the robustness and accuracy of our method. BFSM is available at https://github.com/wang-zidu/BFSM

[641] Comprehensive Benchmarking of YOLOv11 Architectures for Scalable and Granular Peripheral Blood Cell Detection

Mohamad Abou Ali, Mariam Abdulfattah, Baraah Al Hussein, Fadi Dornaika, Ali Cherry, Mohamad Hajj-Hassan, Lara Hamawy

Main category: cs.CV

TL;DR: Systematic evaluation of YOLOv11 variants for peripheral blood smear detection, showing Medium variant achieves best trade-off between accuracy and efficiency with mAP@0.5 of 0.934.

Details

Motivation: Manual peripheral blood smear analysis is labor intensive and subjective, while deep learning offers promising alternatives but lacks systematic evaluation of state-of-the-art models like YOLOv11 for fine-grained detection.

Method: Curated large-scale annotated dataset (16,891 images, 298,850 cells across 12 PBC classes plus RBC class) and conducted comprehensive evaluation of five YOLOv11 variants (Nano to XLarge) under two data splitting strategies (70:20:10 and 80:10:10).

Result: YOLOv11 Medium variant achieves best trade-off with mAP@0.5 of 0.934 under 8:1:1 split. Larger models provide only marginal accuracy gains at substantially higher computational cost. 8:1:1 split consistently outperforms 7:2:1 split across all models.

Conclusion: YOLOv11, particularly the Medium variant, is highly effective for automated fine-grained PBS detection. The publicly released dataset provides valuable resource for advancing blood cell detection research in hematology.

Abstract: Manual peripheral blood smear (PBS) analysis is labor intensive and subjective. While deep learning offers a promising alternative, a systematic evaluation of state of the art models such as YOLOv11 for fine grained PBS detection is still lacking. In this work, we make two key contributions. First, we curate a large scale annotated dataset for blood cell detection and classification, comprising 16,891 images across 12 peripheral blood cell (PBC) classes, along with the red blood cell class, all carefully re annotated for object detection tasks. In total, the dataset contains 298,850 annotated cells. Second, we leverage this dataset to conduct a comprehensive evaluation of five YOLOv11 variants (ranging from Nano to XLarge). These models are rigorously benchmarked under two data splitting strategies (70:20:10 and 80:10:10) and systematically assessed using multiple performance criteria, including mean Average Precision (mAP), precision, recall, F1 score, and computational efficiency. Our experiments show that the YOLOv11 Medium variant achieves the best trade off, reaching a mAP@0.5 of 0.934 under the 8:1:1 split. Larger models (Large and XLarge) provide only marginal accuracy gains at substantially higher computational cost. Moreover, the 8:1:1 split consistently outperforms the 7:2:1 split across all models. These findings highlight YOLOv11, particularly the Medium variant, as a highly effective framework for automated, fine grained PBS detection. Beyond benchmarking, our publicly released dataset (github.com/Mohamad-AbouAli/OI-PBC-Dataset) offers a valuable resource to advance research on blood cell detection and classification in hematology.

[642] Biomechanical-phase based Temporal Segmentation in Sports Videos: a Demonstration on Javelin-Throw

Bikash Kumar Badatya, Vipul Baghel, Jyotirmoy Amin, Ravi Hegde

Main category: cs.CV

TL;DR: Novel unsupervised framework using structured optimal transport with ASTGCN for automatic temporal segmentation of javelin-throw motion phases, outperforming state-of-the-art methods with 71.02% mAP and 74.61% F1-score.

Details

Motivation: Traditional sports analytics methods are manual, time-consuming, costly, and lack scalability. Automatic extraction of kinetic variables requires robust temporal segmentation without expensive manual labeling.

Method: Unsupervised framework combining structured optimal transport (SOT) with Attention-based Spatio-Temporal Graph Convolutional Network (ASTGCN) for contextually aware motion phase segmentation.

Result: Achieved 71.02% mean average precision (mAP) and 74.61% F1-score on test data, substantially outperforming competing unsupervised baselines. Released dataset of 211 annotated professional javelin-throw videos.

Conclusion: The proposed unsupervised framework successfully enables automatic identification of motion phase transitions in javelin-throw without manual labeling, providing a scalable solution for sports analytics.

Abstract: Precise analysis of athletic motion is central to sports analytics, particularly in disciplines where nuanced biomechanical phases directly impact performance outcomes. Traditional analytics techniques rely on manual annotation or laboratory-based instrumentation, which are time-consuming, costly, and lack scalability. Automatic extraction of relevant kinetic variables requires a robust and contextually appropriate temporal segmentation. Considering the specific case of elite javelin-throw, we present a novel unsupervised framework for such a contextually aware segmentation, which applies the structured optimal transport (SOT) concept to augment the well-known Attention-based Spatio-Temporal Graph Convolutional Network (ASTGCN). This enables the identification of motion phase transitions without requiring expensive manual labeling. Extensive experiments demonstrate that our approach outperforms state-of-the-art unsupervised methods, achieving 71.02% mean average precision (mAP) and 74.61% F1-score on test data, substantially higher than competing baselines. We also release a new dataset of 211 manually annotated professional javelin-throw videos with frame-level annotations, covering key biomechanical phases: approach steps, drive, throw, and recovery.

[643] FreeRet: MLLMs as Training-Free Retrievers

Yuhan Zhu, Xiangyu Zeng, Chenting Wang, Xinhao Li, Yicheng Xu, Ziang Yan, Yi Wang, Limin Wang

Main category: cs.CV

TL;DR: FreeRet is a plug-and-play framework that converts any multimodal large language model (MLLM) into a two-stage retriever without additional training, using semantic embeddings for fast search and reasoning for precise reranking.

Details

Motivation: MLLMs require heavy post-hoc training to become contrastive encoders for retrieval. This work explores whether off-the-shelf MLLMs can serve as powerful retrievers without additional training.

Method: FreeRet uses a two-stage approach: 1) derives semantically grounded embeddings directly from MLLMs for fast candidate search, bypassing lexical alignment layers; 2) exploits MLLM’s reasoning ability for precise reranking, using explicit priors and neutral choice framing to mitigate framing effects.

Result: On MMEB and MMEB-V2 benchmarks spanning 46 datasets, FreeRet substantially outperforms models trained on millions of pairs. It is model-agnostic, scales across MLLM families and sizes, preserves generative abilities, supports arbitrary modality combinations, and unifies retrieval, reranking, and generation in end-to-end RAG.

Conclusion: Pretrained MLLMs, when carefully harnessed, can serve as strong retrieval engines without training, closing a critical gap in their role as generalists.

Abstract: Multimodal large language models (MLLMs) are emerging as versatile foundations for mixed-modality retrieval. Yet, they often require heavy post-hoc training to convert them into contrastive encoders for retrieval. This work asks: Can off-the-shelf MLLMs serve as powerful retrievers without additional training? We present FreeRet, a plug-and-play framework that turns any MLLM into a two-stage retriever. FreeRet first derives semantically grounded embeddings directly from the model for fast candidate search, and then exploits its reasoning ability for precise reranking. The framework contributes three advances: bypassing lexical alignment layers to obtain semantically faithful embeddings, conditioning representation generation with explicit priors, and mitigating framing effect in reranking via neutral choice framing. On the MMEB and MMEB-V2 benchmarks spanning 46 datasets, FreeRet substantially outperforms models trained on millions of pairs. Beyond benchmarks, FreeRet is model-agnostic and scales seamlessly across MLLM families and sizes, preserves their generative abilities, supports arbitrary modality combinations, and unifies retrieval, reranking, and generation into end-to-end RAG within a single model. Our findings demonstrate that pretrained MLLMs, when carefully harnessed, can serve as strong retrieval engines without training, closing a critical gap in their role as generalists.

[644] RIFLE: Removal of Image Flicker-Banding via Latent Diffusion Enhancement

Zhu, Libo, Zhou, Zihan, Liu, Xiaoyang, Zhang, Weihang, Shi, Keyu, Fu, Yifan, Zhang, Yulun

Main category: cs.CV

TL;DR: RIFLE is a diffusion-based framework that removes flicker-banding artifacts from screen photos while preserving details, using a flicker-banding prior estimator and masked loss for targeted restoration.

Details

Motivation: Flicker-banding artifacts in photos of emissive displays degrade readability and quality, but this problem remains underexplored compared to moire degradation.

Method: Proposes RIFLE framework with flicker-banding prior estimator to predict banding attributes, masked loss for focused supervision, and a simulation pipeline to generate realistic training data with stochastic jitter and noise.

Result: RIFLE outperforms recent image reconstruction baselines across quantitative metrics and visual comparisons on real-world datasets, from mild to severe flicker-banding.

Conclusion: This is the first work to research simulation and removal of flicker-banding, establishing foundations for future research in dataset construction and removal model design.

Abstract: Capturing screens is now routine in our everyday lives. But the photographs of emissive displays are often influenced by the flicker-banding (FB), which is alternating bright%u2013dark stripes that arise from temporal aliasing between a camera’s rolling-shutter readout and the display’s brightness modulation. Unlike moire degradation, which has been extensively studied, the FB remains underexplored despite its frequent and severe impact on readability and perceived quality. We formulate FB removal as a dedicated restoration task and introduce Removal of Image Flicker-Banding via Latent Diffusion Enhancement, RIFLE, a diffusion-based framework designed to remove FB while preserving fine details. We propose the flicker-banding prior estimator (FPE) that predicts key banding attributes and injects it into the restoration network. Additionally, Masked Loss (ML) is proposed to concentrate supervision on banded regions without sacrificing global fidelity. To overcome data scarcity, we provide a simulation pipeline that synthesizes FB in the luminance domain with stochastic jitter in banding angle, banding spacing, and banding width. Feathered boundaries and sensor noise are also applied for a more realistic simulation. For evaluation, we collect a paired real-world FB dataset with pixel-aligned banding-free references captured via long exposure. Across quantitative metrics and visual comparisons on our real-world dataset, RIFLE consistently outperforms recent image reconstruction baselines from mild to severe flicker-banding. To the best of our knowledge, it is the first work to research the simulation and removal of FB. Our work establishes a great foundation for subsequent research in both the dataset construction and the removal model design. Our dataset and code will be released soon.

[645] Learning Object-Centric Representations Based on Slots in Real World Scenarios

Adil Kaan Akan

Main category: cs.CV

TL;DR: A framework that adapts pretrained diffusion models for object-centric synthesis by integrating slot-based conditioning, enabling fine-grained object control while maintaining scene coherence for both images and videos.

Details

Motivation: To bridge the gap between holistic diffusion models and object-level editing needs, enabling fine-grained, controllable image and video generation that aligns with human object-based perception.

Method: Integrates lightweight slot-based conditioning into pretrained diffusion models using register tokens for background/style and slot-conditioned modules for objects. For video, uses Invariant Slot Attention (ISA) to separate object identity from pose and Transformer-based temporal aggregation.

Result: Achieves state-of-the-art results in object discovery, segmentation, compositional editing, and controllable image generation. For video, establishes new benchmarks in unsupervised video object segmentation and reconstruction, supporting advanced editing tasks without explicit supervision.

Conclusion: Establishes a general and scalable approach to object-centric generative modeling that bridges human perception and machine learning, expanding possibilities for interactive and structured generative tools across various domains.

Abstract: A central goal in AI is to represent scenes as compositions of discrete objects, enabling fine-grained, controllable image and video generation. Yet leading diffusion models treat images holistically and rely on text conditioning, creating a mismatch for object-level editing. This thesis introduces a framework that adapts powerful pretrained diffusion models for object-centric synthesis while retaining their generative capacity. We identify a core challenge: balancing global scene coherence with disentangled object control. Our method integrates lightweight, slot-based conditioning into pretrained models, preserving their visual priors while providing object-specific manipulation. For images, SlotAdapt augments diffusion models with a register token for background/style and slot-conditioned modules for objects, reducing text-conditioning bias and achieving state-of-the-art results in object discovery, segmentation, compositional editing, and controllable image generation. We further extend the framework to video. Using Invariant Slot Attention (ISA) to separate object identity from pose and a Transformer-based temporal aggregator, our approach maintains consistent object representations and dynamics across frames. This yields new benchmarks in unsupervised video object segmentation and reconstruction, and supports advanced editing tasks such as object removal, replacement, and insertion without explicit supervision. Overall, this work establishes a general and scalable approach to object-centric generative modeling for images and videos. By bridging human object-based perception and machine learning, it expands the design space for interactive, structured, and user-driven generative tools in creative, scientific, and practical domains.

[646] Vehicle Classification under Extreme Imbalance: A Comparative Study of Ensemble Learning and CNNs

Abu Hanif Muhammad Syarubany

Main category: cs.CV

TL;DR: Vehicle type recognition using ensembles and CNNs on imbalanced dataset, with best CNN achieving 79.19% accuracy but struggling with rare classes like Barges.

Details

Motivation: Class imbalance in public datasets suppresses performance on rare vehicle categories, limiting intelligent transportation and logistics applications.

Method: Created 16-class corpus (~47k images) from multiple sources, used SMOTE oversampling and undersampling, benchmarked lightweight ensembles (Random Forest, AdaBoost, soft-voting) against configurable ResNet-style CNN with strong augmentation and label smoothing.

Result: Best ensemble (SMOTE-combined) achieved 74.8% test accuracy, CNN achieved 79.19% on full test set and 81.25% on unseen inference batch. Barge class remained a failure mode despite rebalancing.

Conclusion: Deep models have advantage but rebalancing alone has limits; need additional minority-class collection, cost-sensitive objectives like focal loss, and hybrid ensemble-CNN pipelines to combine interpretability with representational power.

Abstract: Accurate vehicle type recognition underpins intelligent transportation and logistics, but severe class imbalance in public datasets suppresses performance on rare categories. We curate a 16-class corpus (~47k images) by merging Kaggle, ImageNet, and web-crawled data, and create six balanced variants via SMOTE oversampling and targeted undersampling. Lightweight ensembles, such as Random Forest, AdaBoost, and a soft-voting combiner built on MobileNet-V2 features are benchmarked against a configurable ResNet-style CNN trained with strong augmentation and label smoothing. The best ensemble (SMOTE-combined) attains 74.8% test accuracy, while the CNN achieves 79.19% on the full test set and 81.25% on an unseen inference batch, confirming the advantage of deep models. Nonetheless, the most under-represented class (Barge) remains a failure mode, highlighting the limits of rebalancing alone. Results suggest prioritizing additional minority-class collection and cost-sensitive objectives (e.g., focal loss) and exploring hybrid ensemble or CNN pipelines to combine interpretability with representational power.

[647] Classifier-Centric Adaptive Framework for Open-Vocabulary Camouflaged Object Segmentation

Hanyu Zhang, Yiming Zhou, Jinxia Zhang

Main category: cs.CV

TL;DR: A classifier-centric adaptive framework with layered asymmetric initialization improves open-vocabulary camouflaged object segmentation by enhancing the classification component.

Details

Motivation: Open-vocabulary camouflaged object segmentation requires high generalization for unseen categories, and existing methods show that the classification component significantly impacts segmentation performance.

Method: Proposes a lightweight text adapter with novel layered asymmetric initialization to improve the classification component in a classifier-centric adaptive framework.

Result: Achieves substantial improvements: cIoU from 0.443 to 0.493, cSm from 0.579 to 0.658, and cMAE from 0.336 to 0.239 on OVCamo benchmark compared to OVCoser baseline.

Conclusion: Targeted classification enhancement effectively advances camouflaged object segmentation performance.

Abstract: Open-vocabulary camouflaged object segmentation requires models to segment camouflaged objects of arbitrary categories unseen during training, placing extremely high demands on generalization capabilities. Through analysis of existing methods, it is observed that the classification component significantly affects overall segmentation performance. Accordingly, a classifier-centric adaptive framework is proposed to enhance segmentation performance by improving the classification component via a lightweight text adapter with a novel layered asymmetric initialization. Through the classification enhancement, the proposed method achieves substantial improvements in segmentation metrics compared to the OVCoser baseline on the OVCamo benchmark: cIoU increases from 0.443 to 0.493, cSm from 0.579 to 0.658, and cMAE reduces from 0.336 to 0.239. These results demonstrate that targeted classification enhancement provides an effective approach for advancing camouflaged object segmentation performance.

[648] Traumatic Brain Injury Segmentation using an Ensemble of Encoder-decoder Models

Ghanshyam Dhamat, Vaanathi Sundaresan

Main category: cs.CV

TL;DR: Developed an automated segmentation pipeline for traumatic brain injury lesions using nnUNet framework with post-processing, achieving competitive results in the AIMS-TBI 2025 challenge.

Details

Motivation: Traumatic brain injury lesions are extremely heterogeneous in size, number, and laterality, complicating image registration and brain parcellation, which reduces analytical accuracy in neuroimaging.

Method: Leveraged various architectures within the nnUNet framework for initial segmentation, complemented by post-processing strategies to enhance evaluation metrics.

Result: Achieved accuracy of 0.8451, Dice scores of 0.4711 (images with visible lesions) and 0.8514 (images without visible lesions), with overall Dice score of 0.5973, ranking among top-6 methods in AIMS-TBI 2025 challenge.

Conclusion: The developed automated segmentation pipeline effectively addresses the challenge of TBI lesion segmentation and is publicly available for use.

Abstract: The identification and segmentation of moderate-severe traumatic brain injury (TBI) lesions pose a significant challenge in neuroimaging. This difficulty arises from the extreme heterogeneity of these lesions, which vary in size, number, and laterality, thereby complicating downstream image processing tasks such as image registration and brain parcellation, reducing the analytical accuracy. Thus, developing methods for highly accurate segmentation of TBI lesions is essential for reliable neuroimaging analysis. This study aims to develop an effective automated segmentation pipeline to automatically detect and segment TBI lesions in T1-weighted MRI scans. We evaluate multiple approaches to achieve accurate segmentation of the TBI lesions. The core of our pipeline leverages various architectures within the nnUNet framework for initial segmentation, complemented by post-processing strategies to enhance evaluation metrics. Our final submission to the challenge achieved an accuracy of 0.8451, Dice score values of 0.4711 and 0.8514 for images with and without visible lesions, respectively, with an overall Dice score of 0.5973, ranking among the top-6 methods in the AIMS-TBI 2025 challenge. The Python implementation of our pipeline is publicly available.

[649] OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing

Zhihong Chen, Xuehai Bai, Yang Shi, Chaoyou Fu, Huanyu Zhang, Haotian Wang, Xiaoyan Sun, Zhang Zhang, Liang Wang, Yuanxing Zhang, Pengfei Wan, Yi-Fan Zhang

Main category: cs.CV

TL;DR: OpenGPT-4o-Image is a large-scale dataset with 80k instruction-image pairs covering 11 domains and 51 subtasks, designed to address limitations in existing multimodal training data through systematic hierarchical taxonomy and automated generation.

Details

Motivation: Existing multimodal datasets lack systematic structure and challenging scenarios needed for real-world applications, limiting the performance of unified models for image generation and editing.

Method: Combines hierarchical task taxonomy with automated data generation using structured resource pools and GPT-4o, covering fundamental capabilities and challenging categories like scientific imagery and complex instruction editing.

Result: Fine-tuning leading models on this dataset achieved significant performance gains: up to 18% improvement on editing tasks (UniWorld-V1 on ImgEdit-Bench) and 13% on generation tasks (Harmon on GenEval).

Conclusion: Systematic data construction is key to advancing multimodal AI capabilities, as demonstrated by the substantial performance improvements achieved through the OpenGPT-4o-Image dataset.

Abstract: The performance of unified multimodal models for image generation and editing is fundamentally constrained by the quality and comprehensiveness of their training data. While existing datasets have covered basic tasks like style transfer and simple object manipulation, they often lack the systematic structure and challenging scenarios required for real-world applications. To address this bottleneck, we introduce OpenGPT-4o-Image, a large-scale dataset constructed using a novel methodology that combines hierarchical task taxonomy with automated data generation. Our taxonomy not only includes fundamental capabilities such as text rendering and style control but also introduces highly practical yet challenging categories like scientific imagery for chemistry illustrations and complex instruction editing requiring simultaneous execution of multiple operations. Through an automated pipeline leveraging structured resource pools and GPT-4o, we generate 80k high-quality instruction-image pairs with controlled diversity, covering 11 major domains and 51 subtasks. Extensive experiments show that fine-tuning leading models on our dataset achieves significant performance gains across multiple benchmarks, with improvements of up to 18% on editing tasks (UniWorld-V1 on ImgEdit-Bench) and 13% on generation tasks (Harmon on GenEval). Our work demonstrates that systematic data construction is key to advancing multimodal AI capabilities.

[650] Enhancing Physical Plausibility in Video Generation by Reasoning the Implausibility

Yutong Hao, Chen Chen, Ajmal Saeed Mian, Chang Xu, Daochang Liu

Main category: cs.CV

TL;DR: A training-free framework that improves physical plausibility in video generation by using physics-aware reasoning and synchronized decoupled guidance to suppress implausible motions.

Details

Motivation: Existing diffusion models for video generation implicitly learn physics from large datasets, which is costly, difficult to scale, and still produces physically implausible motions that violate fundamental laws.

Method: Uses lightweight physics-aware reasoning to create counterfactual prompts encoding physics-violating behaviors, and proposes Synchronized Decoupled Guidance (SDG) with synchronized directional normalization and trajectory-decoupled denoising to suppress implausible content throughout denoising.

Result: Experiments show substantial enhancement of physical fidelity while maintaining photorealism across different physical domains, with no additional training required. Ablation studies confirm the effectiveness of both components.

Conclusion: Establishes a new plug-and-play physics-aware paradigm for video generation that explicitly reasons about physical plausibility at inference time.

Abstract: Diffusion models can generate realistic videos, but existing methods rely on implicitly learning physical reasoning from large-scale text-video datasets, which is costly, difficult to scale, and still prone to producing implausible motions that violate fundamental physical laws. We introduce a training-free framework that improves physical plausibility at inference time by explicitly reasoning about implausibility and guiding the generation away from it. Specifically, we employ a lightweight physics-aware reasoning pipeline to construct counterfactual prompts that deliberately encode physics-violating behaviors. Then, we propose a novel Synchronized Decoupled Guidance (SDG) strategy, which leverages these prompts through synchronized directional normalization to counteract lagged suppression and trajectory-decoupled denoising to mitigate cumulative trajectory bias, ensuring that implausible content is suppressed immediately and consistently throughout denoising. Experiments across different physical domains show that our approach substantially enhances physical fidelity while maintaining photorealism, despite requiring no additional training. Ablation studies confirm the complementary effectiveness of both the physics-aware reasoning component and SDG. In particular, the aforementioned two designs of SDG are also individually validated to contribute critically to the suppression of implausible content and the overall gains in physical plausibility. This establishes a new and plug-and-play physics-aware paradigm for video generation.

[651] Segmentor-Guided Counterfactual Fine-Tuning for Image Synthesis

Tian Xia, Matthew Sinclair, Andreas Schuh, Fabio De Sousa Ribeiro, Raghav Mehta, Rajat Rasal, Esther Puyol-Antón, Samuel Gerber, Kersten Petersen, Michiel Schaap, Ben Glocker

Main category: cs.CV

TL;DR: Seg-CFT is a method for generating counterfactual medical images that enables structure-specific interventions without requiring pixel-level label maps, producing locally coherent counterfactuals for applications like chest radiograph generation and coronary artery disease modeling.

Details

Motivation: Current counterfactual image generation approaches rely on external classifiers/regressors for subject-level interventions, which is insufficient for structure-specific interventions and can cause undesirable global effects. Previous methods required tedious pixel-level label maps.

Method: Segmentor-guided Counterfactual Fine-Tuning (Seg-CFT) preserves the simplicity of intervening on scalar-valued, structure-specific variables while producing locally coherent counterfactuals without requiring hypothetical segmentations.

Result: The method demonstrates capability of generating realistic chest radiographs and shows promising results for modeling coronary artery disease, producing effective counterfactuals with local coherence.

Conclusion: Seg-CFT provides a practical approach for structure-specific counterfactual image generation that avoids the need for tedious pixel-level annotations while maintaining intervention simplicity and producing realistic medical images.

Abstract: Counterfactual image generation is a powerful tool for augmenting training data, de-biasing datasets, and modeling disease. Current approaches rely on external classifiers or regressors to increase the effectiveness of subject-level interventions (e.g., changing the patient’s age). For structure-specific interventions (e.g., changing the area of the left lung in a chest radiograph), we show that this is insufficient, and can result in undesirable global effects across the image domain. Previous work used pixel-level label maps as guidance, requiring a user to provide hypothetical segmentations which are tedious and difficult to obtain. We propose Segmentor-guided Counterfactual Fine-Tuning (Seg-CFT), which preserves the simplicity of intervening on scalar-valued, structure-specific variables while producing locally coherent and effective counterfactuals. We demonstrate the capability of generating realistic chest radiographs, and we show promising results for modeling coronary artery disease. Code: https://github.com/biomedia-mira/seg-cft.

[652] IWR-Bench: Can LVLMs reconstruct interactive webpage from a user interaction video?

Yang Chen, Minghao Liu, Yufan Shen, Yunwen Li, Tianyuan Huang, Xinyu Fang, Tianyu Zheng, Wenxuan Huang, Cheng Yang, Daocheng Fu, Jianbiao Mei, Rong Wu, Licheng Wen, Xuemeng Yang, Song Mao, Qunshu Lin, Zhi Yu, Yongliang Shen, Yu Qiao, Botian Shi

Main category: cs.CV

TL;DR: IWR-Bench is a new benchmark for evaluating Large Vision-Language Models’ ability to reconstruct interactive webpages from videos, addressing limitations of static screenshot-to-code tasks by focusing on dynamic interactions.

Details

Motivation: Existing benchmarks focus on static screenshot-to-code tasks and overlook dynamic interactions that are fundamental to real-world web applications.

Method: Created IWR-Bench with 113 tasks from 100 real-world websites, featuring 1,001 actions and diverse interaction complexities. Includes user interaction videos and crawled static assets. Uses agent-as-a-judge framework with comprehensive metrics to assess functional correctness and visual fidelity.

Result: Extensive experiments on 28 LVLMs show the best model achieves only 36.35% overall score, with functional correctness (24.39%) lagging significantly behind visual fidelity (64.25%).

Conclusion: Current models have critical limitations in reasoning about temporal dynamics and synthesizing event-driven logic. IWR-Bench establishes a challenging frontier for vision-language research.

Abstract: The webpage-to-code task requires models to understand visual representations of webpages and generate corresponding code. However, existing benchmarks primarily focus on static screenshot-to-code tasks, thereby overlooking the dynamic interactions fundamental to real-world web applications. To address this limitation, this paper introduces IWR-Bench, a novel benchmark for evaluating the capabilities of Large Vision-Language Models (LVLMs) in interactive webpage reconstruction from video. IWR-Bench comprises 113 meticulously curated tasks from 100 real-world websites, with 1,001 actions and featuring diverse interaction complexities (e.g., web games), visual styles, and domains. Aligning with standard web development practices, each task includes not only user interaction videos but also all crawled static assets (e.g., images, videos). This benchmark evaluates models on two fundamental challenges: comprehensive multi-modal reasoning to infer interaction logic from video and assets, and advanced code generation to translate this logic into functional code. An agent-as-a-judge framework with a comprehensive metric system automatically assesses the functional correctness and visual fidelity of generated webpages. Extensive experiments on 28 LVLMs reveal a significant challenge: the best model achieves an overall score of only 36.35%, as functional correctness (24.39% IFS) lags significantly behind visual fidelity (64.25% VFS). These results highlight critical limitations in current models’ ability to reason about temporal dynamics and synthesize event-driven logic, establishing IWR-Bench as a challenging frontier for vision-language research. The benchmark and evaluation code will be made publicly available. Code is available at https://github.com/L-O-I/IWR-Bench.

[653] Scalable GANs with Transformers

Sangeek Hyun, MinKyu Lee, Jae-Pil Heo

Main category: cs.CV

TL;DR: The paper introduces GAT, a scalable GAN architecture using transformer-based generators/discriminators trained in VAE latent space, achieving state-of-the-art FID of 2.96 on ImageNet-256 in just 40 epochs.

Details

Motivation: To investigate scalability principles for GANs, which have been underexplored compared to other generative models, by leveraging effective design choices from other generative approaches.

Method: Uses two key design choices: training in compact VAE latent space for efficiency, and adopting purely transformer-based generators/discriminators. Addresses scaling issues with lightweight intermediate supervision and width-aware learning-rate adjustment.

Result: GAT can be reliably trained across various capacities (S through XL). GAT-XL/2 achieves state-of-the-art FID of 2.96 on ImageNet-256 in only 40 epochs, 6x faster than strong baselines.

Conclusion: The proposed GAT architecture demonstrates that GANs can be effectively scaled using transformer-based designs and latent-space training, achieving competitive performance with significantly improved training efficiency.

Abstract: Scalability has driven recent advances in generative modeling, yet its principles remain underexplored for adversarial learning. We investigate the scalability of Generative Adversarial Networks (GANs) through two design choices that have proven to be effective in other types of generative models: training in a compact Variational Autoencoder latent space and adopting purely transformer-based generators and discriminators. Training in latent space enables efficient computation while preserving perceptual fidelity, and this efficiency pairs naturally with plain transformers, whose performance scales with computational budget. Building on these choices, we analyze failure modes that emerge when naively scaling GANs. Specifically, we find issues as underutilization of early layers in the generator and optimization instability as the network scales. Accordingly, we provide simple and scale-friendly solutions as lightweight intermediate supervision and width-aware learning-rate adjustment. Our experiments show that GAT, a purely transformer-based and latent-space GANs, can be easily trained reliably across a wide range of capacities (S through XL). Moreover, GAT-XL/2 achieves state-of-the-art single-step, class-conditional generation performance (FID of 2.96) on ImageNet-256 in just 40 epochs, 6x fewer epochs than strong baselines.

[654] Evaluation of Polarimetric Fusion for Semantic Segmentation in Aquatic Environments

Luis F. W. Batista, Tom Bourbon, Cedric Pradalier

Main category: cs.CV

TL;DR: Polarimetric imaging improves floating debris segmentation by mitigating water-surface glare, but increases computational load and introduces new false positives.

Details

Motivation: Accurate segmentation of floating debris on water is compromised by surface glare and changing outdoor illumination, which polarimetric imaging can help mitigate.

Method: Benchmark state-of-the-art fusion networks on PoTATO dataset of polarimetric images, comparing with single-image baselines using traditional models.

Result: Polarimetric cues help recover low-contrast objects and suppress reflection-induced false positives, improving mean IoU and lowering contour error compared to RGB inputs.

Conclusion: Polarized cameras provide sharper masks but increase computational load and risk of new false positives; benchmark helps researchers decide if suitable for their applications.

Abstract: Accurate segmentation of floating debris on water is often compromised by surface glare and changing outdoor illumination. Polarimetric imaging offers a single-sensor route to mitigate water-surface glare that disrupts semantic segmentation of floating objects. We benchmark state-of-the-art fusion networks on PoTATO, a public dataset of polarimetric images of plastic bottles in inland waterways, and compare their performance with single-image baselines using traditional models. Our results indicate that polarimetric cues help recover low-contrast objects and suppress reflection-induced false positives, raising mean IoU and lowering contour error relative to RGB inputs. These sharper masks come at a cost: the additional channels enlarge the models increasing the computational load and introducing the risk of new false positives. By providing a reproducible, diagnostic benchmark and publicly available code, we hope to help researchers choose if polarized cameras are suitable for their applications and to accelerate related research.

[655] Toward a Vision-Language Foundation Model for Medical Data: Multimodal Dataset and Benchmarks for Vietnamese PET/CT Report Generation

Huu Tien Nguyen, Dac Thai Nguyen, The Minh Duc Nguyen, Trung Thanh Nguyen, Thao Nguyen Truong, Huy Hieu Pham, Johan Barthelemy, Minh Quan Tran, Thanh Tam Nguyen, Quoc Viet Hung Nguyen, Quynh Anh Chau, Hong Son Mai, Thanh Trung Nguyen, Phi Le Nguyen

Main category: cs.CV

TL;DR: This paper introduces a novel Vietnamese-language multimodal medical dataset with 1.5M+ CT-PET images and corresponding clinical reports, addressing gaps in PET/CT imaging data and low-resource language representation in medical VLMs.

Details

Motivation: Address limitations in medical VLMs due to limited availability of diverse imaging modalities (especially PET/CT) and underrepresentation of low-resource languages like Vietnamese in medical vision-language research.

Method: Created a Vietnamese multimodal medical dataset with 1,567,062 CT-PET image pairs and 2,757 clinical reports, plus a training framework with data augmentation and expert-validated test sets. Benchmarked state-of-the-art VLMs on medical report generation and visual question answering tasks.

Result: Experimental results show that incorporating the dataset significantly improves performance of existing VLMs on downstream medical tasks.

Conclusion: The dataset and benchmark represent a pivotal step in advancing robust VLMs for medical imaging, particularly for low-resource languages, and improving clinical relevance in Vietnamese healthcare.

Abstract: Vision-Language Foundation Models (VLMs), trained on large-scale multimodal datasets, have driven significant advances in Artificial Intelligence by enabling rich cross-modal reasoning. Despite their success in general domains, applying these models to medical imaging remains challenging due to the limited availability of diverse imaging modalities and multilingual clinical data. Most existing medical VLMs are trained on a subset of imaging modalities and focus primarily on high-resource languages, thus limiting their generalizability and clinical utility. To address these limitations, we introduce a novel Vietnamese-language multimodal medical dataset comprising 1,567,062 paired CT-PET images and corresponding 2,757 full-length clinical reports. This dataset is designed to fill two pressing gaps in medical AI development: (1) the lack of PET/CT imaging data in existing VLMs training corpora, which hinders the development of models capable of handling functional imaging tasks; and (2) the underrepresentation of low-resource languages, particularly the Vietnamese language, in medical vision-language research. To the best of our knowledge, this is the first dataset to provide comprehensive PET/CT-report pairs in Vietnamese. We further introduce a training framework to enhance VLMs’ learning, including data augmentation and expert-validated test sets. We conduct comprehensive experiments benchmarking state-of-the-art VLMs on downstream tasks, including medical report generation and visual question answering. The experimental results show that incorporating our dataset significantly improves the performance of existing VLMs. We believe this dataset and benchmark will serve as a pivotal step in advancing the development of more robust VLMs for medical imaging, particularly in low-resource languages, and improving their clinical relevance in Vietnamese healthcare.

Xue-Feng Zhu, Tianyang Xu, Yifan Pan, Jinjie Gu, Xi Li, Jiwen Lu, Xiao-Jun Wu, Josef Kittler

Main category: cs.CV

TL;DR: This paper introduces a novel multi-modal tracking task using RGB, Depth, and Thermal Infrared modalities, proposes a new RGBDT500 dataset, and develops RDTTrack - a tri-modal tracker that integrates complementary information through prompt learning.

Details

Motivation: Existing multi-modal tracking approaches focus on dual-modal paradigms (RGB-Depth or RGB-Thermal) but struggle in complex scenarios due to limited input modalities. The authors aim to enhance robustness by leveraging three complementary modalities.

Method: Proposed RDTTrack integrates tri-modal information by fusing thermal infrared and depth under an orthogonal projection constraint, then integrates them with RGB signals as prompts for a pre-trained foundation tracking model using prompt learning techniques.

Result: Experimental results show significant improvements over existing dual-modal approaches in tracking accuracy and robustness in complex scenarios.

Conclusion: The proposed tri-modal tracking approach with RGB, Depth, and Thermal Infrared modalities effectively enhances tracking performance in complex scenarios, demonstrating the value of complementary multi-modal fusion.

Abstract: Existing multi-modal object tracking approaches primarily focus on dual-modal paradigms, such as RGB-Depth or RGB-Thermal, yet remain challenged in complex scenarios due to limited input modalities. To address this gap, this work introduces a novel multi-modal tracking task that leverages three complementary modalities, including visible RGB, Depth (D), and Thermal Infrared (TIR), aiming to enhance robustness in complex scenarios. To support this task, we construct a new multi-modal tracking dataset, coined RGBDT500, which consists of 500 videos with synchronised frames across the three modalities. Each frame provides spatially aligned RGB, depth, and thermal infrared images with precise object bounding box annotations. Furthermore, we propose a novel multi-modal tracker, dubbed RDTTrack. RDTTrack integrates tri-modal information for robust tracking by leveraging a pretrained RGB-only tracking model and prompt learning techniques. In specific, RDTTrack fuses thermal infrared and depth modalities under a proposed orthogonal projection constraint, then integrates them with RGB signals as prompts for the pre-trained foundation tracking model, effectively harmonising tri-modal complementary cues. The experimental results demonstrate the effectiveness and advantages of the proposed method, showing significant improvements over existing dual-modal approaches in terms of tracking accuracy and robustness in complex scenarios.

[657] ExGS: Extreme 3D Gaussian Compression with Diffusion Priors

Jiaqi Chen, Xinhao Ji, Yuanyuan Gao, Hao Li, Yuning Gong, Yifei Liu, Dan Xu, Zhihang Zhong, Dingwen Zhang, Xiao Sun

Main category: cs.CV

TL;DR: ExGS is a feed-forward framework for extreme 3D Gaussian Splatting compression that combines Universal Gaussian Compression (UGC) for aggressive pruning and GaussPainter with diffusion priors for quality restoration, achieving over 100× compression while preserving rendering quality.

Details

Motivation: Neural scene representations like 3DGS have high storage and transmission costs that hinder deployment in resource-constrained environments. Existing compression methods either require costly optimization or degrade quality under high compression ratios.

Method: ExGS unifies Universal Gaussian Compression (UGC) for re-optimization-free pruning and GaussPainter that leverages diffusion priors with mask-guided refinement to restore high-quality renderings from heavily pruned scenes. It uses a lightweight VAE and one-step diffusion for real-time restoration.

Result: The framework achieves over 100× compression (reducing 354.77 MB to about 3.31 MB) while preserving fidelity and significantly improving image quality under challenging conditions.

Conclusion: Diffusion priors play a central role in bridging the gap between extreme compression and high-quality neural rendering, enabling practical deployment of compressed 3DGS models.

Abstract: Neural scene representations, such as 3D Gaussian Splatting (3DGS), have enabled high-quality neural rendering; however, their large storage and transmission costs hinder deployment in resource-constrained environments. Existing compression methods either rely on costly optimization, which is slow and scene-specific, or adopt training-free pruning and quantization, which degrade rendering quality under high compression ratios. In contrast, recent data-driven approaches provide a promising direction to overcome this trade-off, enabling efficient compression while preserving high rendering quality. We introduce \textbf{ExGS}, a novel feed-forward framework that unifies \textbf{Universal Gaussian Compression} (UGC) with \textbf{GaussPainter} for \textbf{Ex}treme 3D\textbf{GS} compression. \textbf{UGC} performs re-optimization-free pruning to aggressively reduce Gaussian primitives while retaining only essential information, whereas \textbf{GaussPainter} leverages powerful diffusion priors with mask-guided refinement to restore high-quality renderings from heavily pruned Gaussian scenes. Unlike conventional inpainting, GaussPainter not only fills in missing regions but also enhances visible pixels, yielding substantial improvements in degraded renderings. To ensure practicality, it adopts a lightweight VAE and a one-step diffusion design, enabling real-time restoration. Our framework can even achieve over $100\times$ compression (reducing a typical 354.77 MB model to about 3.31 MB) while preserving fidelity and significantly improving image quality under challenging conditions. These results highlight the central role of diffusion priors in bridging the gap between extreme compression and high-quality neural rendering. Our code repository will be released at \href{https://github.com/chenttt2001/ExGS}{here}.

[658] LOVE-R1: Advancing Long Video Understanding with an Adaptive Zoom-in Mechanism via Multi-Step Reasoning

Shenghao Fu, Qize Yang, Yuan-Ming Li, Xihan Wei, Xiaohua Xie, Wei-Shi Zheng

Main category: cs.CV

TL;DR: LOVE-R1 is a Large Video-Language Model that adaptively zooms in on video clips using a slow-fast sampling mechanism to balance temporal understanding and spatial detail perception in long video understanding.

Details

Motivation: Current LVLMs with uniform frame sampling sacrifice either temporal clues or spatial details, creating a conflict between long-form temporal understanding and detailed spatial perception in long videos.

Method: Proposes adaptive zoom-in mechanism: starts with densely sampled small-resolution frames, then zooms in on clips of interest with high resolution through multi-step reasoning. Uses 38k CoT data finetuning and decoupled reinforcement finetuning to optimize zoom-in ability.

Result: Achieves better trade-off between sampling density and frame resolutions, outperforming baseline Qwen2.5-VL by average 3.1% across 4 long video understanding benchmarks.

Conclusion: The slow-fast adaptive frame sampling mechanism effectively resolves the conflict between temporal understanding and spatial perception in long video understanding tasks.

Abstract: Long video understanding is still challenging for recent Large Video-Language Models (LVLMs) due to the conflict between long-form temporal understanding and detailed spatial perception. LVLMs with a uniform frame sampling mechanism, which samples frames with an equal frame size and fixed sampling rate, inevitably sacrifice either temporal clues or spatial details, resulting in suboptimal solutions. To mitigate this dilemma, we propose LOVE-R1, a model that can adaptively zoom in on a video clip. The model is first provided with densely sampled frames but in a small resolution. If some spatial details are needed, the model can zoom in on a clip of interest with a large frame resolution based on its reasoning until key visual information is obtained. The whole process is implemented as a multi-step reasoning process. To train the reasoning ability, we first finetune the model on our collected 38k high-quality CoT data and enhance it with decoupled reinforcement finetuning. As outcome rewards can not provide fine-grained process supervision, we decouple multi-step reasoning into multiple single-step reasoning and optimize the internal zoom-in ability explicitly. Experiments on long video understanding benchmarks show that our model with the slow-fast adaptive frame sampling mechanism achieves a great trade-off between sampling density and frame resolutions, and LOVE-R1 outperforms our baseline Qwen2.5-VL by an average of 3.1% points across 4 common long video understanding benchmarks.

[659] Vision Function Layer in Multimodal LLMs

Cheng Shi, Yizhou Yu, Sibei Yang

Main category: cs.CV

TL;DR: This study reveals that visual functions in MLLMs are concentrated in specific layers (Vision Function Layers), follows a consistent pattern across models, and enables more efficient training and data selection methods.

Details

Motivation: To understand how visual-related functions are distributed across decoder layers in Multimodal Large Language Models and leverage this understanding for more efficient model training and data selection.

Method: Developed Visual Token Swapping framework to analyze layer-specific functions, identified Vision Function Layers (VFLs), and created VFL-LoRA for targeted training and VFL-select for automated data classification.

Result: Found that visual functions narrow to 2-3 specific layers per function, with consistent patterns across MLLMs. VFL-LoRA outperforms full-LoRA and prevents function forgetting. VFL-select achieves 98% performance with only 20% of data.

Conclusion: The study provides deeper understanding of MLLM visual processing, enabling creation of more efficient, interpretable, and robust models through targeted layer training and automated data selection.

Abstract: This study identifies that visual-related functional decoding is distributed across different decoder layers in Multimodal Large Language Models (MLLMs). Typically, each function, such as counting, grounding, or OCR recognition, narrows down to two or three layers, which we define as Vision Function Layers (VFL). Additionally, the depth and its order of different VFLs exhibits a consistent pattern across different MLLMs, which is well-aligned with human behaviors (e.g., recognition occurs first, followed by counting, and then grounding). These findings are derived from Visual Token Swapping, our novel analytical framework that modifies targeted KV cache entries to precisely elucidate layer-specific functions during decoding. Furthermore, these insights offer substantial utility in tailoring MLLMs for real-world downstream applications. For instance, when LoRA training is selectively applied to VFLs whose functions align with the training data, VFL-LoRA not only outperform full-LoRA but also prevent out-of-domain function forgetting. Moreover, by analyzing the performance differential on training data when particular VFLs are ablated, VFL-select automatically classifies data by function, enabling highly efficient data selection to directly bolster corresponding capabilities. Consequently, VFL-select surpasses human experts in data selection, and achieves 98% of full-data performance with only 20% of the original dataset. This study delivers deeper comprehension of MLLM visual processing, fostering the creation of more efficient, interpretable, and robust models.

[660] CLASP: Adaptive Spectral Clustering for Unsupervised Per-Image Segmentation

Max Curie, Paulo da Costa

Main category: cs.CV

TL;DR: CLASP is a lightweight unsupervised image segmentation framework that uses self-supervised ViT features, spectral clustering with automatic segment count selection, and DenseCRF for boundary refinement, achieving competitive results without any training.

Details

Motivation: To create a training-free, easily reproducible baseline for unsupervised image segmentation that can handle large unannotated datasets commonly found in digital advertising and marketing workflows.

Method: Extracts patch features using self-supervised DINO ViT encoder, builds affinity matrix, applies spectral clustering with automatic segment count selection via eigengap silhouette search, and refines boundaries with fully connected DenseCRF.

Result: Achieves competitive mIoU and pixel accuracy on COCO Stuff and ADE20K datasets, matching recent unsupervised baselines despite requiring no training.

Conclusion: CLASP provides a strong, simple, and training-free baseline for unsupervised image segmentation that is particularly useful for applications in digital advertising, brand safety screening, creative asset curation, and social media content moderation.

Abstract: We introduce CLASP (Clustering via Adaptive Spectral Processing), a lightweight framework for unsupervised image segmentation that operates without any labeled data or finetuning. CLASP first extracts per patch features using a self supervised ViT encoder (DINO); then, it builds an affinity matrix and applies spectral clustering. To avoid manual tuning, we select the segment count automatically with a eigengap silhouette search, and we sharpen the boundaries with a fully connected DenseCRF. Despite its simplicity and training free nature, CLASP attains competitive mIoU and pixel accuracy on COCO Stuff and ADE20K, matching recent unsupervised baselines. The zero training design makes CLASP a strong, easily reproducible baseline for large unannotated corpora especially common in digital advertising and marketing workflows such as brand safety screening, creative asset curation, and social media content moderation

[661] TACO-Net: Topological Signatures Triumph in 3D Object Classification

Anirban Ghosh, Ayan Dutta

Main category: cs.CV

TL;DR: TACO-Net is a novel 3D object classification method that combines topological data analysis with image filtration techniques to achieve state-of-the-art accuracy on synthetic benchmarks and demonstrates strong robustness on real-world datasets and corrupted inputs.

Details

Motivation: 3D object classification is crucial for applications in computer vision, robotics, and autonomous driving, but remains challenging due to unordered point clouds, irregularity, and noise in real-world data.

Method: Transform point clouds into voxelized binary 3D images, extract topological features using topological data analysis and image filtration techniques, and train a lightweight 1D CNN on the extracted features.

Result: Achieved 99.05% accuracy on ModelNet40 and 99.52% on ModelNet10 benchmarks, and demonstrated strong performance on the real-world OmniObject3D dataset with robust performance on corrupted inputs.

Conclusion: TACO-Net sets a new state-of-the-art for 3D object classification by effectively combining topological features with deep learning, showing high accuracy and strong resilience to noise and corruption.

Abstract: 3D object classification is a crucial problem due to its significant practical relevance in many fields, including computer vision, robotics, and autonomous driving. Although deep learning methods applied to point clouds sampled on CAD models of the objects and/or captured by LiDAR or RGBD cameras have achieved remarkable success in recent years, achieving high classification accuracy remains a challenging problem due to the unordered point clouds and their irregularity and noise. To this end, we propose a novel state-of-the-art (SOTA) 3D object classification technique that combines topological data analysis with various image filtration techniques to classify objects when they are represented using point clouds. We transform every point cloud into a voxelized binary 3D image to extract distinguishing topological features. Next, we train a lightweight one-dimensional Convolutional Neural Network (1D CNN) using the extracted feature set from the training dataset. Our framework, TACO-Net, sets a new state-of-the-art by achieving $99.05%$ and $99.52%$ accuracy on the widely used synthetic benchmarks ModelNet40 and ModelNet10, and further demonstrates its robustness on the large-scale real-world OmniObject3D dataset. When tested with ten different kinds of corrupted ModelNet40 inputs, the proposed TACO-Net demonstrates strong resiliency overall.

[662] UP2You: Fast Reconstruction of Yourself from Unconstrained Photo Collections

Zeyu Cai, Ziyang Li, Xiaoben Li, Boqian Li, Zeyu Wang, Zhenyu Zhang, Yuliang Xiu

Main category: cs.CV

TL;DR: UP2You is a tuning-free solution for reconstructing high-fidelity 3D clothed portraits from unconstrained 2D photos, using a data rectifier paradigm and pose-correlated feature aggregation to efficiently process raw images into clean multi-view inputs for 3D reconstruction.

Details

Motivation: Previous approaches require clean inputs with minimal occlusions or well-calibrated cross-view captures, which limits practical applications. UP2You addresses the need to process raw, unstructured photographs that vary significantly in pose, viewpoint, cropping, and occlusion.

Method: Introduces a data rectifier paradigm that converts unconstrained inputs into clean orthogonal multi-view images in seconds. Uses pose-correlated feature aggregation (PCFA) to fuse information from multiple reference images relative to target poses, and a perceiver-based multi-reference shape predictor that eliminates need for pre-captured body templates.

Result: Surpasses previous methods in geometric accuracy (Chamfer-15%, P2S-18% on PuzzleIOI) and texture fidelity (PSNR-21%, LPIPS-46% on 4D-Dress). Efficient (1.5 minutes per person) and versatile, supporting arbitrary pose control and training-free multi-garment 3D virtual try-on.

Conclusion: UP2You provides a practical solution for real-world scenarios where humans are casually captured, making high-fidelity 3D reconstruction from unconstrained photos feasible and efficient.

Abstract: We present UP2You, the first tuning-free solution for reconstructing high-fidelity 3D clothed portraits from extremely unconstrained in-the-wild 2D photos. Unlike previous approaches that require “clean” inputs (e.g., full-body images with minimal occlusions, or well-calibrated cross-view captures), UP2You directly processes raw, unstructured photographs, which may vary significantly in pose, viewpoint, cropping, and occlusion. Instead of compressing data into tokens for slow online text-to-3D optimization, we introduce a data rectifier paradigm that efficiently converts unconstrained inputs into clean, orthogonal multi-view images in a single forward pass within seconds, simplifying the 3D reconstruction. Central to UP2You is a pose-correlated feature aggregation module (PCFA), that selectively fuses information from multiple reference images w.r.t. target poses, enabling better identity preservation and nearly constant memory footprint, with more observations. We also introduce a perceiver-based multi-reference shape predictor, removing the need for pre-captured body templates. Extensive experiments on 4D-Dress, PuzzleIOI, and in-the-wild captures demonstrate that UP2You consistently surpasses previous methods in both geometric accuracy (Chamfer-15%, P2S-18% on PuzzleIOI) and texture fidelity (PSNR-21%, LPIPS-46% on 4D-Dress). UP2You is efficient (1.5 minutes per person), and versatile (supports arbitrary pose control, and training-free multi-garment 3D virtual try-on), making it practical for real-world scenarios where humans are casually captured. Both models and code will be released to facilitate future research on this underexplored task. Project Page: https://zcai0612.github.io/UP2You

[663] Fast Real-Time Pipeline for Robust Arm Gesture Recognition

Milán Zsolt Bagladi, László Gulyás, Gergő Szalay

Main category: cs.CV

TL;DR: Real-time dynamic arm gesture recognition pipeline using OpenPose keypoint estimation, normalization, and RNN classifier with robustness to camera angle variations.

Details

Motivation: To develop a robust real-time system for dynamic arm gesture recognition that can handle varying viewing angles and speeds, particularly for applications like traffic control.

Method: Uses OpenPose for keypoint estimation, implements 1x1 normalization and two feature representations (coordinate- and angle-based), employs RNN classifier, and enhances robustness through artificially rotated training data.

Result: Achieves high accuracy across varying viewing angles and speeds on a custom traffic-control gesture dataset, with additional capability to calculate arm signal speed.

Conclusion: The proposed pipeline provides effective real-time dynamic arm gesture recognition with robustness to camera angle variations and varying speeds.

Abstract: This paper presents a real-time pipeline for dynamic arm gesture recognition based on OpenPose keypoint estimation, keypoint normalization, and a recurrent neural network classifier. The 1 x 1 normalization scheme and two feature representations (coordinate- and angle-based) are presented for the pipeline. In addition, an efficient method to improve robustness against camera angle variations is also introduced by using artificially rotated training data. Experiments on a custom traffic-control gesture dataset demonstrate high accuracy across varying viewing angles and speeds. Finally, an approach to calculate the speed of the arm signal (if necessary) is also presented.

[664] Training-Free Token Pruning via Zeroth-Order Gradient Estimation in Vision-Language Models

Youngeun Kim, Youjia Zhang, Huiling Liu, Aecheon Jung, Sunwoo Lee, Sungeun Hong

Main category: cs.CV

TL;DR: A training-free token pruning framework for VLMs that uses zeroth-order perturbations to identify sensitive tokens with complementary visual cues, achieving up to 94.4% token reduction while maintaining accuracy.

Details

Motivation: Existing token pruning methods have limitations: attention-based approaches are unstable and lead to redundant selections, while diversity-based methods risk dropping important regions needed for accurate predictions.

Method: Proposes a framework that estimates token sensitivity using zeroth-order perturbations at the projection layer, measuring how small random perturbations affect projection outputs to identify influential tokens without backpropagation.

Result: Outperforms prior methods across multiple VLMs and benchmarks, pruning up to 94.4% of tokens while maintaining accuracy, achieving up to 2.30x faster end-to-end inference.

Conclusion: The proposed training-free framework effectively identifies and prunes redundant visual tokens in VLMs while preserving accuracy, significantly improving inference efficiency through lightweight sensitivity estimation.

Abstract: Large Vision-Language Models (VLMs) enable strong multimodal reasoning but incur heavy inference costs from redundant visual tokens. Token pruning alleviates this issue, yet existing approaches face limitations. Attention-based methods rely on raw attention scores, which are often unstable across layers and heads and can lead to redundant selections. Diversity-based methods improve robustness by selecting tokens far apart in feature space but risk dropping regions needed for accurate prediction. We propose \ours, a training-free framework built on a simple intuition: tokens with higher sensitivity are more likely to influence the model’s output, and they should also capture complementary visual cues rather than overlapping information. To achieve this, we estimate token sensitivity using zeroth-order perturbations at the projection layer, a shallow and computationally light component of the model. This approach measures how small random perturbations affect the projection outputs, allowing us to approximate each token’s influence through lightweight forward passes without backpropagation. Extensive experiments across multiple VLMs and benchmarks show that \ours consistently outperforms prior methods, pruning up to 94.4% of tokens while maintaining accuracy and significantly improving efficiency, achieving up to 2.30x faster end-to-end inference over the baseline.

[665] PHASE-Net: Physics-Grounded Harmonic Attention System for Efficient Remote Photoplethysmography Measurement

Bo Zhao, Dan Guo, Junzhe Cao, Yong Xu, Tao Tan, Yue Sun, Bochao Zou, Jie Zhang, Zitong Yu

Main category: cs.CV

TL;DR: PHASE-Net is a physics-informed rPPG model that uses a Temporal Convolutional Network derived from hemodynamic equations, featuring axial swapping, adaptive spatial filtering, and gated temporal modeling for robust non-contact heart rate monitoring.

Details

Motivation: Existing deep learning methods for remote photoplethysmography (rPPG) lack theoretical grounding and suffer from accuracy degradation under head motion and illumination changes, limiting robustness and interpretability.

Method: Derived from Navier-Stokes hemodynamic equations, the method shows pulse signals follow a second-order dynamical system leading to causal convolution. PHASE-Net implements: (1) Zero-FLOPs Axial Swapper for cross-region feature interaction, (2) Adaptive Spatial Filter for noise suppression, and (3) Gated TCN for long-range temporal modeling.

Result: Extensive experiments demonstrate state-of-the-art performance with strong efficiency, offering a theoretically grounded and deployment-ready rPPG solution.

Conclusion: PHASE-Net provides a physics-informed paradigm for rPPG that combines theoretical justification with practical efficiency, addressing robustness issues in non-contact physiological monitoring.

Abstract: Remote photoplethysmography (rPPG) measurement enables non-contact physiological monitoring but suffers from accuracy degradation under head motion and illumination changes. Existing deep learning methods are mostly heuristic and lack theoretical grounding, which limits robustness and interpretability. In this work, we propose a physics-informed rPPG paradigm derived from the Navier-Stokes equations of hemodynamics, showing that the pulse signal follows a second-order dynamical system whose discrete solution naturally leads to a causal convolution. This provides a theoretical justification for using a Temporal Convolutional Network (TCN). Based on this principle, we design PHASE-Net, a lightweight model with three key components: (1) Zero-FLOPs Axial Swapper module, which swaps or transposes a few spatial channels to mix distant facial regions and enhance cross-region feature interaction without breaking temporal order; (2) Adaptive Spatial Filter, which learns a soft spatial mask per frame to highlight signal-rich areas and suppress noise; and (3) Gated TCN, a causal dilated TCN with gating that models long-range temporal dynamics for accurate pulse recovery. Extensive experiments demonstrate that PHASE-Net achieves state-of-the-art performance with strong efficiency, offering a theoretically grounded and deployment-ready rPPG solution.

[666] ELPG-DTFS: Prior-Guided Adaptive Time-Frequency Graph Neural Network for EEG Depression Diagnosis

Jingru Qiu, Jiale Liang, Xuanhan Fan, Mingda Zhang, Zhenli He

Main category: cs.CV

TL;DR: ELPG-DTFS is a prior-guided adaptive time-frequency graph neural network for EEG-based major depressive disorder diagnosis that achieves 97.63% accuracy on MODMA dataset.

Details

Motivation: Current MDD diagnosis relies on subjective scales, while existing EEG deep learning models treat spectra as static images, fix inter-channel graphs, and ignore neuroscience prior knowledge, limiting accuracy and interpretability.

Method: Proposes ELPG-DTFS with: (1) channel-band attention with cross-band mutual information, (2) learnable adjacency matrix for dynamic functional links, and (3) residual knowledge-graph pathway injecting neuroscience priors.

Result: Achieves 97.63% accuracy and 97.33% F1 score on 128-channel MODMA dataset (53 subjects), surpassing state-of-the-art ACM-GNN. Ablation shows removing any module lowers F1 by up to 4.35.

Conclusion: ELPG-DTFS offers a robust and interpretable framework for next-generation EEG-based MDD diagnostics, with complementary value from all proposed modules.

Abstract: Timely and objective screening of major depressive disorder (MDD) is vital, yet diagnosis still relies on subjective scales. Electroencephalography (EEG) provides a low-cost biomarker, but existing deep models treat spectra as static images, fix inter-channel graphs, and ignore prior knowledge, limiting accuracy and interpretability. We propose ELPG-DTFS, a prior-guided adaptive time-frequency graph neural network that introduces: (1) channel-band attention with cross-band mutual information, (2) a learnable adjacency matrix for dynamic functional links, and (3) a residual knowledge-graph pathway injecting neuroscience priors. On the 128-channel MODMA dataset (53 subjects), ELPG-DTFS achieves 97.63% accuracy and 97.33% F1, surpassing the 2025 state-of-the-art ACM-GNN. Ablation shows that removing any module lowers F1 by up to 4.35, confirming their complementary value. ELPG-DTFS thus offers a robust and interpretable framework for next-generation EEG-based MDD diagnostics.

[667] BRIDGE – Building Reinforcement-Learning Depth-to-Image Data Generation Engine for Monocular Depth Estimation

Dingning Liu, Haoyu Guo, Jingyi Zhou, Tong He

Main category: cs.CV

TL;DR: BRIDGE is an RL-optimized depth-to-image generation framework that synthesizes 20M+ realistic RGB images with ground truth depth pairs, enabling superior monocular depth estimation through hybrid supervision training.

Details

Motivation: Traditional monocular depth estimation methods are limited by data scarcity and quality issues, which hinder their robustness and performance.

Method: Propose BRIDGE framework that uses RL-optimized depth-to-image generation to create 20M+ realistic RGB-depth pairs from diverse source depth maps, then trains depth estimation model with hybrid supervision combining teacher pseudo-labels and ground truth depth.

Result: BRIDGE achieves breakthroughs in scale and domain diversity, consistently outperforming state-of-the-art approaches quantitatively and in complex scene detail capture.

Conclusion: The innovative data generation and training paradigm fosters general and robust depth features, advancing monocular depth estimation capabilities.

Abstract: Monocular Depth Estimation (MDE) is a foundational task for computer vision. Traditional methods are limited by data scarcity and quality, hindering their robustness. To overcome this, we propose BRIDGE, an RL-optimized depth-to-image (D2I) generation framework that synthesizes over 20M realistic and geometrically accurate RGB images, each intrinsically paired with its ground truth depth, from diverse source depth maps. Then we train our depth estimation model on this dataset, employing a hybrid supervision strategy that integrates teacher pseudo-labels with ground truth depth for comprehensive and robust training. This innovative data generation and training paradigm enables BRIDGE to achieve breakthroughs in scale and domain diversity, consistently outperforming existing state-of-the-art approaches quantitatively and in complex scene detail capture, thereby fostering general and robust depth features. Code and models are available at https://dingning-liu.github.io/bridge.github.io/.

[668] Vision At Night: Exploring Biologically Inspired Preprocessing For Improved Robustness Via Color And Contrast Transformations

Lorena Stracke, Lia Nimmermann, Shashank Agnihotri, Margret Keuper, Volker Blanz

Main category: cs.CV

TL;DR: Biologically inspired preprocessing using Difference-of-Gaussians filtering improves semantic segmentation robustness to adverse conditions without changing model architecture.

Details

Motivation: Inspired by human visual system's contrast enhancement and color-opponency mechanisms to improve robustness in semantic segmentation.

Method: Apply Difference-of-Gaussians filtering to RGB, grayscale, and opponent-color channels for contrast enhancement as input preprocessing.

Result: Maintains in-distribution performance while improving robustness to adverse conditions (night, fog, snow) on Cityscapes, ACDC, and Dark Zurich datasets.

Conclusion: Model-agnostic, lightweight preprocessing enables robust inputs for downstream vision models in safety-critical environments.

Abstract: Inspired by the human visual system’s mechanisms for contrast enhancement and color-opponency, we explore biologically motivated input preprocessing for robust semantic segmentation. By applying Difference-of-Gaussians (DoG) filtering to RGB, grayscale, and opponent-color channels, we enhance local contrast without modifying model architecture or training. Evaluations on Cityscapes, ACDC, and Dark Zurich show that such preprocessing maintains in-distribution performance while improving robustness to adverse conditions like night, fog, and snow. As this processing is model-agnostic and lightweight, it holds potential for integration into imaging pipelines, enabling imaging systems to deliver task-ready, robust inputs for downstream vision models in safety-critical environments.

[669] UniLat3D: Geometry-Appearance Unified Latents for Single-Stage 3D Generation

Guanjun Wu, Jiemin Fang, Chen Yang, Sikuang Li, Taoran Yi, Jia Lu, Zanwei Zhou, Jiazhong Cen, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Xinggang Wang, Qi Tian

Main category: cs.CV

TL;DR: UniLat3D is a unified framework that generates 3D assets in a single stage by encoding geometry and appearance in a single latent space, eliminating the need for separate geometry and texture generation stages.

Details

Motivation: Existing 3D generation methods use decoupled pipelines that first generate geometry then synthesize appearance, leading to geometry-texture misalignment and high computational costs.

Method: Proposes a geometry-appearance Unified VAE that compresses high-resolution sparse features into a compact latent representation (UniLat), then trains a single flow-matching model to generate UniLat directly from Gaussian noise.

Result: UniLat3D produces high-quality 3D assets in seconds from a single image, achieving superior appearance fidelity and geometric quality compared to existing methods.

Conclusion: The unified approach enables efficient single-stage 3D generation with better alignment between geometry and appearance, significantly reducing computational costs while maintaining high quality.

Abstract: High-fidelity 3D asset generation is crucial for various industries. While recent 3D pretrained models show strong capability in producing realistic content, most are built upon diffusion models and follow a two-stage pipeline that first generates geometry and then synthesizes appearance. Such a decoupled design tends to produce geometry-texture misalignment and non-negligible cost. In this paper, we propose UniLat3D, a unified framework that encodes geometry and appearance in a single latent space, enabling direct single-stage generation. Our key contribution is a geometry-appearance Unified VAE, which compresses high-resolution sparse features into a compact latent representation – UniLat. UniLat integrates structural and visual information into a dense low-resolution latent, which can be efficiently decoded into diverse 3D formats, e.g., 3D Gaussians and meshes. Based on this unified representation, we train a single flow-matching model to map Gaussian noise directly into UniLat, eliminating redundant stages. Trained solely on public datasets, UniLat3D produces high-quality 3D assets in seconds from a single image, achieving superior appearance fidelity and geometric quality. More demos & code are available at https://unilat3d.github.io/

[670] StreamForest: Efficient Online Video Understanding with Persistent Event Memory

Xiangyu Zeng, Kefan Qiu, Qingyu Zhang, Xinhao Li, Jing Wang, Jiaxin Li, Ziang Yan, Kun Tian, Meng Tian, Xinhai Zhao, Yi Wang, Limin Wang

Main category: cs.CV

TL;DR: StreamForest is a novel architecture for streaming video understanding that uses Persistent Event Memory Forest for efficient long-term memory and Fine-grained Spatiotemporal Window for real-time perception, achieving state-of-the-art performance on multiple benchmarks.

Details

Motivation: Current MLLMs have limitations in real-time streaming scenarios due to storage constraints of historical visual features and insufficient real-time spatiotemporal reasoning.

Method: Proposes Persistent Event Memory Forest that organizes video frames into event-level tree structures using penalty functions, and Fine-grained Spatiotemporal Window for detailed short-term visual cues. Also introduces OnlineIT instruction-tuning dataset and ODV-Bench evaluation benchmark.

Result: Achieves 77.3% on StreamingBench, 60.5% on OVBench, and 55.6% on OVO-Bench. Maintains 96.8% accuracy under extreme visual token compression (1024 tokens) compared to default settings.

Conclusion: StreamForest demonstrates robustness, efficiency, and generalizability for streaming video understanding, particularly in autonomous driving scenarios.

Abstract: Multimodal Large Language Models (MLLMs) have recently achieved remarkable progress in video understanding. However, their effectiveness in real-time streaming scenarios remains limited due to storage constraints of historical visual features and insufficient real-time spatiotemporal reasoning. To address these challenges, we propose StreamForest, a novel architecture specifically designed for streaming video understanding. Central to StreamForest is the Persistent Event Memory Forest, a memory mechanism that adaptively organizes video frames into multiple event-level tree structures. This process is guided by penalty functions based on temporal distance, content similarity, and merge frequency, enabling efficient long-term memory retention under limited computational resources. To enhance real-time perception, we introduce a Fine-grained Spatiotemporal Window, which captures detailed short-term visual cues to improve current scene perception. Additionally, we present OnlineIT, an instruction-tuning dataset tailored for streaming video tasks. OnlineIT significantly boosts MLLM performance in both real-time perception and future prediction. To evaluate generalization in practical applications, we introduce ODV-Bench, a new benchmark focused on real-time streaming video understanding in autonomous driving scenarios. Experimental results demonstrate that StreamForest achieves the state-of-the-art performance, with accuracies of 77.3% on StreamingBench, 60.5% on OVBench, and 55.6% on OVO-Bench. In particular, even under extreme visual token compression (limited to 1024 tokens), the model retains 96.8% of its average accuracy in eight benchmarks relative to the default setting. These results underscore the robustness, efficiency, and generalizability of StreamForest for streaming video understanding.

[671] Score Distillation of Flow Matching Models

Mingyuan Zhou, Yi Gu, Huangjie Zheng, Liangchen Song, Guande He, Yizhe Zhang, Wenze Hu, Yinfei Yang

Main category: cs.CV

TL;DR: Score identity Distillation (SiD) extends to text-to-image flow-matching models, enabling efficient few-step generation without architectural changes or teacher finetuning.

Details

Motivation: Diffusion models produce high-quality images but suffer from slow iterative sampling. While distillation methods help, it was unclear if they transfer to flow-matching models, which are theoretically equivalent to diffusion under Gaussian assumptions.

Method: Extends Score identity Distillation (SiD) to pretrained text-to-image flow-matching models using a simple derivation based on Bayes’ rule and conditional expectations, with modest flow-matching- and DiT-specific adjustments.

Result: SiD works out of the box across various models (SANA, SD3-Medium, SD3.5-Medium/Large, FLUX.1-dev) in both data-free and data-aided settings, providing systematic evidence that score distillation applies broadly to flow matching models.

Conclusion: Score distillation techniques unify acceleration across diffusion- and flow-based generators, resolving prior concerns about stability and soundness in flow matching models.

Abstract: Diffusion models achieve high-quality image generation but are limited by slow iterative sampling. Distillation methods alleviate this by enabling one- or few-step generation. Flow matching, originally introduced as a distinct framework, has since been shown to be theoretically equivalent to diffusion under Gaussian assumptions, raising the question of whether distillation techniques such as score distillation transfer directly. We provide a simple derivation – based on Bayes’ rule and conditional expectations – that unifies Gaussian diffusion and flow matching without relying on ODE/SDE formulations. Building on this view, we extend Score identity Distillation (SiD) to pretrained text-to-image flow-matching models, including SANA, SD3-Medium, SD3.5-Medium/Large, and FLUX.1-dev, all with DiT backbones. Experiments show that, with only modest flow-matching- and DiT-specific adjustments, SiD works out of the box across these models, in both data-free and data-aided settings, without requiring teacher finetuning or architectural changes. This provides the first systematic evidence that score distillation applies broadly to text-to-image flow matching models, resolving prior concerns about stability and soundness and unifying acceleration techniques across diffusion- and flow-based generators. We will make the PyTorch implementation publicly available.

[672] Environment-Aware Satellite Image Generation with Diffusion Models

Nikos Kostagiolas, Pantelis Georgiades, Yannis Panagakis, Mihalis A. Nicolaou

Main category: cs.CV

TL;DR: A novel diffusion model for satellite image generation that conditions on environmental context using three control signals: text, metadata, and visual data, addressing limitations of existing methods.

Details

Motivation: Existing diffusion models for remote sensing have limitations including limited environmental context, poor handling of missing/corrupted data, and unreliable reflection of user intentions in generated outputs.

Method: Proposes a diffusion model conditioned on environmental context using three control signals (text, metadata, visual data) with metadata fusion strategy to handle partial/corrupt observations and dynamic environmental conditions.

Result: Outperforms previous methods both qualitatively (robustness to missing metadata, higher responsiveness to control inputs) and quantitatively (higher fidelity, accuracy, quality measured by 6 metrics) in single-image and temporal generation trials.

Conclusion: Conditioning on environmental context improves foundation model performance for satellite imagery, making the model promising for downstream tasks; also created first publicly-available 3-modal dataset combining these data mediums.

Abstract: Diffusion-based foundation models have recently garnered much attention in the field of generative modeling due to their ability to generate images of high quality and fidelity. Although not straightforward, their recent application to the field of remote sensing signaled the first successful trials towards harnessing the large volume of publicly available datasets containing multimodal information. Despite their success, existing methods face considerable limitations: they rely on limited environmental context, struggle with missing or corrupted data, and often fail to reliably reflect user intentions in generated outputs. In this work, we propose a novel diffusion model conditioned on environmental context, that is able to generate satellite images by conditioning from any combination of three different control signals: a) text, b) metadata, and c) visual data. In contrast to previous works, the proposed method is i) to our knowledge, the first of its kind to condition satellite image generation on dynamic environmental conditions as part of its control signals, and ii) incorporating a metadata fusion strategy that models attribute embedding interactions to account for partially corrupt and/or missing observations. Our method outperforms previous methods both qualitatively (robustness to missing metadata, higher responsiveness to control inputs) and quantitatively (higher fidelity, accuracy, and quality of generations measured using 6 different metrics) in the trials of single-image and temporal generation. The reported results support our hypothesis that conditioning on environmental context can improve the performance of foundation models for satellite imagery, and render our model a promising candidate for usage in downstream tasks. The collected 3-modal dataset is to our knowledge, the first publicly-available dataset to combine data from these three different mediums.

[673] Fast Feature Field ($\text{F}^3$): A Predictive Representation of Events

Richeek Das, Kostas Daniilidis, Pratik Chaudhari

Main category: cs.CV

TL;DR: Fast Feature Field (F³) is a novel representation for event-based camera data that learns by predicting future events from past events, enabling efficient downstream tasks like optical flow, semantic segmentation, and depth estimation.

Details

Motivation: To develop efficient representations for event-based camera data that preserve scene structure and motion information while being robust to noise and variations in event rates.

Method: F³ uses multi-resolution hash encoding and deep sets to efficiently compute representations from sparse event data, representing events within a contiguous spatiotemporal volume as multi-channel images.

Result: Achieves 120 Hz at HD and 440 Hz at VGA resolutions for representation computation, and 25-75 Hz for downstream task prediction. State-of-the-art performance on optical flow, semantic segmentation, and depth estimation across various robotic platforms and conditions.

Conclusion: F³ provides an efficient, robust representation for event-based vision that enables high-performance downstream applications across diverse real-world scenarios.

Abstract: This paper develops a mathematical argument and algorithms for building representations of data from event-based cameras, that we call Fast Feature Field ($\text{F}^3$). We learn this representation by predicting future events from past events and show that it preserves scene structure and motion information. $\text{F}^3$ exploits the sparsity of event data and is robust to noise and variations in event rates. It can be computed efficiently using ideas from multi-resolution hash encoding and deep sets - achieving 120 Hz at HD and 440 Hz at VGA resolutions. $\text{F}^3$ represents events within a contiguous spatiotemporal volume as a multi-channel image, enabling a range of downstream tasks. We obtain state-of-the-art performance on optical flow estimation, semantic segmentation, and monocular metric depth estimation, on data from three robotic platforms (a car, a quadruped robot and a flying platform), across different lighting conditions (daytime, nighttime), environments (indoors, outdoors, urban, as well as off-road) and dynamic vision sensors (resolutions and event rates). Our implementations can predict these tasks at 25-75 Hz at HD resolution.

[674] ThermalGen: Style-Disentangled Flow-Based Generative Models for RGB-to-Thermal Image Translation

Jiuhong Xiao, Roshan Nayak, Ning Zhang, Daniel Tortei, Giuseppe Loianno

Main category: cs.CV

TL;DR: ThermalGen is an adaptive flow-based generative model for RGB-to-thermal image translation that addresses the scarcity of synchronized RGB-thermal data through synthetic thermal image generation from abundant RGB datasets.

Details

Motivation: The scarcity of synchronized and calibrated RGB-thermal image pairs presents a major obstacle for visual-thermal sensor fusion and cross-modality tasks, necessitating RGB-to-thermal image translation solutions.

Method: Proposed ThermalGen with RGB image conditioning architecture and style-disentangled mechanism, trained on eight public RGB-T datasets plus three new large-scale satellite-aerial datasets captured across diverse conditions.

Result: Extensive evaluations show ThermalGen achieves comparable or superior translation performance compared to existing GAN-based and diffusion-based methods across multiple RGB-T benchmarks.

Conclusion: ThermalGen is the first RGB-T image translation model capable of synthesizing thermal images that reflect significant variations in viewpoints, sensor characteristics, and environmental conditions.

Abstract: Paired RGB-thermal data is crucial for visual-thermal sensor fusion and cross-modality tasks, including important applications such as multi-modal image alignment and retrieval. However, the scarcity of synchronized and calibrated RGB-thermal image pairs presents a major obstacle to progress in these areas. To overcome this challenge, RGB-to-Thermal (RGB-T) image translation has emerged as a promising solution, enabling the synthesis of thermal images from abundant RGB datasets for training purposes. In this study, we propose ThermalGen, an adaptive flow-based generative model for RGB-T image translation, incorporating an RGB image conditioning architecture and a style-disentangled mechanism. To support large-scale training, we curated eight public satellite-aerial, aerial, and ground RGB-T paired datasets, and introduced three new large-scale satellite-aerial RGB-T datasets–DJI-day, Bosonplus-day, and Bosonplus-night–captured across diverse times, sensor types, and geographic regions. Extensive evaluations across multiple RGB-T benchmarks demonstrate that ThermalGen achieves comparable or superior translation performance compared to existing GAN-based and diffusion-based methods. To our knowledge, ThermalGen is the first RGB-T image translation model capable of synthesizing thermal images that reflect significant variations in viewpoints, sensor characteristics, and environmental conditions. Project page: http://xjh19971.github.io/ThermalGen

[675] VAGUEGAN: Stealthy Poisoning and Backdoor Attacks on Image Generative Pipelines

Mostafa Mohaimen Akand Faisal, Rabeya Amin Jhuma

Main category: cs.CV

TL;DR: VagueGAN is an attack pipeline that uses stealthy perturbations to cause targeted changes in generative model outputs, showing that poisoning can actually improve visual quality rather than reduce fidelity.

Details

Motivation: To explore adversarial attacks on generative pipelines where small perturbations lead to controlled output changes, which is less studied compared to attacks on discriminative models.

Method: Combines a modular perturbation network (PoisonerNet) with a Generator-Discriminator pair to craft stealthy triggers, evaluated using custom metrics and perceptual/frequency analysis. Also tested transferability to diffusion models via ControlNet.

Result: Poisoned outputs can display higher visual quality than clean counterparts, and latent space poisoning can retain or enhance aesthetics. Optimized perturbations produce consistent, stealthy effects while remaining visually inconspicuous.

Conclusion: The method exposes a blind spot in pixel-level defenses and raises concerns for the integrity of image generation pipelines, as carefully crafted perturbations can manipulate outputs without reducing quality.

Abstract: Generative models such as GANs and diffusion models are widely used to synthesize photorealistic images and to support downstream creative and editing tasks. While adversarial attacks on discriminative models are well studied, attacks targeting generative pipelines where small, stealthy perturbations in inputs lead to controlled changes in outputs are less explored. This study introduces VagueGAN, an attack pipeline combining a modular perturbation network PoisonerNet with a Generator Discriminator pair to craft stealthy triggers that cause targeted changes in generated images. Attack efficacy is evaluated using a custom proxy metric, while stealth is analyzed through perceptual and frequency domain measures. The transferability of the method to a modern diffusion based pipeline is further examined through ControlNet guided editing. Interestingly, the experiments show that poisoned outputs can display higher visual quality compared to clean counterparts, challenging the assumption that poisoning necessarily reduces fidelity. Unlike conventional pixel level perturbations, latent space poisoning in GANs and diffusion pipelines can retain or even enhance output aesthetics, exposing a blind spot in pixel level defenses. Moreover, carefully optimized perturbations can produce consistent, stealthy effects on generator outputs while remaining visually inconspicuous, raising concerns for the integrity of image generation pipelines.

[676] DWGS: Enhancing Sparse-View Gaussian Splatting with Hybrid-Loss Depth Estimation and Bidirectional Warping

Yu Ma, Guoliang Wei, Yue Cheng

Main category: cs.CV

TL;DR: DWGS enhances 3D Gaussian Splatting for sparse-view synthesis by integrating structural cues, virtual view constraints, and occluded region completion to address overfitting and geometric distortion issues.

Details

Motivation: Novel View Synthesis from sparse views suffers from overfitting, geometric distortion, and incomplete scene recovery due to limited multi-view constraints. 3D Gaussian Splatting has real-time rendering but shows floating artifacts and structural inconsistencies in sparse-input settings.

Method: Proposes DWGS with three main components: Hybrid-Loss Depth Estimation using dense matching priors with reprojection, point propagation, and smoothness constraints; Bidirectional Warping Virtual View Synthesis to generate virtual training views; and Occlusion-Aware Reconstruction using depth-difference mask and learning-based inpainting.

Result: Achieves state-of-the-art performance on LLFF, Blender, and DTU benchmarks with up to 21.13 dB PSNR and 0.189 LPIPS, while maintaining real-time inference capabilities.

Conclusion: DWGS effectively addresses the limitations of 3D Gaussian Splatting in sparse-view settings through integrated structural constraints, virtual view generation, and occlusion handling, achieving superior performance in novel view synthesis.

Abstract: Novel View Synthesis (NVS) from sparse views remains a core challenge in 3D reconstruction, typically suffering from overfitting, geometric distortion, and incomplete scene recovery due to limited multi-view constraints. Although 3D Gaussian Splatting (3DGS) enables real-time, high-fidelity rendering, it suffers from floating artifacts and structural inconsistencies under sparse-input settings. To address these issues, we propose DWGS, a novel unified framework that enhances 3DGS for sparse-view synthesis by integrating robust structural cues, virtual view constraints, and occluded region completion. Our approach introduces three principal contributions: a Hybrid-Loss Depth Estimation module that leverages dense matching priors with reprojection, point propagation, and smoothness constraints to enforce multi-view consistency; a Bidirectional Warping Virtual View Synthesis method generates virtual training views to impose stronger geometric and photometric constraints; and an Occlusion-Aware Reconstruction component that utilizes depth-difference mask and a learning-based inpainting model to recover obscured regions. Extensive experiments on standard benchmarks (LLFF, Blender, and DTU) show that DWGS achieves a new state-of-the-art, achieving up to 21.13 dB PSNR and 0.189 LPIPS, while retaining real-time inference capabilities.

[677] DAM: Dual Active Learning with Multimodal Foundation Model for Source-Free Domain Adaptation

Xi Chen, Hongxun Yao, Zhaopan Xu, Kui Jiang

Main category: cs.CV

TL;DR: DAM is a novel SFADA framework that integrates multimodal supervision from Vision-and-Language models with sparse human annotations through bidirectional distillation, achieving state-of-the-art performance.

Details

Motivation: Existing SFADA methods treat ViL-based and data supervision as separate sources without effective fusion, limiting their potential for knowledge transfer from source models to unlabeled target domains.

Method: DAM initializes stable ViL-guided targets and employs bidirectional distillation to enable mutual knowledge exchange between the target model and dual supervisions (ViL model + human annotations) during iterative adaptation.

Result: Extensive experiments show DAM consistently outperforms existing methods and sets new state-of-the-art across multiple SFADA benchmarks and active learning strategies.

Conclusion: The proposed DAM framework effectively integrates multimodal supervision with human annotations through bidirectional distillation, demonstrating superior performance in source-free active domain adaptation tasks.

Abstract: Source-free active domain adaptation (SFADA) enhances knowledge transfer from a source model to an unlabeled target domain using limited manual labels selected via active learning. While recent domain adaptation studies have introduced Vision-and-Language (ViL) models to improve pseudo-label quality or feature alignment, they often treat ViL-based and data supervision as separate sources, lacking effective fusion. To overcome this limitation, we propose Dual Active learning with Multimodal (DAM) foundation model, a novel framework that integrates multimodal supervision from a ViL model to complement sparse human annotations, thereby forming a dual supervisory signal. DAM initializes stable ViL-guided targets and employs a bidirectional distillation mechanism to foster mutual knowledge exchange between the target model and the dual supervisions during iterative adaptation. Extensive experiments demonstrate that DAM consistently outperforms existing methods and sets a new state-of-the-art across multiple SFADA benchmarks and active learning strategies.

[678] GHOST: Hallucination-Inducing Image Generation for Multimodal LLMs

Aryan Yazdan Parast, Parsa Hosseini, Hesam Asadollahzadeh, Arshia Soltani Moakhar, Basim Azam, Soheil Feizi, Naveed Akhtar

Main category: cs.CV

TL;DR: GHOST is an automatic method that generates natural-looking images to induce object hallucinations in MLLMs, achieving high success rates and uncovering transferable vulnerabilities across models.

Details

Motivation: Current benchmarks for studying object hallucination in MLLMs use static scenarios, limiting the discovery of model-specific or unexpected vulnerabilities.

Method: GHOST optimizes in image embedding space to create misleading cues while keeping target objects absent, then uses a diffusion model to generate natural-looking images that cause hallucinations.

Result: Achieved 28% hallucination success rate (vs 1% in prior methods), generated high-quality object-free images, and uncovered transferable vulnerabilities (66.5% transfer rate from Qwen2.5-VL to GPT-4o).

Conclusion: GHOST serves as both a diagnostic tool for uncovering hallucinations and a corrective tool for improving model reliability through fine-tuning.

Abstract: Object hallucination in Multimodal Large Language Models (MLLMs) is a persistent failure mode that causes the model to perceive objects absent in the image. This weakness of MLLMs is currently studied using static benchmarks with fixed visual scenarios, which preempts the possibility of uncovering model-specific or unanticipated hallucination vulnerabilities. We introduce GHOST (Generating Hallucinations via Optimizing Stealth Tokens), a method designed to stress-test MLLMs by actively generating images that induce hallucination. GHOST is fully automatic and requires no human supervision or prior knowledge. It operates by optimizing in the image embedding space to mislead the model while keeping the target object absent, and then guiding a diffusion model conditioned on the embedding to generate natural-looking images. The resulting images remain visually natural and close to the original input, yet introduce subtle misleading cues that cause the model to hallucinate. We evaluate our method across a range of models, including reasoning models like GLM-4.1V-Thinking, and achieve a hallucination success rate exceeding 28%, compared to around 1% in prior data-driven discovery methods. We confirm that the generated images are both high-quality and object-free through quantitative metrics and human evaluation. Also, GHOST uncovers transferable vulnerabilities: images optimized for Qwen2.5-VL induce hallucinations in GPT-4o at a 66.5% rate. Finally, we show that fine-tuning on our images mitigates hallucination, positioning GHOST as both a diagnostic and corrective tool for building more reliable multimodal systems.

[679] Accurate Cobb Angle Estimation via SVD-Based Curve Detection and Vertebral Wedging Quantification

Chang Shi, Nan Meng, Yipeng Zhuang, Moxin Zhao, Jason Pui Yin Cheung, Hua Huang, Xiuyuan Chen, Cong Nie, Wenting Zhong, Guiqiang Jiang, Yuxin Wei, Jacob Hong Man Yu, Si Chen, Xiaowen Ou, Teng Zhang

Main category: cs.CV

TL;DR: A deep learning framework for automated assessment of adolescent idiopathic scoliosis (AIS) that predicts vertebral endplate angles and introduces a novel Vertebral Wedging Index (VWI) for improved diagnosis and progression monitoring.

Details

Motivation: Traditional manual Cobb angle measurements for AIS assessment suffer from significant observer variability, and existing automated methods use simplified spinal models that fail to address clinical complexity.

Method: Combines HRNet backbone with Swin-Transformer modules and biomechanically informed constraints for feature extraction. Uses Singular Value Decomposition (SVD) to analyze angle predictions directly from vertebral morphology without predefined curve assumptions.

Result: Achieved 83.45% diagnostic accuracy and 2.55° mean absolute error on 630 full-spine radiographs. Demonstrated exceptional generalization on out-of-distribution cases. VWI showed significant prognostic correlation with curve progression while traditional Cobb angles showed no correlation.

Conclusion: The framework provides robust support for early AIS detection, personalized treatment planning, and progression monitoring through the novel VWI metric that better captures vertebral deformation compared to traditional Cobb angles.

Abstract: Adolescent idiopathic scoliosis (AIS) is a common spinal deformity affecting approximately 2.2% of boys and 4.8% of girls worldwide. The Cobb angle serves as the gold standard for AIS severity assessment, yet traditional manual measurements suffer from significant observer variability, compromising diagnostic accuracy. Despite prior automation attempts, existing methods use simplified spinal models and predetermined curve patterns that fail to address clinical complexity. We present a novel deep learning framework for AIS assessment that simultaneously predicts both superior and inferior endplate angles with corresponding midpoint coordinates for each vertebra, preserving the anatomical reality of vertebral wedging in progressive AIS. Our approach combines an HRNet backbone with Swin-Transformer modules and biomechanically informed constraints for enhanced feature extraction. We employ Singular Value Decomposition (SVD) to analyze angle predictions directly from vertebral morphology, enabling flexible detection of diverse scoliosis patterns without predefined curve assumptions. Using 630 full-spine anteroposterior radiographs from patients aged 10-18 years with rigorous dual-rater annotation, our method achieved 83.45% diagnostic accuracy and 2.55{\deg} mean absolute error. The framework demonstrates exceptional generalization capability on out-of-distribution cases. Additionally, we introduce the Vertebral Wedging Index (VWI), a novel metric quantifying vertebral deformation. Longitudinal analysis revealed VWI’s significant prognostic correlation with curve progression while traditional Cobb angles showed no correlation, providing robust support for early AIS detection, personalized treatment planning, and progression monitoring.

[680] DC-Gen: Post-Training Diffusion Acceleration with Deeply Compressed Latent Space

Wenkun He, Yuchao Gu, Junyu Chen, Dongyun Zou, Yujun Lin, Zhekai Zhang, Haocheng Xi, Muyang Li, Ligeng Zhu, Jincheng Yu, Junsong Chen, Enze Xie, Song Han, Han Cai

Main category: cs.CV

TL;DR: DC-Gen accelerates text-to-image diffusion models by using a deeply compressed latent space, achieving 53x speedup for 4K image generation while maintaining quality comparable to base models.

Details

Motivation: Existing text-to-image diffusion models face efficiency challenges at high resolutions like 4K, and previous research seldom addresses the inherent redundancy within the latent space.

Method: DC-Gen uses an efficient post-training pipeline with lightweight embedding alignment training to bridge the representation gap between base model’s latent space and deeply compressed latent space, followed by minimal LoRA fine-tuning.

Result: DC-Gen-SANA and DC-Gen-FLUX achieve quality comparable to base models with significant speedup: 53x latency reduction for 4K generation on H100 GPU, and 138x total latency reduction when combined with NVFP4 SVDQuant on 5090 GPU.

Conclusion: DC-Gen provides an effective framework for accelerating text-to-image diffusion models through latent space compression while preserving generation quality, enabling practical high-resolution image generation.

Abstract: Existing text-to-image diffusion models excel at generating high-quality images, but face significant efficiency challenges when scaled to high resolutions, like 4K image generation. While previous research accelerates diffusion models in various aspects, it seldom handles the inherent redundancy within the latent space. To bridge this gap, this paper introduces DC-Gen, a general framework that accelerates text-to-image diffusion models by leveraging a deeply compressed latent space. Rather than a costly training-from-scratch approach, DC-Gen uses an efficient post-training pipeline to preserve the quality of the base model. A key challenge in this paradigm is the representation gap between the base model’s latent space and a deeply compressed latent space, which can lead to instability during direct fine-tuning. To overcome this, DC-Gen first bridges the representation gap with a lightweight embedding alignment training. Once the latent embeddings are aligned, only a small amount of LoRA fine-tuning is needed to unlock the base model’s inherent generation quality. We verify DC-Gen’s effectiveness on SANA and FLUX.1-Krea. The resulting DC-Gen-SANA and DC-Gen-FLUX models achieve quality comparable to their base models but with a significant speedup. Specifically, DC-Gen-FLUX reduces the latency of 4K image generation by 53x on the NVIDIA H100 GPU. When combined with NVFP4 SVDQuant, DC-Gen-FLUX generates a 4K image in just 3.5 seconds on a single NVIDIA 5090 GPU, achieving a total latency reduction of 138x compared to the base FLUX.1-Krea model. Code: https://github.com/dc-ai-projects/DC-Gen.

[681] Attention Surgery: An Efficient Recipe to Linearize Your Video Diffusion Transformer

Mohsen Ghafoorian, Denis Korzhenkov, Amirhossein Habibian

Main category: cs.CV

TL;DR: Attention Surgery is an efficient framework that linearizes or hybridizes attention in pretrained video diffusion models without full retraining, reducing attention cost by up to 40% while maintaining generation quality.

Details

Motivation: Transformer-based video diffusion models suffer from quadratic computational cost of self-attention, making long sequences and high resolutions expensive. Linear attention offers sub-quadratic complexity but prior methods fail to match softmax attention expressiveness without costly retraining.

Method: Combines hybrid attention mechanism (mixing softmax and linear tokens) with lightweight distillation and fine-tuning pipeline, plus cost-aware block-rate strategy to balance expressiveness and efficiency across layers.

Result: Applied to Wan2.1 1.3B VDM, achieves first competitive sub-quadratic attention video diffusion models, reducing attention FLOPs by up to 40% while maintaining quality on VBench and VBench-2.0 benchmarks.

Conclusion: Attention Surgery enables efficient video diffusion models with sub-quadratic attention complexity while preserving generation quality, requiring only a few GPU-days of training.

Abstract: Transformer-based video diffusion models (VDMs) deliver state-of-the-art video generation quality but are constrained by the quadratic cost of self-attention, making long sequences and high resolutions computationally expensive. While linear attention offers sub-quadratic complexity, prior attempts fail to match the expressiveness of softmax attention without costly retraining. We introduce \textit{Attention Surgery}, an efficient framework for \textit{linearizing} or \textit{hybridizing} attention in pretrained VDMs without training from scratch. Inspired by recent advances in language models, our method combines a novel hybrid attention mechanism-mixing softmax and linear tokens-with a lightweight distillation and fine-tuning pipeline requiring only a few GPU-days. Additionally, we incorporate a cost-aware block-rate strategy to balance expressiveness and efficiency across layers. Applied to Wan2.1 1.3B, a state-of-the-art DiT-based VDM, Attention Surgery achieves the first competitive sub-quadratic attention video diffusion models, reducing attention cost by up to 40% in terms of FLOPs, while maintaining generation quality as measured on the standard VBench and VBench-2.0 benchmarks.

[682] DC-VideoGen: Efficient Video Generation with Deep Compression Video Autoencoder

Junyu Chen, Wenkun He, Yuchao Gu, Yuyang Zhao, Jincheng Yu, Junsong Chen, Dongyun Zou, Yujun Lin, Zhekai Zhang, Muyang Li, Haocheng Xi, Ligeng Zhu, Enze Xie, Song Han, Han Cai

Main category: cs.CV

TL;DR: DC-VideoGen is a post-training acceleration framework that improves video generation efficiency by compressing pre-trained video diffusion models into a deep compression latent space with lightweight fine-tuning.

Details

Motivation: To address the computational inefficiency and high latency of existing video diffusion models, enabling faster inference and higher-resolution video generation on limited hardware.

Method: Uses a Deep Compression Video Autoencoder with chunk-causal temporal design for 32x/64x spatial and 4x temporal compression, combined with AE-Adapt-V adaptation strategy for stable transfer to compressed latent space.

Result: Achieves up to 14.8x lower inference latency without quality loss, enables 2160x3840 video generation on single GPU, and requires only 10 GPU days for adaptation of Wan-2.1-14B model.

Conclusion: DC-VideoGen provides an efficient post-training acceleration solution that significantly improves video generation performance while maintaining quality, making high-resolution video generation more accessible.

Abstract: We introduce DC-VideoGen, a post-training acceleration framework for efficient video generation. DC-VideoGen can be applied to any pre-trained video diffusion model, improving efficiency by adapting it to a deep compression latent space with lightweight fine-tuning. The framework builds on two key innovations: (i) a Deep Compression Video Autoencoder with a novel chunk-causal temporal design that achieves 32x/64x spatial and 4x temporal compression while preserving reconstruction quality and generalization to longer videos; and (ii) AE-Adapt-V, a robust adaptation strategy that enables rapid and stable transfer of pre-trained models into the new latent space. Adapting the pre-trained Wan-2.1-14B model with DC-VideoGen requires only 10 GPU days on the NVIDIA H100 GPU. The accelerated models achieve up to 14.8x lower inference latency than their base counterparts without compromising quality, and further enable 2160x3840 video generation on a single GPU. Code: https://github.com/dc-ai-projects/DC-VideoGen.

Songze Li, Zun Wang, Gengze Zhou, Jialu Li, Xiangyu Zeng, Limin Wang, Yu Qiao, Qi Wu, Mohit Bansal, Yi Wang

Main category: cs.CV

TL;DR: SID is a goal-oriented language-guided navigation approach that uses self-improving demonstrations to enhance agent exploration capabilities, achieving state-of-the-art performance on navigation tasks.

Details

Motivation: Existing navigation methods rely too heavily on shortest-path trajectories and lack effective exploration priors, limiting their ability to navigate to goals in unknown environments without step-by-step instructions.

Method: SID first trains an initial agent on shortest-path data, then uses this agent to generate novel exploration trajectories. These trajectories provide better demonstrations for training an improved agent, creating an iterative self-improving pipeline.

Result: SID significantly boosts exploration capabilities and generalization, achieving 50.9% success rate on SOON unseen validation splits - 13.9% higher than prior leading approaches, and sets new state-of-the-art on REVERIE and SOON tasks.

Conclusion: The iterative self-improving demonstration pipeline effectively scales to new environments and transfers across language-guided navigation tasks, elevating performance in diverse goal-oriented navigation scenarios.

Abstract: Goal-oriented language-guided navigation requires robust exploration capabilities for agents to navigate to specified goals in unknown environments without step-by-step instructions. Existing methods tend to exclusively utilize shortest-path trajectories, lacking effective exploration priors for training navigation agents. To address the above challenges, we present SID, a goal-oriented language-guided navigation learning approach with Self-Improving Demonstrations. Specifically, SID learns an initial agent on the shortest-path data sampled from environments and then leverages this agent to generate novel exploration trajectories. The novel rollouts provide demonstrations with stronger exploration strategies to train a better agent, which in turn produces higher-quality agent demonstrations for the next round of training. We show that this iterative self-improving pipeline readily scales to new environments, and the resulting demonstrations can be transferred across a variety of language-guided navigation tasks, elevating the performance ceiling in diverse goal-oriented navigation tasks. Extensive experiments demonstrate that SID significantly boosts the exploration capabilities and generalization of navigation agents. The resulting agent achieves new state-of-the-art performance on goal-oriented language-guided navigation tasks, including REVERIE, SOON, notably achieving a 50.9% success rate on the unseen validation splits of SOON, surpassing the prior leading approaches by a margin of 13.9%.

[684] Perceive, Reflect and Understand Long Video: Progressive Multi-Granular Clue Exploration with Interactive Agents

Jiahua Li, Kun Wei, Zhe Xu, Zibo Su, Xu Yang, Cheng Deng

Main category: cs.CV

TL;DR: CogniGPT is a framework that uses an interactive loop between Multi-Granular Perception Agent and Verification-Enhanced Reflection Agent for efficient and reliable long video understanding, achieving state-of-the-art performance with minimal frame usage.

Details

Motivation: Long videos present challenges due to temporal complexity and sparse task-relevant information. Existing LLM-based approaches struggle to balance completeness and efficiency in capturing task-critical information from long videos.

Method: The framework mimics human progressive visual cognition through an interactive loop between two agents: MGPA (Multi-Granular Perception Agent) that captures task-related information using divergent and focused attention, and VERA (Verification-Enhanced Reflection Agent) that verifies perceived clues to reduce hallucination and optimize perception strategies.

Result: Extensive experiments on EgoSchema, Video-MME, NExT-QA, and MovieChat datasets show CogniGPT’s superiority in both accuracy and efficiency. On EgoSchema, it surpasses existing training-free methods using only 11.2 frames and achieves performance comparable to Gemini 1.5-Pro.

Conclusion: CogniGPT effectively addresses long video understanding challenges by exploring minimal sets of informative and reliable task-related clues through interactive perception and verification, demonstrating both high accuracy and computational efficiency.

Abstract: Long videos, characterized by temporal complexity and sparse task-relevant information, pose significant reasoning challenges for AI systems. Although various Large Language Model (LLM)-based approaches have advanced long video understanding, they still struggle to achieve both completeness and efficiency in capturing task-critical information. Inspired by human progressive visual cognition, we propose CogniGPT, a framework that leverages an interactive loop between Multi-Granular Perception Agent (MGPA) and Verification-Enhanced Reflection Agent (VERA) for efficient and reliable long video understanding. Specifically, MGPA mimics human visual divergent and focused attention to capture task-related information, while VERA verifies perceived key clues to mitigate hallucination and optimize subsequent perception strategies. Through this interactive process, CogniGPT explores a minimal set of informative and reliable task-related clues. Extensive experiments on EgoSchema, Video-MME, NExT-QA, and MovieChat datasets demonstrate CogniGPT’s superiority in both accuracy and efficiency. Notably, on EgoSchema, it surpasses existing training-free methods using only 11.2 frames and achieves performance comparable to Gemini 1.5-Pro.

[685] Evaluating Temperature Scaling Calibration Effectiveness for CNNs under Varying Noise Levels in Brain Tumour Detection

Ankur Chanda, Kushan Choudhury, Shubhrodeep Roy, Shubhajit Biswas, Somenath Kuiry

Main category: cs.CV

TL;DR: Temperature Scaling significantly improves calibration of CNN-based brain tumor classifiers under various noise conditions without affecting accuracy.

Details

Motivation: Need for reliable confidence estimation in medical AI systems to prevent overconfident misclassifications that could have serious consequences in clinical settings.

Method: Developed custom CNN trained on merged brain MRI dataset, introduced five types of image noise (Gaussian, Poisson, Salt & Pepper, Speckle, Uniform), applied Temperature Scaling as post-hoc calibration technique.

Result: TS significantly reduced Expected Calibration Error (ECE) and Negative Log-Likelihood (NLL) under all noise conditions without degrading classification accuracy.

Conclusion: Temperature Scaling is an effective and computationally efficient approach to enhance decision confidence of medical AI systems, making model outputs more reliable in noisy or uncertain settings.

Abstract: Precise confidence estimation in deep learning is vital for high-stakes fields like medical imaging, where overconfident misclassifications can have serious consequences. This work evaluates the effectiveness of Temperature Scaling (TS), a post-hoc calibration technique, in improving the reliability of convolutional neural networks (CNNs) for brain tumor classification. We develop a custom CNN and train it on a merged brain MRI dataset. To simulate real-world uncertainty, five types of image noise are introduced: Gaussian, Poisson, Salt & Pepper, Speckle, and Uniform. Model performance is evaluated using precision, recall, F1-score, accuracy, negative log-likelihood (NLL), and expected calibration error (ECE), both before and after calibration. Results demonstrate that TS significantly reduces ECE and NLL under all noise conditions without degrading classification accuracy. This underscores TS as an effective and computationally efficient approach to enhance decision confidence of medical AI systems, hence making model outputs more reliable in noisy or uncertain settings.

Ermanno Bartoli, Dennis Rotondi, Buwei He, Patric Jensfelt, Kai O. Arras, Iolanda Leite

Main category: cs.CV

TL;DR: The paper introduces Social 3D Scene Graphs, an enhanced 3D scene representation that captures humans, their attributes, activities, and relationships with the environment using open-vocabulary framework, along with a new benchmark for evaluating social scene understanding.

Details

Motivation: Existing 3D Scene Graph approaches largely ignore humans in scenes and lack human-environment relationship annotations. They also only capture relations from single image frames, limiting their ability to model long-range interactions.

Method: Proposed Social 3D Scene Graphs that augment traditional 3D Scene Graphs to include humans, their attributes, activities, and both local and remote relationships with the environment using open-vocabulary framework. Also created a new benchmark with synthetic environments containing comprehensive human-scene relationship annotations.

Result: The experiments show that the proposed representation improves human activity prediction and reasoning about human-environment relations.

Conclusion: Social 3D Scene Graphs pave the way toward socially intelligent robots by enabling better understanding of human interactions with their surroundings.

Abstract: Understanding how people interact with their surroundings and each other is essential for enabling robots to act in socially compliant and context-aware ways. While 3D Scene Graphs have emerged as a powerful semantic representation for scene understanding, existing approaches largely ignore humans in the scene, also due to the lack of annotated human-environment relationships. Moreover, existing methods typically capture only open-vocabulary relations from single image frames, which limits their ability to model long-range interactions beyond the observed content. We introduce Social 3D Scene Graphs, an augmented 3D Scene Graph representation that captures humans, their attributes, activities and relationships in the environment, both local and remote, using an open-vocabulary framework. Furthermore, we introduce a new benchmark consisting of synthetic environments with comprehensive human-scene relationship annotations and diverse types of queries for evaluating social scene understanding in 3D. The experiments demonstrate that our representation improves human activity prediction and reasoning about human-environment relations, paving the way toward socially intelligent robots.

Donghwa Kang, Junho Kim, Dongwoo Kang

Main category: cs.CV

TL;DR: A novel framework using cross-modal fusion attention and self-supervised multi-event representation learning for event-based facial keypoint alignment, addressing limitations of event data through RGB guidance and unsupervised learning.

Details

Motivation: Event cameras have advantages for facial keypoint alignment in challenging conditions, but existing RGB methods perform poorly on event data, and training solely on event data leads to suboptimal performance due to limited spatial information and lack of labeled datasets.

Method: Proposes cross-modal fusion attention (CMFA) to integrate RGB data for guiding robust facial feature extraction from event inputs, and self-supervised multi-event representation learning (SSMER) for effective feature learning from unlabeled event data.

Result: Extensive experiments on real-event E-SIE dataset and synthetic-event WFLW-V benchmark show the approach consistently surpasses state-of-the-art methods across multiple evaluation metrics.

Conclusion: The proposed framework effectively addresses the limitations of event data for facial keypoint alignment through cross-modal guidance and self-supervised learning, achieving superior performance compared to existing methods.

Abstract: Event cameras offer unique advantages for facial keypoint alignment under challenging conditions, such as low light and rapid motion, due to their high temporal resolution and robustness to varying illumination. However, existing RGB facial keypoint alignment methods do not perform well on event data, and training solely on event data often leads to suboptimal performance because of its limited spatial information. Moreover, the lack of comprehensive labeled event datasets further hinders progress in this area. To address these issues, we propose a novel framework based on cross-modal fusion attention (CMFA) and self-supervised multi-event representation learning (SSMER) for event-based facial keypoint alignment. Our framework employs CMFA to integrate corresponding RGB data, guiding the model to extract robust facial features from event input images. In parallel, SSMER enables effective feature learning from unlabeled event data, overcoming spatial limitations. Extensive experiments on our real-event E-SIE dataset and a synthetic-event version of the public WFLW-V benchmark show that our approach consistently surpasses state-of-the-art methods across multiple evaluation metrics.

[688] On-the-Fly Data Augmentation for Brain Tumor Segmentation

Ishika Jain, Siri Willems, Steven Latre, Tom De Schepper

Main category: cs.CV

TL;DR: Proposed on-the-fly augmentation using pretrained GliGANs to insert synthetic tumors during training, achieving top performance in BraTS 2025 Task 1 for glioma segmentation across treatment timelines.

Details

Motivation: Robust glioma segmentation across pre- and post-treatment scans is crucial for consistent tumor monitoring and treatment planning, but limited annotated data and computational costs of storing augmented 3D data pose challenges.

Method: Developed on-the-fly augmentation strategy that dynamically inserts synthetic tumors using pretrained generative adversarial networks (GliGANs) during training, evaluated three nnU-Net-based models and their ensembles with different augmentation approaches.

Result: Ensemble model achieved lesion-wise Dice scores of 0.79 (ET), 0.749 (NETC), 0.872 (RC), 0.825 (SNFH), 0.79 (TC), and 0.88 (WT) on BraTS 2025 validation platform, ranking first in BraTS Lighthouse Challenge 2025 Task 1.

Conclusion: The proposed on-the-fly augmentation with pretrained GliGANs effectively addresses data limitations and computational costs, enabling robust glioma segmentation across treatment timelines and achieving state-of-the-art performance.

Abstract: Robust segmentation across both pre-treatment and post-treatment glioma scans can be helpful for consistent tumor monitoring and treatment planning. BraTS 2025 Task 1 addresses this by challenging models to generalize across varying tumor appearances throughout the treatment timeline. However, training such generalized models requires access to diverse, high-quality annotated data, which is often limited. While data augmentation can alleviate this, storing large volumes of augmented 3D data is computationally expensive. To address these challenges, we propose an on-the-fly augmentation strategy that dynamically inserts synthetic tumors using pretrained generative adversarial networks (GliGANs) during training. We evaluate three nnU-Net-based models and their ensembles: (1) a baseline without external augmentation, (2) a regular on-the-fly augmented model, and (3) a model with customized on-the-fly augmentation. Built upon the nnU-Net framework, our pipeline leverages pretrained GliGAN weights and tumor insertion methods from prior challenge-winning solutions. An ensemble of the three models achieves lesion-wise Dice scores of 0.79 (ET), 0.749 (NETC), 0.872 (RC), 0.825 (SNFH), 0.79 (TC), and 0.88 (WT) on the online BraTS 2025 validation platform. This work ranked first in the BraTS Lighthouse Challenge 2025 Task 1- Adult Glioma Segmentation.

[689] Wan-Alpha: High-Quality Text-to-Video Generation with Alpha Channel

Haotian Dong, Wenjing Wang, Chen Li, Di Lin

Main category: cs.CV

TL;DR: Wan-Alpha is a new framework for generating transparent RGBA videos with superior visual quality, motion realism, and transparency rendering using a joint RGB-alpha learning approach.

Details

Motivation: Existing RGBA video generation methods often neglect visual quality, limiting their practical usability in applications requiring transparent video content.

Method: Proposes a framework that generates transparent videos by learning both RGB and alpha channels jointly, using an effective VAE that encodes alpha channel into RGB latent space and training a diffusion transformer on a high-quality RGBA video dataset.

Result: Demonstrates superior performance compared to state-of-the-art methods in visual quality, motion realism, and transparency rendering. Can generate semi-transparent objects, glowing effects, and fine-grained details like hair strands.

Conclusion: Wan-Alpha provides an effective solution for high-quality RGBA video generation with practical applications, and the model is publicly available.

Abstract: RGBA video generation, which includes an alpha channel to represent transparency, is gaining increasing attention across a wide range of applications. However, existing methods often neglect visual quality, limiting their practical usability. In this paper, we propose \textit{Wan-Alpha}, a new framework that generates transparent videos by learning both RGB and alpha channels jointly. We design an effective variational autoencoder (VAE) that encodes the alpha channel into the RGB latent space. Then, to support the training of our diffusion transformer, we construct a high-quality and diverse RGBA video dataset. Compared with state-of-the-art methods, our model demonstrates superior performance in visual quality, motion realism, and transparency rendering. Notably, our model can generate a wide variety of semi-transparent objects, glowing effects, and fine-grained details such as hair strands. The released model is available on our website: \href{https://donghaotian123.github.io/Wan-Alpha/}{https://donghaotian123.github.io/Wan-Alpha/}.

[690] SDPose: Exploiting Diffusion Priors for Out-of-Domain and Robust Pose Estimation

Shuang Liang, Jing He, Chuanmeizhi Wang, Lejun Liao, Guo Zhang, Yingcong Chen, Yuan Yuan

Main category: cs.CV

TL;DR: SDPose is a fine-tuning framework that adapts Stable Diffusion for human pose estimation by predicting keypoint heatmaps in the U-Net’s latent space, using a lightweight pose head and auxiliary RGB reconstruction to preserve generative priors and enhance robustness.

Details

Motivation: To explore the potential of diffusion priors for structured outputs like human pose estimation, which remains underexplored despite diffusion models' strong cross-domain generalization capabilities.

Method: Directly predict keypoint heatmaps in SD U-Net’s latent space, use lightweight convolutional pose head to map features, and incorporate auxiliary RGB reconstruction branch to prevent overfitting and enhance robustness.

Result: Achieves parity with Sapiens-1B/2B on COCO validation set with only 1/5 training schedule, sets new SOTA on cross-domain benchmarks HumanArt and COCO-OOD, and works as zero-shot pose annotator for controllable generation tasks.

Conclusion: SDPose effectively exploits pre-trained diffusion priors for human pose estimation, demonstrating strong performance and cross-domain robustness while enabling downstream applications in controllable generation.

Abstract: Pre-trained diffusion models provide rich multi-scale latent features and are emerging as powerful vision backbones. While recent works such as Marigold~\citep{ke2024repurposing} and Lotus~\citep{he2024lotus} adapt diffusion priors for dense prediction with strong cross-domain generalization, their potential for structured outputs (e.g., human pose estimation) remains underexplored. In this paper, we propose \textbf{SDPose}, a fine-tuning framework built upon Stable Diffusion to fully exploit pre-trained diffusion priors for human pose estimation. First, rather than modifying cross-attention modules or introducing learnable embeddings, we directly predict keypoint heatmaps in the SD U-Net’s image latent space to preserve the original generative priors. Second, we map these latent features into keypoint heatmaps through a lightweight convolutional pose head, which avoids disrupting the pre-trained backbone. Finally, to prevent overfitting and enhance out-of-distribution robustness, we incorporate an auxiliary RGB reconstruction branch that preserves domain-transferable generative semantics. To evaluate robustness under domain shift, we further construct \textbf{COCO-OOD}, a style-transferred variant of COCO with preserved annotations. With just one-fifth of the training schedule used by Sapiens on COCO, SDPose attains parity with Sapiens-1B/2B on the COCO validation set and establishes a new state of the art on the cross-domain benchmarks HumanArt and COCO-OOD. Furthermore, we showcase SDPose as a zero-shot pose annotator for downstream controllable generation tasks, including ControlNet-based image synthesis and video generation, where it delivers qualitatively superior pose guidance.

[691] PanoWorld-X: Generating Explorable Panoramic Worlds via Sphere-Aware Video Diffusion

Yuyang Yin, HaoXiang Guo, Fangfu Liu, Mengyu Wang, Hanwen Liang, Eric Li, Yikai Wang, Xiaojie Jin, Yao Zhao, Yunchao Wei

Main category: cs.CV

TL;DR: PanoWorld-X is a framework for high-fidelity, controllable panoramic video generation with diverse camera trajectories, addressing limitations of narrow field-of-view and insufficient camera controllability in prior works.

Details

Motivation: To enable complete and explorable 360-degree visual worlds for downstream applications, overcoming constraints of narrow field-of-view limitations and insufficient camera controllability that restrict free exploration.

Method: Constructed large-scale dataset of panoramic video-exploration route pairs using Unreal Engine simulation, and introduced Sphere-Aware Diffusion Transformer architecture that reprojects equirectangular features onto spherical surface to model geometric adjacency in latent space.

Result: Extensive experiments demonstrate superior performance in motion range, control precision, and visual quality compared to prior methods.

Conclusion: PanoWorld-X shows strong potential for real-world applications by achieving high-fidelity and controllable panoramic video generation with enhanced visual fidelity and spatiotemporal continuity.

Abstract: Generating a complete and explorable 360-degree visual world enables a wide range of downstream applications. While prior works have advanced the field, they remain constrained by either narrow field-of-view limitations, which hinder the synthesis of continuous and holistic scenes, or insufficient camera controllability that restricts free exploration by users or autonomous agents. To address this, we propose PanoWorld-X, a novel framework for high-fidelity and controllable panoramic video generation with diverse camera trajectories. Specifically, we first construct a large-scale dataset of panoramic video-exploration route pairs by simulating camera trajectories in virtual 3D environments via Unreal Engine. As the spherical geometry of panoramic data misaligns with the inductive priors from conventional video diffusion, we then introduce a Sphere-Aware Diffusion Transformer architecture that reprojects equirectangular features onto the spherical surface to model geometric adjacency in latent space, significantly enhancing visual fidelity and spatiotemporal continuity. Extensive experiments demonstrate that our PanoWorld-X achieves superior performance in various aspects, including motion range, control precision, and visual quality, underscoring its potential for real-world applications.

[692] LVT: Large-Scale Scene Reconstruction via Local View Transformers

Tooba Imtiaz, Lucy Chai, Kathryn Heal, Xuan Luo, Jungyeon Park, Jennifer Dy, John Flynn

Main category: cs.CV

TL;DR: LVT is a transformer-based architecture for large-scale 3D scene reconstruction and novel view synthesis that avoids quadratic attention complexity by processing only local view neighborhoods with geometric-aware positional encoding.

Details

Motivation: Standard transformers have quadratic complexity that limits scalability to large scenes. Spatially nearby views provide more useful local scene information than distant views.

Method: Processes information in local neighborhoods around each view using geometric-aware positional encoding based on relative transformations between query and nearby views. Outputs are decoded into 3D Gaussian Splat representation with color and opacity view-dependence.

Result: Enables reconstruction of arbitrarily large, high-resolution scenes in a single forward pass.

Conclusion: LVT successfully addresses scalability limitations of standard transformers for 3D vision tasks while maintaining reconstruction quality for large scenes.

Abstract: Large transformer models are proving to be a powerful tool for 3D vision and novel view synthesis. However, the standard Transformer’s well-known quadratic complexity makes it difficult to scale these methods to large scenes. To address this challenge, we propose the Local View Transformer (LVT), a large-scale scene reconstruction and novel view synthesis architecture that circumvents the need for the quadratic attention operation. Motivated by the insight that spatially nearby views provide more useful signal about the local scene composition than distant views, our model processes all information in a local neighborhood around each view. To attend to tokens in nearby views, we leverage a novel positional encoding that conditions on the relative geometric transformation between the query and nearby views. We decode the output of our model into a 3D Gaussian Splat scene representation that includes both color and opacity view-dependence. Taken together, the Local View Transformer enables reconstruction of arbitrarily large, high-resolution scenes in a single forward pass. See our project page for results and interactive demos https://toobaimt.github.io/lvt/.

[693] GeoVLM-R1: Reinforcement Fine-Tuning for Improved Remote Sensing Reasoning

Mustansar Fiaz, Hiyam Debary, Paolo Fraccaro, Danda Paudel, Luc Van Gool, Fahad Khan, Salman Khan

Main category: cs.CV

TL;DR: A novel post-training framework that incorporates task-aware rewards to adapt reinforcement learning models for Earth Observation tasks, improving reasoning capabilities and robustness.

Details

Motivation: Reinforcement learning has shown strong reasoning in natural images but remains unexplored for Earth Observation tasks, which present unique challenges like object detection, captioning, change detection, and temporal analysis requiring task-aware reasoning.

Method: Proposed a post-training framework with task-aware rewards to adapt reasoning-based RL models to diverse EO tasks, enhancing reasoning for remote sensing images while stabilizing optimization.

Result: Extensive experiments across multiple EO benchmarks show consistent performance gains over state-of-the-art generic and specialized vision language models.

Conclusion: The framework successfully enhances reasoning capabilities for remote sensing images and improves robustness, with code and models to be publicly released.

Abstract: Recent advances in reinforcement learning (RL) have delivered strong reasoning capabilities in natural image domains, yet their potential for Earth Observation (EO) remains largely unexplored. EO tasks introduce unique challenges, spanning referred object detection, image or region captioning, change detection, grounding, and temporal analysis, that demand task aware reasoning. We propose a novel post training framework that incorporates task aware rewards to enable effective adaptation of reasoning based RL models to diverse EO tasks. This training strategy enhances reasoning capabilities for remote sensing images, stabilizes optimization, and improves robustness. Extensive experiments across multiple EO benchmarks show consistent performance gains over state of the art generic and specialized vision language models. Code and models will be released publicly at https://mustansarfiaz.github.io/GeoVLM-R1/ .

[694] STAGE: Stable and Generalizable GRPO for Autoregressive Image Generation

Xiaoxiao Ma, Haibo Qiu, Guohui Zhang, Zhixiong Zeng, Siqi Yang, Lin Ma, Feng Zhao

Main category: cs.CV

TL;DR: STAGE is a stable reinforcement learning framework for autoregressive image generation that addresses gradient conflicts and entropy instability in GRPO algorithms through advantage/KL reweighting and entropy rewards.

Details

Motivation: Existing GRPO algorithms struggle with autoregressive image models due to training instability that disrupts pretrained model capabilities, leading to marginal gains, degraded quality, and poor generalization.

Method: STAGE introduces two key solutions: 1) Advantage/KL reweighting with similarity-aware reweighting to alleviate conflicting gradient updates, and 2) Entropy reward based on reference model to stabilize learning dynamics.

Result: Experiments show STAGE consistently improves visual quality, training stability, and cross-task generalization compared to baseline GRPO across multiple benchmarks.

Conclusion: STAGE effectively reduces disruption of pretrained distributions and mitigates reward hacking, leading to better generalization and transfer performance in autoregressive image generation.

Abstract: Reinforcement learning has recently been explored to improve text-to-image generation, yet applying existing GRPO algorithms to autoregressive (AR) image models remains challenging. The instability of the training process easily disrupts the pretrained model capability during long runs, resulting in marginal gains, degraded image quality, and poor generalization. In this work, we revisit GRPO for AR image generation and identify two key issues: contradictory gradients from unnecessary tokens and unstable policy entropy dynamics. To address these, we introduce STAGE, a stable and generalizable framework that leverages two targeted solutions: 1) Advantage/KL reweighting. Similarity-aware reweighting to alleviate conflicting updates; and 2) Entropy reward. An entropy-based reward corresponding to reference model to stabilize learning. With the help of alleviating conflicts between tokens and an entropy reward for stabilizing training, we reduce disruption of the pretrained distribution and mitigate reward hacking, which in turn improves generalization and transfer better to other benchmarks. Experiments across multiple benchmarks show that STAGE consistently improves visual quality, stability, and cross-task generalization compared to baseline GRPO.

[695] VT-FSL: Bridging Vision and Text with LLMs for Few-Shot Learning

Wenhao Li, Qiangchang Wang, Xianjing Meng, Zhibin Wu, Yilong Yin

Main category: cs.CV

TL;DR: VT-FSL is a novel few-shot learning framework that bridges vision and text using LLMs, achieving state-of-the-art performance across diverse benchmarks through cross-modal prompting and geometric alignment.

Details

Motivation: Existing FSL methods suffer from hallucinating semantics that contradict visual evidence due to lack of grounding in actual instances, resulting in noisy guidance and costly corrections.

Method: Proposes Cross-modal Iterative Prompting (CIP) that conditions LLMs on class names and support images to generate precise class descriptions, and Cross-modal Geometric Alignment (CGA) that aligns textual, support, and synthetic visual representations by minimizing kernelized volume of 3D parallelotope.

Result: Establishes new state-of-the-art performance across ten diverse benchmarks including standard, cross-domain, and fine-grained few-shot learning scenarios.

Conclusion: VT-FSL effectively addresses semantic hallucination in FSL by leveraging LLMs for precise cross-modal prompting and geometric alignment, providing comprehensive multimodal integration for improved few-shot learning.

Abstract: Few-shot learning (FSL) aims to recognize novel concepts from only a few labeled support samples. Recent studies enhance support features by incorporating additional semantic information or designing complex semantic fusion modules. However, they still suffer from hallucinating semantics that contradict the visual evidence due to the lack of grounding in actual instances, resulting in noisy guidance and costly corrections. To address these issues, we propose a novel framework, bridging Vision and Text with LLMs for Few-Shot Learning (VT-FSL), which constructs precise cross-modal prompts conditioned on Large Language Models (LLMs) and support images, seamlessly integrating them through a geometry-aware alignment. It mainly consists of Cross-modal Iterative Prompting (CIP) and Cross-modal Geometric Alignment (CGA). Specifically, the CIP conditions an LLM on both class names and support images to generate precise class descriptions iteratively in a single structured reasoning pass. These descriptions not only enrich the semantic understanding of novel classes but also enable the zero-shot synthesis of semantically consistent images. The descriptions and synthetic images act respectively as complementary textual and visual prompts, providing high-level class semantics and low-level intra-class diversity to compensate for limited support data. Furthermore, the CGA jointly aligns the fused textual, support, and synthetic visual representations by minimizing the kernelized volume of the 3-dimensional parallelotope they span. It captures global and nonlinear relationships among all representations, enabling structured and consistent multimodal integration. The proposed VT-FSL method establishes new state-of-the-art performance across ten diverse benchmarks, including standard, cross-domain, and fine-grained few-shot learning scenarios. Code is available at https://github.com/peacelwh/VT-FSL.

[696] A Scalable Distributed Framework for Multimodal GigaVoxel Image Registration

Rohit Jena, Vedant Zope, Pratik Chaudhari, James C. Gee

Main category: cs.CV

TL;DR: FFDP is a distributed framework with IO-aware non-GEMM fused kernels for large-scale image registration, enabling unprecedented problem sizes and significant performance improvements.

Details

Motivation: Image registration algorithms have not scaled with image acquisition capabilities, creating bottlenecks in biomedical and life sciences applications.

Method: Uses IO-aware non-GEMM fused kernels, distributed framework with convolution-aware tensor sharding, and complements existing model parallelism techniques.

Result: Achieved 570x larger registration problems than standard clinical data in about a minute using 8 GPUs, with 6-7x speedup and 20-59% memory reduction compared to SOTA.

Conclusion: FFDP enables unprecedented scale in image registration, fitting 64x larger problems on single GPU while providing substantial performance and efficiency gains over existing methods.

Abstract: In this work, we propose FFDP, a set of IO-aware non-GEMM fused kernels supplemented with a distributed framework for image registration at unprecedented scales. Image registration is an inverse problem fundamental to biomedical and life sciences, but algorithms have not scaled in tandem with image acquisition capabilities. Our framework complements existing model parallelism techniques proposed for large-scale transformer training by optimizing non-GEMM bottlenecks and enabling convolution-aware tensor sharding. We demonstrate unprecedented capabilities by performing multimodal registration of a 100 micron ex-vivo human brain MRI volume at native resolution - an inverse problem more than 570x larger than a standard clinical datum in about a minute using only 8 A6000 GPUs. FFDP accelerates existing state-of-the-art optimization and deep learning registration pipelines by upto 6 - 7x while reducing peak memory consumption by 20 - 59%. Comparative analysis on a 250 micron dataset shows that FFDP can fit upto 64x larger problems than existing SOTA on a single GPU, and highlights both the performance and efficiency gains of FFDP compared to SOTA image registration methods.

[697] GEM: 3D Gaussian Splatting for Efficient and Accurate Cryo-EM Reconstruction

Huaizhi Qu, Xiao Wang, Gengwei Zhang, Jie Peng, Tianlong Chen

Main category: cs.CV

TL;DR: GEM introduces a cryo-EM reconstruction framework using 3D Gaussian Splatting that achieves faster training, lower memory usage, and improved resolution compared to existing methods.

Details

Motivation: Traditional cryo-EM reconstruction methods are computationally expensive and memory intensive for large datasets, while neural radiance field approaches improve accuracy but have cubic memory/computation overhead.

Method: GEM represents proteins with compact 3D Gaussians (11 parameters each) and uses a novel gradient computation design to reduce memory footprint and training cost.

Result: GEM achieves up to 48% faster training, 12% lower memory usage, and 38.8% improved local resolution compared to state-of-the-art methods on standard benchmarks.

Conclusion: GEM establishes a practical and scalable paradigm for cryo-EM reconstruction that unifies speed, efficiency, and high-resolution accuracy.

Abstract: Cryo-electron microscopy (cryo-EM) has become a central tool for high-resolution structural biology, yet the massive scale of datasets (often exceeding 100k particle images) renders 3D reconstruction both computationally expensive and memory intensive. Traditional Fourier-space methods are efficient but lose fidelity due to repeated transforms, while recent real-space approaches based on neural radiance fields (NeRFs) improve accuracy but incur cubic memory and computation overhead. Therefore, we introduce GEM, a novel cryo-EM reconstruction framework built on 3D Gaussian Splatting (3DGS) that operates directly in real-space while maintaining high efficiency. Instead of modeling the entire density volume, GEM represents proteins with compact 3D Gaussians, each parameterized by only 11 values. To further improve the training efficiency, we designed a novel gradient computation to 3D Gaussians that contribute to each voxel. This design substantially reduced both memory footprint and training cost. On standard cryo-EM benchmarks, GEM achieves up to 48% faster training and 12% lower memory usage compared to state-of-the-art methods, while improving local resolution by as much as 38.8%. These results establish GEM as a practical and scalable paradigm for cryo-EM reconstruction, unifying speed, efficiency, and high-resolution accuracy. Our code is available at https://github.com/UNITES-Lab/GEM.

[698] MANI-Pure: Magnitude-Adaptive Noise Injection for Adversarial Purification

Xiaoyi Huang, Junwei Wu, Kejia Zhang, Carl Yang, Zhiming Luo

Main category: cs.CV

TL;DR: MANI-Pure is a magnitude-adaptive adversarial purification framework that uses frequency-targeted noise injection instead of uniform noise, achieving state-of-the-art robust accuracy while preserving clean accuracy.

Details

Motivation: Existing adversarial purification methods use uniform noise injection that corrupts semantic structures, while adversarial perturbations are actually concentrated in high-frequency regions with heterogeneous patterns.

Method: MANI-Pure leverages the magnitude spectrum of inputs to guide purification, adaptively applying heterogeneous frequency-targeted noise that suppresses adversarial perturbations in fragile high-frequency bands while preserving low-frequency content.

Result: On CIFAR-10 and ImageNet-1K, MANI-Pure narrows clean accuracy gap to within 0.59 of original classifier, boosts robust accuracy by 2.15, and achieves top-1 robust accuracy on RobustBench leaderboard.

Conclusion: The proposed magnitude-adaptive purification framework effectively addresses the limitations of uniform noise injection and demonstrates superior performance in adversarial defense.

Abstract: Adversarial purification with diffusion models has emerged as a promising defense strategy, but existing methods typically rely on uniform noise injection, which indiscriminately perturbs all frequencies, corrupting semantic structures and undermining robustness. Our empirical study reveals that adversarial perturbations are not uniformly distributed: they are predominantly concentrated in high-frequency regions, with heterogeneous magnitude intensity patterns that vary across frequencies and attack types. Motivated by this observation, we introduce MANI-Pure, a magnitude-adaptive purification framework that leverages the magnitude spectrum of inputs to guide the purification process. Instead of injecting homogeneous noise, MANI-Pure adaptively applies heterogeneous, frequency-targeted noise, effectively suppressing adversarial perturbations in fragile high-frequency, low-magnitude bands while preserving semantically critical low-frequency content. Extensive experiments on CIFAR-10 and ImageNet-1K validate the effectiveness of MANI-Pure. It narrows the clean accuracy gap to within 0.59 of the original classifier, while boosting robust accuracy by 2.15, and achieves the top-1 robust accuracy on the RobustBench leaderboard, surpassing the previous state-of-the-art method.

[699] Triangle Splatting+: Differentiable Rendering with Opaque Triangles

Jan Held, Renaud Vandeghen, Sanghyun Son, Daniel Rebain, Matheus Gadelha, Yi Zhou, Ming C. Lin, Marc Van Droogenbroeck, Andrea Tagliasacchi

Main category: cs.CV

TL;DR: Triangle Splatting+ directly optimizes triangles within a differentiable splatting framework to create mesh-compatible representations for real-time graphics applications, achieving state-of-the-art performance in mesh-based novel view synthesis.

Details

Motivation: Existing 3D Gaussian Splatting methods are incompatible with mesh-based pipelines used in VR and graphics applications. Conversion solutions increase complexity and degrade quality, so a direct triangle-based approach is needed.

Method: Directly optimizes triangles (fundamental computer graphics primitive) in a differentiable splatting framework with triangle parametrization for connectivity through shared vertices and a training strategy that enforces opaque triangles.

Result: Achieves state-of-the-art performance on Mip-NeRF360 and Tanks & Temples datasets, surpassing prior splatting approaches in visual fidelity while remaining efficient and fast to train. Output is immediately usable in standard graphics engines.

Conclusion: Triangle Splatting+ provides mesh-compatible representations that support downstream applications like physics-based simulation and interactive walkthroughs, bridging the gap between neural rendering and traditional graphics pipelines.

Abstract: Reconstructing 3D scenes and synthesizing novel views has seen rapid progress in recent years. Neural Radiance Fields demonstrated that continuous volumetric radiance fields can achieve high-quality image synthesis, but their long training and rendering times limit practicality. 3D Gaussian Splatting (3DGS) addressed these issues by representing scenes with millions of Gaussians, enabling real-time rendering and fast optimization. However, Gaussian primitives are not natively compatible with the mesh-based pipelines used in VR headsets, and real-time graphics applications. Existing solutions attempt to convert Gaussians into meshes through post-processing or two-stage pipelines, which increases complexity and degrades visual quality. In this work, we introduce Triangle Splatting+, which directly optimizes triangles, the fundamental primitive of computer graphics, within a differentiable splatting framework. We formulate triangle parametrization to enable connectivity through shared vertices, and we design a training strategy that enforces opaque triangles. The final output is immediately usable in standard graphics engines without post-processing. Experiments on the Mip-NeRF360 and Tanks & Temples datasets show that Triangle Splatting+achieves state-of-the-art performance in mesh-based novel view synthesis. Our method surpasses prior splatting approaches in visual fidelity while remaining efficient and fast to training. Moreover, the resulting semi-connected meshes support downstream applications such as physics-based simulation or interactive walkthroughs. The project page is https://trianglesplatting2.github.io/trianglesplatting2/.

[700] VideoAnchor: Reinforcing Subspace-Structured Visual Cues for Coherent Visual-Spatial Reasoning

Zhaozhi Wang, Tong Zhang, Mingyue Guo, Yaowei Wang, Qixiang Ye

Main category: cs.CV

TL;DR: VideoAnchor is a plug-and-play module that improves visual-spatial reasoning in MLLMs by leveraging subspace affinities to reinforce visual cues across frames without retraining.

Details

Motivation: MLLMs have limitations in visual-spatial reasoning because visual tokens are overshadowed by language tokens in attention mechanisms, preventing consistent recognition of visual cues across frames.

Method: Draws connection between self-expressiveness in sparse subspace clustering and attention in Transformers, using subspace affinities to anchor attention to shared visual structures across frames.

Result: Achieves 3.2% and 4.6% improvements on VSI-Bench and Video-MME spatial-related tasks with InternVL2-8B and Qwen2.5VL-72B models, with qualitative improvements in subspace partitions and visual grounding.

Conclusion: VideoAnchor effectively addresses visual-spatial reasoning limitations in MLLMs through subspace-based attention anchoring, providing consistent performance gains across benchmarks and backbone models.

Abstract: Multimodal Large Language Models (MLLMs) have achieved impressive progress in vision-language alignment, yet they remain limited in visual-spatial reasoning. We first identify that this limitation arises from the attention mechanism: visual tokens are overshadowed by language tokens, preventing the model from consistently recognizing the same visual cues across frames. To address this challenge, we draw a novel connection between the self-expressiveness property in sparse subspace clustering and the attention mechanism in Transformers. Building on this insight, we propose VideoAnchor, a plug-and-play module that leverages subspace affinities to reinforce visual cues across frames without retraining, effectively anchoring attention to shared visual structures. Extensive experiments across benchmarks and backbone models show consistent performance gains – $e.g.$, 3.2% and 4.6% improvements on VSI-Bench and Video-MME (spatial-related tasks) with InternVL2-8B and Qwen2.5VL-72B – while qualitative analyses demonstrate more coherent subspace partitions and stronger visual grounding. Our codes will be made public available at https://github.com/feufhd/VideoAnchor.

[701] Rolling Forcing: Autoregressive Long Video Diffusion in Real Time

Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, Shijian Lu

Main category: cs.CV

TL;DR: Rolling Forcing is a novel video generation technique that enables streaming long videos with minimal error accumulation through joint denoising, attention sink mechanism, and efficient training algorithm.

Details

Motivation: Existing streaming video generation methods suffer from severe error accumulation that degrades video quality over long horizons, limiting their practical use in interactive world models and neural game engines.

Method: Three key designs: 1) Joint denoising scheme that simultaneously denoises multiple frames with progressive noise levels to suppress error growth; 2) Attention sink mechanism that maintains key value states of initial frames as global context anchor; 3) Efficient training algorithm for few-step distillation over extended denoising windows.

Result: Enables real-time streaming generation of multi-minute videos on a single GPU with substantially reduced error accumulation compared to existing methods.

Conclusion: Rolling Forcing effectively addresses the error accumulation problem in streaming video generation, making it suitable for practical applications requiring long, coherent video streams.

Abstract: Streaming video generation, as one fundamental component in interactive world models and neural game engines, aims to generate high-quality, low-latency, and temporally coherent long video streams. However, most existing work suffers from severe error accumulation that often significantly degrades the generated stream videos over long horizons. We design Rolling Forcing, a novel video generation technique that enables streaming long videos with minimal error accumulation. Rolling Forcing comes with three novel designs. First, instead of iteratively sampling individual frames, which accelerates error propagation, we design a joint denoising scheme that simultaneously denoises multiple frames with progressively increasing noise levels. This design relaxes the strict causality across adjacent frames, effectively suppressing error growth. Second, we introduce the attention sink mechanism into the long-horizon stream video generation task, which allows the model to keep key value states of initial frames as a global context anchor and thereby enhances long-term global consistency. Third, we design an efficient training algorithm that enables few-step distillation over largely extended denoising windows. This algorithm operates on non-overlapping windows and mitigates exposure bias conditioned on self-generated histories. Extensive experiments show that Rolling Forcing enables real-time streaming generation of multi-minute videos on a single GPU, with substantially reduced error accumulation.

[702] Aligning Visual Foundation Encoders to Tokenizers for Diffusion Models

Bowei Chen, Sai Bi, Hao Tan, He Zhang, Tianyuan Zhang, Zhengqi Li, Yuanjun Xiong, Jianming Zhang, Kai Zhang

Main category: cs.CV

TL;DR: Aligning pretrained visual encoders as tokenizers for latent diffusion models improves semantic richness and accelerates convergence in image generation.

Details

Motivation: To leverage the rich semantic structure of foundation encoders for image tokenization, avoiding the low-level focus of training VAEs from scratch.

Method: Three-stage alignment: (1) freeze encoder and train adapter/decoder, (2) joint optimization with semantic preservation loss, (3) decoder refinement for better reconstruction.

Result: Achieved gFID of 1.90 in 64 epochs on ImageNet 256×256, outperformed FLUX VAE in 2B-parameter text-to-image model on LAION.

Conclusion: The method provides a simple, scalable, and semantically grounded paradigm for continuous tokenizer design in diffusion models.

Abstract: In this work, we propose aligning pretrained visual encoders to serve as tokenizers for latent diffusion models in image generation. Unlike training a variational autoencoder (VAE) from scratch, which primarily emphasizes low-level details, our approach leverages the rich semantic structure of foundation encoders. We introduce a three-stage alignment strategy: (1) freeze the encoder and train an adapter and a decoder to establish a semantic latent space; (2) jointly optimize all components with an additional semantic preservation loss, enabling the encoder to capture perceptual details while retaining high-level semantics; and (3) refine the decoder for improved reconstruction quality. This alignment yields semantically rich image tokenizers that benefit diffusion models. On ImageNet 256$\times$256, our tokenizer accelerates the convergence of diffusion models, reaching a gFID of 1.90 within just 64 epochs, and improves generation both with and without classifier-free guidance. Scaling to LAION, a 2B-parameter text-to-image model trained with our tokenizer consistently outperforms FLUX VAE under the same training steps. Overall, our method is simple, scalable, and establishes a semantically grounded paradigm for continuous tokenizer design.

[703] YOLO26: Key Architectural Enhancements and Performance Benchmarking for Real-Time Object Detection

Ranjan Sapkota, Rahul Harsha Cheppally, Ajay Sharda, Manoj Karkee

Main category: cs.CV

TL;DR: YOLO26 introduces key architectural enhancements for real-time edge object detection, featuring NMS-free inference, removal of DFL, new ProgLoss and STAL techniques, and MuSGD optimizer, achieving superior efficiency and accuracy on edge devices.

Details

Motivation: To push the boundaries of efficiency and accuracy for real-time object detection on edge and low-power devices, addressing the need for deployment-ready models in resource-constrained environments.

Method: Architectural innovations including end-to-end NMS-free inference, removal of Distribution Focal Loss (DFL), introduction of ProgLoss and Small-Target-Aware Label Assignment (STAL) for stability and small-object detection, and adoption of MuSGD optimizer inspired by large language model training.

Result: Performance benchmarks show YOLO26 achieves superior efficiency, accuracy, and deployment versatility compared to previous YOLO versions (YOLOv8, YOLO11, YOLOv12, YOLOv13) across edge devices, particularly NVIDIA Orin Jetson platforms.

Conclusion: YOLO26 represents a pivotal milestone in YOLO evolution, establishing itself as the most cutting-edge member with enhanced architectural features that significantly improve real-time edge object detection capabilities.

Abstract: This study presents Key Architectural Enhancements and Performance Benchmarking of Ultralytics YOLO26 for real-time edge object detection, providing a comprehensive overview of the design principles of YOLO26, technological advances, and deployment readiness. YOLO26, released in September 2025 by Ultralytics, represents the newest and most cutting-edge member of the You Only Look Once (YOLO) family, engineered to push the boundaries of efficiency and accuracy on edge and low-power devices. This paper highlights architectural innovations in YOLO26, including end-to-end NMS-free inference, removal of Distribution Focal Loss (DFL) for streamlined exports, introduction of ProgLoss and Small-Target-Aware Label Assignment (STAL) for improved stability and small-object detection, and the adoption of the MuSGD optimizer inspired by large language model training. In addition, we report performance benchmarks for YOLO26 across edge devices, specifically NVIDIA Orin Jetson platforms, and compare results against YOLOv8 and YOLO11 (previous Ultralytics releases) as well as YOLOv12 and YOLOv13, which bridged the lineage between YOLO11 and YOLO26. Our comparative analysis highlights superior efficiency of YOLO26, accuracy, and deployment versatility, establishing it as a pivotal milestone in the YOLO evolution.

[704] Personalized Vision via Visual In-Context Learning

Yuxin Jiang, Yuchao Gu, Yiren Song, Ivor Tsang, Mike Zheng Shou

Main category: cs.CV

TL;DR: PICO is a visual in-context learning framework that uses diffusion transformers to perform personalized vision tasks from a single annotated example without retraining, outperforming fine-tuning and synthetic-data approaches.

Details

Motivation: Modern vision models struggle with personalized tasks defined by users at test time, while existing personalization methods require costly fine-tuning or synthetic data pipelines that are inflexible.

Method: PICO uses a four-panel framework with diffusion transformers as visual in-context learners, constructs VisRel dataset for task diversity, and employs attention-guided seed scorer for reliable inference.

Result: PICO surpasses fine-tuning and synthetic-data baselines, flexibly adapts to novel user-defined tasks, and generalizes across both recognition and generation tasks.

Conclusion: Task diversity rather than scale drives robust generalization in visual in-context learning, enabling flexible personalization without retraining.

Abstract: Modern vision models, trained on large-scale annotated datasets, excel at predefined tasks but struggle with personalized vision – tasks defined at test time by users with customized objects or novel objectives. Existing personalization approaches rely on costly fine-tuning or synthetic data pipelines, which are inflexible and restricted to fixed task formats. Visual in-context learning (ICL) offers a promising alternative, yet prior methods confine to narrow, in-domain tasks and fail to generalize to open-ended personalization. We introduce Personalized In-Context Operator (PICO), a simple four-panel framework that repurposes diffusion transformers as visual in-context learners. Given a single annotated exemplar, PICO infers the underlying transformation and applies it to new inputs without retraining. To enable this, we construct VisRel, a compact yet diverse tuning dataset, showing that task diversity, rather than scale, drives robust generalization. We further propose an attention-guided seed scorer that improves reliability via efficient inference scaling. Extensive experiments demonstrate that PICO (i) surpasses fine-tuning and synthetic-data baselines, (ii) flexibly adapts to novel user-defined tasks, and (iii) generalizes across both recognition and generation.

[705] Mitigating Hallucination in Multimodal LLMs with Layer Contrastive Decoding

Bingkui Tong, Jiaer Xia, Kaiyang Zhou

Main category: cs.CV

TL;DR: LayerCD is a simple method that reduces hallucinations in Multimodal Large Language Models by contrasting output distributions from shallow vs deep visual features to filter out biased low-level information.

Details

Motivation: MLLMs suffer from hallucinations - generating outputs that are linguistically coherent but inconsistent with image context, particularly inaccuracies in objects, attributes, and relations.

Method: Layer Contrastive Decoding (LayerCD) contrasts output distributions generated from shallow vs deep visual features of the vision encoder, filtering out hallucinations caused by biased low-level information from shallow layers.

Result: Extensive experiments on two hallucination benchmarks show LayerCD significantly outperforms current state-of-the-art methods.

Conclusion: LayerCD effectively reduces hallucinations in MLLMs by leveraging the observation that shallow visual features are more prone to cause hallucinations than deep visual features.

Abstract: Multimodal Large Language Models (MLLMs) have shown impressive perception and reasoning capabilities, yet they often suffer from hallucinations – generating outputs that are linguistically coherent but inconsistent with the context of the input image, including inaccuracies in objects, attributes, and relations. To address this challenge, we propose a simple approach called Layer Contrastive Decoding (LayerCD). Our design is motivated by the observation that shallow visual features are much more likely than deep visual features to cause an MLLM to hallucinate as they only capture biased, low-level information that is insufficient for high-level reasoning. Therefore, LayerCD aims to filter out hallucinations by contrasting the output distributions generated from visual features of different levels, specifically those from the shallow and deep layers of the vision encoder, respectively. We conduct extensive experiments on two hallucination benchmarks and show that LayerCD significantly outperforms current state-of-the-art. The code for LayerCD is available at https://github.com/maifoundations/LayerCD .

[706] PAD3R: Pose-Aware Dynamic 3D Reconstruction from Casual Videos

Ting-Hsuan Liao, Haowen Liu, Yiran Xu, Songwei Ge, Gengshan Yang, Jia-Bin Huang

Main category: cs.CV

TL;DR: PAD3R reconstructs deformable 3D objects from unposed monocular videos using a personalized pose estimator guided by pre-trained 3D models and regularized by long-term 2D point tracking.

Details

Motivation: Existing methods struggle with long video sequences containing substantial object deformation, large camera movement, and limited view coverage that challenge conventional 3D reconstruction systems.

Method: Trains object-centric pose estimator supervised by pre-trained image-to-3D model, guides optimization of deformable 3D Gaussian representation, and regularizes with long-term 2D point tracking over entire video.

Result: Reconstructs high-fidelity, articulated 3D representations in category-agnostic way, showing robustness and good generalization across challenging scenarios.

Conclusion: PAD3R demonstrates potential for dynamic scene understanding and 3D content creation by combining generative priors with differentiable rendering for deformable object reconstruction.

Abstract: We present PAD3R, a method for reconstructing deformable 3D objects from casually captured, unposed monocular videos. Unlike existing approaches, PAD3R handles long video sequences featuring substantial object deformation, large-scale camera movement, and limited view coverage that typically challenge conventional systems. At its core, our approach trains a personalized, object-centric pose estimator, supervised by a pre-trained image-to-3D model. This guides the optimization of deformable 3D Gaussian representation. The optimization is further regularized by long-term 2D point tracking over the entire input video. By combining generative priors and differentiable rendering, PAD3R reconstructs high-fidelity, articulated 3D representations of objects in a category-agnostic way. Extensive qualitative and quantitative results show that PAD3R is robust and generalizes well across challenging scenarios, highlighting its potential for dynamic scene understanding and 3D content creation.

[707] PixelCraft: A Multi-Agent System for High-Fidelity Visual Reasoning on Structured Images

Shuoshuo Zhang, Zijian Li, Yizhen Zhang, Jingjing Fu, Lei Song, Jiang Bian, Jun Zhang, Yujiu Yang, Rui Wang

Main category: cs.CV

TL;DR: PixelCraft is a multi-agent system that enhances visual reasoning on structured images (charts, diagrams) through high-fidelity image processing and flexible reasoning with dynamic workflow and image memory.

Details

Motivation: Existing MLLMs struggle with structured images due to perceptual errors and rigid reasoning patterns, while current cue-based methods have limitations in image fidelity and reasoning flexibility.

Method: Multi-agent system with dispatcher, planner, reasoner, critics, and visual tool agents; combines fine-tuned MLLM for pixel-level localization with traditional CV algorithms; uses dynamic three-stage workflow (tool selection, agent discussion, self-criticism) with adaptive image memory.

Result: Significantly improves visual reasoning performance on challenging chart and geometry benchmarks, setting new standards for structured image reasoning.

Conclusion: PixelCraft demonstrates effective high-fidelity processing and flexible reasoning for structured images, overcoming limitations of existing methods through its multi-agent architecture and dynamic workflow.

Abstract: Structured images (e.g., charts and geometric diagrams) remain challenging for multimodal large language models (MLLMs), as perceptual slips can cascade into erroneous conclusions. Intermediate visual cues can steer reasoning; however, existing cue-based methods are constrained with low-fidelity image processing and linear, rigid reasoning patterns, limiting their effectiveness on complex structured-image tasks. In this paper, we propose PixelCraft, a novel multi-agent system for high-fidelity image processing and flexible visual reasoning on structured images. The system comprises a dispatcher, a planner, a reasoner, critics, and a set of visual tool agents. To achieve high-fidelity processing, we construct a high-quality corpus and fine-tune an MLLM into a grounding model, whose pixel-level localizations are integrated with traditional computer vision (CV) algorithms in tool agents. Building on this foundation, PixelCraft facilitates flexible visual reasoning through a dynamic three-stage workflow of tool selection, agent discussion, and self-criticism. Moreover, unlike prior linear reasoning patterns that simply append historical images, PixelCraft maintains an image memory to allow the planner to adaptively revisit earlier visual steps, explore alternative reasoning branches, and dynamically adjust the reasoning trajectory during discussion. Extensive experiments on challenging chart and geometry benchmarks demonstrate that PixelCraft significantly improves visual reasoning performance for advanced MLLMs, setting a new standard for structured image reasoning. Our code will be available at https://github.com/microsoft/PixelCraft.

[708] FlashI2V: Fourier-Guided Latent Shifting Prevents Conditional Image Leakage in Image-to-Video Generation

Yunyang Ge, Xinhua Cheng, Chengshu Zhao, Xianyi He, Shenghai Yuan, Bin Lin, Bin Zhu, Li Yuan

Main category: cs.CV

TL;DR: FlashI2V introduces Fourier-Guided Latent Shifting to prevent conditional image leakage in Image-to-Video generation, achieving better generalization and performance on out-of-domain data with fewer parameters.

Details

Motivation: Existing I2V methods suffer from conditional image leakage where denoisers shortcut the conditional image, causing slow motion, color inconsistency, and poor performance on out-of-domain data.

Method: FlashI2V uses: (1) Latent Shifting - modifies source/target distributions by subtracting conditional image information from noisy latents, (2) Fourier Guidance - uses high-frequency magnitude features from Fourier Transform to accelerate convergence and adjust detail levels.

Result: Achieves dynamic degree score of 53.01 on Vbench-I2V with only 1.3B parameters, outperforming larger models like CogVideoX1.5-5B-I2V and Wan2.1-I2V-14B-480P.

Conclusion: The method effectively overcomes conditional image leakage and achieves best generalization and performance on out-of-domain data among various I2V paradigms.

Abstract: In Image-to-Video (I2V) generation, a video is created using an input image as the first-frame condition. Existing I2V methods concatenate the full information of the conditional image with noisy latents to achieve high fidelity. However, the denoisers in these methods tend to shortcut the conditional image, which is known as conditional image leakage, leading to performance degradation issues such as slow motion and color inconsistency. In this work, we further clarify that conditional image leakage leads to overfitting to in-domain data and decreases the performance in out-of-domain scenarios. Moreover, we introduce Fourier-Guided Latent Shifting I2V, named FlashI2V, to prevent conditional image leakage. Concretely, FlashI2V consists of: (1) Latent Shifting. We modify the source and target distributions of flow matching by subtracting the conditional image information from the noisy latents, thereby incorporating the condition implicitly. (2) Fourier Guidance. We use high-frequency magnitude features obtained by the Fourier Transform to accelerate convergence and enable the adjustment of detail levels in the generated video. Experimental results show that our method effectively overcomes conditional image leakage and achieves the best generalization and performance on out-of-domain data among various I2V paradigms. With only 1.3B parameters, FlashI2V achieves a dynamic degree score of 53.01 on Vbench-I2V, surpassing CogVideoX1.5-5B-I2V and Wan2.1-I2V-14B-480P. Github page: https://pku-yuangroup.github.io/FlashI2V/

[709] Visual Jigsaw Post-Training Improves MLLMs

Penghao Wu, Yushan Zhang, Haiwen Diao, Bo Li, Lewei Lu, Ziwei Liu

Main category: cs.CV

TL;DR: Visual Jigsaw is a self-supervised post-training framework that enhances multimodal large language models’ visual understanding through a jigsaw puzzle task where models reconstruct shuffled visual inputs by producing correct permutations in natural language.

Details

Motivation: Current post-training paradigms for MLLMs are predominantly text-centric, using visual inputs only to extract sparse cues for text-based reasoning. There's a need for vision-centric approaches that strengthen intrinsic visual understanding without relying on text as an intermediate mediator or additional visual generative components.

Method: Visual Jigsaw formulates visual understanding as a general ordering task: visual inputs are partitioned and shuffled, and the model must reconstruct the visual information by producing the correct permutation in natural language. This aligns with reinforcement learning from verifiable rewards, requires no additional visual generative components, and derives supervisory signals automatically without annotations.

Result: Extensive experiments demonstrate substantial improvements in fine-grained perception, temporal reasoning, and 3D spatial understanding across three visual modalities (images, videos, and 3D data).

Conclusion: Visual Jigsaw highlights the potential of self-supervised vision-centric tasks in post-training MLLMs and aims to inspire further research on vision-centric pretext designs for enhancing visual understanding capabilities.

Abstract: Reinforcement learning based post-training has recently emerged as a powerful paradigm for enhancing the alignment and reasoning capabilities of multimodal large language models (MLLMs). While vision-centric post-training is crucial for enhancing MLLMs’ intrinsic understanding of visual signals, current post-training paradigms are predominantly text-centric, where dense visual inputs are only leveraged to extract sparse cues for text-based reasoning. There exist a few approaches in this direction, however, they often still rely on text as an intermediate mediator or introduce additional visual generative designs. In this work, we introduce Visual Jigsaw, a generic self-supervised post-training framework designed to strengthen visual understanding in MLLMs. Visual Jigsaw is formulated as a general ordering task: visual inputs are partitioned, shuffled, and the model must reconstruct the visual information by producing the correct permutation in natural language. This naturally aligns with reinforcement learning from verifiable rewards (RLVR), requires no additional visual generative components, and derives its supervisory signal automatically without any annotations. We instantiate Visual Jigsaw across three visual modalities, including images, videos, and 3D data. Extensive experiments demonstrate substantial improvements in fine-grained perception, temporal reasoning, and 3D spatial understanding. Our findings highlight the potential of self-supervised vision-centric tasks in post-training MLLMs and aim to inspire further research on vision-centric pretext designs. Project Page: https://penghao-wu.github.io/visual_jigsaw/

[710] VGGT-X: When VGGT Meets Dense Novel View Synthesis

Yang Liu, Chuanchen Luo, Zimo Tang, Junran Peng, Zhaoxiang Zhang

Main category: cs.CV

TL;DR: VGGT-X addresses VRAM and output quality barriers when scaling 3D Foundation Models to dense novel view synthesis, achieving state-of-the-art COLMAP-free results.

Details

Motivation: Current NVS approaches rely on slow and fragile SfM pipelines. 3DFMs offer speed but struggle with dense views due to VRAM limitations and imperfect outputs that degrade 3D training.

Method: Proposes VGGT-X with memory-efficient VGGT scaling to 1000+ images, adaptive global alignment for output enhancement, and robust 3DGS training practices.

Result: Substantially closes fidelity gap with COLMAP-initialized pipelines, achieving SOTA in dense COLMAP-free NVS and pose estimation.

Conclusion: The work provides insights for future development of 3D foundation models and dense NVS, analyzing remaining gaps with COLMAP-initialized rendering.

Abstract: We study the problem of applying 3D Foundation Models (3DFMs) to dense Novel View Synthesis (NVS). Despite significant progress in Novel View Synthesis powered by NeRF and 3DGS, current approaches remain reliant on accurate 3D attributes (e.g., camera poses and point clouds) acquired from Structure-from-Motion (SfM), which is often slow and fragile in low-texture or low-overlap captures. Recent 3DFMs showcase orders of magnitude speedup over the traditional pipeline and great potential for online NVS. But most of the validation and conclusions are confined to sparse-view settings. Our study reveals that naively scaling 3DFMs to dense views encounters two fundamental barriers: dramatically increasing VRAM burden and imperfect outputs that degrade initialization-sensitive 3D training. To address these barriers, we introduce VGGT-X, incorporating a memory-efficient VGGT implementation that scales to 1,000+ images, an adaptive global alignment for VGGT output enhancement, and robust 3DGS training practices. Extensive experiments show that these measures substantially close the fidelity gap with COLMAP-initialized pipelines, achieving state-of-the-art results in dense COLMAP-free NVS and pose estimation. Additionally, we analyze the causes of remaining gaps with COLMAP-initialized rendering, providing insights for the future development of 3D foundation models and dense NVS. Our project page is available at https://dekuliutesla.github.io/vggt-x.github.io/

[711] Learning to Infer Unseen Single-/Multi-Attribute-Object Compositions with Graph Networks

Hui Chen, Jingjing Jiang, Nanning Zheng

Main category: cs.CV

TL;DR: Proposes an attribute-object semantic association graph model for unseen composition recognition, introduces a balance loss to address domain bias, and builds a large-scale MultiAttribute Dataset (MAD) with novel evaluation metrics.

Details

Motivation: Existing methods are limited to single-attribute-object composition recognition and struggle to learn relations between attributes and objects, making it difficult to infer unseen attribute-object compositions.

Method: Uses an attribute-object semantic association graph with nodes representing attributes and objects, contrastive loss to pull anchor features closer to correct labels and push away from negatives, and a novel balance loss to alleviate domain bias towards seen compositions.

Result: Experiments on MAD and two single-attribute-object benchmarks (MIT-States and UT-Zappos50K) demonstrate the effectiveness of the approach in recognizing both single- and multi-attribute-object compositions.

Conclusion: The proposed graph model successfully learns complex relations between attributes and objects, enabling knowledge transfer and effective recognition of unseen multi-attribute-object compositions, with the balance loss effectively addressing domain bias issues.

Abstract: Inferring the unseen attribute-object composition is critical to make machines learn to decompose and compose complex concepts like people. Most existing methods are limited to the composition recognition of single-attribute-object, and can hardly learn relations between the attributes and objects. In this paper, we propose an attribute-object semantic association graph model to learn the complex relations and enable knowledge transfer between primitives. With nodes representing attributes and objects, the graph can be constructed flexibly, which realizes both single- and multi-attribute-object composition recognition. In order to reduce mis-classifications of similar compositions (e.g., scratched screen and broken screen), driven by the contrastive loss, the anchor image feature is pulled closer to the corresponding label feature and pushed away from other negative label features. Specifically, a novel balance loss is proposed to alleviate the domain bias, where a model prefers to predict seen compositions. In addition, we build a large-scale MultiAttribute Dataset (MAD) with 116,099 images and 8,030 label categories for inferring unseen multi-attribute-object compositions. Along with MAD, we propose two novel metrics Hard and Soft to give a comprehensive evaluation in the multi-attribute setting. Experiments on MAD and two other single-attribute-object benchmarks (MIT-States and UT-Zappos50K) demonstrate the effectiveness of our approach.

[712] TRIPS: Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch Selection

Chaoya Jiang, Haiyang Xu, Chenliang Li, Miang Yan, Wei Ye, Shikun Zhang, Bin Bi, Songfang Huang

Main category: cs.CV

TL;DR: TRIPS is an efficient VLP model that reduces visual sequence length using text-guided patch selection, achieving 40% speedup without performance loss.

Details

Motivation: Vision Transformers in VLP models suffer from computational inefficiency due to long visual sequences, which TRIPS aims to solve.

Method: Uses text-guided patch-selection layer to dynamically identify and fuse attentive image tokens based on text relevance, without adding extra parameters.

Result: 40% speedup over previous VLP models while maintaining competitive or better performance on various benchmark datasets.

Conclusion: TRIPS provides an effective solution for efficient VLP training and inference through text-relevant image patch selection.

Abstract: Vision Transformers (ViTs) have been widely used in large-scale Vision and Language Pre-training (VLP) models. Though previous VLP works have proved the effectiveness of ViTs, they still suffer from computational efficiency brought by the long visual sequence. To tackle this problem, in this paper, we propose an efficient vision-and-language pre-training model with \textbf{T}ext-\textbf{R}elevant \textbf{I}mage \textbf{P}atch \textbf{S}election, namely TRIPS, which reduces the visual sequence progressively with a text-guided patch-selection layer in the visual backbone for efficient training and inference. The patch-selection layer can dynamically compute text-dependent visual attention to identify the attentive image tokens with text guidance and fuse inattentive ones in an end-to-end manner. Meanwhile, TRIPS does not introduce extra parameters to ViTs. Experimental results on a variety of popular benchmark datasets demonstrate that TRIPS gain a speedup of 40% over previous similar VLP models, yet with competitive or better downstream task performance.

[713] Self-Supervised Geometry-Guided Initialization for Robust Monocular Visual Odometry

Takayuki Kanai, Igor Vasiljevic, Vitor Guizilini, Kazuhiro Shintani

Main category: cs.CV

TL;DR: The paper proposes using self-supervised priors from a pre-trained monocular depth estimator to improve learning-based dense SLAM, addressing failures in large motion and dynamic scenarios.

Details

Motivation: Traditional feature-based visual odometry fails in poor lighting, low texture, and large motions. Learning-based dense SLAM methods are more robust but still struggle with large motion and object dynamics.

Method: Diagnose weaknesses in DROID-SLAM, then use self-supervised priors from a frozen pre-trained monocular depth estimator to initialize dense bundle adjustment without fine-tuning the SLAM backbone.

Result: Significant improvements on KITTI odometry and the challenging DDAD benchmark, demonstrating robust visual odometry.

Conclusion: Simple integration of self-supervised depth priors effectively enhances learning-based dense SLAM performance in challenging scenarios.

Abstract: Monocular visual odometry is a key technology in various autonomous systems. Traditional feature-based methods suffer from failures due to poor lighting, insufficient texture, and large motions. In contrast, recent learning-based dense SLAM methods exploit iterative dense bundle adjustment to address such failure cases, and achieve robust and accurate localization in a wide variety of real environments, without depending on domain-specific supervision. However, despite its potential, the methods still struggle with scenarios involving large motion and object dynamics. In this study, we diagnose key weaknesses in a popular learning-based dense SLAM model (DROID-SLAM) by analyzing major failure cases on outdoor benchmarks and exposing various shortcomings of its optimization process. We then propose the use of self-supervised priors leveraging a frozen large-scale pre-trained monocular depth estimator to initialize the dense bundle adjustment process, leading to robust visual odometry without the need to fine-tune the SLAM backbone. Despite its simplicity, the proposed method demonstrates significant improvements on KITTI odometry, as well as the challenging DDAD benchmark.

[714] fVDB: A Deep-Learning Framework for Sparse, Large-Scale, and High-Performance Spatial Intelligence

Francis Williams, Jiahui Huang, Jonathan Swartz, Gergely Klár, Vijay Thakkar, Matthew Cong, Xuanchi Ren, Ruilong Li, Clement Fuji-Tsang, Sanja Fidler, Eftychios Sifakis, Ken Museth

Main category: cs.CV

TL;DR: fVDB is a GPU-optimized framework for deep learning on large-scale 3D data, providing differentiable primitives for tasks like convolution, pooling, attention, ray-tracing, and meshing with superior performance and versatility.

Details

Motivation: To address the limitations of existing 3D deep learning frameworks that lack comprehensive feature sets while maintaining efficiency, and to enable processing of large-scale 3D datasets with high spatial resolution.

Method: Uses a novel VDB index grid acceleration structure with GPU-accelerated sparse grid construction, convolution using tensorcores, fast ray tracing with HDDA algorithm, and jagged tensors, fully integrated with PyTorch.

Result: fVDB matches or exceeds performance of other frameworks while providing more features, processes larger datasets with higher resolution, and maintains competitive memory footprint on small inputs.

Conclusion: fVDB successfully combines versatility and performance for large-scale 3D deep learning, demonstrating effectiveness across various tasks including point-cloud segmentation, 3D generative modeling, Neural Radiance Fields, and point cloud reconstruction.

Abstract: We present fVDB, a novel GPU-optimized framework for deep learning on large-scale 3D data. fVDB provides a complete set of differentiable primitives to build deep learning architectures for common tasks in 3D learning such as convolution, pooling, attention, ray-tracing, meshing, etc. fVDB simultaneously provides a much larger feature set (primitives and operators) than established frameworks with no loss in efficiency: our operators match or exceed the performance of other frameworks with narrower scope. Furthermore, fVDB can process datasets with much larger footprint and spatial resolution than prior works, while providing a competitive memory footprint on small inputs. To achieve this combination of versatility and performance, fVDB relies on a single novel VDB index grid acceleration structure paired with several key innovations including GPU accelerated sparse grid construction, convolution using tensorcores, fast ray tracing kernels using a Hierarchical Digital Differential Analyzer algorithm (HDDA), and jagged tensors. Our framework is fully integrated with PyTorch enabling interoperability with existing pipelines, and we demonstrate its effectiveness on a number of representative tasks such as large-scale point-cloud segmentation, high resolution 3D generative modeling, unbounded scale Neural Radiance Fields, and large-scale point cloud reconstruction.

[715] MCPDepth: Omnidirectional Depth Estimation via Stereo Matching from Multi-Cylindrical Panoramas

Feng Qiao, Zhexiao Xiong, Xinge Zhu, Yuexin Ma, Qiumeng He, Nathan Jacobs

Main category: cs.CV

TL;DR: MCPDepth is a two-stage framework for omnidirectional depth estimation using stereo matching across multiple cylindrical panoramas, achieving significant performance improvements over existing methods.

Details

Motivation: Address the challenge of omnidirectional depth estimation by exploring the impact of projection methods and developing a solution that works well with standard network components for easy deployment.

Method: Two-stage framework: 1) Stereo matching using cylindrical panoramas, 2) Robust fusion of depth maps from different views. Uses circular attention module to handle vertical distortions and standard network components for deployment.

Result: Improves MAE by 18.8% on Deep360 dataset and 19.9% on 3D60 dataset compared to existing methods. Demonstrates superior efficacy of cylindrical projection over spherical and cubic projections.

Conclusion: MCPDepth establishes a new paradigm in omnidirectional depth estimation with practical insights for real-world applications, offering better performance while using standard network components for easy deployment.

Abstract: Omnidirectional depth estimation presents a significant challenge due to the inherent distortions in panoramic images. Despite notable advancements, the impact of projection methods remains underexplored. We introduce Multi-Cylindrical Panoramic Depth Estimation (MCPDepth), a novel two-stage framework designed to enhance omnidirectional depth estimation through stereo matching across multiple cylindrical panoramas. MCPDepth initially performs stereo matching using cylindrical panoramas, followed by a robust fusion of the resulting depth maps from different views. Unlike existing methods that rely on customized kernels to address distortions, MCPDepth utilizes standard network components, facilitating seamless deployment on embedded devices while delivering exceptional performance. To effectively address vertical distortions in cylindrical panoramas, MCPDepth incorporates a circular attention module, significantly expanding the receptive field beyond traditional convolutions. We provide a comprehensive theoretical and experimental analysis of common panoramic projections-spherical, cylindrical, and cubic-demonstrating the superior efficacy of cylindrical projection. Our method improves the mean absolute error (MAE) by 18.8% on the outdoor dataset Deep360 and by 19.9% on the real dataset 3D60. This work offers practical insights for other tasks and real-world applications, establishing a new paradigm in omnidirectional depth estimation. The code is available at https://github.com/Qjizhi/MCPDepth.

[716] Leveraging BEV Paradigm for Ground-to-Aerial Image Synthesis

Junyan Ye, Jun He, Weijia Li, Zhutao Lv, Yi Lin, Jinhua Yu, Haote Yang, Conghui He

Main category: cs.CV

TL;DR: SkyDiffusion is a novel method for generating realistic aerial images from ground street view images using diffusion models and Bird’s-Eye View transformation to bridge the viewpoint gap and handle occlusion in dense urban scenes.

Details

Motivation: The significant viewpoint difference between ground and aerial images creates domain gaps, and dense urban scenes limit street view visibility, making cross-view generation challenging.

Method: Uses Curved-BEV to convert street-view images to BEV perspective, employs multi-to-one mapping for occlusion handling, and implements a BEV-guided diffusion model for aerial image generation.

Result: Outperforms state-of-the-art methods on multiple datasets (CVUSA, CVACT, VIGOR-Chicago, G2A-3) across natural, suburban, urban, and various application scenarios.

Conclusion: SkyDiffusion achieves realistic and content-consistent aerial image generation from street views and introduces a new dataset (Ground2Aerial-3) for diverse ground-to-aerial synthesis applications.

Abstract: Ground-to-aerial image synthesis focuses on generating realistic aerial images from corresponding ground street view images while maintaining consistent content layout, simulating a top-down view. The significant viewpoint difference leads to domain gaps between views, and dense urban scenes limit the visible range of street views, making this cross-view generation task particularly challenging. In this paper, we introduce SkyDiffusion, a novel cross-view generation method for synthesizing aerial images from street view images, utilizing a diffusion model and the Bird’s-Eye View (BEV) paradigm. The Curved-BEV method in SkyDiffusion converts street-view images into a BEV perspective, effectively bridging the domain gap, and employs a “multi-to-one” mapping strategy to address occlusion issues in dense urban scenes. Next, SkyDiffusion designed a BEV-guided diffusion model to generate content-consistent and realistic aerial images. Additionally, we introduce a novel dataset, Ground2Aerial-3, designed for diverse ground-to-aerial image synthesis applications, including disaster scene aerial synthesis, low-altitude UAV image synthesis, and historical high-resolution satellite image synthesis tasks. Experimental results demonstrate that SkyDiffusion outperforms state-of-the-art methods on cross-view datasets across natural (CVUSA), suburban (CVACT), urban (VIGOR-Chicago), and various application scenarios (G2A-3), achieving realistic and content-consistent aerial image generation. The code, datasets and more information of this work can be found at https://opendatalab.github.io/skydiffusion/ .

[717] Disentangling Regional Primitives for Image Generation

Zhengting Chen, Lei Cheng, Lianghui Ding, Liang Lin, Quanshi Zhang

Main category: cs.CV

TL;DR: The paper proposes a neural network for image generation that explains representation structures by disentangling primitive feature components from intermediate layers, where each component generates a regional pattern and the entire image is a superposition of these components.

Details

Motivation: To provide a new perspective on image generation by explaining representation structures and defining desirable properties for neural networks in this context.

Method: Proposes properties (feature completeness, spatial boundedness, consistency) to define representation structure, then disentangles primitive feature components from intermediate-layer features using OR Harsanyi interaction computation.

Result: Experiments verified the faithfulness of the disentangled primitive regional patterns in image generation.

Conclusion: The method successfully explains image generation as a superposition of primitive feature components that satisfy the proposed properties, with experimental validation of the approach’s faithfulness.

Abstract: This paper explains a neural network for image generation from a new perspective, i.e., explaining representation structures for image generation. We propose a set of desirable properties to define the representation structure of a neural network for image generation, including feature completeness, spatial boundedness and consistency. These properties enable us to propose a method for disentangling primitive feature components from the intermediate-layer features, where each feature component generates a primitive regional pattern covering multiple image patches. In this way, the generation of the entire image can be explained as a superposition of these feature components. We prove that these feature components, which satisfy the feature completeness property and the linear additivity property (derived from the feature completeness, spatial boundedness, and consistency properties), can be computed as OR Harsanyi interaction. Experiments have verified the faithfulness of the disentangled primitive regional patterns.

[718] DiffusionGuard: A Robust Defense Against Malicious Diffusion-based Image Editing

June Suk Choi, Kyungmin Lee, Jongheon Jeong, Saining Xie, Jinwoo Shin, Kimin Lee

Main category: cs.CV

TL;DR: DiffusionGuard is a robust defense method against unauthorized edits by diffusion-based image editing models, using adversarial noise targeting early diffusion stages and mask-augmentation for better protection.

Details

Motivation: Concerns about misuse of text-guided image manipulation methods for creating misleading/harmful content, and limitations of existing defense strategies against sophisticated manipulations like masked editing.

Method: Generates adversarial noise targeting early diffusion process stages, uses mask-augmentation technique to enhance robustness against various masks, and introduces comprehensive benchmark for evaluation.

Result: Achieves stronger protection and improved mask robustness with lower computational costs compared to baselines, exhibits superior transferability and better resilience to noise removal techniques.

Conclusion: DiffusionGuard provides an effective and efficient defense solution against unauthorized image edits by diffusion models, addressing limitations of previous methods.

Abstract: Recent advances in diffusion models have introduced a new era of text-guided image manipulation, enabling users to create realistic edited images with simple textual prompts. However, there is significant concern about the potential misuse of these methods, especially in creating misleading or harmful content. Although recent defense strategies, which introduce imperceptible adversarial noise to induce model failure, have shown promise, they remain ineffective against more sophisticated manipulations, such as editing with a mask. In this work, we propose DiffusionGuard, a robust and effective defense method against unauthorized edits by diffusion-based image editing models, even in challenging setups. Through a detailed analysis of these models, we introduce a novel objective that generates adversarial noise targeting the early stage of the diffusion process. This approach significantly improves the efficiency and effectiveness of adversarial noises. We also introduce a mask-augmentation technique to enhance robustness against various masks during test time. Finally, we introduce a comprehensive benchmark designed to evaluate the effectiveness and robustness of methods in protecting against privacy threats in realistic scenarios. Through extensive experiments, we show that our method achieves stronger protection and improved mask robustness with lower computational costs compared to the strongest baseline. Additionally, our method exhibits superior transferability and better resilience to noise removal techniques compared to all baseline methods. Our source code is publicly available at https://github.com/choi403/DiffusionGuard.

[719] CART: Compositional Auto-Regressive Transformer for Image Generation

Siddharth Roheda, Rohit Chowdhury, Aniruddha Bala, Rohan Jaiswal

Main category: cs.CV

TL;DR: CART proposes an auto-regressive image generation method that models images as hierarchical compositions of interpretable visual layers, outperforming traditional approaches through iterative detail addition via semantic decompositions.

Details

Motivation: To address the challenges of applying auto-regressive models to vision tasks due to spatial dependencies in images, and to improve controllability and interpretability in image generation.

Method: Uses hierarchical composition of interpretable visual layers with three decomposition strategies: Base-Detail Decomposition (Mumford-Shah smoothness), Intrinsic Decomposition (albedo/shading), and Specularity Decomposition (diffuse/specular).

Result: CART outperforms traditional ’next-token’ and ’next-scale’ approaches, improving controllability, semantic interpretability, and resolution scalability while generating visually compelling results.

Conclusion: The method enables structured image manipulation and opens new directions for controllable generative modeling through physically or perceptually motivated image factorization.

Abstract: We propose a novel Auto-Regressive (AR) image generation approach that models images as hierarchical compositions of interpretable visual layers. While AR models have achieved transformative success in language modeling, replicating this success in vision tasks has presented unique challenges due to inherent spatial dependencies in images. Addressing the unique challenges of vision tasks, our method (CART) adds image details iteratively via semantically meaningful decompositions. We demonstrate the flexibility and generality of CART by applying it across three distinct decomposition strategies: (i) Base-Detail Decomposition (Mumford-Shah smoothness), (ii) Intrinsic Decomposition (albedo/shading), and (iii) Specularity Decomposition (diffuse/specular). This “next-detail” strategy outperforms traditional “next-token” and “next-scale” approaches, improving controllability, semantic interpretability, and resolution scalability. Experiments show CART generates visually compelling results while enabling structured image manipulation, opening new directions for controllable generative modeling via physically or perceptually motivated image factorization.

[720] Continuous Speculative Decoding for Autoregressive Image Generation

Zili Wang, Robert Zhang, Kun Ding, Qi Yang, Fei Li, Shiming Xiang

Main category: cs.CV

TL;DR: This paper introduces continuous speculative decoding to accelerate continuous autoregressive models, addressing challenges of low acceptance rates and complex distributions through denoising trajectory alignment and acceptance-rejection sampling.

Details

Motivation: Continuous autoregressive models suffer from heavy inference burden, and while speculative decoding works well for discrete LLMs, there's no analogous theory for continuous distributions to accelerate continuous AR models.

Method: Proposes continuous speculative decoding with denoising trajectory alignment and token pre-filling to improve acceptance rates, and acceptance-rejection sampling with appropriate upper bounds to handle complex integral distributions without explicit calculation.

Result: Achieves over 2× speedup on off-the-shelf models while maintaining original generation quality.

Conclusion: Continuous speculative decoding effectively accelerates continuous autoregressive models, bridging the gap between discrete and continuous speculative decoding approaches.

Abstract: Continuous visual autoregressive (AR) models have demonstrated promising performance in image generation. However, the heavy autoregressive inference burden imposes significant overhead. In Large Language Models (LLMs), speculative decoding has effectively accelerated discrete autoregressive inference. However, the absence of an analogous theory for continuous distributions precludes its use in accelerating continuous AR models. To fill this gap, this work presents continuous speculative decoding, and addresses challenges from: 1) low acceptance rate, caused by inconsistent output distribution between target and draft models, and 2) modified distribution without analytic expression, caused by complex integral. To address challenge 1), we propose denoising trajectory alignment and token pre-filling strategies. To address challenge 2), we introduce acceptance-rejection sampling algorithm with an appropriate upper bound, thereby avoiding explicitly calculating the integral. Furthermore, our denoising trajectory alignment is also reused in acceptance-rejection sampling, effectively avoiding repetitive diffusion model inference. Extensive experiments demonstrate that our proposed continuous speculative decoding achieves over $2\times$ speedup on off-the-shelf models, while maintaining the original generation quality. Codes is available at: https://github.com/MarkXCloud/CSpD

[721] Open-Vocabulary Online Semantic Mapping for SLAM

Tomas Berriel Martins, Martin R. Oswald, Javier Civera

Main category: cs.CV

TL;DR: OVO is an open-vocabulary online 3D semantic mapping pipeline that uses CLIP vectors to describe 3D segments, featuring low computational footprint and superior segmentation performance compared to offline and online baselines.

Details

Motivation: To create an efficient online 3D semantic mapping system that can handle open-vocabulary concepts with lower computational and memory requirements than existing offline methods.

Method: Given posed RGB-D frames, detect and track 3D segments, describe them using CLIP vectors computed from viewpoints using a novel CLIP merging method, and integrate with SLAM backends like Gaussian-SLAM and ORB-SLAM2.

Result: OVO achieves significantly lower computational and memory footprint than offline baselines while showing better segmentation metrics than both offline and online methods. Successfully demonstrates end-to-end open-vocabulary online 3D mapping with loop closure.

Conclusion: OVO presents the first open-vocabulary online 3D semantic mapping pipeline with neural network-based CLIP descriptor merging, offering efficient real-time performance with superior segmentation quality.

Abstract: This paper presents an Open-Vocabulary Online 3D semantic mapping pipeline, that we denote by its acronym OVO. Given a sequence of posed RGB-D frames, we detect and track 3D segments, which we describe using CLIP vectors. These are computed from the viewpoints where they are observed by a novel CLIP merging method. Notably, our OVO has a significantly lower computational and memory footprint than offline baselines, while also showing better segmentation metrics than offline and online ones. Along with superior segmentation performance, we also show experimental results of our mapping contributions integrated with two different full SLAM backbones (Gaussian-SLAM and ORB-SLAM2), being the first ones using a neural network to merge CLIP descriptors and demonstrating end-to-end open-vocabulary online 3D mapping with loop closure.

[722] GenMix: Effective Data Augmentation with Generative Diffusion Model Image Editing

Khawar Islam, Muhammad Zaigham Zaheer, Arif Mahmood, Karthik Nandakumar, Naveed Akhtar

Main category: cs.CV

TL;DR: GenMix is a prompt-guided generative data augmentation method that uses image editing and fractal patterns to enhance both in-domain and cross-domain image classification performance and adversarial robustness.

Details

Motivation: Traditional data augmentation methods struggle with domain gaps in domain adaptation scenarios, limiting their effectiveness when source and target domains differ.

Method: Leverages image editing with custom conditional prompts to generate augmented images, blends portions of input images with edited generative counterparts, and incorporates fractal patterns to mitigate unrealistic images and label ambiguity.

Result: Achieves stronger performance across eight public datasets for general and fine-grained classification in both in-domain and cross-domain settings, with improvements in self-supervised learning, data scarcity scenarios, and adversarial robustness.

Conclusion: GenMix outperforms existing state-of-the-art methods and demonstrates broad applicability across various learning scenarios including domain adaptation, self-supervised learning, and adversarial robustness.

Abstract: Data augmentation is widely used to enhance generalization in visual classification tasks. However, traditional methods struggle when source and target domains differ, as in domain adaptation, due to their inability to address domain gaps. This paper introduces GenMix, a generalizable prompt-guided generative data augmentation approach that enhances both in-domain and cross-domain image classification. Our technique leverages image editing to generate augmented images based on custom conditional prompts, designed specifically for each problem type. By blending portions of the input image with its edited generative counterpart and incorporating fractal patterns, our approach mitigates unrealistic images and label ambiguity, improving the performance and adversarial robustness of the resulting models. Efficacy of our method is established with extensive experiments on eight public datasets for general and fine-grained classification, in both in-domain and cross-domain settings. Additionally, we demonstrate performance improvements for self-supervised learning, learning with data scarcity, and adversarial robustness. As compared to the existing state-of-the-art methods, our technique achieves stronger performance across the board.

[723] LoRACLR: Contrastive Adaptation for Customization of Diffusion Models

Enis Simsar, Thomas Hofmann, Federico Tombari, Pinar Yanardag

Main category: cs.CV

TL;DR: LoRACLR is a novel method for multi-concept image generation that merges multiple LoRA models into a single unified model using contrastive learning, enabling high-quality synthesis of multiple personalized concepts without attribute entanglement.

Details

Motivation: Current text-to-image customization methods struggle with combining multiple personalized models, often leading to attribute entanglement or requiring separate training to preserve concept distinctiveness.

Method: LoRACLR merges multiple LoRA models (each fine-tuned for a distinct concept) into a single unified model using a contrastive objective to align and merge weight spaces, ensuring compatibility while minimizing interference.

Result: The approach enables efficient, scalable model composition for high-quality multi-concept image synthesis, accurately merging multiple concepts while preserving their distinctiveness.

Conclusion: LoRACLR advances the capabilities of personalized image generation by providing an effective solution for multi-concept composition without additional fine-tuning.

Abstract: Recent advances in text-to-image customization have enabled high-fidelity, context-rich generation of personalized images, allowing specific concepts to appear in a variety of scenarios. However, current methods struggle with combining multiple personalized models, often leading to attribute entanglement or requiring separate training to preserve concept distinctiveness. We present LoRACLR, a novel approach for multi-concept image generation that merges multiple LoRA models, each fine-tuned for a distinct concept, into a single, unified model without additional individual fine-tuning. LoRACLR uses a contrastive objective to align and merge the weight spaces of these models, ensuring compatibility while minimizing interference. By enforcing distinct yet cohesive representations for each concept, LoRACLR enables efficient, scalable model composition for high-quality, multi-concept image synthesis. Our results highlight the effectiveness of LoRACLR in accurately merging multiple concepts, advancing the capabilities of personalized image generation.

[724] Measurement of Medial Elbow Joint Space using Landmark Detection

Shizuka Akahori, Shotaro Teruya, Pragyan Shrestha, Yuichi Yoshii, Ryuhei Michinobu, Satoshi Iizuka, Itaru Kitahara

Main category: cs.CV

TL;DR: This paper introduces the first public ultrasound medial elbow dataset for automated joint space measurement to diagnose Ulnar Collateral Ligament injuries, and proposes a Shape Subspace landmark refinement method to improve detection accuracy.

Details

Motivation: There is no publicly available dataset for automated measurement of elbow joint space in ultrasound images, which is crucial for early diagnosis of UCL injuries through valgus instability assessment.

Method: Created a dataset of 4,201 medial elbow ultrasound images from 22 subjects with expert landmark annotations, evaluated heatmap-based, regression-based, and token-based landmark detection methods, and proposed Shape Subspace landmark refinement using geometrical similarities.

Result: Achieved mean joint space measurement error of 0.116 mm using HRNet, and SS refinement reduced mean absolute error by 0.010 mm with HRNet and 0.103 mm with ViTPose on average.

Conclusion: The proposed dataset and SS refinement enable high-precision, real-time diagnosis of UCL injuries through accurate joint space measurement, with potential for point-based segmentation of humerus and ulna.

Abstract: Ultrasound imaging of the medial elbow is crucial for the early diagnosis of Ulnar Collateral Ligament (UCL) injuries. Specifically, measuring the elbow joint space in ultrasound images is used to assess the valgus instability of the elbow caused by UCL injuries. To automate this measurement, a model trained on a precisely annotated dataset is necessary; however, no publicly available dataset exists to date. This study introduces a novel ultrasound medial elbow dataset to measure the joint space. The dataset comprises 4,201 medial elbow ultrasound images from 22 subjects, with landmark annotations on the humerus and ulna, based on the expertise of three orthopedic surgeons. We evaluated joint space measurement methods on our proposed dataset using heatmap-based, regression-based, and token-based landmark detection methods. While heatmap-based landmark detection methods generally achieve high accuracy, they sometimes produce multiple peaks on a heatmap, leading to incorrect detection. To mitigate this issue and enhance landmark localization, we propose Shape Subspace (SS) landmark refinement by measuring geometrical similarities between the detected and reference landmark positions. The results show that the mean joint space measurement error is 0.116 mm when using HRNet. Furthermore, SS landmark refinement can reduce the mean absolute error of landmark positions by 0.010 mm with HRNet and by 0.103 mm with ViTPose on average. These highlight the potential for high-precision, real-time diagnosis of UCL injuries by accurately measuring joint space. Lastly, we demonstrate point-based segmentation for the humerus and ulna using the detected landmarks as inputs. Our dataset will be publicly available at https://github.com/Akahori000/Ultrasound-Medial-Elbow-Dataset

Yang Du, Yuqi Liu, Qin Jin

Main category: cs.CV

TL;DR: RTime is a new video-text retrieval dataset that emphasizes temporal understanding by using reversed videos as hard negative samples, challenging existing models that perform well on current benchmarks but lack temporal reasoning capabilities.

Details

Motivation: Current video-text retrieval benchmarks don't adequately assess temporal understanding, allowing image-text pre-trained models to achieve comparable performance to video-text models. There's a need for datasets that specifically test temporal reasoning in videos.

Method: Created RTime dataset by collecting videos with significant temporality, reversing them to create hard negative samples, using human annotators to judge significance and reversibility, writing captions, and expanding captions using GPT-4. The dataset has 21k videos with 10 captions each.

Result: RTime poses new challenges to video-text retrieval models. Three benchmark tasks (RTime-Origin, RTime-Hard, RTime-Binary) were created and various models were benchmarked, showing that the dataset effectively tests temporal understanding capabilities.

Conclusion: RTime successfully addresses the limitations of existing benchmarks by focusing on temporal understanding, providing a more challenging evaluation framework for video-text retrieval models and advancing multimodal understanding research.

Abstract: Cross-modal (e.g. image-text, video-text) retrieval is an important task in information retrieval and multimodal vision-language understanding field. Temporal understanding makes video-text retrieval more challenging than image-text retrieval. However, we find that the widely used video-text benchmarks have shortcomings in comprehensively assessing abilities of models, especially in temporal understanding, causing large-scale image-text pre-trained models can already achieve comparable zero-shot performance with video-text pre-trained models. In this paper, we introduce RTime, a novel temporal-emphasized video-text retrieval dataset. We first obtain videos of actions or events with significant temporality, and then reverse these videos to create harder negative samples. We then recruit annotators to judge the significance and reversibility of candidate videos, and write captions for qualified videos. We further adopt GPT-4 to extend more captions based on human-written captions. Our RTime dataset currently consists of 21k videos with 10 captions per video, totalling about 122 hours. Based on RTime, we propose three retrieval benchmark tasks: RTime-Origin, RTime-Hard, and RTime-Binary. We further enhance the use of harder-negatives in model training, and benchmark a variety of video-text models on RTime. Extensive experiment analysis proves that RTime indeed poses new and higher challenges to video-text retrieval. We release our RTime dataset https://github.com/qyr0403/Reversed-in-Time to further advance video-text retrieval and multimodal understanding research.

[726] PERSE: Personalized 3D Generative Avatars from A Single Portrait

Hyunsoo Cha, Inhee Lee, Hanbyul Joo

Main category: cs.CV

TL;DR: PERSE is a method for creating personalized 3D generative avatars from portrait references, enabling continuous and disentangled facial attribute editing while preserving identity.

Details

Motivation: To enable intuitive facial attribute manipulation in 3D avatars while maintaining individual identity, addressing limitations in previous approaches.

Method: Synthesizes large-scale 2D video datasets with consistent facial changes, then uses 3D Gaussian Splatting with latent space regularization using interpolated 2D faces as supervision.

Result: Generates high-quality avatars with interpolated attributes while preserving reference identity, outperforming previous approaches.

Conclusion: PERSE successfully creates personalized 3D avatars with continuous, disentangled facial attribute control and identity preservation.

Abstract: We present PERSE, a method for building a personalized 3D generative avatar from a reference portrait. Our avatar enables facial attribute editing in a continuous and disentangled latent space to control each facial attribute, while preserving the individual’s identity. To achieve this, our method begins by synthesizing large-scale synthetic 2D video datasets, where each video contains consistent changes in facial expression and viewpoint, along with variations in a specific facial attribute from the original input. We propose a novel pipeline to produce high-quality, photorealistic 2D videos with facial attribute editing. Leveraging this synthetic attribute dataset, we present a personalized avatar creation method based on 3D Gaussian Splatting, learning a continuous and disentangled latent space for intuitive facial attribute manipulation. To enforce smooth transitions in this latent space, we introduce a latent space regularization technique by using interpolated 2D faces as supervision. Compared to previous approaches, we demonstrate that PERSE generates high-quality avatars with interpolated attributes while preserving the identity of the reference individual.

[727] MIAFEx: An Attention-based Feature Extraction Method for Medical Image Classification

Oscar Ramos-Soto, Jorge Ramos-Frutos, Ezequiel Perez-Zarate, Diego Oliva, Sandra E. Balderas-Mata

Main category: cs.CV

TL;DR: MIAFEx is a novel medical image feature extractor using a learnable refinement mechanism in Transformer architecture to enhance classification token, outperforming classical feature extractors and modern CNN/ViT models in accuracy and robustness, especially with limited training data.

Details

Motivation: Classical feature extractors and traditional machine learning classifiers have limitations in providing sufficient discriminative information for complex medical image sets. CNNs and ViTs are prone to overfitting due to medical imaging data characteristics like small sample sizes and high intra-class variance.

Method: Proposes Medical Image Attention-based Feature Extractor (MIAFEx) that employs a learnable refinement mechanism to enhance the classification token within Transformer encoder architecture, adjusting the token based on learned weights to improve salient feature extraction.

Result: MIAFEx demonstrates superiority in accuracy and robustness across multiple complex medical imaging datasets compared to classical feature extractors, traditional/hybrid classifiers, and modern CNN/ViT models, particularly in scenarios with limited training data.

Conclusion: The proposed MIAFEx method effectively addresses the challenges of medical imaging data by enhancing feature extraction through learnable refinement, showing significant advantages over existing approaches especially when dealing with limited training samples.

Abstract: Feature extraction techniques are crucial in medical image classification; however, classical feature extractors, in addition to traditional machine learning classifiers, often exhibit significant limitations in providing sufficient discriminative information for complex image sets. While Convolutional Neural Networks (CNNs) and Vision Transformer (ViT) have shown promise in feature extraction, they are prone to overfitting due to the inherent characteristics of medical imaging data, including small sample sizes or high intra-class variance. In this work, the Medical Image Attention-based Feature Extractor (MIAFEx) is proposed, a novel method that employs a learnable refinement mechanism to enhance the classification token within the Transformer encoder architecture. This mechanism adjusts the token based on learned weights, improving the extraction of salient features and enhancing the model’s adaptability to the challenges presented by medical imaging data. The MIAFEx output feature quality is compared against classical feature extractors using traditional and hybrid classifiers. Also, the performance of these features is compared against modern CNN and ViT models in classification tasks, demonstrating their superiority in accuracy and robustness across multiple complex medical imaging datasets. This advantage is particularly pronounced in scenarios with limited training data, where traditional and modern models often struggle to generalize effectively. The source code of this proposal can be found at https://github.com/Oscar-RamosS/Medical-Image-Attention-based-Feature-Extractor-MIAFEx

[728] CGI: Identifying Conditional Generative Models with Example Images

Zhi Zhou, Hao-Zhe Tan, Peng-Xiao Song, Lan-Zhe Guo

Main category: cs.CV

TL;DR: The paper proposes PromptBased Model Identification (PMI) to help users find the most suitable generative models from model hubs using example images instead of manual review.

Details

Motivation: Existing model hubs rely on basic text matching, but users struggle to choose models due to different abstractions and large model quantities. Manual review of descriptions and examples is inefficient.

Method: Proposes Conditional Generative Model Identification (CGI) framework with PMI approach that uses user-provided example images to match requirements with model specifications.

Result: PMI achieves 92% correct model identification with significantly better FID scores when 4 example images are provided. Benchmark includes 65 models and 9100 identification tasks.

Conclusion: PMI provides an effective solution for model identification in model hubs, enabling efficient model selection without manual review of large model collections.

Abstract: Generative models have achieved remarkable performance recently, and thus model hubs have emerged. Existing model hubs typically assume basic text matching is sufficient to search for models. However, in reality, due to different abstractions and the large number of models in model hubs, it is not easy for users to review model descriptions and example images, choosing which model best meets their needs. Therefore, it is necessary to describe model functionality wisely so that future users can efficiently search for the most suitable model for their needs. Efforts to address this issue remain limited. In this paper, we propose Conditional Generative Model Identification (CGI), which aims to provide an effective way to identify the most suitable model using user-provided example images rather than requiring users to manually review a large number of models with example images. To address this problem, we propose the PromptBased Model Identification (PMI) , which can adequately describe model functionality and precisely match requirements with specifications. To evaluate PMI approach and promote related research, we provide a benchmark comprising 65 models and 9100 identification tasks. Extensive experimental and human evaluation results demonstrate that PMI is effective. For instance, 92% of models are correctly identified with significantly better FID scores when four example images are provided.

[729] Med-PU: Point Cloud Upsampling for High-Fidelity 3D Medical Shape Reconstruction

Tongxu Zhang, Bei Wang

Main category: cs.CV

TL;DR: Med-PU is a knowledge-driven framework that integrates medical image segmentation with point cloud upsampling for high-fidelity 3D pelvic reconstruction, outperforming existing methods in surface quality and anatomical fidelity.

Details

Motivation: High-fidelity 3D anatomical reconstruction is essential for clinical applications like preoperative planning, radiotherapy target delineation, and orthopedic implant design.

Method: Combines volumetric medical image segmentation (using SAM-Med3D) with point cloud upsampling to learn implicit anatomical priors from large-scale 3D shape data, enabling dense completion from sparse segmentation-derived point sets.

Result: Med-PU consistently improves surface quality and anatomical fidelity while reducing artifacts, demonstrating robustness across input densities on pelvic CT datasets (MedShapePelvic for training and Pelvic1k for validation).

Conclusion: Med-PU serves as a practical, generalizable tool to bridge segmentation outputs and clinically usable 3D models, with potential applications to other skeletal regions and organs beyond the pelvis.

Abstract: High-fidelity 3D anatomical reconstruction is a prerequisite for downstream clinical tasks such as preoperative planning, radiotherapy target delineation, and orthopedic implant design. We present Med-PU, a knowledge-driven framework that integrates volumetric medical image segmentation with point cloud upsampling for accurate pelvic shape reconstruction. Unlike landmark- or PCA-based statistical shape models, Med-PU learns an implicit anatomical prior directly from large-scale 3D shape data, enabling dense completion and refinement from sparse segmentation-derived point sets. The pipeline couples SAM-Med3D-based voxel segmentation, point extraction, deep upsampling, and surface reconstruction, yielding smooth and topologically consistent meshes. We evaluate Med-PU on pelvic CT datasets (MedShapePelvic for training and Pelvic1k for validation), benchmarking against state-of-the-art upsampling methods using comprehensive geometry and surface metrics. Med-PU consistently improves surface quality and anatomical fidelity while reducing artifacts, demonstrating robustness across input densities. Although validated on the pelvis, the approach is anatomy-agnostic and applicable to other skeletal regions and organs. These results suggest Med-PU as a practical, generalizable tool to bridge segmentation outputs and clinically usable 3D models.

[730] DeepFRC: An End-to-End Deep Learning Model for Functional Registration and Classification

Siyuan Jiang, Yihan Hu, Wenjie Li, Pengcheng Zeng

Main category: cs.CV

TL;DR: DeepFRC is an end-to-end deep learning framework that jointly learns diffeomorphic warping functions and classification for functional data, addressing phase variability through unified elastic alignment and classification.

Details

Motivation: Functional data often suffers from phase variability (temporal misalignments) that obscures patterns and degrades model performance. Current methods treat registration and classification as separate sequential tasks, which is suboptimal.

Method: DeepFRC combines: neural deformation operator for elastic alignment, spectral representation using Fourier basis for smooth functional embedding, and class-aware contrastive loss for intra-class coherence and inter-class separation.

Result: Extensive experiments show DeepFRC consistently outperforms state-of-the-art methods in both alignment quality and classification accuracy, with notable robustness to noise, missing data, and varying dataset scales.

Conclusion: DeepFRC provides the first theoretical guarantees for joint registration-classification models, demonstrating synergy between components and superior performance across diverse datasets.

Abstract: Functional data, representing curves or trajectories, are ubiquitous in fields like biomedicine and motion analysis. A fundamental challenge is phase variability – temporal misalignments that obscure underlying patterns and degrade model performance. Current methods often address registration (alignment) and classification as separate, sequential tasks. This paper introduces DeepFRC, an end-to-end deep learning framework that jointly learns diffeomorphic warping functions and a classifier within a unified architecture. DeepFRC combines a neural deformation operator for elastic alignment, a spectral representation using Fourier basis for smooth functional embedding, and a class-aware contrastive loss that promotes both intra-class coherence and inter-class separation. We provide the first theoretical guarantees for such a joint model, proving its ability to approximate optimal warpings and establishing a data-dependent generalization bound that formally links registration fidelity to classification performance. Extensive experiments on synthetic and real-world datasets demonstrate that DeepFRC consistently outperforms state-of-the-art methods in both alignment quality and classification accuracy, while ablation studies validate the synergy of its components. DeepFRC also shows notable robustness to noise, missing data, and varying dataset scales. Code is available at https://github.com/Drivergo-93589/DeepFRC.

[731] Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer

Tao Ren, Zishi Zhang, Jingyang Jiang, Zehao Li, Shentao Qin, Yi Zheng, Guanghao Li, Qianyou Sun, Yan Li, Jiafeng Liang, Xinping Li, Yijie Peng

Main category: cs.CV

TL;DR: The paper proposes RLR optimizer, a Half-Order fine-tuning paradigm for diffusion models that provides unbiased gradient estimation with lower variance compared to RL and truncated BP methods, enabling efficient alignment of foundation diffusion models.

Details

Motivation: Existing methods for aligning foundation diffusion models (RL and truncated BP) suffer from low sample efficiency and biased gradient estimation, leading to limited improvement or training failure.

Method: Proposed Recursive Likelihood Ratio (RLR) optimizer with Half-Order gradient estimator that enables computational graph rearrangement within the recursive diffusive chain for unbiased gradient estimation with lower variance.

Result: Theoretical analysis shows RLR has unbiased gradients with lower variance. Extensive experiments on image and video generation validate superiority. A novel prompt technique synergizes with RLR.

Conclusion: RLR optimizer provides an effective solution for efficient alignment of foundation diffusion models with superior performance compared to existing methods.

Abstract: The probabilistic diffusion model (DM), generating content by inferencing through a recursive chain structure, has emerged as a powerful framework for visual generation. After pre-training on enormous data, the model needs to be properly aligned to meet requirements for downstream applications. How to efficiently align the foundation DM is a crucial task. Contemporary methods are either based on Reinforcement Learning (RL) or truncated Backpropagation (BP). However, RL and truncated BP suffer from low sample efficiency and biased gradient estimation, respectively, resulting in limited improvement or, even worse, complete training failure. To overcome the challenges, we propose the Recursive Likelihood Ratio (RLR) optimizer, a Half-Order (HO) fine-tuning paradigm for DM. The HO gradient estimator enables the computation graph rearrangement within the recursive diffusive chain, making the RLR’s gradient estimator an unbiased one with lower variance than other methods. We theoretically investigate the bias, variance, and convergence of our method. Extensive experiments are conducted on image and video generation to validate the superiority of the RLR. Furthermore, we propose a novel prompt technique that is natural for the RLR to achieve a synergistic effect.

[732] 3D Foundation Model for Generalizable Disease Detection in Head Computed Tomography

Weicheng Zhu, Haoxu Huang, Huanze Tang, Rushabh Musthyala, Boyang Yu, Long Chen, Emilio Vega, Thomas O’Donnell, Seena Dehkharghani, Jennifer A. Frontera, Arjun V. Masurkar, Kara Melmed, Narges Razavian

Main category: cs.CV

TL;DR: FM-CT is a 3D foundation model for head CT imaging that uses self-supervised learning on 361,663 scans to improve disease detection, especially with limited labeled data.

Details

Motivation: Address the scarcity of high-quality labels and annotations for head CT imaging, particularly for less common conditions, which hinders development of powerful deep learning models.

Method: Self-supervised learning with discrimination and masked image modeling on 361,663 non-contrast 3D head CT scans, using 3D architecture instead of 2D slice-level processing.

Result: Significantly improves downstream classification performance compared to models trained from scratch and previous 3D CT foundation models, validated on internal and external datasets including out-of-distribution data.

Conclusion: Self-supervised foundation models are effective for medical imaging and set a new benchmark for head CT analysis, enabling broader AI use in head CT-based diagnosis.

Abstract: Head computed tomography (CT) imaging is a widely-used imaging modality with multitudes of medical indications, particularly in assessing pathology of the brain, skull, and cerebrovascular system. It is commonly the first-line imaging in neurologic emergencies given its rapidity of image acquisition, safety, cost, and ubiquity. Deep learning models may facilitate detection of a wide range of diseases. However, the scarcity of high-quality labels and annotations, particularly among less common conditions, significantly hinders the development of powerful models. To address this challenge, we introduce FM-CT: a Foundation Model for Head CT for generalizable disease detection, trained using self-supervised learning. Our approach pre-trains a deep learning model on a large, diverse dataset of 361,663 non-contrast 3D head CT scans without the need for manual annotations, enabling the model to learn robust, generalizable features. To investigate the potential of self-supervised learning in head CT, we employed both discrimination with self-distillation and masked image modeling, and we construct our model in 3D rather than at the slice level (2D) to exploit the structure of head CT scans more comprehensively and efficiently. The model’s downstream classification performance is evaluated using internal and three external datasets, encompassing both in-distribution (ID) and out-of-distribution (OOD) data. Our results demonstrate that the self-supervised foundation model significantly improves performance on downstream diagnostic tasks compared to models trained from scratch and previous 3D CT foundation models on scarce annotated datasets. This work highlights the effectiveness of self-supervised learning in medical imaging and sets a new benchmark for head CT image analysis in 3D, enabling broader use of artificial intelligence for head CT-based diagnosis.

[733] PoI: A Filter to Extract Pixel of Interest from Novel View Synthesis for Scene Coordinate Regression

Feifei Li, Qi Song, Chi Zhang, Hui Shuai, Rui Huang

Main category: cs.CV

TL;DR: A dual-criteria filtering method is proposed to improve camera pose estimation by filtering out blurry pixels from Neural View Synthesis (NVS) rendered images during training of Scene Coordinate Regression methods.

Details

Motivation: NVS techniques like NeRF and 3DGS can augment training data for camera pose estimation, but their rendered images suffer from blurring artifacts that undermine reliability, especially for pixel-level 3D coordinate estimation in SCR methods.

Method: Proposes a dual-criteria filtering mechanism that dynamically identifies and discards suboptimal pixels using two metrics: real-time SCR reprojection error and gradient threshold. Also introduces a coarse-to-fine PoI variant for sparse input scenarios using NVS-generated data.

Result: The method achieves state-of-the-art localization accuracy across indoor and outdoor benchmarks while maintaining computational efficiency.

Conclusion: The proposed dual-criteria filtering effectively addresses the blurring issues in NVS-rendered images, enabling reliable use of augmented data for camera pose estimation in both dense and sparse input scenarios.

Abstract: Novel View synthesis (NVS) techniques, notably Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS), can augment camera pose estimation by extending training data with rendered images. However, the images rendered by these methods are often plagued by blurring, undermining their reliability as training data for camera pose estimation. This limitation is particularly critical for Scene Coordinate Regression (SCR) methods, which aim at pixel-level 3D coordinate estimation, because rendering artifacts directly lead to estimation inaccuracies. To address this challenge, we propose a dual-criteria filtering mechanism that dynamically identifies and discards suboptimal pixels during training. The dual-criteria filter evaluates two concurrent metrics: (1) real-time SCR reprojection error, and (2) gradient threshold, across the coordinate regression domain. In addition, for visual localization problems in sparse input scenarios, it will be even more necessary to use data generated by NVS to assist the localization task. We design a coarse-to-fine PoI variant using sparse input NVS to solve this problem. Experiments across indoor and outdoor benchmarks confirm our method’s efficacy. It achieves state-of-the-art localization accuracy while maintaining computational efficiency.

[734] Bidirectional Uncertainty-Aware Region Learning for Semi-Supervised Medical Image Segmentation

Shiwei Zhou, Xin Liu, Haifeng Zhao, Bin Luo, Dengdi Sun

Main category: cs.CV

TL;DR: Proposes a bidirectional uncertainty-aware region learning strategy for semi-supervised medical image segmentation that focuses on high-uncertainty regions in labeled data and low-uncertainty regions in unlabeled data to address erroneous pseudo-label accumulation.

Details

Motivation: To address the problem of erroneous pseudo-labels accumulating during semi-supervised medical image segmentation training, particularly in high-uncertainty regions, without discarding potentially valuable training data like traditional methods.

Method: Bidirectional uncertainty-aware region learning strategy: uses precise label information to guide learning in high-uncertainty regions for labeled data, and focuses on low-uncertainty regions for unlabeled data to reduce interference from erroneous pseudo-labels.

Result: Significant performance improvement achieved across different medical image segmentation tasks through extensive experiments.

Conclusion: The proposed bidirectional learning strategy effectively utilizes precise supervision from labeled data while stabilizing unlabeled data training, leading to substantial performance gains in semi-supervised medical image segmentation.

Abstract: In semi-supervised medical image segmentation, the poor quality of unlabeled data and the uncertainty in the model’s predictions lead to models that inevitably produce erroneous pseudo-labels. These errors accumulate throughout model training, thereby weakening the model’s performance. We found that these erroneous pseudo-labels are typically concentrated in high-uncertainty regions. Traditional methods improve performance by directly discarding pseudo-labels in these regions, which can also result in neglecting potentially valuable training data. To alleviate this problem, we propose a bidirectional uncertainty-aware region learning strategy to fully utilize the precise supervision provided by labeled data and stabilize the training of unlabeled data. Specifically, in the training labeled data, we focus on high-uncertainty regions, using precise label information to guide the model’s learning in potentially uncontrollable areas. Meanwhile, in the training of unlabeled data, we concentrate on low-uncertainty regions to reduce the interference of erroneous pseudo-labels on the model. Through this bidirectional learning strategy, the model’s overall performance has significantly improved. Extensive experiments show that our proposed method achieves significant performance improvement on different medical image segmentation tasks.

[735] IM360: Large-scale Indoor Mapping with 360 Cameras

Dongki Jung, Jaehoon Choi, Yonghan Lee, Dinesh Manocha

Main category: cs.CV

TL;DR: IM360 is a novel 3D mapping pipeline for large-scale indoor environments that uses omnidirectional images and spherical camera models to overcome occlusion and textureless challenges, achieving state-of-the-art performance in camera localization and 3D reconstruction.

Details

Motivation: Large-scale indoor environments present significant challenges including prevalent occlusions and textureless regions that make traditional 3D mapping approaches ineffective.

Method: The approach integrates spherical camera models into Structure-from-Motion pipeline using dense matching features for 360 images, and employs mesh-based neural rendering with texture optimization that combines diffuse and specular components.

Result: IM360 achieves 3.5 PSNR increase in textured mesh reconstruction and state-of-the-art performance in camera localization and registration on Matterport3D and Stanford2D3D datasets.

Conclusion: The proposed IM360 pipeline effectively addresses large-scale indoor mapping challenges and demonstrates superior performance in real-world scenarios through its integration of omnidirectional imaging and advanced texture optimization techniques.

Abstract: We present a novel 3D mapping pipeline for large-scale indoor environments. To address the significant challenges in large-scale indoor scenes, such as prevalent occlusions and textureless regions, we propose IM360, a novel approach that leverages the wide field of view of omnidirectional images and integrates the spherical camera model into the Structure-from-Motion (SfM) pipeline. Our SfM utilizes dense matching features specifically designed for 360 images, demonstrating superior capability in image registration. Furthermore, with the aid of mesh-based neural rendering techniques, we introduce a texture optimization method that refines texture maps and accurately captures view-dependent properties by combining diffuse and specular components. We evaluate our pipeline on large-scale indoor scenes, demonstrating its effectiveness in real-world scenarios. In practice, IM360 demonstrates superior performance, achieving a 3.5 PSNR increase in textured mesh reconstruction. We attain state-of-the-art performance in terms of camera localization and registration on Matterport3D and Stanford2D3D.

[736] VPNeXt – Rethinking Dense Decoding for Plain Vision Transformer

Xikai Tang, Ye Huang, Guangqiang Yin, Lixin Duan

Main category: cs.CV

TL;DR: VPNeXt is a simple Vision Transformer model that introduces Visual Context Replay (VCR) and ViTUp modules to address limitations in existing dense representation approaches, achieving state-of-the-art semantic segmentation performance.

Details

Motivation: To address two key concerns: (1) whether complex Transformer Mask Decoder architectures are necessary for good representations, and (2) whether Plain ViT really needs mock pyramid features for upsampling.

Method: Introduced Visual Context Replay (VCR) to efficiently achieve Transformer Decoder effects, and ViTUp module that utilizes ViT’s real pyramid features for better upsampling instead of mock pyramid features.

Result: Achieved state-of-the-art performance in semantic segmentation, significantly exceeding the long-established mIoU barrier on VOC2012 dataset with the largest improvement since 2015.

Conclusion: VPNeXt demonstrates that simple and effective designs can achieve superior performance in semantic segmentation for Plain ViT, setting new state-of-the-art benchmarks.

Abstract: We present VPNeXt, a new and simple model for the Plain Vision Transformer (ViT). Unlike the many related studies that share the same homogeneous paradigms, VPNeXt offers a fresh perspective on dense representation based on ViT. In more detail, the proposed VPNeXt addressed two concerns about the existing paradigm: (1) Is it necessary to use a complex Transformer Mask Decoder architecture to obtain good representations? (2) Does the Plain ViT really need to depend on the mock pyramid feature for upsampling? For (1), we investigated the potential underlying reasons that contributed to the effectiveness of the Transformer Decoder and introduced the Visual Context Replay (VCR) to achieve similar effects efficiently. For (2), we introduced the ViTUp module. This module fully utilizes the previously overlooked ViT real pyramid feature to achieve better upsampling results compared to the earlier mock pyramid feature. This represents the first instance of such functionality in the field of semantic segmentation for Plain ViT. We performed ablation studies on related modules to verify their effectiveness gradually. We conducted relevant comparative experiments and visualizations to show that VPNeXt achieved state-of-the-art performance with a simple and effective design. Moreover, the proposed VPNeXt significantly exceeded the long-established mIoU wall/barrier of the VOC2012 dataset, setting a new state-of-the-art by a large margin, which also stands as the largest improvement since 2015.

[737] Spiking Meets Attention: Efficient Remote Sensing Image Super-Resolution with Attention Spiking Neural Networks

Yi Xiao, Qiangqiang Yuan, Kui Jiang, Wenke Huang, Qiang Zhang, Tingting Zheng, Chia-Wen Lin, Liangpei Zhang

Main category: cs.CV

TL;DR: SpikeSR is a spiking neural network-based super-resolution method for remote sensing images that uses spiking attention blocks to achieve state-of-the-art performance with high computational efficiency.

Details

Motivation: SNNs offer biological plausibility and energy efficiency but have limited capacity and representation power, remaining underexplored in remote sensing super-resolution tasks. The observation that spiking signals show drastic intensity variations across textures indicates active learning states, motivating their use for efficient SR.

Method: Proposed spiking attention block (SAB) that optimizes membrane potentials through inferred attention weights to regulate spiking activity for better feature representation. Bridges temporal and channel dimension modulation and accesses global self-similar patterns in large-scale RS imagery to infer spatial attention weights.

Result: SpikeSR achieves state-of-the-art performance across various remote sensing benchmarks (AID, DOTA, DIOR) while maintaining high computational efficiency.

Conclusion: The proposed SpikeSR method successfully applies SNNs to remote sensing super-resolution tasks, demonstrating superior performance and efficiency through innovative spiking attention mechanisms.

Abstract: Spiking neural networks (SNNs) are emerging as a promising alternative to traditional artificial neural networks (ANNs), offering biological plausibility and energy efficiency. Despite these merits, SNNs are frequently hampered by limited capacity and insufficient representation power, yet remain underexplored in remote sensing super-resolution (SR) tasks. In this paper, we first observe that spiking signals exhibit drastic intensity variations across diverse textures, highlighting an active learning state of the neurons. This observation motivates us to apply SNNs for efficient SR of RSIs. Inspired by the success of attention mechanisms in representing salient information, we devise the spiking attention block (SAB), a concise yet effective component that optimizes membrane potentials through inferred attention weights, which, in turn, regulates spiking activity for superior feature representation. Our key contributions include: 1) we bridge the independent modulation between temporal and channel dimensions, facilitating joint feature correlation learning, and 2) we access the global self-similar patterns in large-scale remote sensing imagery to infer spatial attention weights, incorporating effective priors for realistic and faithful reconstruction. Building upon SAB, we proposed SpikeSR, which achieves state-of-the-art performance across various remote sensing benchmarks such as AID, DOTA, and DIOR, while maintaining high computational efficiency. The code of SpikeSR will be available upon paper acceptance.

[738] High-Precision Dichotomous Image Segmentation via Depth Integrity-Prior and Fine-Grained Patch Strategy

Xianjie Liu, Keren Fu, Qijun Zhao

Main category: cs.CV

TL;DR: PDFNet uses pseudo depth information to improve dichotomous image segmentation by leveraging depth integrity-prior, achieving state-of-the-art performance with much fewer parameters than diffusion methods.

Details

Motivation: Existing DIS methods face a trade-off: non-diffusion methods are efficient but inaccurate due to weak semantics, while diffusion methods are accurate but computationally expensive. Pseudo depth information can provide essential semantic understanding to bridge this gap.

Method: Proposed PDFNet with multimodal interactive modeling to fuse RGB and pseudo depth features, depth integrity-prior loss to enforce depth consistency, and fine-grained perception enhancement module with adaptive patch selection for boundary refinement.

Result: PDFNet achieves state-of-the-art performance with only 94M parameters (<11% of diffusion-based models), outperforming all non-diffusion methods and surpassing some diffusion methods.

Conclusion: The depth integrity-prior from pseudo depth maps provides effective spatial understanding for DIS, enabling high accuracy with computational efficiency through the proposed PDFNet framework.

Abstract: High-precision dichotomous image segmentation (DIS) is a task of extracting fine-grained objects from high-resolution images. Existing methods face a dilemma: non-diffusion methods work efficiently but suffer from false or missed detections due to weak semantics and less robust spatial priors; diffusion methods, using strong generative priors, have high accuracy but encounter high computational burdens. As a solution, we find pseudo depth information from monocular depth estimation models can provide essential semantic understanding that quickly reveals spatial differences across target objects and backgrounds. Inspired by this phenomenon, we discover a novel insight we term the depth integrity-prior: in pseudo depth maps, foreground objects consistently convey stable depth values with much lower variances than chaotic background patterns. To exploit such a prior, we propose a Prior of Depth Fusion Network (PDFNet). Specifically, our network establishes multimodal interactive modeling to achieve depth-guided structural perception by deeply fusing RGB and pseudo depth features. We further introduce a novel depth integrity-prior loss to explicitly enforce depth consistency in segmentation results. Additionally, we design a fine-grained perception enhancement module with adaptive patch selection to perform boundary-sensitive detail refinement. Notably, PDFNet achieves state-of-the-art performance with only 94M parameters (<11% of those diffusion-based models), outperforming all non-diffusion methods and surpassing some diffusion methods. Code is provided in the supplementary materials.

[739] Text2Story: Advancing Video Storytelling with Text Guidance

Taewon Kang, Divya Kothandaraman, Ming C. Lin

Main category: cs.CV

TL;DR: A novel framework for generating coherent long-form videos from text by integrating scene and action prompts through dynamics-inspired prompt mixing, bidirectional time-weighted latent blending, and semantic action representation.

Details

Motivation: Addressing the challenge of long-form video synthesis from text, which remains largely unexplored due to difficulties in temporal coherency, semantic preservation, and maintaining scene context and action continuity across extended sequences.

Method: Uses bidirectional time-weighted latent blending for temporal consistency, dynamics-informed prompt weighting (DIPW) to balance scene and action prompts, and semantic action representation for motion continuity. Latent space blending maintains spatial coherence while time-weighted blending enforces bidirectional temporal constraints.

Result: Significant improvements over baselines, achieving temporally consistent and visually compelling video narratives without additional training. The system prevents abrupt transitions while ensuring fluid storytelling that faithfully reflects both scene and action cues.

Conclusion: Bridges the gap between short clips and extended video, establishing a new paradigm in GenAI-driven video synthesis from text through an integrative approach that maintains coherence across long-form sequences.

Abstract: Generating coherent long-form video sequences from discrete input using only text prompts is a critical task in content creation. While diffusion-based models excel at short video synthesis, long-form storytelling from text remains largely unexplored and a challenge due to difficulties in temporal coherency, preserving semantic meaning, and maintaining both scene context and action continuity across the video. We introduce a novel storytelling framework that achieves this by integrating scene and action prompts through dynamics-inspired prompt mixing. Specifically, we first present a bidirectional time-weighted latent blending strategy to ensure temporal consistency between segments of the long-form video being generated. We then propose a dynamics-informed prompt weighting (DIPW) mechanism that adaptively balances the influence of scene and action prompts at each diffusion timestep by jointly considering CLIP-based alignment, narrative continuity, and temporal smoothness. To further enhance motion continuity, we incorporate a semantic action representation to encode high-level action semantics into the blending process, dynamically adjusting transitions based on action similarity and ensuring smooth yet adaptable motion changes. Latent space blending maintains spatial coherence between objects in a scene, while time-weighted blending enforces bidirectional constraints for temporal consistency. The resulting integrative system prevents abrupt transitions while ensuring fluid storytelling that faithfully reflects both scene and action cues. Extensive experiments demonstrate significant improvements over baselines, achieving temporally consistent and visually compelling video narratives without any additional training. This approach bridges the gap between short clips and extended video to establish a new paradigm in GenAI-driven video synthesis from text.

[740] Find your Needle: Small Object Image Retrieval via Multi-Object Attention Optimization

Michael Green, Matan Levy, Issar Tzachor, Dvir Samuel, Nir Darshan, Rami Ben-Ari

Main category: cs.CV

TL;DR: A novel framework called Multi-object Attention Optimization (MaO) for Small Object Image Retrieval that uses multi-object pre-training and attention-based feature extraction with object masks to create unified image descriptors.

Details

Motivation: Address the challenge of retrieving images containing specific small objects in cluttered scenes, where existing methods struggle to construct effective single image descriptors that represent all objects.

Method: Multi-object pre-training phase followed by attention-based feature extraction with object masks, integrating features into a single unified image descriptor.

Result: Significantly outperforms existing retrieval methods and strong baselines, achieving notable improvements in both zero-shot and lightweight multi-object fine-tuning.

Conclusion: The MaO approach provides a strong foundation for enhancing retrieval performance in practical small object image retrieval tasks.

Abstract: We address the challenge of Small Object Image Retrieval (SoIR), where the goal is to retrieve images containing a specific small object, in a cluttered scene. The key challenge in this setting is constructing a single image descriptor, for scalable and efficient search, that effectively represents all objects in the image. In this paper, we first analyze the limitations of existing methods on this challenging task and then introduce new benchmarks to support SoIR evaluation. Next, we introduce Multi-object Attention Optimization (MaO), a novel retrieval framework which incorporates a dedicated multi-object pre-training phase. This is followed by a refinement process that leverages attention-based feature extraction with object masks, integrating them into a single unified image descriptor. Our MaO approach significantly outperforms existing retrieval methods and strong baselines, achieving notable improvements in both zero-shot and lightweight multi-object fine-tuning. We hope this work will lay the groundwork and inspire further research to enhance retrieval performance for this highly practical task. Code and Data are available on our project page: $\href{https://pihash2k.github.io/findyourneedle.github.io}{https://pihash2k.github.io/findyourneedle.github.io}$.

[741] Exploring Reprensentation Invariance in Finetuning

Wenqiang Zu, Shenghao Xie, Hao Chen, Zhiqiang Chen, Liwen Hu, Yuanhao Xi, Yiming Liang, Junliang Ye, Bo Lei, Tiejun Huang, Guoqi Li, Lei Ma

Main category: cs.CV

TL;DR: RIFT is a regularization method that preserves pretrained representations during finetuning by maximizing similarity between pretrained and finetuned models using orthogonal invariance.

Details

Motivation: Pretrained foundation models lose their generalizable representations during finetuning, degrading model performance on downstream tasks.

Method: Representation Invariance FineTuning (RIFT) - a regularization that maximizes representation similarity between pretrained and finetuned models using orthogonal invariance of manifolds.

Result: RIFT is compatible with mainstream finetuning methods, offers competitive or enhanced performance, and better preserves model generalizability.

Conclusion: Downstream tasks can be effectively adapted without sacrificing the benefits of pretrained representations using RIFT regularization.

Abstract: Foundation models pretrained on large-scale natural images are widely adapted to various cross-domain low-resource downstream tasks, benefiting from generalizable and transferable patterns captured by their representations. However, these representations are later found to gradually vanish during finetuning, accompanied by a degradation of model’s original generalizability. In this paper, we argue that such tasks can be effectively adapted without sacrificing the benefits of pretrained representations. We approach this by introducing \textit{Representation Invariance FineTuning (RIFT)}, a regularization that maximizes the representation similarity between pretrained and finetuned models by leveraging orthogonal invariance of manifolds in a computationally efficient way. Experiments demonstrate that our method is compatible with mainstream finetuning methods, offering competitive or even enhanced performance and better preservation of the generalizability.

[742] Controllable Adversarial Makeup for Privacy via Text-Guided Diffusion

Youngjin Kwon, Xiao Zhang

Main category: cs.CV

TL;DR: MASQUE is a diffusion-based framework that generates localized adversarial makeups guided by user-defined text prompts to protect privacy from facial recognition systems, achieving robust dodging performance with high perceptual fidelity.

Details

Motivation: To address privacy concerns from widespread facial recognition usage by developing effective anti-facial recognition techniques that overcome limitations of existing generative makeup-based approaches, which have weak dodging success rates and introduce visual artifacts.

Method: Built on precise null-text inversion, customized cross-attention fusion with masking, and pairwise adversarial guidance mechanism using images of the same individual to generate localized adversarial makeups guided by text prompts.

Result: MASQUE significantly improves dodging success rates over all baselines, with higher perceptual fidelity preservation, stronger adaptability to various makeup prompts, and robustness to image transformations, as demonstrated in comprehensive evaluations on open-source models and commercial APIs.

Conclusion: MASQUE provides an effective privacy protection solution against facial recognition systems through localized adversarial makeup generation that balances dodging performance with visual quality and user customization.

Abstract: As face recognition becomes more widespread in government and commercial services, its potential misuse raises serious concerns about privacy and civil rights. To counteract this threat, various anti-facial recognition techniques have been proposed, which protect privacy by adversarially perturbing face images. Among these, generative makeup-based approaches are the most widely studied. However, these methods, designed primarily to impersonate specific target identities, can only achieve weak dodging success rates while increasing the risk of targeted abuse. In addition, they often introduce global visual artifacts or a lack of adaptability to accommodate diverse makeup prompts, compromising user satisfaction. To address the above limitations, we develop MASQUE, a novel diffusion-based framework that generates localized adversarial makeups guided by user-defined text prompts. Built upon precise null-text inversion, customized cross-attention fusion with masking, and a pairwise adversarial guidance mechanism using images of the same individual, MASQUE achieves robust dodging performance without requiring any external identity. Comprehensive evaluations on open-source facial recognition models and commercial APIs demonstrate that MASQUE significantly improves dodging success rates over all baselines, along with higher perceptual fidelity preservation, stronger adaptability to various makeup prompts, and robustness to image transformations.

[743] A Survey on Self-supervised Contrastive Learning for Multimodal Text-Image Analysis

Asifullah Khan, Laiba Asmatullah, Anza Malik, Shahzaib Khan, Hamna Asif

Main category: cs.CV

TL;DR: This paper provides a comprehensive survey of contrastive learning in text-image models, covering terminology, recent developments, categorization by model structures, technical advances, and state-of-the-art applications.

Details

Motivation: Self-supervised learning enables learning from unlabeled data by extracting discriminative features. Contrastive learning has shown significant improvements in image understanding and text-image analysis without heavy reliance on labeled data.

Method: The paper categorizes approaches based on different model structures, discusses pretext tasks for both images and text, analyzes architectural structures, and identifies key trends in contrastive learning for text-image models.

Result: The survey provides an overview of recent developments in contrastive learning approaches for text-image models, including the latest advances in techniques and architectural structures.

Conclusion: The paper comprehensively discusses the state-of-the-art applications of self-supervised contrastive learning in text-image based models, highlighting recent advances and trends in this rapidly evolving field.

Abstract: Self-supervised learning is a machine learning approach that generates implicit labels by learning underlined patterns and extracting discriminative features from unlabeled data without manual labelling. Contrastive learning introduces the concept of “positive” and “negative” samples, where positive pairs (e.g., variation of the same image/object) are brought together in the embedding space, and negative pairs (e.g., views from different images/objects) are pushed farther away. This methodology has shown significant improvements in image understanding and image text analysis without much reliance on labeled data. In this paper, we comprehensively discuss the terminologies, recent developments and applications of contrastive learning with respect to text-image models. Specifically, we provide an overview of the approaches of contrastive learning in text-image models in recent years. Secondly, we categorize the approaches based on different model structures. Thirdly, we further introduce and discuss the latest advances of the techniques used in the process such as pretext tasks for both images and text, architectural structures, and key trends. Lastly, we discuss the recent state-of-art applications of self-supervised contrastive learning Text-Image based models.

[744] Interpretable 3D Neural Object Volumes for Robust Conceptual Reasoning

Nhi Pham, Artur Jesslen, Bernt Schiele, Adam Kortylewski, Jonas Fischer

Main category: cs.CV

TL;DR: CAVE is a concept-aware volume-based classifier that unifies interpretability and robustness by learning sparse concepts from 3D object representations, achieving competitive performance with meaningful concepts across out-of-distribution settings.

Details

Motivation: To address the gap in interpretability for 3D-aware classifiers and the lack of OOD robustness in current concept-based XAI methods, aiming to unify both interpretability and robustness in image classification.

Method: Design CAVE as a robust and inherently interpretable classifier that learns sparse concepts from 3D object representation, and propose 3D Consistency (3D-C) metric using ground-truth object meshes to measure spatial consistency of concepts.

Result: CAVE achieves competitive classification performance while discovering consistent and meaningful concepts across images in various OOD settings.

Conclusion: CAVE successfully bridges the gap between interpretability and robustness in 3D-aware classifiers, providing a unified framework that maintains competitive performance while offering meaningful concept explanations across out-of-distribution scenarios.

Abstract: With the rise of deep neural networks, especially in safety-critical applications, robustness and interpretability are crucial to ensure their trustworthiness. Recent advances in 3D-aware classifiers that map image features to volumetric representation of objects, rather than relying solely on 2D appearance, have greatly improved robustness on out-of-distribution (OOD) data. Such classifiers have not yet been studied from the perspective of interpretability. Meanwhile, current concept-based XAI methods often neglect OOD robustness. We aim to address both aspects with CAVE - Concept Aware Volumes for Explanations - a new direction that unifies interpretability and robustness in image classification. We design CAVE as a robust and inherently interpretable classifier that learns sparse concepts from 3D object representation. We further propose 3D Consistency (3D-C), a metric to measure spatial consistency of concepts. Unlike existing metrics that rely on human-annotated parts on images, 3D-C leverages ground-truth object meshes as a common surface to project and compare explanations across concept-based methods. CAVE achieves competitive classification performance while discovering consistent and meaningful concepts across images in various OOD settings. Code available at https://github.com/phamleyennhi/CAVE.

[745] DPFlow: Adaptive Optical Flow Estimation with a Dual-Pyramid Framework

Henrique Morimitsu, Xiaobin Zhu, Roberto M. Cesar Jr., Xiangyang Ji, Xu-Cheng Yin

Main category: cs.CV

TL;DR: DPFlow is a new adaptive optical flow architecture that generalizes to 8K resolution inputs while trained only on low-resolution samples, plus a new Kubric-NK benchmark for high-resolution optical flow evaluation.

Details

Motivation: Current optical flow methods are designed for low resolution and don't generalize well to high-resolution inputs (up to 8K), requiring downscaling or tiling that loses details and global information. There's also a lack of proper benchmarks for high-resolution optical flow evaluation.

Method: Proposed DPFlow - an adaptive optical flow architecture capable of handling up to 8K resolution inputs while being trained with only low-resolution samples. Also introduced Kubric-NK benchmark for evaluating optical flow methods across 1K to 8K resolutions.

Result: DPFlow achieves state-of-the-art results on MPI-Sintel, KITTI 2015, Spring, and other high-resolution benchmarks. The high-resolution evaluation reveals new insights about existing methods’ generalization capabilities.

Conclusion: DPFlow successfully addresses the high-resolution optical flow estimation problem by providing an adaptive architecture that generalizes well to 8K inputs, and the Kubric-NK benchmark enables proper evaluation of optical flow methods at high resolutions.

Abstract: Optical flow estimation is essential for video processing tasks, such as restoration and action recognition. The quality of videos is constantly increasing, with current standards reaching 8K resolution. However, optical flow methods are usually designed for low resolution and do not generalize to large inputs due to their rigid architectures. They adopt downscaling or input tiling to reduce the input size, causing a loss of details and global information. There is also a lack of optical flow benchmarks to judge the actual performance of existing methods on high-resolution samples. Previous works only conducted qualitative high-resolution evaluations on hand-picked samples. This paper fills this gap in optical flow estimation in two ways. We propose DPFlow, an adaptive optical flow architecture capable of generalizing up to 8K resolution inputs while trained with only low-resolution samples. We also introduce Kubric-NK, a new benchmark for evaluating optical flow methods with input resolutions ranging from 1K to 8K. Our high-resolution evaluation pushes the boundaries of existing methods and reveals new insights about their generalization capabilities. Extensive experimental results show that DPFlow achieves state-of-the-art results on the MPI-Sintel, KITTI 2015, Spring, and other high-resolution benchmarks.

[746] Plug-and-Play 1.x-Bit KV Cache Quantization for Video Large Language Models

Keda Tao, Haoxuan You, Yang Sui, Can Qin, Huan Wang

Main category: cs.CV

TL;DR: VidKV is a plug-and-play KV cache quantization method that compresses VideoLLM KV cache to lower than 2 bits (1.5-bit for key, 1.58-bit for value) with minimal performance loss.

Details

Motivation: VideoLLMs process thousands of visual tokens from video frames, causing KV cache to significantly increase memory requirements and become a bottleneck for inference speed and memory usage. Existing 2-bit KV quantization works well but the limits of lower-bit quantization haven't been explored.

Method: For key: mixed-precision quantization with 2-bit for anomalous channels and 1-bit + FFT for normal channels. For value: 1.58-bit quantization with selective filtering of semantically salient visual tokens. Uses per-channel quantization for value cache instead of per-token approach.

Result: Extensive experiments with LLaVA-OV-7B and Qwen2.5-VL-7B on six benchmarks show VidKV effectively compresses KV cache to 1.5-bit and 1.58-bit precision with almost no performance drop compared to FP16 baselines.

Conclusion: VidKV enables efficient VideoLLM inference by compressing KV cache to ultra-low bits (below 2 bits) while maintaining model performance, addressing the memory bottleneck issue in video processing.

Abstract: Video large language models (VideoLLMs) have demonstrated the capability to process longer video inputs and enable complex reasoning and analysis. However, due to the thousands of visual tokens from the video frames, the key-value (KV) cache can significantly increase memory requirements, becoming a bottleneck for inference speed and memory usage. KV cache quantization is a widely used approach to address this problem. In this paper, we find that 2-bit KV quantization of VideoLLMs can hardly hurt the model performance, while the limit of KV cache quantization in even lower bits has not been investigated. To bridge this gap, we introduce VidKV, a plug-and-play KV cache quantization method to compress the KV cache to lower than 2 bits. Specifically, (1) for key, we propose a mixed-precision quantization strategy in the channel dimension, where we perform 2-bit quantization for anomalous channels and 1-bit quantization combined with FFT for normal channels; (2) for value, we implement 1.58-bit quantization while selectively filtering semantically salient visual tokens for targeted preservation, for a better trade-off between precision and model performance. Importantly, our findings suggest that the value cache of VideoLLMs should be quantized in a per-channel fashion instead of the per-token fashion proposed by prior KV cache quantization works for LLMs. Empirically, extensive results with LLaVA-OV-7B and Qwen2.5-VL-7B on six benchmarks show that VidKV effectively compresses the KV cache to 1.5-bit and 1.58-bit precision with almost no performance drop compared to the FP16 counterparts.

[747] Efficient Self-Supervised Adaptation for Medical Image Analysis

Moein Sorkhei, Emir Konuk, Jingyu Guo, Chanjuan Meng, Christos Matsoukas, Kevin Smith

Main category: cs.CV

TL;DR: ESSA framework applies parameter-efficient fine-tuning to self-supervised adaptation, achieving better performance than full-parameter SSA and supervised fine-tuning while reducing computational costs.

Details

Motivation: Self-supervised adaptation improves foundation model transfer to medical domains but is computationally expensive. Parameter-efficient methods like LoRA work for supervised adaptation but their effectiveness for SSA was unknown.

Method: Proposed efficient self-supervised adaptation (ESSA) framework that applies parameter-efficient fine-tuning techniques to SSA. Tested various methods including Attention Projection Layer Adaptation (APLA).

Result: APLA achieved state-of-the-art performance, consistently surpassing full-parameter SSA and supervised fine-tuning across diverse medical tasks. Reduced GPU memory by up to 40.1% and increased training throughput by 25.2% while maintaining inference efficiency.

Conclusion: Parameter-efficient fine-tuning can be effectively applied to self-supervised adaptation, achieving better performance with significantly reduced computational costs in medical domain applications.

Abstract: Self-supervised adaptation (SSA) improves foundation model transfer to medical domains but is computationally prohibitive. Although parameter efficient fine-tuning methods such as LoRA have been explored for supervised adaptation, their effectiveness for SSA remains unknown. In this work, we introduce efficient self-supervised adaptation (ESSA), a framework that applies parameter-efficient fine-tuning techniques to SSA with the aim of reducing computational cost and improving adaptation performance. Among the methods tested, Attention Projection Layer Adaptation (APLA) sets a new state-of-the-art, consistently surpassing full-parameter SSA and supervised fine-tuning across diverse medical tasks, while reducing GPU memory by up to 40.1% and increasing training throughput by 25.2%, all while maintaining inference efficiency.

[748] Audio-centric Video Understanding Benchmark without Text Shortcut

Yudong Yang, Jimin Zhuang, Guangzhi Sun, Changli Tang, Yixuan Li, Peihan Li, Yifan Jiang, Wei Li, Zejun Ma, Chao Zhang

Main category: cs.CV

TL;DR: AVUT is an audio-centric video understanding benchmark that evaluates multimodal LLMs’ video comprehension with focus on auditory information, addressing text shortcut problems through answer permutation filtering.

Details

Motivation: Current audio-visual LLMs treat audio as auxiliary modality, but thorough video understanding depends on auditory information which provides critical context, emotional cues, and semantic meaning that visual data alone lacks.

Method: Proposes AVUT benchmark with carefully designed audio-centric tasks testing audio content understanding and audio-visual interactions, plus an answer permutation-based filtering mechanism to address text shortcut problems.

Result: Comprehensive evaluation across diverse open-source and proprietary multimodal LLMs reveals deficiencies in audio-visual LLMs’ capabilities.

Conclusion: AVUT provides a robust benchmark for evaluating audio-centric video understanding in multimodal LLMs, highlighting the importance of auditory information and addressing existing benchmark limitations.

Abstract: Audio often serves as an auxiliary modality in video understanding tasks of audio-visual large language models (LLMs), merely assisting in the comprehension of visual information. However, a thorough understanding of videos significantly depends on auditory information, as audio offers critical context, emotional cues, and semantic meaning that visual data alone often lacks. This paper proposes an audio-centric video understanding benchmark (AVUT) to evaluate the video comprehension capabilities of multimodal LLMs with a particular focus on auditory information. AVUT introduces a suite of carefully designed audio-centric tasks, holistically testing the understanding of both audio content and audio-visual interactions in videos. Moreover, this work points out the text shortcut problem that largely exists in other benchmarks where the correct answer can be found from question text alone without needing videos. AVUT addresses this problem by proposing a answer permutation-based filtering mechanism. A thorough evaluation across a diverse range of open-source and proprietary multimodal LLMs is performed, followed by the analyses of deficiencies in audio-visual LLMs. Demos and data are available at https://github.com/lark-png/AVUT.

[749] Beyond Synthetic Replays: Turning Diffusion Features into Few-Shot Class-Incremental Learning Knowledge

Junsu Kim, Yunhoe Ku, Dongyoon Han, Seungryul Baek

Main category: cs.CV

TL;DR: Diffusion-FSCIL uses Stable Diffusion as a unified backbone for few-shot class-incremental learning, extracting four synergistic feature types from SD to address data scarcity and catastrophic forgetting without requiring synthetic buffers or separate classification backbones.

Details

Motivation: Existing approaches use Stable Diffusion mainly as a replay generator, but the authors demonstrate that SD's rich multi-scale representations can serve as a unified backbone for FSCIL, offering a more integrated solution.

Method: Extracts four feature types from SD: real image characteristics through inversion, semantic diversity via class-conditioned synthesis, enhanced generalization through controlled noise injection, and replay without image storage through generative features. Operates entirely in latent space with lightweight networks (~6M parameters).

Result: State-of-the-art performance on CUB-200, miniImageNet, and CIFAR-100. Comprehensive ablations confirm the necessity of each feature type. Streamlined variant maintains competitive accuracy while substantially improving efficiency.

Conclusion: Establishes the viability of generative models as practical and effective backbones for FSCIL, offering a unified framework that outperforms conventional approaches requiring synthetic buffers and separate classification backbones.

Abstract: Few-shot class-incremental learning (FSCIL) is challenging due to extremely limited training data while requiring models to acquire new knowledge without catastrophic forgetting. Recent works have explored generative models, particularly Stable Diffusion (SD), to address these challenges. However, existing approaches use SD mainly as a replay generator, whereas we demonstrate that SD’s rich multi-scale representations can serve as a unified backbone. Motivated by this observation, we introduce Diffusion-FSCIL, which extracts four synergistic feature types from SD by capturing real image characteristics through inversion, providing semantic diversity via class-conditioned synthesis, enhancing generalization through controlled noise injection, and enabling replay without image storage through generative features. Unlike conventional approaches requiring synthetic buffers and separate classification backbones, our unified framework operates entirely in the latent space with only lightweight networks ($\approx$6M parameters). Extensive experiments on CUB-200, miniImageNet, and CIFAR-100 demonstrate state-of-the-art performance, with comprehensive ablations confirming the necessity of each feature type. Furthermore, we confirm that our streamlined variant maintains competitive accuracy while substantially improving efficiency, establishing the viability of generative models as practical and effective backbones for FSCIL.

[750] SCRAMBLe : Enhancing Multimodal LLM Compositionality with Synthetic Preference Data

Samarth Mishra, Kate Saenko, Venkatesh Saligrama

Main category: cs.CV

TL;DR: SCRAMBLe improves compositional reasoning in MLLMs through synthetic preference tuning on binary caption choices, achieving state-of-the-art performance on Winoground and general VQA tasks.

Details

Motivation: Current MLLMs struggle with compositional reasoning (e.g., distinguishing 'dog chasing cat' vs 'cat chasing dog') and perform significantly worse than humans on benchmarks like Winoground.

Method: Synthetic Compositional Reasoning Augmentation with Binary preference Learning (SCRAMBLe) - preference tuning open-weight MLLMs on automatically generated synthetic preference data from existing image-caption pairs.

Result: SCRAMBLe-tuned Molmo-7B improves Winoground performance from 49.5% to 54.8% (best reported), with ~1% improvement on general visual question answering tasks.

Conclusion: Compositional reasoning in MLLMs can be effectively improved through data-driven preference tuning on synthetic compositional examples, with benefits extending to general vision-language tasks.

Abstract: Compositionality, or correctly recognizing scenes as compositions of atomic visual concepts, remains difficult for multimodal large language models (MLLMs). Even state of the art MLLMs such as GPT-4o can make mistakes in distinguishing compositions like “dog chasing cat” vs “cat chasing dog”. While on Winoground, a benchmark for measuring such reasoning, MLLMs have made significant progress, they are still far from a human’s performance. We show that compositional reasoning in these models can be improved by elucidating such concepts via data, where a model is trained to prefer the correct caption for an image over a close but incorrect one. We introduce SCRAMBLe: Synthetic Compositional Reasoning Augmentation of MLLMs with Binary preference Learning, an approach for preference tuning open-weight MLLMs on synthetic preference data generated in a fully automated manner from existing image-caption data. SCRAMBLe holistically improves these MLLMs’ compositional reasoning capabilities which we can see through significant improvements across multiple vision language compositionality benchmarks, as well as smaller but significant improvements on general question answering tasks. As a sneak peek, SCRAMBLe tuned Molmo-7B model improves on Winoground from 49.5% to 54.8% (best reported to date), while improving by ~1% on more general visual question answering tasks. Code for SCRAMBLe along with tuned models and our synthetic training dataset is available at https://github.com/samarth4149/SCRAMBLe.

[751] From Specificity to Generality: Revisiting Generalizable Artifacts in Detecting Face Deepfakes

Long Ma, Zhiyuan Yan, Jin Xu, Yize Chen, Qinglang Guo, Zhen Bi, Yong Liao, Hui Lin

Main category: cs.CV

TL;DR: The paper proposes a universal deepfake detection framework that categorizes artifacts into Face Inconsistency Artifacts (FIA) and Up-Sampling Artifacts (USA), and creates pseudo-fake data with only these general artifacts to train detectors that generalize well to unseen deepfakes.

Details

Motivation: To build a universal detection framework effective for most facial deepfakes, addressing the challenge of diverse forgery artifacts from various generators by focusing on common general artifacts rather than learning all specific artifacts separately.

Method: Categorizes artifacts into FIA (inconsistencies between facial features and surroundings) and USA (traces from generator’s up-sampling). Uses data-level pseudo-fake creation: super-resolution for USA and Blender module with image-level self-blending for FIA.

Result: A standard image classifier trained only with pseudo-fake data non-trivially generalizes well to unseen deepfakes, showing effectiveness of focusing on general artifacts.

Conclusion: The proposed framework successfully identifies and leverages two fundamental artifact types (FIA and USA) to create universal deepfake detectors that generalize across different generation methods without needing to learn all specific artifacts.

Abstract: Detecting deepfakes has been an increasingly important topic, especially given the rapid development of AI generation techniques. In this paper, we ask: How can we build a universal detection framework that is effective for most facial deepfakes? One significant challenge is the wide variety of deepfake generators available, resulting in varying forgery artifacts (e.g., lighting inconsistency, color mismatch, etc). But should we ``teach" the detector to learn all these artifacts separately? It is impossible and impractical to elaborate on them all. So the core idea is to pinpoint the more common and general artifacts across different deepfakes. Accordingly, we categorize deepfake artifacts into two distinct yet complementary types: Face Inconsistency Artifacts (FIA) and Up-Sampling Artifacts (USA). FIA arise from the challenge of generating all intricate details, inevitably causing inconsistencies between the complex facial features and relatively uniform surrounding areas. USA, on the other hand, are the inevitable traces left by the generator’s decoder during the up-sampling process. This categorization stems from the observation that all existing deepfakes typically exhibit one or both of these artifacts. To achieve this, we propose a new data-level pseudo-fake creation framework that constructs fake samples with only the FIA and USA, without introducing extra less-general artifacts. Specifically, we employ a super-resolution to simulate the USA, while design a Blender module that uses image-level self-blending on diverse facial regions to create the FIA. We surprisingly found that, with this intuitive design, a standard image classifier trained only with our pseudo-fake data can non-trivially generalize well to unseen deepfakes.

[752] VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, Limin Wang

Main category: cs.CV

TL;DR: This paper explores Reinforcement Fine-Tuning (RFT) with GRPO for video MLLMs, developing VideoChat-R1 which achieves state-of-the-art performance on spatio-temporal perception tasks while maintaining general capabilities.

Details

Motivation: While reinforcement learning approaches like GRPO show promise in text and image domains, their application to video understanding remains limited. The paper aims to enhance spatio-temporal perception in video MLLMs while maintaining general capabilities.

Method: The paper uses Reinforcement Fine-Tuning (RFT) with Group Relative Policy Optimization (GRPO) for video MLLMs, conducting multi-task RFT on spatio-temporal perception objectives with limited samples.

Result: VideoChat-R1 achieves state-of-the-art performance on spatio-temporal perception tasks, with significant improvements over Qwen2.5-VL-7B in temporal grounding (+31.8) and object tracking (+31.2), while also improving on general QA benchmarks like VideoMME (+0.9), MVBench (+1.0), and Perception Test (+0.9).

Conclusion: RFT is highly data-efficient for task-specific improvements in video MLLMs, and the work demonstrates the potential of RFT for specialized task enhancement while offering valuable insights for future RL research in video MLLMs.

Abstract: Recent advancements in reinforcement learning have significantly advanced the reasoning capabilities of multimodal large language models (MLLMs). While approaches such as Group Relative Policy Optimization (GRPO) and rule-based reward mechanisms demonstrate promise in text and image domains, their application to video understanding remains limited. This paper presents a systematic exploration of Reinforcement Fine-Tuning (RFT) with GRPO for video MLLMs, aiming to enhance spatio-temporal perception while maintaining general capabilities. Our experiments reveal that RFT is highly data-efficient for task-specific improvements. Through multi-task RFT on spatio-temporal perception objectives with limited samples, we develop VideoChat-R1, a powerful video MLLM that achieves state-of-the-art performance on spatio-temporal perception tasks without sacrificing chat ability, while exhibiting emerging spatio-temporal reasoning abilities. Compared to Qwen2.5-VL-7B, VideoChat-R1 boosts performance several-fold in tasks like temporal grounding (+31.8) and object tracking (+31.2). Additionally, it significantly improves on general QA benchmarks such as VideoMME (+0.9), MVBench (+1.0), and Perception Test (+0.9). Our findings underscore the potential of RFT for specialized task enhancement of Video MLLMs. We hope our work offers valuable insights for future RL research in video MLLMs.

[753] Self-alignment of Large Video Language Models with Refined Regularized Preference Optimization

Pritam Sarkar, Ali Etemad

Main category: cs.CV

TL;DR: A self-alignment framework for Large Video Language Models that enables learning from errors through Refined Regularized Preference Optimization (RRPO), improving temporal understanding and reducing hallucinations.

Details

Motivation: LVLMs struggle with fine-grained temporal understanding, hallucinate, and make simple mistakes, posing challenges for real-world deployment.

Method: Self-alignment framework with preferred/non-preferred response pairs and RRPO method using sub-sequence-level refined rewards and token-wise KL regularization.

Result: RRPO achieves more precise alignment and stable training compared to DPO, with effectiveness validated across diverse video tasks.

Conclusion: The proposed self-alignment framework with RRPO effectively addresses LVLM limitations in temporal understanding and reduces hallucinations.

Abstract: Despite recent advances in Large Video Language Models (LVLMs), they still struggle with fine-grained temporal understanding, hallucinate, and often make simple mistakes on even simple video question-answering tasks, all of which pose significant challenges to their safe and reliable deployment in real-world applications. To address these limitations, we propose a self-alignment framework that enables LVLMs to learn from their own errors. Our proposed framework first obtains a training set of preferred and non-preferred response pairs, where non-preferred responses are generated by incorporating common error patterns that often occur due to inadequate spatio-temporal understanding, spurious correlations between co-occurring concepts, and over-reliance on linguistic cues while neglecting the vision modality, among others. To facilitate self-alignment of LVLMs with the constructed preferred and non-preferred response pairs, we introduce Refined Regularized Preference Optimization (RRPO), a novel preference optimization method that utilizes sub-sequence-level refined rewards and token-wise KL regularization to address the limitations of Direct Preference Optimization (DPO). We demonstrate that RRPO achieves more precise alignment and more stable training compared to DPO. Our experiments and analysis validate the effectiveness of our approach across diverse video tasks, including video hallucination, short- and long-video understanding, and fine-grained temporal reasoning.

[754] HSACNet: Hierarchical Scale-Aware Consistency Regularized Semi-Supervised Change Detection

Qi’ao Xu, Pengfei Wang, Yanjun Li, Tianwen Qian, Xiaoling Wang

Main category: cs.CV

TL;DR: HSACNet is a hierarchical scale-aware consistency regularized network for semi-supervised change detection that integrates SAM2 backbone, scale-aware attention, and dual-augmentation consistency to handle complex scenarios with limited labeled data.

Details

Motivation: Existing semi-supervised change detection methods struggle in complex scenarios with noisy data, neglect intra-layer multi-scale features while over-emphasizing inter-layer fusion, which harms the integrity of change objects at different scales.

Method: Integrates Segment Anything Model 2 (SAM2) with Hiera backbone as encoder for inter-layer multi-scale features, uses adapters for parameter-efficient fine-tuning, designs Scale-Aware Differential Attention Module (SADAM) for intra-layer multi-scale change features and noise suppression, and employs dual-augmentation consistency regularization for unlabeled data utilization.

Result: Extensive experiments across four change detection benchmarks demonstrate state-of-the-art performance with reduced parameters and computational cost.

Conclusion: HSACNet effectively addresses limitations of existing methods by combining hierarchical scale-aware feature extraction, attention mechanisms for noise suppression, and consistency regularization, achieving superior performance in semi-supervised change detection.

Abstract: Semi-supervised change detection (SSCD) aims to detect changes between bi-temporal remote sensing images by utilizing limited labeled data and abundant unlabeled data. Existing methods struggle in complex scenarios, exhibiting poor performance when confronted with noisy data. They typically neglect intra-layer multi-scale features while emphasizing inter-layer fusion, harming the integrity of change objects with different scales. In this paper, we propose HSACNet, a Hierarchical Scale-Aware Consistency regularized Network for SSCD. Specifically, we integrate Segment Anything Model 2 (SAM2), using its Hiera backbone as the encoder to extract inter-layer multi-scale features and applying adapters for parameter-efficient fine-tuning. Moreover, we design a Scale-Aware Differential Attention Module (SADAM) that can precisely capture intra-layer multi-scale change features and suppress noise. Additionally, a dual-augmentation consistency regularization strategy is adopted to effectively utilize the unlabeled data. Extensive experiments across four CD benchmarks demonstrate that our HSACNet achieves state-of-the-art performance, with reduced parameters and computational cost.

[755] WMKA-Net: A Weighted Multi-Kernel Attention Network for Retinal Vessel Segmentation

Xinran Xu, Yuliang Ma, Sifu Cai, Ming Meng, Qiang Lv, Ruoyan Shi

Main category: cs.CV

TL;DR: WMKA-Net proposes a dual-stage approach for retinal vessel segmentation using reversible multi-scale fusion and vascular-oriented attention to address feature fusion, contextual continuity, and noise challenges.

Details

Motivation: Address three major challenges in retinal vessel segmentation: insufficient multi-scale feature fusion, disruption of contextual continuity, and noise interference for intelligent ophthalmic diagnosis.

Method: Two-stage approach: 1) Reversible Multi-Scale Fusion Module (RMS) with hierarchical adaptive convolution for cross-scale feature merging and bias calibration; 2) Vascular-Oriented Attention Mechanism with axial pathway for long-distance continuity and bifurcation attention pathway for topological key nodes.

Result: Achieves accuracy of 0.9909, sensitivity of 0.9198, and specificity of 0.9953 on DRIVE, STARE, and CHASE-DB1 datasets, significantly outperforming existing methods.

Conclusion: Provides an efficient, precise, and robust intelligent solution for early screening of diabetic retinopathy by effectively restoring vascular continuity and improving segmentation accuracy.

Abstract: Retinal vessel segmentation is crucial for intelligent ophthalmic diagnosis, yet it faces three major challenges: insufficient multi-scale feature fusion, disruption of contextual continuity, and noise interference. This study proposes a dual-stage solution to address these issues. The first stage employs a Reversible Multi-Scale Fusion Module (RMS) that uses hierarchical adaptive convolution to dynamically merge cross-scale features from capillaries to main vessels, self-adaptively calibrating feature biases. The second stage introduces a Vascular-Oriented Attention Mechanism, which models long-distance vascular continuity through an axial pathway and enhances the capture of topological key nodes, such as bifurcation points, via a dedicated bifurcation attention pathway. The synergistic operation of these two pathways effectively restores the continuity of vascular structures and improves the segmentation accuracy of complex vascular networks. Systematic experiments on the DRIVE, STARE, and CHASE-DB1 datasets demonstrate that WMKA-Net achieves an accuracy of 0.9909, sensitivity of 0.9198, and specificity of 0.9953, significantly outperforming existing methods. This model provides an efficient, precise, and robust intelligent solution for the early screening of diabetic retinopathy.

[756] Model-based Metric 3D Shape and Motion Reconstruction of Wild Bottlenose Dolphins in Drone-Shot Videos

Daniele Baieri, Riccardo Cicciarella, Michael Krützen, Emanuele Rodolà, Silvia Zuffi

Main category: cs.CV

TL;DR: A model-based approach for estimating 3D shape and motion of wild dolphins from monocular video to assess body condition, incorporating water occlusion modeling.

Details

Motivation: Aquatic animals remain unexplored for 3D reconstruction due to underwater observation difficulties, while terrestrial animals have seen considerable progress.

Method: Model-based approach with transmission model to account for water-induced occlusion, applied to video captured under different sea conditions.

Result: Estimated mass and volume, compared with manual 2D measurements. While manual approach was often more accurate, the proposed method showed advantage for larger specimens.

Conclusion: The method demonstrates potential as a scalable and automated alternative for mass and volume estimation of dolphins from monocular video.

Abstract: We address the problem of estimating the metric 3D shape and motion of wild dolphins from monocular video, with the aim of assessing their body condition. While considerable progress has been made in reconstructing 3D models of terrestrial quadrupeds, aquatic animals remain unexplored due to the difficulty of observing them in their natural underwater environment. To address this, we propose a model-based approach that incorporates a transmission model to account for water-induced occlusion. We apply our method to video captured under different sea conditions. We estimate mass and volume, and compare our results to a manual 2D measurements-based method. Additionally, we apply our method to video of captive animals with known ground truth mass. While in our experiments the manual approach is often more accurate, our method demonstrates a distinct advantage when applied to larger specimen. These findings highlight the potential of our method as a scalable and automated alternative for mass and volume estimation of dolphins from monocular video.

[757] CLIP-IT: CLIP-based Pairing for Histology Images Classification

Banafsheh Karimian, Giulia Avanzato, Soufian Belharbi, Alexis Guichemerre, Luke McCaffrey, Mohammadhadi Shateri, Eric Granger

Main category: cs.CV

TL;DR: CLIP-IT is a framework that leverages unpaired text reports to enhance medical image classification by retrieving semantically relevant reports using CLIP, creating pseudo-pairs, and distilling knowledge into vision models via LoRA adaptation, enabling multimodal benefits without paired data requirements.

Details

Motivation: Address limitations of traditional VLMs that require large paired datasets and complex inference, by utilizing freely available unpaired text reports to provide complementary diagnostic cues while reducing annotation costs, privacy concerns, and computational demands.

Method: Uses CLIP pre-trained on separate histology image-text pairs to retrieve most relevant unpaired text reports for each image, creates pseudo-pairs based on shared clinical semantics, distills knowledge into vision model during training with LoRA adaptation to bridge semantic gaps, and uses only vision model at inference.

Result: Experiments on histology image datasets show CLIP-IT consistently improves classification accuracy over both unimodal and multimodal CLIP-based baselines in most cases, without requiring paired data or complex inference.

Conclusion: CLIP-IT effectively leverages unpaired text reports to enhance medical image classification, providing multimodal benefits while avoiding the practical limitations of paired data requirements and inference complexity.

Abstract: Multimodal learning has shown promise in medical imaging, combining complementary modalities like images and text. Vision-language models (VLMs) capture rich diagnostic cues but often require large paired datasets and prompt- or text-based inference, limiting their practicality due to annotation cost, privacy, and compute demands. Crucially, available free unpaired external text, like pathology reports, can still provide complementary diagnostic cues if semantically relevant content is retrievable per image. To address this, we introduce CLIP-IT, a novel framework that relies on rich unpaired text reports. Specifically, CLIP-IT uses a CLIP model pre-trained on histology image-text pairs from a separate dataset to retrieve the most relevant unpaired textual report for each image in the downstream unimodal dataset. These reports, sourced from the same disease domain and tissue type, form pseudo-pairs that reflect shared clinical semantics rather than exact alignment. Knowledge from these texts is distilled into the vision model during training, while LoRA-based adaptation mitigates the semantic gap between unaligned modalities. At inference, only the vision model is used, keeping overhead low while still benefiting from multimodal training without requiring paired data in the downstream dataset. Experiments on histology image datasets confirm that CLIP-IT consistently improves classification accuracy over both unimodal and multimodal CLIP-based baselines in most cases, without the burden of per-dataset paired annotation or inference-time complexity.

[758] DreamO: A Unified Framework for Image Customization

Chong Mou, Yanze Wu, Wenxu Wu, Zinan Guo, Pengze Zhang, Yufeng Cheng, Yiming Luo, Fei Ding, Shiwen Zhang, Xinghui Li, Mengtian Li, Mingcong Liu, Yi Zhang, Shaojin Wu, Songtao Zhao, Jian Zhang, Qian He, Xinglong Wu

Main category: cs.CV

TL;DR: DreamO is a unified image customization framework using diffusion transformers that supports multiple customization tasks and conditions through feature routing constraints, placeholder strategies, and progressive training.

Details

Motivation: Most image customization approaches are task-specific and lack generalizability to combine different types of conditions, creating a need for a unified framework.

Method: Uses diffusion transformer (DiT) framework with feature routing constraints for precise reference image querying, placeholder strategies for condition placement control, and three-stage progressive training (baseline consistency, full-scale training, quality alignment).

Result: DreamO effectively performs various image customization tasks with high quality and flexibly integrates different types of control conditions.

Conclusion: The proposed DreamO framework successfully addresses the challenge of unified image customization by supporting multiple tasks and conditions through its innovative architecture and training strategy.

Abstract: Recently, extensive research on image customization (e.g., identity, subject, style, background, etc.) demonstrates strong customization capabilities in large-scale generative models. However, most approaches are designed for specific tasks, restricting their generalizability to combine different types of condition. Developing a unified framework for image customization remains an open challenge. In this paper, we present DreamO, an image customization framework designed to support a wide range of tasks while facilitating seamless integration of multiple conditions. Specifically, DreamO utilizes a diffusion transformer (DiT) framework to uniformly process input of different types. During training, we construct a large-scale training dataset that includes various customization tasks, and we introduce a feature routing constraint to facilitate the precise querying of relevant information from reference images. Additionally, we design a placeholder strategy that associates specific placeholders with conditions at particular positions, enabling control over the placement of conditions in the generated results. Moreover, we employ a progressive training strategy consisting of three stages: an initial stage focused on simple tasks with limited data to establish baseline consistency, a full-scale training stage to comprehensively enhance the customization capabilities, and a final quality alignment stage to correct quality biases introduced by low-quality data. Extensive experiments demonstrate that the proposed DreamO can effectively perform various image customization tasks with high quality and flexibly integrate different types of control conditions.

[759] S2S-Net: Addressing the Domain Gap of Heterogeneous Sensor Systems in LiDAR-Based Collective Perception

Sven Teufel, Jörg Gamerdinger, Oliver Bringmann

Main category: cs.CV

TL;DR: This paper addresses the Sensor2Sensor domain gap in collective perception for autonomous driving, proposing S2S-Net which maintains high performance across different sensor domains and outperforms state-of-the-art methods by up to 44 percentage points.

Details

Motivation: Collective perception faces limitations due to Sensor2Sensor domain gaps from different sensor systems in CAVs, which remains mostly unaddressed due to lack of datasets with heterogeneous sensor setups.

Method: Proposed S2S-Net architecture and conducted in-depth analysis of Sensor2Sensor domain adaptation capabilities using the SCOPE dataset with three different LiDAR sensors.

Result: S2S-Net maintains very high performance in unseen sensor domains and outperforms state-of-the-art methods by up to 44 percentage points, while all evaluated state-of-the-art methods highly suffer from the Sensor2Sensor domain gap.

Conclusion: S2S-Net effectively addresses the Sensor2Sensor domain gap in V2V collective perception, demonstrating superior domain adaptation capabilities compared to existing methods.

Abstract: Collective Perception (CP) has emerged as a promising approach to overcome the limitations of individual perception in the context of autonomous driving. Various approaches have been proposed to realize collective perception; however, the Sensor2Sensor domain gap that arises from the utilization of different sensor systems in Connected and Automated Vehicles (CAVs) remains mostly unaddressed. This is primarily due to the paucity of datasets containing heterogeneous sensor setups among the CAVs. The recently released SCOPE datasets address this issue by providing data from three different LiDAR sensors for each CAV. This study is the first to address the Sensor2Sensor domain gap in vehicle-to-vehicle (V2V) collective perception. First, we present our sensor-domain robust architecture S2S-Net. Then an in-depth analysis of the Sensor2Sensor domain adaptation capabilities of state-of-the-art CP methods and S2S-Net is conducted on the SCOPE dataset. This study shows that, all evaluated state-of-the-art mehtods for collective perception highly suffer from the Sensor2Sensor domain gap, while S2S-Net demonstrates the capability to maintain very high performance in unseen sensor domains and outperforms the evaluated state-of-the-art methods by up to 44 percentage points.

[760] ReactDance: Hierarchical Representation for High-Fidelity and Coherent Long-Form Reactive Dance Generation

Jingzhong Lin, Xinru Li, Yuanyuan Qi, Bohao Zhang, Wenxiang Liu, Kecheng Tang, Wenxuan Huang, Xiangfeng Xu, Bangyan Li, Changbo Wang, Gaoqi He

Main category: cs.CV

TL;DR: ReactDance is a diffusion framework for reactive dance generation that uses hierarchical latent spaces to address spatiotemporal challenges, achieving superior motion quality and efficiency.

Details

Motivation: To overcome limitations in generating fine-grained spatial interactions and ensuring long-term temporal coherence in reactive dance generation, which is important for human-robot interaction and digital entertainment.

Method: Uses a diffusion framework with Hierarchical Finite Scalar Quantization (HFSQ) for multi-scale motion representation and Blockwise Local Context (BLC) for non-autoregressive parallel generation of sequence blocks.

Result: Substantially outperforms state-of-the-art methods in motion quality, long-term coherence, and sampling efficiency.

Conclusion: ReactDance effectively addresses the key spatiotemporal challenges in reactive dance generation through its hierarchical latent space approach and efficient sampling strategy.

Abstract: Reactive dance generation (RDG), the task of generating a dance conditioned on a lead dancer’s motion, holds significant promise for enhancing human-robot interaction and immersive digital entertainment. Despite progress in duet synchronization and motion-music alignment, two key challenges remain: generating fine-grained spatial interactions and ensuring long-term temporal coherence. In this work, we introduce \textbf{ReactDance}, a diffusion framework that operates on a novel hierarchical latent space to address these spatiotemporal challenges in RDG. First, for high-fidelity spatial expression and fine-grained control, we propose Hierarchical Finite Scalar Quantization (\textbf{HFSQ}). This multi-scale motion representation effectively disentangles coarse body posture from subtle limb dynamics, enabling independent and detailed control over both aspects through a layered guidance mechanism. Second, to efficiently generate long sequences with high temporal coherence, we propose Blockwise Local Context (\textbf{BLC}), a non-autoregressive sampling strategy. Departing from slow, frame-by-frame generation, BLC partitions the sequence into blocks and synthesizes them in parallel via periodic causal masking and positional encodings. Coherence across these blocks is ensured by a dense sliding-window training approach that enriches the representation with local temporal context. Extensive experiments show that ReactDance substantially outperforms state-of-the-art methods in motion quality, long-term coherence, and sampling efficiency.

[761] Dynamic Uncertainty Learning with Noisy Correspondence for Text-Based Person Search

Zequn Xie, Haoming Ji, Chengxuan Li, Lingwei Meng

Main category: cs.CV

TL;DR: DURA framework improves text-to-image person search by handling noisy data with Key Feature Selector for uncertainty modeling and Dynamic Softmax Hinge Loss for adaptive negative sample difficulty.

Details

Motivation: To address performance degradation in text-to-image person search caused by noisy data from online co-occurrence pairs, particularly mismatched text-image pairs that existing methods amplify through negative sampling.

Method: Proposes Dynamic Uncertainty and Relational Alignment (DURA) framework with Key Feature Selector (KFS) to model noise uncertainty and Dynamic Softmax Hinge Loss (DSH-Loss) that adapts negative sample difficulty. Uses bidirectional evidence from cross-modal similarity modeled as Dirichlet distribution.

Result: Experiments on three datasets demonstrate strong noise resistance and improved retrieval performance in both low- and high-noise scenarios.

Conclusion: DURA framework effectively handles noisy data in text-to-image person search through uncertainty modeling and adaptive loss functions, providing robust performance across varying noise levels.

Abstract: Text-to-image person search aims to identify an individual based on a text description. To reduce data collection costs, large-scale text-image datasets are created from co-occurrence pairs found online. However, this can introduce noise, particularly mismatched pairs, which degrade retrieval performance. Existing methods often focus on negative samples, which amplify this noise. To address these issues, we propose the Dynamic Uncertainty and Relational Alignment (DURA) framework, which includes the Key Feature Selector (KFS) and a new loss function, Dynamic Softmax Hinge Loss (DSH-Loss). KFS captures and models noise uncertainty, improving retrieval reliability. The bidirectional evidence from cross-modal similarity is modeled as a Dirichlet distribution, enhancing adaptability to noisy data. DSH adjusts the difficulty of negative samples to improve robustness in noisy environments. Our experiments on three datasets show that the method offers strong noise resistance and improves retrieval performance in both low- and high-noise scenarios.

[762] QVGen: Pushing the Limit of Quantized Video Generative Models

Yushi Huang, Ruihao Gong, Jing Liu, Yifu Ding, Chengtao Lv, Haotong Qin, Jun Zhang

Main category: cs.CV

TL;DR: QVGen is a quantization-aware training framework that enables video diffusion models to maintain high performance under extremely low-bit quantization (4-bit or below) while eliminating inference overhead through auxiliary modules and rank-decay strategy.

Details

Motivation: Video diffusion models have high computational and memory demands that limit real-world deployment. While quantization works well for image diffusion models, direct application to video diffusion models remains ineffective.

Method: Uses auxiliary modules to mitigate large quantization errors and improve convergence, then employs a rank-decay strategy using SVD and rank-based regularization to progressively eliminate these modules while maintaining performance.

Result: QVGen is the first method to achieve full-precision comparable quality under 4-bit settings across 4 state-of-the-art video DMs (1.3B-14B parameters), significantly outperforming existing methods with improvements of +25.28 in Dynamic Degree and +8.43 in Scene Consistency on VBench.

Conclusion: QVGen successfully enables high-performance and inference-efficient video diffusion models under extremely low-bit quantization, making them more practical for real-world deployment.

Abstract: Video diffusion models (DMs) have enabled high-quality video synthesis. Yet, their substantial computational and memory demands pose serious challenges to real-world deployment, even on high-end GPUs. As a commonly adopted solution, quantization has proven notable success in reducing cost for image DMs, while its direct application to video DMs remains ineffective. In this paper, we present QVGen, a novel quantization-aware training (QAT) framework tailored for high-performance and inference-efficient video DMs under extremely low-bit quantization (e.g., 4-bit or below). We begin with a theoretical analysis demonstrating that reducing the gradient norm is essential to facilitate convergence for QAT. To this end, we introduce auxiliary modules ($\Phi$) to mitigate large quantization errors, leading to significantly enhanced convergence. To eliminate the inference overhead of $\Phi$, we propose a rank-decay strategy that progressively eliminates $\Phi$. Specifically, we repeatedly employ singular value decomposition (SVD) and a proposed rank-based regularization $\mathbf{\gamma}$ to identify and decay low-contributing components. This strategy retains performance while zeroing out additional inference overhead. Extensive experiments across $4$ state-of-the-art (SOTA) video DMs, with parameter sizes ranging from $1.3\text{B}\sim14\text{B}$, show that QVGen is the first to reach full-precision comparable quality under 4-bit settings. Moreover, it significantly outperforms existing methods. For instance, our 3-bit CogVideoX-2B achieves improvements of $+25.28$ in Dynamic Degree and $+8.43$ in Scene Consistency on VBench.

[763] VFRTok: Variable Frame Rates Video Tokenizer with Duration-Proportional Information Assumption

Tianxiong Zhong, Xingye Tian, Boyuan Jiang, Xuebo Wang, Xin Tao, Pengfei Wan, Zhiwei Zhang

Main category: cs.CV

TL;DR: VFRTok is a video tokenizer that enables variable frame rate encoding/decoding, reducing computational costs by using only 1/8 tokens while maintaining competitive reconstruction quality and state-of-the-art generation fidelity.

Details

Motivation: Current video generation frameworks based on Latent Diffusion Models are inefficient due to the Frame-Proportional Information Assumption, which causes computational costs to scale linearly with frame rate.

Method: Proposes Duration-Proportional Information Assumption and introduces VFRTok, a Transformer-based video tokenizer with asymmetric frame rate training between encoder and decoder. Uses Partial Rotary Position Embeddings (RoPE) to decouple position and content modeling.

Result: Achieves competitive reconstruction quality and state-of-the-art generation fidelity while using only 1/8 tokens compared to existing tokenizers.

Conclusion: VFRTok provides a compact and continuous spatio-temporal representation that significantly reduces computational overhead while maintaining high-quality video generation performance.

Abstract: Modern video generation frameworks based on Latent Diffusion Models suffer from inefficiencies in tokenization due to the Frame-Proportional Information Assumption. Existing tokenizers provide fixed temporal compression rates, causing the computational cost of the diffusion model to scale linearly with the frame rate. The paper proposes the Duration-Proportional Information Assumption: the upper bound on the information capacity of a video is proportional to the duration rather than the number of frames. Based on this insight, the paper introduces VFRTok, a Transformer-based video tokenizer, that enables variable frame rate encoding and decoding through asymmetric frame rate training between the encoder and decoder. Furthermore, the paper proposes Partial Rotary Position Embeddings (RoPE) to decouple position and content modeling, which groups correlated patches into unified tokens. The Partial RoPE effectively improves content-awareness, enhancing the video generation capability. Benefiting from the compact and continuous spatio-temporal representation, VFRTok achieves competitive reconstruction quality and state-of-the-art generation fidelity while using only 1/8 tokens compared to existing tokenizers. The code and weights are released at: https://github.com/KwaiVGI/VFRTok.

[764] VisionReasoner: Unified Reasoning-Integrated Visual Perception via Reinforcement Learning

Yuqi Liu, Tianyuan Qu, Zhisheng Zhong, Bohao Peng, Shu Liu, Bei Yu, Jiaya Jia

Main category: cs.CV

TL;DR: VisionReasoner is a unified vision-language model that can handle multiple visual perception tasks through structured reasoning, achieving superior performance on detection, segmentation, and counting tasks.

Details

Motivation: To create a unified framework capable of reasoning and solving multiple visual perception tasks within a single model, addressing the need for comprehensive visual understanding systems.

Method: Uses a unified reward mechanism and multi-object cognitive learning strategies to enhance reasoning capabilities, generating structured reasoning processes before delivering outputs.

Result: Achieves significant performance improvements over baseline Qwen2.5VL: 29.1% on COCO (detection), 22.1% on ReasonSeg (segmentation), and 15.3% on CountBench (counting). Human evaluation confirms faithful reasoning without annotated training data.

Conclusion: VisionReasoner demonstrates effective unified visual perception capabilities through structured reasoning, showing strong performance across diverse tasks without requiring annotated reasoning data.

Abstract: Large vision-language models exhibit inherent capabilities to handle diverse visual perception tasks. In this paper, we introduce VisionReasoner, a unified framework capable of reasoning and solving multiple visual perception tasks within a shared model. Specifically, by designing a unified reward mechanism and multi-object cognitive learning strategies, VisionReasoner enhances its reasoning capabilities to analyze visual inputs, and addresses diverse perception tasks within a unified model. VisionReasoner generates a structured reasoning process before delivering the desired outputs responding to user queries. Human evaluation reveals the reasoning process of VisionReasoner is faithful and reliable even without annotated reasoning train data. To rigorously assess unified visual perception capabilities, we evaluate VisionReasoner on ten diverse tasks spanning three critical domains: detection, segmentation, and counting. Experimental results show that VisionReasoner achieves superior performance as a unified model, outperforming the baseline Qwen2.5VL by relative margins of 29.1% on COCO (detection), 22.1% on ReasonSeg (segmentation), and 15.3% on CountBench (counting).

[765] Vid2World: Crafting Video Diffusion Models to Interactive World Models

Siqiao Huang, Jialong Wu, Qixing Zhou, Shangchen Miao, Mingsheng Long

Main category: cs.CV

TL;DR: Vid2World converts pre-trained video diffusion models into interactive world models through causalization and action guidance, enabling high-quality autoregressive prediction across diverse domains.

Details

Motivation: Existing world models require extensive domain-specific training and produce low-fidelity predictions, while video diffusion models trained on large-scale data can generate high-quality videos capturing real-world dynamics.

Method: Systematically explores video diffusion causalization to reshape architecture and training objectives for autoregressive generation, and incorporates causal action guidance for enhanced action controllability.

Result: Extensive experiments across robot manipulation, 3D game simulation, and open-world navigation demonstrate scalable and effective repurposing of video diffusion models into interactive world models.

Conclusion: Vid2World provides a general approach for leveraging pre-trained video diffusion models as interactive world models, offering improved data efficiency and high-fidelity predictions in complex environments.

Abstract: World models, which predict future transitions from past observation and action sequences, have shown great promise for improving data efficiency in sequential decision-making. However, existing world models often require extensive domain-specific training and still produce low-fidelity, coarse predictions, limiting their usefulness in complex environments. In contrast, video diffusion models trained on large-scale internet data have demonstrated impressive capabilities in generating high-quality videos that capture diverse real-world dynamics. In this work, we present Vid2World, a general approach for leveraging and transferring pre-trained video diffusion models into interactive world models. To bridge the gap, Vid2World systematically explores video diffusion causalization, reshaping both the architecture and training objective of pre-trained models to enable autoregressive generation. Additionally, it incorporates a causal action guidance mechanism to enhance action controllability in the resulting interactive world models. Extensive experiments across multiple domains, including robot manipulation, 3D game simulation, and open-world navigation, demonstrate that our method offers a scalable and effective pathway for repurposing highly capable video diffusion models into interactive world models.

[766] Better Safe Than Sorry? Overreaction Problem of Vision Language Models in Visual Emergency Recognition

Dasol Choi, Seunghyun Lee, Youngsook Song

Main category: cs.CV

TL;DR: VERI benchmark reveals VLMs suffer from “overreaction problem” - high recall but low precision in safety-critical scenarios, misclassifying safe situations as dangerous due to contextual overinterpretation.

Details

Motivation: To assess reliability of Vision-Language Models in safety-critical scenarios where their performance remains insufficiently explored.

Method: Created VERI diagnostic benchmark with 200 synthetic images (100 contrastive pairs) and 50 real-world images (25 pairs), using two-stage evaluation protocol (risk identification and emergency response) across 17 VLMs.

Result: Models show high recall (70-100%) but low precision, misclassifying 31-96% of safe situations as dangerous. Seven safe scenarios were universally misclassified by all models. 88-98% of errors stem from contextual overinterpretation.

Conclusion: VLMs exhibit systematic “better-safe-than-sorry” bias, challenging their reliability in safety-critical applications. Addressing this requires enhanced contextual reasoning in ambiguous visual situations.

Abstract: Vision-Language Models (VLMs) have shown capabilities in interpreting visual content, but their reliability in safety-critical scenarios remains insufficiently explored. We introduce VERI, a diagnostic benchmark comprising 200 synthetic images (100 contrastive pairs) and an additional 50 real-world images (25 pairs) for validation. Each emergency scene is paired with a visually similar but safe counterpart through human verification. Using a two-stage evaluation protocol (risk identification and emergency response), we assess 17 VLMs across medical emergencies, accidents, and natural disasters. Our analysis reveals an “overreaction problem”: models achieve high recall (70-100%) but suffer from low precision, misclassifying 31-96% of safe situations as dangerous. Seven safe scenarios were universally misclassified by all models. This “better-safe-than-sorry” bias stems from contextual overinterpretation (88-98% of errors). Both synthetic and real-world datasets confirm these systematic patterns, challenging VLM reliability in safety-critical applications. Addressing this requires enhanced contextual reasoning in ambiguous visual situations.

[767] Advancing Marine Research: UWSAM Framework and UIIS10K Dataset for Precise Underwater Instance Segmentation

Hua Li, Shijie Lian, Zhiyuan Li, Runmin Cong, Chongyi Li, Laurence T. Yang, Weidong Zhang, Sam Kwong

Main category: cs.CV

TL;DR: The paper introduces UWSAM, an efficient underwater instance segmentation model that addresses SAM’s limitations in underwater scenarios through knowledge distillation and automatic prompt generation.

Details

Motivation: SAM and its variants face performance limitations in underwater instance segmentation due to lack of domain expertise and high computational requirements, hindering their application in underwater scenarios.

Method: Proposed UWSAM with Mask GAT-based Underwater Knowledge Distillation (MG-UKD) to distill knowledge from SAM ViT-Huge to ViT-Small encoder, and End-to-end Underwater Prompt Generator (EUPG) for automatic underwater prompt generation.

Result: Achieved significant performance improvements over state-of-the-art methods on multiple underwater instance datasets, demonstrating effectiveness in underwater instance segmentation.

Conclusion: UWSAM provides an efficient solution for underwater instance segmentation by addressing computational limitations and domain adaptation challenges through knowledge distillation and automatic prompt generation.

Abstract: With recent breakthroughs in large-scale modeling, the Segment Anything Model (SAM) has demonstrated significant potential in a variety of visual applications. However, due to the lack of underwater domain expertise, SAM and its variants face performance limitations in end-to-end underwater instance segmentation tasks, while their higher computational requirements further hinder their application in underwater scenarios. To address this challenge, we propose a large-scale underwater instance segmentation dataset, UIIS10K, which includes 10,048 images with pixel-level annotations for 10 categories. Then, we introduce UWSAM, an efficient model designed for automatic and accurate segmentation of underwater instances. UWSAM efficiently distills knowledge from the SAM ViT-Huge image encoder into the smaller ViT-Small image encoder via the Mask GAT-based Underwater Knowledge Distillation (MG-UKD) method for effective visual representation learning. Furthermore, we design an End-to-end Underwater Prompt Generator (EUPG) for UWSAM, which automatically generates underwater prompts instead of explicitly providing foreground points or boxes as prompts, thus enabling the network to locate underwater instances accurately for efficient segmentation. Comprehensive experimental results show that our model is effective, achieving significant performance improvements over state-of-the-art methods on multiple underwater instance datasets. Datasets and codes are available at https://github.com/LiamLian0727/UIIS10K.

[768] OViP: Online Vision-Language Preference Learning for VLM Hallucination

Shujun Liu, Siyuan Wang, Zejun Li, Jianxiang Wang, Cheng Zeng, Zhongyu Wei

Main category: cs.CV

TL;DR: OViP is an online preference learning framework that reduces hallucinations in large vision-language models by dynamically generating contrastive training data from the model’s own errors and using diffusion models to synthesize negative images.

Details

Motivation: Current training-based approaches for mitigating hallucinations in LVLMs rely on predefined or randomly edited negative samples that don't reflect actual model errors, limiting training effectiveness.

Method: Dynamically constructs contrastive training data based on the model’s hallucinated outputs, identifies semantic differences between response pairs, and synthesizes negative images using diffusion models to generate relevant supervision signals in real time.

Result: OViP reduces hallucinations while preserving core multi-modal capabilities and substantially improves training efficiency, as demonstrated on hallucination and general benchmarks.

Conclusion: The failure-driven training approach enables adaptive alignment of both textual and visual preferences, providing more effective hallucination mitigation for vision-language models.

Abstract: Large vision-language models (LVLMs) remain vulnerable to hallucination, often generating content misaligned with visual inputs. Although recent training-based approaches aim to mitigate hallucination, they typically rely on predefined or randomly edited negative samples that do not reflect actual model errors, thus limiting training efficacy. In this work, we propose an Online Vision-language Preference Learning (OViP) framework that dynamically constructs contrastive training data based on the model’s own hallucinated outputs. By identifying semantic differences between sampled response pairs and synthesizing negative images using a diffusion model, OViP generates more relevant supervision signals in real time. This failure-driven training enables adaptive alignment of both textual and visual preferences. Moreover, we refine existing evaluation protocols to better capture the trade-off between hallucination suppression and expressiveness. Experiments on hallucination and general benchmarks demonstrate that OViP not only reduces hallucinations while preserving core multi-modal capabilities, but also substantially improves training efficiency. Code is available at https://github.com/lsjlsj35/Online-Vision-Language-Preference-Learning-for-VLM-Hallucination.

[769] Action2Dialogue: Generating Character-Centric Narratives from Scene-Level Prompts

Taewon Kang, Ming C. Lin

Main category: cs.CV

TL;DR: A modular pipeline that transforms action-level prompts into multimodal video narratives with character-driven dialogue and speech, using visual context and recursive memory for consistent storytelling.

Details

Motivation: Current scene-based video generation systems lack character-driven dialogue and speech, which are crucial for storytelling. The paper aims to enrich visual storytelling with natural voice and character expression.

Method: Uses a modular pipeline with: 1) Vision-language encoder for visual context, 2) Large language model for dialogue synthesis, 3) Recursive Narrative Bank for character memory and consistency, 4) Character-conditioned speech rendering.

Result: The framework generates fully-voiced, multimodal video narratives with expressive, character-consistent dialogue grounded in both prompts and visual scenes, generalizing across diverse story settings.

Conclusion: The training-free framework offers a scalable solution for coherent, character-grounded audiovisual storytelling, successfully integrating dialogue and speech into visual narrative generation.

Abstract: Recent advances in scene-based video generation have enabled systems to synthesize coherent visual narratives from structured prompts. However, a crucial dimension of storytelling – character-driven dialogue and speech – remains underexplored. In this paper, we present a modular pipeline that transforms action-level prompts into visually and auditorily grounded narrative dialogue, enriching visual storytelling with natural voice and character expression. Our method takes as input a pair of prompts per scene, where the first defines the setting and the second specifies a character’s behavior. While a story generation model such as Text2Story produces the corresponding visual scene, we focus on generating expressive, character-consistent utterances grounded in both the prompts and the scene image. A pretrained vision-language encoder extracts high-level semantic features from a representative frame, capturing salient visual context. These features are then integrated with structured prompts to guide a large language model in synthesizing natural dialogue. To ensure contextual and emotional consistency across scenes, we introduce a Recursive Narrative Bank – a speaker-aware, temporally structured memory that recursively accumulates each character’s dialogue history. Inspired by Script Theory in cognitive psychology, this design enables characters to speak in ways that reflect their evolving goals, social context, and narrative roles throughout the story. Finally, we render each utterance as expressive, character-conditioned speech, resulting in fully-voiced, multimodal video narratives. Our training-free framework generalizes across diverse story settings – from fantasy adventures to slice-of-life episodes – offering a scalable solution for coherent, character-grounded audiovisual storytelling.

[770] Pixels Versus Priors: Controlling Knowledge Priors in Vision-Language Models through Visual Counterfacts

Michal Golovanevsky, William Rudman, Michael Lepori, Amir Bar, Ritambhara Singh, Carsten Eickhoff

Main category: cs.CV

TL;DR: This paper introduces Visual CounterFact dataset to study whether MLLMs rely more on memorized world knowledge or visual input, finding that models initially use priors but shift to visual evidence in later layers, and proposes PvP steering vectors to control this behavior.

Details

Motivation: To investigate whether Multimodal Large Language Models (MLLMs) rely more on memorized world knowledge or visual information in tasks like visual question answering, by creating counterfactuals that put these two sources in conflict.

Method: Created Visual CounterFact dataset with visually-realistic counterfactuals that conflict world knowledge priors with visual input, analyzed layer-wise model behavior, and developed PvP steering vectors for activation-level interventions to control model outputs.

Result: Model predictions initially reflect memorized priors but shift toward visual evidence in mid-to-late layers, with visual input ultimately overriding priors. PvP steering successfully shifted 99.3% of color and 80.8% of size predictions from priors to counterfactuals.

Conclusion: The findings provide new tools for interpreting and controlling factual behavior in multimodal models, revealing the dynamic competition between world knowledge and visual input in MLLMs.

Abstract: Multimodal Large Language Models (MLLMs) perform well on tasks such as visual question answering, but it remains unclear whether their reasoning relies more on memorized world knowledge or on the visual information present in the input image. To investigate this, we introduce Visual CounterFact, a new dataset of visually-realistic counterfactuals that put world knowledge priors (e.g, red strawberry) into direct conflict with visual input (e.g, blue strawberry). Using Visual CounterFact, we show that model predictions initially reflect memorized priors, but shift toward visual evidence in mid-to-late layers. This dynamic reveals a competition between the two modalities, with visual input ultimately overriding priors during evaluation. To control this behavior, we propose Pixels Versus Priors (PvP) steering vectors, a mechanism for controlling model outputs toward either world knowledge or visual input through activation-level interventions. On average, PvP successfully shifts 99.3% of color and 80.8% of size predictions from priors to counterfactuals. Together, these findings offer new tools for interpreting and controlling factual behavior in multimodal models.

[771] InfoDet: A Dataset for Infographic Element Detection

Jiangning Zhu, Yuxing Zhou, Zheng Wang, Juntao Yao, Yima Gu, Yuhui Yuan, Shixia Liu

Main category: cs.CV

TL;DR: InfoDet is a dataset for training object detection models on charts and human-recognizable objects in infographics, containing 11,264 real and 90,000 synthetic infographics with over 14 million bounding box annotations.

Details

Motivation: Vision-language models have limitations in accurately grounding infographic elements like charts and icons, which is crucial for chart understanding that requires identifying and reasoning over relevant elements.

Method: Created InfoDet dataset combining model-in-the-loop and programmatic methods to generate bounding box annotations, then demonstrated its usefulness through three applications including Thinking-with-Boxes scheme for VLMs.

Result: Developed a comprehensive dataset with 11,264 real and 90,000 synthetic infographics containing over 14 million bounding box annotations for charts and human-recognizable objects.

Conclusion: InfoDet enables improved object detection for infographics and can boost chart understanding performance of vision-language models through applications like the Thinking-with-Boxes scheme.

Abstract: Given the central role of charts in scientific, business, and communication contexts, enhancing the chart understanding capabilities of vision-language models (VLMs) has become increasingly critical. A key limitation of existing VLMs lies in their inaccurate visual grounding of infographic elements, including charts and human-recognizable objects (HROs) such as icons and images. However, chart understanding often requires identifying relevant elements and reasoning over them. To address this limitation, we introduce InfoDet, a dataset designed to support the development of accurate object detection models for charts and HROs in infographics. It contains 11,264 real and 90,000 synthetic infographics, with over 14 million bounding box annotations. These annotations are created by combining the model-in-the-loop and programmatic methods. We demonstrate the usefulness of InfoDet through three applications: 1) constructing a Thinking-with-Boxes scheme to boost the chart understanding performance of VLMs, 2) comparing existing object detection models, and 3) applying the developed detection model to document layout and UI element detection.

[772] T2VUnlearning: A Concept Erasing Method for Text-to-Video Diffusion Models

Xiaoyu Ye, Songjie Cheng, Yongtao Wang, Yajiao Xiong, Yishen Li

Main category: cs.CV

TL;DR: Proposes an unlearning-based method to erase harmful concepts from text-to-video diffusion models while preserving generation capability for other concepts.

Details

Motivation: Address the emerging threat of text-to-video models generating explicit or harmful content that could lead to misuse and rights violations.

Method: Uses negatively-guided velocity prediction fine-tuning with prompt augmentation, mask-based localization regularization, and concept preservation regularization for precise unlearning.

Result: Effectively erases specific harmful concepts while maintaining the model’s ability to generate non-target concepts, outperforming existing methods.

Conclusion: The proposed unlearning approach provides a viable solution to mitigate misuse of text-to-video models by erasing harmful concepts without compromising overall generation quality.

Abstract: Recent advances in text-to-video (T2V) diffusion models have significantly enhanced the quality of generated videos. However, their capability to produce explicit or harmful content introduces new challenges related to misuse and potential rights violations. To address this newly emerging threat, we propose unlearning-based concept erasing as a solution. First, we adopt negatively-guided velocity prediction fine-tuning and enhance it with prompt augmentation to ensure robustness against prompts refined by large language models (LLMs). Second, to achieve precise unlearning, we incorporate mask-based localization regularization and concept preservation regularization to preserve the model’s ability to generate non-target concepts. Extensive experiments demonstrate that our method effectively erases a specific concept while preserving the model’s generation capability for all other concepts, outperforming existing methods. We provide the unlearned models in \href{https://github.com/VDIGPKU/T2VUnlearning.git}{https://github.com/VDIGPKU/T2VUnlearning.git}.

[773] Boosting Open Set Recognition Performance through Modulated Representation Learning

Amit Kumar Kundu, Vaishnavi S Patil, Joseph Jaja

Main category: cs.CV

TL;DR: The paper introduces temperature-modulated representation learning with novel temperature schedules (including negative cosine) to improve open set recognition by enabling gradual task switching from coarse to fine decision boundaries, enhancing representation space without computational overhead.

Details

Motivation: Existing OSR methods use constant temperature scaling, limiting exploration of instance-level to semantic-level features and hindering representation learning.

Method: Proposed temperature schedules that modulate representation learning, allowing models to form coarse decision boundaries initially and gradually smooth them by prioritizing more neighbors. Can be integrated into any existing OSR loss function without additional computational cost.

Result: Implementation on various baselines (cross-entropy, contrastive, ARPL loss functions) shows improved OSR and closed set performance, particularly on challenging semantic shift benchmarks.

Conclusion: Temperature-modulated representation learning with proposed schedules provides a computationally efficient way to enhance OSR performance by creating richer and more generalizable representation spaces.

Abstract: The open set recognition (OSR) problem aims to identify test samples from novel semantic classes that are not part of the training classes, a task that is crucial in many practical scenarios. However, the existing OSR methods use a constant scaling factor (the temperature) to the logits before applying a loss function, which hinders the model from exploring both ends of the spectrum in representation learning – from instance-level to semantic-level features. In this paper, we address this problem by enabling temperature-modulated representation learning using a set of proposed temperature schedules, including our novel negative cosine schedule. Our temperature schedules allow the model to form a coarse decision boundary at the beginning of training by focusing on fewer neighbors, and gradually prioritizes more neighbors to smooth out the rough edges. This gradual task switching leads to a richer and more generalizable representation space. While other OSR methods benefit by including regularization or auxiliary negative samples, such as with mix-up, thereby adding a significant computational overhead, our schedules can be folded into any existing OSR loss function with no overhead. We implement the novel schedule on top of a number of baselines, using cross-entropy, contrastive and the ARPL loss functions and find that it boosts both the OSR and the closed set performance in most cases, especially on the tougher semantic shift benchmarks. Project codes will be available.

[774] CoT-RVS: Zero-Shot Chain-of-Thought Reasoning Segmentation for Videos

Shiu-hong Kao, Yu-Wing Tai, Chi-Keung Tang

Main category: cs.CV

TL;DR: CoT-RVS is a training-free framework that uses Chain-of-Thought reasoning in MLLMs for video object segmentation, addressing complex temporal-semantic queries by analyzing visible objects and selecting keyframes.

Details

Motivation: Existing MLLM-based methods fail with complex temporally-sensitive queries due to lack of temporal and spatial integration in complex scenarios.

Method: Uses zero-shot Chain-of-Thought capability to analyze visible objects matching language queries (semantic) and select corresponding keyframes across frames (temporal). Training-free and compatible with closed-source MLLMs.

Result: Significantly outperforms previous works in video object segmentation with both explicit and implicit queries, both qualitatively and quantitatively.

Conclusion: CoT-RVS effectively addresses complex temporal-semantic reasoning challenges in video object segmentation without requiring training, and can be extended to online video streams and instance segmentation.

Abstract: Reasoning Video Object Segmentation is a challenging task, aiming at generating a mask sequence from an input video given a complex and implicit text query. While existing works finetune Multimodal Large Language Models (MLLM) for the task, they still fail in video inputs given complex temporally-sensitive queries, indicating their lack of temporal and spatial integration in complex scenarios. In this paper, we propose CoT-RVS, a novel framework employing the zero-shot Chain-of-Thought (CoT) capability of MLLM to address these complex challenges by temporal-semantic reasoning: CoT-RVS analyzes the visible objects within a given frame that possibly match the language query (semantic), and chooses a corresponding keyframe for each object that can be observed effortlessly among all frames (temporal). Notably, the CoT-RVS framework is training-free and compatible with closed-source MLLMs, which can be applied to Reasoning Video Instance Segmentation. Our framework’s training-free feature further allows its extension to process online video streams, where the CoT is used at test time to update the object of interest when a better target starts to emerge and becomes visible. We conduct extensive experiments on video object segmentation with explicit and implicit queries. The results show that CoT-RVS significantly outperforms previous works in both cases, qualitatively and quantitatively.

[775] Mod-Adapter: Tuning-Free and Versatile Multi-concept Personalization via Modulation Adapter

Weizhi Zhong, Huan Yang, Zheng Liu, Huiguo He, Zijian He, Xuesong Niu, Di Zhang, Guanbin Li

Main category: cs.CV

TL;DR: A tuning-free method for multi-concept personalization in text-to-image generation that handles both object and abstract concepts without test-time fine-tuning, using a Mod-Adapter module and VLM-guided pre-training.

Details

Motivation: Existing multi-concept personalization methods are limited to object concepts and struggle with abstract concepts like pose and lighting. Current approaches require test-time fine-tuning which is time-consuming and prone to overfitting.

Method: Proposes Mod-Adapter module that predicts concept-specific modulation direction using vision-language cross-attention and Mixture-of-Experts layers. Uses VLM-guided pre-training strategy to bridge the gap between concept image space and modulation space.

Result: Achieves state-of-the-art performance in multi-concept personalization across quantitative, qualitative, and human evaluations. Extends benchmark to include abstract concepts for comprehensive comparison.

Conclusion: The proposed tuning-free method effectively customizes both object and abstract concepts without test-time fine-tuning, demonstrating superior performance through multiple evaluation metrics.

Abstract: Personalized text-to-image generation aims to synthesize images of user-provided concepts in diverse contexts. Despite recent progress in multi-concept personalization, most are limited to object concepts and struggle to customize abstract concepts (e.g., pose, lighting). Some methods have begun exploring multi-concept personalization supporting abstract concepts, but they require test-time fine-tuning for each new concept, which is time-consuming and prone to overfitting on limited training images. In this work, we propose a novel tuning-free method for multi-concept personalization that can effectively customize both object and abstract concepts without test-time fine-tuning. Our method builds upon the modulation mechanism in pre-trained Diffusion Transformers (DiTs) model, leveraging the localized and semantically meaningful properties of the modulation space. Specifically, we propose a novel module, Mod-Adapter, to predict concept-specific modulation direction for the modulation process of concept-related text tokens. It introduces vision-language cross-attention for extracting concept visual features, and Mixture-of-Experts (MoE) layers that adaptively map the concept features into the modulation space. Furthermore, to mitigate the training difficulty caused by the large gap between the concept image space and the modulation space, we introduce a VLM-guided pre-training strategy that leverages the strong image understanding capabilities of vision-language models to provide semantic supervision signals. For a comprehensive comparison, we extend a standard benchmark by incorporating abstract concepts. Our method achieves state-of-the-art performance in multi-concept personalization, supported by quantitative, qualitative, and human evaluations.

Mengdan Zhu, Senhao Cheng, Guangji Bai, Yifei Zhang, Liang Zhao

Main category: cs.CV

TL;DR: Cross-modal RAG is a novel framework that decomposes queries and images into sub-dimensional components for subquery-aware retrieval and generation, outperforming existing methods in complex text-to-image tasks.

Details

Motivation: Existing RAG methods fail when no single image contains all desired elements from complex user queries, necessitating a more granular approach to retrieval and generation.

Method: Proposes a hybrid retrieval strategy combining sub-dimensional sparse retriever with dense retriever to identify Pareto-optimal image sets, and uses a multimodal LLM to selectively condition on relevant visual features aligned to specific subqueries.

Result: Significantly outperforms existing baselines on MS-COCO, Flickr30K, WikiArt, CUB, and ImageNet-LT datasets in both retrieval and generation quality while maintaining high efficiency.

Conclusion: Cross-modal RAG effectively addresses the limitations of traditional RAG methods by enabling subquery-aware retrieval and generation for complex text-to-image tasks.

Abstract: Text-to-image generation increasingly demands access to domain-specific, fine-grained, and rapidly evolving knowledge that pretrained models cannot fully capture, necessitating the integration of retrieval methods. Existing Retrieval-Augmented Generation (RAG) methods attempt to address this by retrieving globally relevant images, but they fail when no single image contains all desired elements from a complex user query. We propose Cross-modal RAG, a novel framework that decomposes both queries and images into sub-dimensional components, enabling subquery-aware retrieval and generation. Our method introduces a hybrid retrieval strategy - combining a sub-dimensional sparse retriever with a dense retriever - to identify a Pareto-optimal set of images, each contributing complementary aspects of the query. During generation, a multimodal large language model is guided to selectively condition on relevant visual features aligned to specific subqueries, ensuring subquery-aware image synthesis. Extensive experiments on MS-COCO, Flickr30K, WikiArt, CUB, and ImageNet-LT demonstrate that Cross-modal RAG significantly outperforms existing baselines in the retrieval and further contributes to generation quality, while maintaining high efficiency.

[777] ReDDiT: Rehashing Noise for Discrete Visual Generation

Tianren Ma, Xiaosong Zhang, Boyu Yang, Junlan Feng, Qixiang Ye

Main category: cs.CV

TL;DR: ReDDiT introduces a rehashing noise approach for discrete diffusion transformers to improve generation quality by extending absorbing states and using randomized multi-index corruption during training, achieving performance comparable to continuous diffusion models.

Details

Motivation: Discrete diffusion models are gaining popularity for efficiency but still lag behind continuous counterparts due to limitations in noise design and sampling heuristics. The authors aim to address these issues to improve discrete diffusion model performance.

Method: Proposes ReDDiT (Rehashing Discrete Diffusion Transformer) with rehashing noise approach that extends absorbing states and uses randomized multi-index corruption during training. Includes a derived rehash sampler that reverses randomized absorbing paths to ensure diversity and low discrepancy.

Result: Significantly outperforms baseline model (reducing gFID from 6.18 to 1.61) and achieves performance on par with continuous diffusion counterparts. The method provides more consistent and competitive generation quality while reducing the need for heavily tuned randomness.

Conclusion: ReDDiT successfully addresses limitations in discrete diffusion models through rehashing noise and improved sampling, achieving state-of-the-art performance comparable to continuous models while maintaining the efficiency benefits of discrete approaches.

Abstract: In the visual generative area, discrete diffusion models are gaining traction for their efficiency and compatibility. However, pioneered attempts still fall behind their continuous counterparts, which we attribute to noise (absorbing state) design and sampling heuristics. In this study, we propose a rehashing noise approach for discrete diffusion transformer (termed ReDDiT), with the aim to extend absorbing states and improve expressive capacity of discrete diffusion models. ReDDiT enriches the potential paths that latent variables traverse during training with randomized multi-index corruption. The derived rehash sampler, which reverses the randomized absorbing paths, guarantees high diversity and low discrepancy of the generation process. These reformulations lead to more consistent and competitive generation quality, mitigating the need for heavily tuned randomness. Experiments show that ReDDiT significantly outperforms the baseline model (reducing gFID from 6.18 to 1.61) and is on par with the continuous counterparts.

[778] ProxyThinker: Test-Time Guidance through Small Visual Reasoners

Zilin Xiao, Jaywon Koo, Siru Ouyang, Jefferson Hernandez, Yu Meng, Vicente Ordonez

Main category: cs.CV

TL;DR: ProxyThinker is an inference-time technique that enables large vision-language models to inherit visual reasoning capabilities from small, slow-thinking reasoners without training, achieving performance comparable to full-scale reinforcement fine-tuning while being 38x faster.

Details

Motivation: Training large vision-language models with reinforcement fine-tuning is computationally expensive, making it challenging to scale model size. There's a need for efficient methods to transfer reasoning capabilities without expensive training.

Method: ProxyThinker subtracts output distributions of base models from those of reinforcement fine-tuning reasoners to modify decoding dynamics, eliciting slow-thinking reasoning behaviors like self-verification and self-correction. It coordinates multiple language models with parallelism techniques.

Result: ProxyThinker consistently boosts performance on challenging visual benchmarks for spatial, mathematical, and multi-disciplinary reasoning, enabling untuned base models to compete with full-scale reinforcement fine-tuning counterparts. It achieves up to 38x faster inference compared to previous decoding-time methods.

Conclusion: ProxyThinker provides an efficient inference-time solution for transferring visual reasoning capabilities without training, making practical deployment feasible while maintaining competitive performance.

Abstract: Recent advancements in reinforcement learning with verifiable rewards have pushed the boundaries of the visual reasoning capabilities in large vision-language models (LVLMs). However, training LVLMs with reinforcement fine-tuning (RFT) is computationally expensive, posing a significant challenge to scaling model size. In this work, we propose ProxyThinker, an inference-time technique that enables large models to inherit the visual reasoning capabilities from small, slow-thinking visual reasoners without any training. By subtracting the output distributions of base models from those of RFT reasoners, ProxyThinker modifies the decoding dynamics and successfully elicits the slow-thinking reasoning demonstrated by the emerged sophisticated behaviors such as self-verification and self-correction. ProxyThinker consistently boosts performance on challenging visual benchmarks on spatial, mathematical, and multi-disciplinary reasoning, enabling untuned base models to compete with the performance of their full-scale RFT counterparts. Furthermore, our implementation efficiently coordinates multiple language models with parallelism techniques and achieves up to 38 $\times$ faster inference compared to previous decoding-time methods, paving the way for the practical deployment of ProxyThinker. Code is available at https://github.com/MrZilinXiao/ProxyThinker.

[779] GLEAM: Learning Generalizable Exploration Policy for Active Mapping in Complex 3D Indoor Scenes

Xiao Chen, Tai Wang, Quanyi Li, Tao Huang, Jiangmiao Pang, Tianfan Xue

Main category: cs.CV

TL;DR: GLEAM-Bench is a large-scale benchmark for generalizable active mapping with 1,152 diverse 3D scenes, and GLEAM is a unified exploration policy that achieves 66.50% coverage (+9.49%) with efficient trajectories on unseen complex scenes.

Details

Motivation: Existing active mapping methods have limited generalizability across diverse environments due to insufficient training data and conservative exploration strategies.

Method: Proposes GLEAM with semantic representations, long-term navigable goals, and randomized strategies for active mapping, built upon the GLEAM-Bench benchmark.

Result: Achieves 66.50% coverage (+9.49% improvement) with efficient trajectories and improved mapping accuracy on 128 unseen complex scenes.

Conclusion: GLEAM demonstrates superior generalizability and outperforms state-of-the-art methods in active mapping across diverse environments.

Abstract: Generalizable active mapping in complex unknown environments remains a critical challenge for mobile robots. Existing methods, constrained by insufficient training data and conservative exploration strategies, exhibit limited generalizability across scenes with diverse layouts and complex connectivity. To enable scalable training and reliable evaluation, we introduce GLEAM-Bench, the first large-scale benchmark designed for generalizable active mapping with 1,152 diverse 3D scenes from synthetic and real-scan datasets. Building upon this foundation, we propose GLEAM, a unified generalizable exploration policy for active mapping. Its superior generalizability comes mainly from our semantic representations, long-term navigable goals, and randomized strategies. It significantly outperforms state-of-the-art methods, achieving 66.50% coverage (+9.49%) with efficient trajectories and improved mapping accuracy on 128 unseen complex scenes. Project page: https://xiao-chen.tech/gleam/.

[780] Score Replacement with Bounded Deviation for Rare Prompt Generation

Bo-Kai Ruan, Zi-Xiang Ni, Bo-Lun Huang, Teng-Fang Hsiao, Hong-Han Shuai

Main category: cs.CV

TL;DR: The paper proposes an adaptive prompt switching method for diffusion models to improve generation of rare concepts by monitoring score deviation between proxy and target prompts.

Details

Motivation: Diffusion models struggle with rare concepts that appear infrequently in training data, and existing fixed-schedule prompt switching methods are brittle across different prompts and models.

Method: Re-frames rare prompt generation as score replacement with bounded deviation criterion - uses frequent proxy prompts initially, then switches to rare prompt when score deviation exceeds threshold.

Result: Extensive experiments across SDXL, SD3, Flux, and Sana show consistent improvements in rare concept synthesis, outperforming baselines in automated metrics and human evaluations.

Conclusion: The adaptive switching based on score deviation provides a principled and practical mechanism that can be widely adopted by different diffusion models for better rare concept generation.

Abstract: Diffusion models achieve impressive performance in high-fidelity image generation but often struggle with rare concepts that appear infrequently in the training distribution. Prior work attempts to address this issue by prompt switching, where generation begins with a frequent proxy prompt and later transitions to the original rare prompt. However, such designs typically rely on fixed schedules that disregard the model’s internal dynamics, making them brittle across prompts and backbones. In this paper, we re-frame rare prompt generation through the lens of score replacement: the denoising trajectory of a rare prompt can be initially guided by the score of a semantically related frequent prompt, which acts as a proxy. However, as the process unfolds, the proxy score gradually diverges from the true rare prompt score. To control this drift, we introduce a bounded deviation criterion that triggers the switch once the deviation exceeds a threshold. This formulation offers both a principled justification and a practical mechanism for rare prompt generation, enabling adaptive switching that can be widely adopted by different models. Extensive experiments across SDXL, SD3, Flux, and Sana confirm that our method consistently improves rare concept synthesis, outperforming strong baselines in both automated metrics and human evaluations.

[781] MagicTryOn: Harnessing Diffusion Transformer for Garment-Preserving Video Virtual Try-on

Guangyuan Li, Siming Zheng, Hao Zhang, Jinwei Chen, Junsheng Luan, Binkai Ou, Lei Zhao, Bo Li, Peng-Tao Jiang

Main category: cs.CV

TL;DR: MagicTryOn is a diffusion-transformer framework for video virtual try-on that preserves garment details and ensures temporal consistency through fine-grained garment preservation and spatiotemporal modeling.

Details

Motivation: Existing VVT methods suffer from inadequate garment fidelity and limited spatiotemporal consistency due to under-exploitation of garment information and lack of spatiotemporal modeling, causing temporal jitter and appearance drift.

Method: Proposes fine-grained garment-preservation strategy that disentangles garment cues and injects them into denoising process, introduces garment-aware spatiotemporal RoPE for temporal consistency, uses mask-aware loss for fidelity, and employs distribution-matching distillation for real-time inference.

Result: Extensive experiments show MagicTryOn outperforms existing methods, delivering superior garment-detail fidelity and temporal stability in unconstrained settings.

Conclusion: MagicTryOn effectively addresses garment fidelity and temporal consistency issues in video virtual try-on through its diffusion-transformer framework with specialized garment preservation and spatiotemporal modeling techniques.

Abstract: Video Virtual Try-On (VVT) aims to synthesize garments that appear natural across consecutive video frames, capturing both their dynamics and interactions with human motion. Despite recent progress, existing VVT methods still suffer from inadequate garment fidelity and limited spatiotemporal consistency. The reasons are: (1) under-exploitation of garment information, with limited garment cues being injected, resulting in weaker fine-detail fidelity; and (2) a lack of spatiotemporal modeling, which hampers cross-frame identity consistency and causes temporal jitter and appearance drift. In this paper, we present MagicTryOn, a diffusion-transformer based framework for garment-preserving video virtual try-on. To preserve fine-grained garment details, we propose a fine-grained garment-preservation strategy that disentangles garment cues and injects these decomposed priors into the denoising process. To improve temporal garment consistency and suppress jitter, we introduce a garment-aware spatiotemporal rotary positional embedding (RoPE) that extends RoPE within full self-attention, using spatiotemporal relative positions to modulate garment tokens. We further impose a mask-aware loss during training to enhance fidelity within garment regions. Moreover, we adopt distribution-matching distillation to compress the sampling trajectory to four steps, enabling real-time inference without degrading garment fidelity. Extensive quantitative and qualitative experiments demonstrate that MagicTryOn outperforms existing methods, delivering superior garment-detail fidelity and temporal stability in unconstrained settings.

[782] Any-to-Bokeh: Arbitrary-Subject Video Refocusing with Video Diffusion Model

Yang Yang, Siming Zheng, Qirui Yang, Jinwei Chen, Boxi Wu, Xiaofei He, Deng Cai, Bo Li, Peng-Tao Jiang

Main category: cs.CV

TL;DR: A one-step diffusion framework for generating temporally coherent, depth-aware video bokeh rendering using multi-plane image representation and progressive training.

Details

Motivation: Existing image-based bokeh methods suffer from temporal flickering and inconsistent blur transitions, while current video editing methods lack explicit control over focus plane and bokeh intensity, limiting controllable video bokeh applications.

Method: Uses a one-step diffusion framework with multi-plane image (MPI) representation adapted to the focal plane to condition the video diffusion model, exploiting 3D priors from pretrained backbones. Introduces progressive training strategy for temporal stability, depth robustness, and detail preservation.

Result: Superior temporal coherence, spatial accuracy, and controllability demonstrated on synthetic and real-world benchmarks, outperforming prior baselines.

Conclusion: This work establishes the first dedicated diffusion framework for video bokeh generation, setting a new baseline for temporally coherent and controllable depth-of-field effects.

Abstract: Diffusion models have recently emerged as powerful tools for camera simulation, enabling both geometric transformations and realistic optical effects. Among these, image-based bokeh rendering has shown promising results, but diffusion for video bokeh remains unexplored. Existing image-based methods are plagued by temporal flickering and inconsistent blur transitions, while current video editing methods lack explicit control over the focus plane and bokeh intensity. These issues limit their applicability for controllable video bokeh. In this work, we propose a one-step diffusion framework for generating temporally coherent, depth-aware video bokeh rendering. The framework employs a multi-plane image (MPI) representation adapted to the focal plane to condition the video diffusion model, thereby enabling it to exploit strong 3D priors from pretrained backbones. To further enhance temporal stability, depth robustness, and detail preservation, we introduce a progressive training strategy. Experiments on synthetic and real-world benchmarks demonstrate superior temporal coherence, spatial accuracy, and controllability, outperforming prior baselines. This work represents the first dedicated diffusion framework for video bokeh generation, establishing a new baseline for temporally coherent and controllable depth-of-field effects. Code will be made publicly available.

[783] CryoCCD: Conditional Cycle-consistent Diffusion with Biophysical Modeling for Cryo-EM Synthesis

Runmin Jiang, Genpei Zhang, Yuntian Yang, Siqi Wu, Minhao Wu, Wanyue Feng, Yizhou Zhao, Xi Xiao, Xiao Wang, Tianyang Wang, Xingjian Li, Muyuan Chen, Min Xu

Main category: cs.CV

TL;DR: CryoCCD is a novel cryo-EM synthesis framework that combines biophysical modeling with conditional diffusion models to generate realistic synthetic cryo-EM data, addressing the scarcity of annotated datasets.

Details

Motivation: The development of cryo-EM processing tools is constrained by the scarcity of high-quality annotated datasets, and existing synthetic data approaches lack thorough biophysical modeling and realistic noise reproduction.

Method: CryoCCD unifies versatile biophysical modeling with a conditional cycle-consistent diffusion model enhanced with mask-guided contrastive learning to ensure realistic noise while preserving structural fidelity.

Result: CryoCCD generates structurally faithful micrographs, enhances particle picking and pose estimation, and achieves superior performance over state-of-the-art baselines while generalizing effectively to held-out protein families.

Conclusion: CryoCCD provides a comprehensive solution for generating realistic cryo-EM synthetic data, overcoming limitations of existing approaches and enabling better development of cryo-EM processing tools.

Abstract: Single-particle cryo-electron microscopy (cryo-EM) has become a cornerstone of structural biology, enabling near-atomic resolution analysis of macromolecules through advanced computational methods. However, the development of cryo-EM processing tools is constrained by the scarcity of high-quality annotated datasets. Synthetic data generation offers a promising alternative, but existing approaches lack thorough biophysical modeling of heterogeneity and fail to reproduce the complex noise observed in real imaging. To address these limitations, we present CryoCCD, a synthesis framework that unifies versatile biophysical modeling with the first conditional cycle-consistent diffusion model tailored for cryo-EM. The biophysical engine provides multi-functional generation capabilities to capture authentic biological organization, and the diffusion model is enhanced with cycle consistency and mask-guided contrastive learning to ensure realistic noise while preserving structural fidelity. Extensive experiments demonstrate that CryoCCD generates structurally faithful micrographs, enhances particle picking and pose estimation, as well as achieves superior performance over state-of-the-art baselines, while also generalizing effectively to held-out protein families.

[784] EgoVIS@CVPR: What Changed and What Could Have Changed? State-Change Counterfactuals for Procedure-Aware Video Representation Learning

Chi-Hsi Kung, Frangil Ramirez, Juhyung Ha, Yi-Ting Chen, David Crandall, Yi-Hsuan Tsai

Main category: cs.CV

TL;DR: The paper proposes using state-change descriptions and counterfactuals generated by LLMs as supervision for procedure-aware video representation learning, improving performance on tasks like action segmentation and error detection.

Details

Motivation: Existing work fails to explicitly learn scene transformations in procedural activities, which are crucial for understanding how actions transform scenes and how scene changes influence subsequent actions.

Method: Incorporates state-change descriptions from LLMs as supervision for video encoders, and generates state-change counterfactuals to simulate failure outcomes for counterfactual reasoning.

Result: Achieves significant improvements on multiple procedure-aware tasks including temporal action segmentation and error detection.

Conclusion: State-change descriptions and counterfactuals effectively enhance procedure awareness in video understanding models.

Abstract: Understanding a procedural activity requires modeling both how action steps transform the scene, and how evolving scene transformations can influence the sequence of action steps, even those that are accidental or erroneous. Yet, existing work on procedure-aware video representations fails to explicitly learned the state changes (scene transformations). In this work, we study procedure-aware video representation learning by incorporating state-change descriptions generated by LLMs as supervision signals for video encoders. Moreover, we generate state-change counterfactuals that simulate hypothesized failure outcomes, allowing models to learn by imagining the unseen ``What if’’ scenarios. This counterfactual reasoning facilitates the model’s ability to understand the cause and effect of each step in an activity. To verify the procedure awareness of our model, we conduct extensive experiments on procedure-aware tasks, including temporal action segmentation, error detection, and more. Our results demonstrate the effectiveness of the proposed state-change descriptions and their counterfactuals, and achieve significant improvements on multiple tasks.

[785] OmniGen2: Exploration to Advanced Multimodal Generation

Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, Ze Liu, Ziyi Xia, Chaofan Li, Haoge Deng, Jiahao Wang, Kun Luo, Bo Zhang, Defu Lian, Xinlong Wang, Zhongyuan Wang, Tiejun Huang, Zheng Liu

Main category: cs.CV

TL;DR: OmniGen2 is an open-source generative model that unifies text-to-image, image editing, and in-context generation tasks through dual decoding pathways and a decoupled image tokenizer, achieving competitive performance with modest parameters.

Details

Motivation: To create a unified solution for diverse generation tasks while preserving text generation capabilities and enabling training without re-adapting VAE inputs from existing multimodal models.

Method: Features two distinct decoding pathways for text and image modalities with unshared parameters and a decoupled image tokenizer. Uses comprehensive data construction pipelines for image editing and in-context generation, plus a reflection mechanism for image generation.

Result: Achieves competitive results on text-to-image and image editing benchmarks. Sets state-of-the-art performance among open-source models for in-context generation consistency on the new OmniContext benchmark.

Conclusion: OmniGen2 provides an effective unified framework for multiple generation tasks with competitive performance despite modest parameter size, and the authors will release all models, code, and datasets to support future research.

Abstract: In this work, we introduce OmniGen2, a versatile and open-source generative model designed to provide a unified solution for diverse generation tasks, including text-to-image, image editing, and in-context generation. Unlike OmniGen v1, OmniGen2 features two distinct decoding pathways for text and image modalities, utilizing unshared parameters and a decoupled image tokenizer. This design enables OmniGen2 to build upon existing multimodal understanding models without the need to re-adapt VAE inputs, thereby preserving the original text generation capabilities. To facilitate the training of OmniGen2, we developed comprehensive data construction pipelines, encompassing image editing and in-context generation data. Additionally, we introduce a reflection mechanism tailored for image generation tasks and curate a dedicated reflection dataset based on OmniGen2. Despite its relatively modest parameter size, OmniGen2 achieves competitive results on multiple task benchmarks, including text-to-image and image editing. To further evaluate in-context generation, also referred to as subject-driven tasks, we introduce a new benchmark named OmniContext. OmniGen2 achieves state-of-the-art performance among open-source models in terms of consistency. We will release our models, training code, datasets, and data construction pipeline to support future research in this field. Project Page: https://vectorspacelab.github.io/OmniGen2; GitHub Link: https://github.com/VectorSpaceLab/OmniGen2

[786] ProstaTD: Bridging Surgical Triplet from Classification to Fully Supervised Detection

Yiliang Chen, Zhixi Li, Cheng Xu, Alex Qinyang Liu, Ruize Cui, Xuemiao Xu, Jeremy Yuen-Chun Teoh, Shengfeng He, Jing Qin

Main category: cs.CV

TL;DR: ProstaTD is a large-scale surgical triplet detection dataset with precise spatial bounding box annotations and temporal boundaries, addressing limitations of existing datasets like CholecT50.

Details

Motivation: Existing surgical triplet datasets lack precise spatial bounding box annotations, making image-level classification insufficient for practical applications. Bounding boxes are essential for spatial context and improved model generalizability.

Method: Created ProstaTD dataset from robot-assisted prostatectomy videos with 71,775 frames and 196,490 annotated triplet instances from 21 surgeries across multiple institutions. Used rigorous medical supervision with 60+ contributors including surgeons and trained annotators, with iterative labeling and verification phases. Developed custom labeling tools and evaluation toolkit.

Result: ProstaTD is the largest and most diverse surgical triplet dataset with precise spatial and temporal boundaries, enabling full detection rather than simple classification.

Conclusion: ProstaTD provides a robust foundation for fair benchmarking in surgical triplet detection by moving the field from classification to full detection with precise spatial and temporal annotations.

Abstract: Surgical triplet detection is a critical task in surgical video analysis. However, existing datasets like CholecT50 lack precise spatial bounding box annotations, rendering triplet classification at the image level insufficient for practical applications. The inclusion of bounding box annotations is essential to make this task meaningful, as they provide the spatial context necessary for accurate analysis and improved model generalizability. To address these shortcomings, we introduce ProstaTD, a large-scale, multi-institutional dataset for surgical triplet detection, developed from the technically demanding domain of robot-assisted prostatectomy. ProstaTD offers clinically defined temporal boundaries and high-precision bounding box annotations for each structured triplet activity. The dataset comprises 71,775 video frames and 196,490 annotated triplet instances, collected from 21 surgeries performed across multiple institutions, reflecting a broad range of surgical practices and intraoperative conditions. The annotation process was conducted under rigorous medical supervision and involved more than 60 contributors, including practicing surgeons and medically trained annotators, through multiple iterative phases of labeling and verification. To further facilitate future general-purpose surgical annotation, we developed two tailored labeling tools to improve efficiency and scalability in our annotation workflows. In addition, we created a surgical triplet detection evaluation toolkit that enables standardized and reproducible performance assessment across studies. ProstaTD is the largest and most diverse surgical triplet dataset to date, moving the field from simple classification to full detection with precise spatial and temporal boundaries and thereby providing a robust foundation for fair benchmarking.

[787] EarthMind: Leveraging Cross-Sensor Data for Advanced Earth Observation Interpretation with a Unified Multimodal LLM

Yan Shu, Bin Ren, Zhitong Xiong, Danda Pani Paudel, Luc Van Gool, Begüm Demir, Nicu Sebe, Paolo Rota

Main category: cs.CV

TL;DR: EarthMind is a unified vision-language framework that handles both single- and cross-sensor inputs using hierarchical cross-modal attention to fuse optical and SAR features, achieving state-of-the-art results on EO benchmarks.

Details

Motivation: Current MLLMs for Earth Observation are limited to single-sensor inputs, missing the complementary benefits of heterogeneous modalities like optical and SAR data.

Method: Proposes hierarchical cross-modal attention (HCA) that captures visual relationships across sensors and aligns them with language queries for adaptive fusion of optical and SAR features. Also introduces FusionEO dataset and EarthMind-Bench benchmark.

Result: EarthMind achieves state-of-the-art results on EarthMind-Bench and surpasses existing MLLMs on multiple EO benchmarks.

Conclusion: The proposed framework effectively handles cross-sensor inputs and demonstrates superior performance in Earth Observation tasks through multimodal fusion.

Abstract: Earth Observation (EO) data analysis is vital for monitoring environmental and human dynamics. Recent Multimodal Large Language Models (MLLMs) show potential in EO understanding but remain restricted to single-sensor inputs, overlooking the complementarity across heterogeneous modalities. We propose EarthMind, a unified vision-language framework that handles both single- and cross-sensor inputs via an innovative hierarchical cross-modal attention (ie, HCA) design. Specifically, HCA hierarchically captures visual relationships across sensors and aligns them with language queries, enabling adaptive fusion of optical and Synthetic Aperture Radar (SAR) features. To support cross-sensor learning, we curate FusionEO, a 30K-pair dataset with diverse annotations, and establish EarthMind-Bench, a 2,841-pair benchmark with expert annotations for perception and reasoning tasks. Extensive experiments show that EarthMind achieves state-of-the-art results on EarthMind-Bench and surpasses existing MLLMs on multiple EO benchmarks.

[788] EgoVIS@CVPR: PAIR-Net: Enhancing Egocentric Speaker Detection via Pretrained Audio-Visual Fusion and Alignment Loss

Yu Wang, Juhyung Ha, David J. Crandall

Main category: cs.CV

TL;DR: PAIR-Net integrates frozen Whisper audio encoder with fine-tuned AV-HuBERT visual backbone, using inter-modal alignment loss to achieve state-of-the-art 76.6% mAP on Ego4D ASD benchmark.

Details

Motivation: Traditional visual-centric ASD methods degrade significantly in egocentric videos due to unstable viewpoints, motion blur, and off-screen speech sources.

Method: Integrates partially frozen Whisper audio encoder with fine-tuned AV-HuBERT visual backbone, using inter-modal alignment loss to synchronize audio and visual representations and counteract modality imbalance.

Result: Achieves 76.6% mAP on Ego4D ASD benchmark, surpassing LoCoNet by 8.2% and STHG by 12.9% mAP, without relying on multi-speaker context or ideal frontal views.

Conclusion: Pretrained audio priors and alignment-based fusion are valuable for robust active speaker detection under real-world egocentric conditions.

Abstract: Active speaker detection (ASD) in egocentric videos presents unique challenges due to unstable viewpoints, motion blur, and off-screen speech sources - conditions under which traditional visual-centric methods degrade significantly. We introduce PAIR-Net (Pretrained Audio-Visual Integration with Regularization Network), an effective model that integrates a partially frozen Whisper audio encoder with a fine-tuned AV-HuBERT visual backbone to robustly fuse cross-modal cues. To counteract modality imbalance, we introduce an inter-modal alignment loss that synchronizes audio and visual representations, enabling more consistent convergence across modalities. Without relying on multi-speaker context or ideal frontal views, PAIR-Net achieves state-of-the-art performance on the Ego4D ASD benchmark with 76.6% mAP, surpassing LoCoNet and STHG by 8.2% and 12.9% mAP, respectively. Our results highlight the value of pretrained audio priors and alignment-based fusion for robust ASD under real-world egocentric conditions.

[789] METok: Multi-Stage Event-based Token Compression for Efficient Long Video Understanding

Mengyue Wang, Shuo Chen, Kristian Kersting, Volker Tresp, Yunpu Ma

Main category: cs.CV

TL;DR: METok is a training-free, multi-stage event-based token compression framework that accelerates Video Large Language Models’ inference while preserving accuracy by progressively eliminating redundant visual tokens.

Details

Motivation: Processing long videos with VLLMs is challenging due to high computational demands and visual data redundancy. Existing methods struggle with efficiency while maintaining accuracy.

Method: Three-stage compression: (1) event-aware compression during vision encoding, (2) hierarchical token pruning in prefilling stage based on semantic alignment and event importance, (3) decoding-stage KV Cache optimization to reduce memory consumption.

Result: Achieves 80.6% FLOPs reduction and 93.5% KV Cache memory savings when equipped with LongVA-7B, while maintaining comparable or superior accuracy on diverse video benchmarks.

Conclusion: METok provides an optimal trade-off between efficiency and accuracy by dynamically selecting informative visual tokens, making long video processing more practical for VLLMs.

Abstract: Recent advances in Video Large Language Models (VLLMs) have significantly enhanced their ability to understand video content. Nonetheless, processing long videos remains challenging due to high computational demands and the redundancy present in the visual data. In this work, we propose METok, a training-free, Multi-stage Event-based Token compression framework designed to accelerate VLLMs’ inference while preserving accuracy. METok progressively eliminates redundant visual tokens across three critical stages: (1) event-aware compression during vision encoding, (2) hierarchical token pruning in the prefilling stage based on semantic alignment and event importance, and (3) a decoding-stage KV Cache optimization that further reduces memory consumption. Our experiments on diverse video benchmarks demonstrate that METok achieves an optimal trade-off between efficiency and accuracy by dynamically selecting informative visual tokens. For instance, equipping LongVA-7B with METok realizes an 80.6% FLOPs reduction and 93.5% KV Cache memory savings, all while maintaining comparable or even superior accuracy.

[790] Struct2D: A Perception-Guided Framework for Spatial Reasoning in Large Multimodal Models

Fangrui Zhu, Hanhui Wang, Yiming Xie, Jing Gu, Tianye Ding, Jianwei Yang, Huaizu Jiang

Main category: cs.CV

TL;DR: Struct2D enables LMMs to perform 3D spatial reasoning using only structured 2D representations, achieving competitive performance without explicit 3D inputs.

Details

Motivation: To unlock spatial reasoning in LMMs for intelligent 3D environment interaction without relying on explicit 3D inputs or specialized architectures.

Method: Developed Struct2D framework using BEV images with object marks and metadata, created Struct2D-Set dataset with 200K QA pairs, and fine-tuned Qwen2.5VL model.

Result: LMMs show strong spatial reasoning with structured 2D inputs, achieving competitive performance on 3D QA, dense captioning, and object grounding benchmarks.

Conclusion: Structured 2D inputs can effectively bridge perception and language reasoning in LMMs without requiring explicit 3D representations.

Abstract: Unlocking spatial reasoning in Large Multimodal Models (LMMs) is crucial for enabling intelligent interaction with 3D environments. While prior efforts often rely on explicit 3D inputs or specialized model architectures, we ask: can LMMs reason about 3D space using only structured 2D representations derived from perception? We introduce Struct2D, a perception-guided prompting framework that combines bird’s-eye-view (BEV) images with object marks and object-centric metadata, optionally incorporating egocentric keyframes when needed. Using Struct2D, we conduct an in-depth zero-shot analysis of closed-source LMMs (e.g., GPT-o3) and find that they exhibit surprisingly strong spatial reasoning abilities when provided with structured 2D inputs, effectively handling tasks such as relative direction estimation and route planning. Building on these insights, we construct Struct2D-Set, a large-scale instruction tuning dataset with 200K fine-grained QA pairs across eight spatial reasoning categories, generated automatically from 3D indoor scenes. We fine-tune an open-source LMM (Qwen2.5VL) on Struct2D-Set, achieving competitive performance on multiple benchmarks, including 3D question answering, dense captioning, and object grounding. Our approach demonstrates that structured 2D inputs can effectively bridge perception and language reasoning in LMMs-without requiring explicit 3D representations as input. We will release both our code and dataset to support future research.

[791] Training-Free Diffusion Framework for Stylized Image Generation with Identity Preservation

Mohammad Ali Rezaei, Helia Hajikazem, Saeed Khanehgir, Mahdi Javanmardi

Main category: cs.CV

TL;DR: A training-free framework for identity-preserved stylized image synthesis that addresses identity loss in diffusion-based style transfer, particularly for distant subjects and group scenes.

Details

Motivation: Existing style transfer techniques struggle to maintain identity while achieving high-quality stylization, which is critical in applications like advertising and marketing where preserving featured individuals' identity is essential.

Method: Introduces ‘Mosaic Restored Content Image’ technique to enhance identity retention in complex scenes and a training-free content consistency loss that directs more attention to the original image during stylization.

Result: The proposed approach substantially exceeds baseline models in maintaining high stylistic fidelity and robust identity integrity without requiring model retraining or fine-tuning.

Conclusion: The framework successfully addresses identity preservation challenges in style transfer, particularly for complex scenes with distant subjects or groups, while maintaining high-quality stylization.

Abstract: Although diffusion models have demonstrated remarkable generative capabilities, existing style transfer techniques often struggle to maintain identity while achieving high-quality stylization. This limitation becomes particularly critical in practical applications such as advertising and marketing, where preserving the identity of featured individuals is essential for a campaign’s effectiveness. It is particularly severe when subjects are distant from the camera or appear within a group, frequently leading to a significant loss of identity. To address this issue, we introduce a novel, training-free framework for identity-preserved stylized image synthesis. Key contributions include the “Mosaic Restored Content Image” technique, which significantly enhances identity retention in complex scenes, and a training-free content consistency loss that improves the preservation of fine-grained details by directing more attention to the original image during stylization. Our experiments reveal that the proposed approach substantially exceeds the baseline model in concurrently maintaining high stylistic fidelity and robust identity integrity, all without necessitating model retraining or fine-tuning.

[792] Vision-EKIPL: External Knowledge-Infused Policy Learning for Visual Reasoning

Chaoyang Wang, Zeyu Zhang, Meng Meng, Xu Zhou, Haiyun Jiang

Main category: cs.CV

TL;DR: Vision-EKIPL is a novel RL framework that enhances MLLMs’ visual reasoning by incorporating high-quality actions from external models during training, overcoming limitations of traditional RL methods.

Details

Motivation: Existing RL methods for MLLMs sample actions only from the policy model, limiting reasoning capability and training efficiency. There's a need to expand exploration space and improve reasoning boundaries.

Method: Proposes Vision-EKIPL framework that introduces high-quality actions generated by external auxiliary models during RL training to guide policy model optimization through knowledge infusion.

Result: Achieved up to 5% performance improvement on Reason-RFT-CoT Benchmark compared to SOTA, with accelerated training convergence and enhanced efficiency.

Conclusion: Vision-EKIPL overcomes traditional RL limitations, significantly enhances visual reasoning performance of MLLMs, and provides a new effective paradigm for multimodal reasoning research.

Abstract: Visual reasoning is crucial for understanding complex multimodal data and advancing Artificial General Intelligence. Existing methods enhance the reasoning capability of Multimodal Large Language Models (MLLMs) through Reinforcement Learning (RL) fine-tuning (e.g., GRPO). However, current RL approaches sample action groups solely from the policy model itself, which limits the upper boundary of the model’s reasoning capability and leads to inefficient training. To address these limitations, this paper proposes a novel RL framework called \textbf{Vision-EKIPL}. The core of this framework lies in introducing high-quality actions generated by external auxiliary models during the RL training process to guide the optimization of the policy model. The policy learning with knowledge infusion from external models significantly expands the model’s exploration space, effectively improves the reasoning boundary, and substantially accelerates training convergence speed and efficiency. Experimental results demonstrate that our proposed Vision-EKIPL achieved up to a 5% performance improvement on the Reason-RFT-CoT Benchmark compared to the state-of-the-art (SOTA). It reveals that Vision-EKIPL can overcome the limitations of traditional RL methods, significantly enhance the visual reasoning performance of MLLMs, and provide a new effective paradigm for research in this field.

[793] ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving

Yongkang Li, Kaixin Xiong, Xiangyu Guo, Fang Li, Sixu Yan, Gangwei Xu, Lijun Zhou, Long Chen, Haiyang Sun, Bing Wang, Kun Ma, Guang Chen, Hangjun Ye, Wenyu Liu, Xinggang Wang

Main category: cs.CV

TL;DR: ReCogDrive is a reinforced cognitive framework for autonomous driving that integrates VLMs with diffusion planning to address language-action mismatch issues in trajectory planning, achieving state-of-the-art performance.

Details

Motivation: To overcome limitations of existing methods that formulate trajectory planning as language modeling, which leads to format violations, infeasible actions, and slow inference speeds.

Method: Unifies driving understanding and planning by integrating autoregressive model with diffusion planner, using hierarchical data pipeline for human-like cognition and DiffGRPO for safety reinforcement.

Result: Achieves state-of-the-art performance on NAVSIM and Bench2Drive benchmarks, with strong scene comprehension across diverse driving scenarios.

Conclusion: ReCogDrive effectively bridges the gap between VLMs’ cognitive capabilities and practical autonomous driving requirements, providing a robust framework for end-to-end driving systems.

Abstract: Recent studies have explored leveraging the world knowledge and cognitive capabilities of Vision-Language Models (VLMs) to address the long-tail problem in end-to-end autonomous driving. However, existing methods typically formulate trajectory planning as a language modeling task, where physical actions are output in the language space, potentially leading to issues such as format-violating outputs, infeasible actions, and slow inference speeds. In this paper, we propose ReCogDrive, a novel Reinforced Cognitive framework for end-to-end autonomous Driving, unifying driving understanding and planning by integrating an autoregressive model with a diffusion planner. First, to instill human driving cognition into the VLM, we introduce a hierarchical data pipeline that mimics the sequential cognitive process of human drivers through three stages: generation, refinement, and quality control. Building on this cognitive foundation, we then address the language-action mismatch by injecting the VLM’s learned driving priors into a diffusion planner to efficiently generate continuous and stable trajectories. Furthermore, to enhance driving safety and reduce collisions, we introduce a Diffusion Group Relative Policy Optimization (DiffGRPO) stage, reinforcing the planner for enhanced safety and comfort. Extensive experiments on the NAVSIM and Bench2Drive benchmarks demonstrate that ReCogDrive achieves state-of-the-art performance. Additionally, qualitative results across diverse driving scenarios and DriveBench highlight the model’s scene comprehension. All code, model weights, and datasets will be made publicly available to facilitate subsequent research.

[794] MIRAGE: Multimodal foundation model and benchmark for comprehensive retinal OCT image analysis

José Morano, Botond Fazekas, Emese Sükei, Ronald Fecso, Taha Emre, Markus Gumpinger, Georg Faustmann, Marzieh Oghbaie, Ursula Schmidt-Erfurth, Hrvoje Bogunović

Main category: cs.CV

TL;DR: MIRAGE is a multimodal foundation model for analyzing OCT and SLO retinal images that outperforms existing models in classification and segmentation tasks, with both the model and evaluation benchmark made publicly available.

Details

Motivation: Existing AI models for ophthalmology require extensive annotation, underperform on unseen data, and available foundation models lack validation for segmentation tasks and focus on single imaging modalities.

Method: Proposed MIRAGE - a multimodal foundation model for OCT and SLO image analysis, along with a new evaluation benchmark containing OCT/SLO classification and segmentation tasks.

Result: MIRAGE demonstrated superiority over both general and specialized foundation models and segmentation methods in both classification and segmentation tasks.

Conclusion: MIRAGE is suitable as a basis for developing robust AI systems for retinal OCT image analysis and is publicly available along with the evaluation benchmark.

Abstract: Artificial intelligence (AI) has become a fundamental tool for assisting clinicians in analyzing ophthalmic images, such as optical coherence tomography (OCT). However, developing AI models often requires extensive annotation, and existing models tend to underperform on independent, unseen data. Foundation models (FMs), large AI models trained on vast unlabeled datasets, have shown promise in overcoming these challenges. Nonetheless, available FMs for ophthalmology lack extensive validation, especially for segmentation tasks, and focus on a single imaging modality. In this context, we propose MIRAGE, a novel multimodal FM for the analysis of OCT and scanning laser ophthalmoscopy (SLO) images. Additionally, we propose a new evaluation benchmark with OCT/SLO classification and segmentation tasks. The comparison with general and specialized FMs and segmentation methods shows the superiority of MIRAGE in both types of tasks, highlighting its suitability as a basis for the development of robust AI systems for retinal OCT image analysis. Both MIRAGE and the evaluation benchmark are publicly available: https://github.com/j-morano/MIRAGE.

[795] Revisiting Visual Understanding in Multimodal Reasoning through a Lens of Image Perturbation

Yuting Li, Lai Wei, Kaipeng Zheng, Jingyuan Huang, Guilin Li, Bo Wang, Linghe Kong, Lichao Sun, Weiran Huang

Main category: cs.CV

TL;DR: Visual perturbation framework enhances MLLMs’ perceptual robustness without algorithmic changes or extra data, improving multimodal mathematical reasoning performance.

Details

Motivation: Current MLLMs generate accurate visual descriptions but fail to effectively integrate them during reasoning, as shown by language-only models with image captions outperforming MLLMs with raw visual inputs.

Method: Proposes three targeted visual perturbations: distractor concatenation, dominance-preserving mixup, and random rotation, which can be integrated into existing post-training pipelines like SFT, DPO, and GRPO.

Result: Consistent improvements in mathematical reasoning across multiple datasets, with gains comparable to algorithmic changes. Achieved competitive performance among open-source 7B RL-tuned models with Qwen2.5-VL-7B.

Conclusion: Visual perturbation plays a critical role in multimodal mathematical reasoning, demonstrating that better reasoning begins with better seeing through enhanced perceptual robustness.

Abstract: Despite the rapid progress of multimodal large language models (MLLMs), they have largely overlooked the importance of visual processing. In a simple yet revealing experiment, we interestingly find that language-only models, when provided with image captions, can achieve comparable or even better performance than MLLMs that consume raw visual inputs. This suggests that current MLLMs may generate accurate visual descriptions but fail to effectively integrate them during reasoning. Motivated by this, we propose a simple visual perturbation framework that enhances perceptual robustness without requiring algorithmic modifications or additional training data. Our approach introduces three targeted perturbations: distractor concatenation, dominance-preserving mixup, and random rotation, that can be easily integrated into existing post-training pipelines including SFT, DPO, and GRPO. Through extensive experiments across multiple datasets, we demonstrate consistent improvements in mathematical reasoning performance, with gains comparable to those achieved through algorithmic changes. Additionally, we achieve competitive performance among open-source 7B RL-tuned models by training Qwen2.5-VL-7B with visual perturbation. Through comprehensive ablation studies, we analyze the effectiveness of different perturbation strategies, revealing that each perturbation type contributes uniquely to different aspects of visual reasoning. Our findings highlight the critical role of visual perturbation in multimodal mathematical reasoning: better reasoning begins with better seeing. Our code is available at https://github.com/YutingLi0606/Vision-Matters.

[796] ViSpec: Accelerating Vision-Language Models with Vision-Aware Speculative Decoding

Jialiang Kang, Han Shu, Wenshuo Li, Yingjie Zhai, Xinghao Chen

Main category: cs.CV

TL;DR: ViSpec introduces vision-aware speculative decoding for VLMs, achieving substantial speedups by compressing image tokens and enhancing multimodal coherence through global feature vectors.

Details

Motivation: Speculative decoding is effective for LLMs but underexplored for VLMs, with existing methods achieving only modest speedups (<1.5x). The gap is significant as multimodal capabilities become central to large-scale models.

Method: ViSpec uses a lightweight vision adaptor to compress image tokens into compact representations, integrates them into draft model’s attention while preserving positional info, and augments text tokens with global image features. A specialized training dataset is curated to prevent shortcut learning.

Result: ViSpec achieves the first substantial speedup in VLM speculative decoding, significantly outperforming existing methods that only reach <1.5x speedup.

Conclusion: ViSpec successfully bridges the gap in speculative decoding for VLMs, demonstrating that large VLMs can effectively filter redundant image information while maintaining textual comprehension, enabling practical acceleration for multimodal inference.

Abstract: Speculative decoding is a widely adopted technique for accelerating inference in large language models (LLMs), yet its application to vision-language models (VLMs) remains underexplored, with existing methods achieving only modest speedups (<1.5x). This gap is increasingly significant as multimodal capabilities become central to large-scale models. We hypothesize that large VLMs can effectively filter redundant image information layer by layer without compromising textual comprehension, whereas smaller draft models struggle to do so. To address this, we introduce Vision-Aware Speculative Decoding (ViSpec), a novel framework tailored for VLMs. ViSpec employs a lightweight vision adaptor module to compress image tokens into a compact representation, which is seamlessly integrated into the draft model’s attention mechanism while preserving original image positional information. Additionally, we extract a global feature vector for each input image and augment all subsequent text tokens with this feature to enhance multimodal coherence. To overcome the scarcity of multimodal datasets with long assistant responses, we curate a specialized training dataset by repurposing existing datasets and generating extended outputs using the target VLM with modified prompts. Our training strategy mitigates the risk of the draft model exploiting direct access to the target model’s hidden states, which could otherwise lead to shortcut learning when training solely on target model outputs. Extensive experiments validate ViSpec, achieving, to our knowledge, the first substantial speedup in VLM speculative decoding. Code is available at https://github.com/KangJialiang/ViSpec.

[797] DART: Differentiable Dynamic Adaptive Region Tokenizer for Vision Foundation Models

Shicheng Yin, Kaixuan Yin, Yang Liu, Weixing Chen, Liang Lin

Main category: cs.CV

TL;DR: DART introduces a differentiable dynamic adaptive region tokenizer that creates content-aware patches of varying sizes, enabling more efficient vision models by allocating higher token density to information-rich regions.

Details

Motivation: Standard vision models use fixed-grid tokenizers that create a trade-off between capturing fine-grained detail and computational redundancy. This bottleneck limits model efficiency and performance.

Method: DART employs learnable region scores and quantile-based partitioning to create content-aware patches of varying sizes, intelligently allocating higher token density to information-rich regions.

Result: DART-equipped DeiT-Small (22M parameters) matches DeiT-Base (86M) performance with nearly double the inference speed. The approach also shows benefits in dense prediction and spatiotemporal video tasks.

Conclusion: Adaptive tokenization resolves the tokenizer bottleneck and is a key component for building more efficient and capable foundation models for multimodal AI, robotics, and content generation.

Abstract: The content-agnostic, fixed-grid tokenizers used by standard large-scale vision models like Vision Transformer (ViT) and Vision Mamba (Vim) represent a fundamental performance bottleneck, creating a trade-off between capturing fine-grained detail and suffering from redundant computation. To resolve this dilemma, we introduce DART, a fully differentiable Dynamic Adaptive Region Tokenizer. DART employs learnable region scores and quantile-based partitioning to create content-aware patches of varying sizes, intelligently allocating a higher token density to information-rich regions. The impact of this approach is profound: it unlocks a more intelligent scaling paradigm, where a DART-equipped DeiT-Small (22M parameters) matches the performance of a DeiT-Base (86M) with nearly double the inference speed by efficiently capturing high-resolution details in key regions. Furthermore, the principle of adaptive tokenization proves its generality with clear benefits in dense prediction and spatiotemporal video tasks. We argue that by resolving the tokenizer bottleneck at its source, adaptive tokenization is a key component for building the next generation of more efficient and capable foundation models for multimodal AI, robotics, and content generation. Code is available at https://github.com/HCPLab-SYSU/DART.

[798] Decoupled Classifier-Free Guidance for Counterfactual Diffusion Models

Tian Xia, Fabio De Sousa Ribeiro, Rajat R Rasal, Avinash Kori, Raghav Mehta, Ben Glocker

Main category: cs.CV

TL;DR: The paper proposes Decoupled Classifier-Free Guidance (DCFG) to address limitations in counterfactual generation using diffusion models, enabling attribute-wise control through causal graphs.

Details

Motivation: Current classifier-free guidance (CFG) uses a global guidance scale for all attributes, causing spurious changes in counterfactuals. This motivates the need for more precise, attribute-level control.

Method: DCFG uses an attribute-split embedding strategy to disentangle semantic inputs, allowing selective guidance on user-defined attribute groups following causal relationships.

Result: DCFG enables more accurate counterfactual generation by preventing spurious changes and providing flexible attribute-wise control.

Conclusion: DCFG offers a model-agnostic solution for improved counterfactual generation with precise attribute-level guidance, overcoming limitations of standard CFG.

Abstract: Counterfactual generation aims to simulate realistic hypothetical outcomes under causal interventions. Diffusion models have emerged as a powerful tool for this task, combining DDIM inversion with conditional generation and classifier-free guidance (CFG). In this work, we identify a key limitation of CFG for counterfactual generation: it prescribes a global guidance scale for all attributes, leading to significant spurious changes in inferred counterfactuals. To mitigate this, we propose Decoupled Classifier-Free Guidance (DCFG), a flexible and model-agnostic guidance technique that enables attribute-wise control following a causal graph. DCFG is implemented via a simple attribute-split embedding strategy that disentangles semantic inputs, enabling selective guidance on user-defined attribute groups.

[799] FindingDory: A Benchmark to Evaluate Memory in Embodied Agents

Karmesh Yadav, Yusuf Ali, Gunshi Gupta, Yarin Gal, Zsolt Kira

Main category: cs.CV

TL;DR: Introduces a new benchmark for long-range embodied tasks in Habitat simulator to evaluate memory-based capabilities in robotics, addressing limitations of current VLMs in handling long-term experience.

Details

Motivation: Current VLMs struggle with processing long-term experience (hundreds of images) needed for embodied reasoning, and existing benchmarks don't adequately test memory in embodied contexts requiring object manipulation and navigation.

Method: Created a benchmark with 60 memory-intensive tasks in Habitat simulator that require sustained engagement and contextual awareness, with procedural extension capability for scalable evaluation. Integrated state-of-the-art VLMs with low-level navigation policies as baselines.

Result: The benchmark enables evaluation of memory-based capabilities in embodied settings, highlighting areas where current VLM approaches need improvement for long-horizon control tasks.

Conclusion: The proposed benchmark fills a critical gap in evaluating memory integration for embodied AI, providing a scalable framework to assess how well agents can recall historical information and execute actions based on that memory in complex environments.

Abstract: Large vision-language models have recently demonstrated impressive performance in planning and control tasks, driving interest in their application to real-world robotics. However, deploying these models for reasoning in embodied contexts is limited by their ability to incorporate long-term experience collected across multiple days and represented by vast collections of images. Current VLMs typically struggle to process more than a few hundred images concurrently, highlighting the need for more efficient mechanisms to handle long-term memory in embodied settings. To effectively evaluate these models for long-horizon control, a benchmark must specifically target scenarios where memory is crucial for success. Existing long-video QA benchmarks overlook embodied challenges like object manipulation and navigation, which demand low-level skills and fine-grained reasoning over past interactions. Moreover, effective memory integration in embodied agents involves both recalling relevant historical information and executing actions based on that information, making it essential to study these aspects together rather than in isolation. In this work, we introduce a new benchmark for long-range embodied tasks in the Habitat simulator. This benchmark evaluates memory-based capabilities across 60 tasks requiring sustained engagement and contextual awareness in an environment. The tasks can also be procedurally extended to longer and more challenging versions, enabling scalable evaluation of memory and reasoning. We also present baselines that integrate state-of-the-art VLMs with low level navigation policies, assessing their performance on these memory-intensive tasks and highlight areas for improvement.

[800] Do We Need Large VLMs for Spotting Soccer Actions?

Ritabrata Chakraborty, Rajatsubhra Chakraborty, Avijit Dasgupta, Sandeep Chaurasia

Main category: cs.CV

TL;DR: The paper proposes a text-based approach for soccer action spotting using LLMs instead of video processing, achieving near state-of-the-art performance with zero video compute.

Details

Motivation: Traditional video-based action spotting requires complex, computationally expensive models. The authors want to create a lightweight, scalable alternative using text commentary instead of visual inputs.

Method: Uses a system of three specialized LLMs acting as judges for outcome, excitement, and tactics to spot actions in soccer matches based on expert commentary.

Result: The language-centric approach performs effectively in detecting critical match events, coming close to state-of-the-art video-based spotters while using zero video processing compute.

Conclusion: Expert commentary contains sufficient information for reliable action spotting, enabling a lightweight, scalable text-based alternative to video-centric approaches.

Abstract: Traditional video-based tasks like soccer action spotting rely heavily on visual inputs, often requiring complex and computationally expensive models to process dense video data. We propose a shift from this video-centric approach to a text-based task, making it lightweight and scalable by utilizing Large Language Models (LLMs) instead of Vision-Language Models (VLMs). We posit that expert commentary, which provides rich descriptions and contextual cues contains sufficient information to reliably spot key actions in a match. To demonstrate this, we employ a system of three LLMs acting as judges specializing in outcome, excitement, and tactics for spotting actions in soccer matches. Our experiments show that this language-centric approach performs effectively in detecting critical match events coming close to state-of-the-art video-based spotters while using zero video processing compute and similar amount of time to process the entire match.

[801] MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

Junbo Niu, Zheng Liu, Zhuangcheng Gu, Bin Wang, Linke Ouyang, Zhiyuan Zhao, Tao Chu, Tianyao He, Fan Wu, Qintong Zhang, Zhenjiang Jin, Guang Liang, Rui Zhang, Wenzheng Zhang, Yuan Qu, Zhifei Ren, Yuefeng Sun, Yuanhong Zheng, Dongsheng Ma, Zirui Tang, Boyu Niu, Ziyang Miao, Hejun Dong, Siyi Qian, Junyuan Zhang, Jingzhou Chen, Fangdong Wang, Xiaomeng Zhao, Liqun Wei, Wei Li, Shasha Wang, Ruiliang Xu, Yuanyuan Cao, Lu Chen, Qianqian Wu, Huaiyu Gu, Lindong Lu, Keming Wang, Dechen Lin, Guanlin Shen, Xuanhe Zhou, Linfeng Zhang, Yuhang Zang, Xiaoyi Dong, Jiaqi Wang, Bo Zhang, Lei Bai, Pei Chu, Weijia Li, Jiang Wu, Lijun Wu, Zhenxiang Li, Guangyu Wang, Zhongying Tu, Chao Xu, Kai Chen, Yu Qiao, Bowen Zhou, Dahua Lin, Wentao Zhang, Conghui He

Main category: cs.CV

TL;DR: MinerU2.5 is a 1.2B-parameter vision-language model for document parsing that uses a two-stage coarse-to-fine approach, achieving state-of-the-art accuracy with high computational efficiency.

Details

Motivation: To develop an efficient document parsing model that can handle high-resolution inputs without computational overhead while maintaining accuracy for complex elements like dense text, formulas, and tables.

Method: Two-stage parsing strategy: first stage performs layout analysis on downsampled images, second stage performs targeted content recognition on native-resolution crops. Supported by a comprehensive data engine for training data generation.

Result: Achieves state-of-the-art performance on multiple benchmarks, surpassing both general-purpose and domain-specific models across various recognition tasks with significantly lower computational overhead.

Conclusion: MinerU2.5 demonstrates strong document parsing ability with exceptional computational efficiency, making it suitable for practical applications requiring accurate document understanding with limited resources.

Abstract: We introduce MinerU2.5, a 1.2B-parameter document parsing vision-language model that achieves state-of-the-art recognition accuracy while maintaining exceptional computational efficiency. Our approach employs a coarse-to-fine, two-stage parsing strategy that decouples global layout analysis from local content recognition. In the first stage, the model performs efficient layout analysis on downsampled images to identify structural elements, circumventing the computational overhead of processing high-resolution inputs. In the second stage, guided by the global layout, it performs targeted content recognition on native-resolution crops extracted from the original image, preserving fine-grained details in dense text, complex formulas, and tables. To support this strategy, we developed a comprehensive data engine that generates diverse, large-scale training corpora for both pretraining and fine-tuning. Ultimately, MinerU2.5 demonstrates strong document parsing ability, achieving state-of-the-art performance on multiple benchmarks, surpassing both general-purpose and domain-specific models across various recognition tasks, while maintaining significantly lower computational overhead.

[802] From Drawings to Decisions: A Hybrid Vision-Language Framework for Parsing 2D Engineering Drawings into Structured Manufacturing Knowledge

Muhammad Tayyab Khan, Lequn Chen, Zane Yong, Jun Ming Tan, Wenhe Feng, Seung Ki Moon

Main category: cs.CV

TL;DR: A hybrid vision-language framework combining rotation-aware object detection (YOLOv11-obb) with transformer-based vision-language parsing for efficient extraction of key information from 2D engineering drawings.

Details

Motivation: Manual extraction of information from 2D engineering drawings is slow and labor-intensive, while generic OCR models fail due to complex layouts, engineering symbols, and rotated text, leading to incomplete and unreliable outputs.

Method: Proposed hybrid framework integrates YOLOv11-obb for rotation-aware object detection and oriented bounding box extraction, followed by parsing using fine-tuned lightweight vision-language models (Donut and Florence-2) on a curated dataset of 1,367 mechanical drawings.

Result: Donut outperformed Florence-2 with 88.5% precision, 99.2% recall, 93.5% F1-score, and 11.5% hallucination rate. The extracted structured information successfully supports downstream manufacturing tasks like process and tool selection.

Conclusion: The proposed framework effectively addresses challenges in 2D engineering drawing interpretation, providing reliable automated extraction of key information for digital manufacturing workflows.

Abstract: Efficient and accurate extraction of key information from 2D engineering drawings is essential for advancing digital manufacturing workflows. Such information includes geometric dimensioning and tolerancing (GD&T), measures, material specifications, and textual annotations. Manual extraction is slow and labor-intensive, while generic OCR models often fail due to complex layouts, engineering symbols, and rotated text, leading to incomplete and unreliable outputs. These limitations result in incomplete and unreliable outputs. To address these challenges, we propose a hybrid vision-language framework that integrates a rotation-aware object detection model (YOLOv11-obb) with a transformer-based vision-language parser. Our structured pipeline applies YOLOv11-OBB to localize annotations and extract oriented bounding box (OBB) patches, which are then parsed into structured outputs using a fine-tuned, lightweight vision-language model (VLM). We curate a dataset of 1,367 2D mechanical drawings annotated across nine key categories. YOLOv11-OBB is trained on this dataset to detect OBBs and extract annotation patches. These are parsed using two open-source VLMs: Donut and Florence-2. Both models are lightweight and well-suited for specialized industrial tasks under limited computational overhead. Following fine-tuning of both models on the curated dataset of image patches paired with structured annotation labels, a comparative experiment is conducted to evaluate parsing performance across four key metrics. Donut outperforms Florence-2, achieving 88.5% precision, 99.2% recall, and a 93.5% F1-score, with a hallucination rate of 11.5%. Finally, a case study demonstrates how the extracted structured information supports downstream manufacturing tasks such as process and tool selection, showcasing the practical utility of the proposed framework in modernizing 2D drawing interpretation.

[803] Improving Black-Box Generative Attacks via Generator Semantic Consistency

Jongoh Jeong, Hunmin Yang, Jaeseok Jeong, Kuk-Jin Yoon

Main category: cs.CV

TL;DR: The paper proposes a method to improve transfer attacks by enforcing semantic consistency in generative models through intermediate feature alignment with an EMA teacher, reducing semantic drift and improving black-box transfer performance without inference-time overhead.

Details

Motivation: Current generative attacks for transfer attacks optimize surrogate losses but overlook the generator's internal dynamics, failing to explore how internal representations shape transferable perturbations. This leads to underexplored potential in improving black-box transfer performance.

Method: Enforce semantic consistency by aligning early generator’s intermediate features to an EMA teacher, stabilizing object-aligned representations. Quantify semantic stability using standard deviation of foreground IoU between cluster-derived activation masks and foreground masks across generator blocks.

Result: The approach reduces semantic drift and improves black-box transfer across architectures, domains, and tasks. It can be seamlessly integrated into existing generative attacks with consistent improvements while maintaining test-time efficiency.

Conclusion: The proposed semantic consistency method effectively improves transfer attacks by addressing the generator’s internal dynamics, providing better black-box transfer performance without additional inference cost, and introducing ACR for more reliable evaluation of attack success.

Abstract: Transfer attacks optimize on a surrogate and deploy to a black-box target. While iterative optimization attacks in this paradigm are limited by their per-input cost limits efficiency and scalability due to multistep gradient updates for each input, generative attacks alleviate these by producing adversarial examples in a single forward pass at test time. However, current generative attacks still adhere to optimizing surrogate losses (e.g., feature divergence) and overlook the generator’s internal dynamics, underexploring how the generator’s internal representations shape transferable perturbations. To address this, we enforce semantic consistency by aligning the early generator’s intermediate features to an EMA teacher, stabilizing object-aligned representations and improving black-box transfer without inference-time overhead. To ground the mechanism, we quantify semantic stability as the standard deviation of foreground IoU between cluster-derived activation masks and foreground masks across generator blocks, and observe reduced semantic drift under our method. For more reliable evaluation, we also introduce Accidental Correction Rate (ACR) to separate inadvertent corrections from intended misclassifications, complementing the inherent blind spots in traditional Attack Success Rate (ASR), Fooling Rate (FR), and Accuracy metrics. Across architectures, domains, and tasks, our approach can be seamlessly integrated into existing generative attacks with consistent improvements in black-box transfer, while maintaining test-time efficiency.

[804] Light of Normals: Unified Feature Representation for Universal Photometric Stereo

Hong Li, Houyuan Chen, Chongjie Ye, Zhaoxi Chen, Bohan Li, Shaocong Xu, Xianda Guo, Xuhui Liu, Yikai Wang, Baochang Zhang, Satoshi Ikehata, Boxin Shi, Anyi Rao, Hao Zhao

Main category: cs.CV

TL;DR: LINO UniPS introduces light register tokens and interleaved attention blocks to decouple illumination from surface normals, plus wavelet-based architecture and normal-gradient loss to preserve high-frequency geometric details, achieving state-of-the-art universal photometric stereo performance.

Details

Motivation: Current universal photometric stereo methods cannot guarantee proper decoupling of illumination and normal information, and tend to lose high-frequency geometric details. The paper aims to address these two fundamental challenges.

Method: Proposes LINO UniPS with: (1) Light Register Tokens with light alignment supervision to aggregate various light types; (2) Interleaved Attention Block with global cross-image attention; (3) Wavelet-based Dual-branch Architecture; (4) Normal-gradient Perception Loss; and introduces PS-Verse dataset with curriculum training.

Result: Achieves state-of-the-art results on DiLiGenT and Luces benchmarks, demonstrates stronger generalization to real materials, improved efficiency, and better feature decoupling with preserved geometric details.

Conclusion: The proposed components effectively address illumination-normal decoupling and detail preservation challenges in universal photometric stereo, with comprehensive ablations confirming the contributions of each proposed technique.

Abstract: Universal photometric stereo (PS) is defined by two factors: it must (i) operate under arbitrary, unknown lighting conditions and (ii) avoid reliance on specific illumination models. Despite progress (e.g., SDM UniPS), two challenges remain. First, current encoders cannot guarantee that illumination and normal information are decoupled. To enforce decoupling, we introduce LINO UniPS with two key components: (i) Light Register Tokens with light alignment supervision to aggregate point, direction, and environment lights; (ii) Interleaved Attention Block featuring global cross-image attention that takes all lighting conditions together so the encoder can factor out lighting while retaining normal-related evidence. Second, high-frequency geometric details are easily lost. We address this with (i) a Wavelet-based Dual-branch Architecture and (ii) a Normal-gradient Perception Loss. These techniques yield a unified feature space in which lighting is explicitly represented by register tokens, while normal details are preserved via wavelet branch. We further introduce PS-Verse, a large-scale synthetic dataset graded by geometric complexity and lighting diversity, and adopt curriculum training from simple to complex scenes. Extensive experiments show new state-of-the-art results on public benchmarks (e.g., DiLiGenT, Luces), stronger generalization to real materials, and improved efficiency; ablations confirm that Light Register Tokens + Interleaved Attention Block drive better feature decoupling, while Wavelet-based Dual-branch Architecture + Normal-gradient Perception Loss recover finer details.

[805] SimpleGVR: A Simple Baseline for Latent-Cascaded Video Super-Resolution

Liangbin Xie, Yu Li, Shian Du, Menghan Xia, Xintao Wang, Fanghua Yu, Ziyan Chen, Pengfei Wan, Jiantao Zhou, Chao Dong

Main category: cs.CV

TL;DR: The paper proposes a two-stage video generation framework with a lightweight cascaded video super-resolution model, introducing degradation strategies, temporal sampling insights, and efficient attention mechanisms for high-resolution video synthesis.

Details

Motivation: As user expectations shift toward higher-resolution video outputs, relying solely on latent diffusion models becomes inadequate. The paper aims to address this by studying underexplored design principles for cascaded video super-resolution models that can efficiently upscale base model outputs.

Method: Proposes two degradation strategies to generate training pairs that mimic base model output characteristics. Analyzes timestep sampling strategies and noise augmentation effects. Introduces interleaving temporal unit and sparse local attention for efficient training and inference.

Result: Extensive experiments demonstrate superiority over existing methods, with ablation studies confirming the efficacy of each design choice. The framework achieves efficient high-resolution video generation with reduced computational overhead.

Conclusion: Establishes a simple yet effective baseline for cascaded video super-resolution generation, offering practical insights to guide future advancements in efficient cascaded synthesis systems.

Abstract: Latent diffusion models have emerged as a leading paradigm for efficient video generation. However, as user expectations shift toward higher-resolution outputs, relying solely on latent computation becomes inadequate. A promising approach involves decoupling the process into two stages: semantic content generation and detail synthesis. The former employs a computationally intensive base model at lower resolutions, while the latter leverages a lightweight cascaded video super-resolution (VSR) model to achieve high-resolution output. In this work, we focus on studying key design principles for latter cascaded VSR models, which are underexplored currently. First, we propose two degradation strategies to generate training pairs that better mimic the output characteristics of the base model, ensuring alignment between the VSR model and its upstream generator. Second, we provide critical insights into VSR model behavior through systematic analysis of (1) timestep sampling strategies, (2) noise augmentation effects on low-resolution (LR) inputs. These findings directly inform our architectural and training innovations. Finally, we introduce interleaving temporal unit and sparse local attention to achieve efficient training and inference, drastically reducing computational overhead. Extensive experiments demonstrate the superiority of our framework over existing methods, with ablation studies confirming the efficacy of each design choice. Our work establishes a simple yet effective baseline for cascaded video super-resolution generation, offering practical insights to guide future advancements in efficient cascaded synthesis systems.

[806] XTransfer: Modality-Agnostic Few-Shot Model Transfer for Human Sensing at the Edge

Yu Zhang, Xi Zhang, Hualin zhou, Xinyuan Chen, Shang Gao, Hong Jia, Jianfei Yang, Yuankai Qi, Tao Gu

Main category: cs.CV

TL;DR: XTransfer enables modality-agnostic few-shot model transfer for edge-based human sensing, reducing data and computational costs while maintaining state-of-the-art performance.

Details

Motivation: Deep learning for human sensing on edge systems is limited by scarce sensor data and resource constraints. Existing transfer learning methods require extensive data and resources, making them impractical for real-world edge applications.

Method: XTransfer uses model repairing to adapt pre-trained layers with minimal sensor data, and layer recombining to efficiently search and combine layers from source models to create compact models suitable for edge deployment.

Result: XTransfer achieves state-of-the-art performance across diverse human sensing datasets while significantly reducing costs for sensor data collection, model training, and edge deployment.

Conclusion: XTransfer provides an effective solution for resource-efficient model transfer in edge-based human sensing applications, enabling practical deployment with minimal data and computational requirements.

Abstract: Deep learning for human sensing on edge systems presents significant potential for smart applications. However, its training and development are hindered by the limited availability of sensor data and resource constraints of edge systems. While transferring pre-trained models to different sensing applications is promising, existing methods often require extensive sensor data and computational resources, resulting in high costs and poor adaptability in practice. In this paper, we propose XTransfer, a first-of-its-kind method enabling modality-agnostic, few-shot model transfer with resource-efficient design. XTransfer flexibly uses single or multiple pre-trained models and transfers knowledge across different modalities by (i) model repairing that safely mitigates modality shift by adapting pre-trained layers with only few sensor data, and (ii) layer recombining that efficiently searches and recombines layers of interest from source models in a layer-wise manner to create compact models. We benchmark various baselines across diverse human sensing datasets spanning different modalities. Comprehensive results demonstrate that XTransfer achieves state-of-the-art performance while significantly reducing the costs of sensor data collection, model training, and edge deployment.

[807] Controllable Reference Guided Diffusion with Local Global Fusion for Real World Remote Sensing Image Super Resolution

Ce Wang, Wanjie Sun

Main category: cs.CV

TL;DR: CRefDiff is a controllable reference-guided diffusion model for remote sensing image super-resolution that addresses under-generation and over-reliance on reference images through dual-branch fusion and accelerated inference.

Details

Motivation: Existing reference-based super-resolution methods struggle with real-world complexities like cross-sensor resolution gaps and land cover changes, leading to under-generation or over-reliance on reference images.

Method: Proposes CRefDiff with dual-branch fusion mechanism for adaptive local/global reference fusion, generative prior for accurate structures/textures, and Better Start strategy to reduce denoising steps for faster inference.

Result: Achieves state-of-the-art performance on RealRefRSSRD dataset, improves downstream tasks, and enables reference strength control during inference for enhanced interactivity.

Conclusion: CRefDiff effectively addresses real-world RefSR challenges in remote sensing through controllable fusion and accelerated diffusion modeling, with demonstrated superior performance on the new RealRefRSSRD dataset.

Abstract: Super resolution techniques can enhance the spatial resolution of remote sensing images, enabling more efficient large scale earth observation applications. While single image SR methods enhance low resolution images, they neglect valuable complementary information from auxiliary data. Reference based SR can be interpreted as an information fusion task, where historical high resolution reference images are combined with current LR observations. However, existing RefSR methods struggle with real world complexities, such as cross sensor resolution gap and significant land cover changes, often leading to under generation or over reliance on reference image. To address these challenges, we propose CRefDiff, a novel controllable reference guided diffusion model for real world remote sensing image SR. To address the under generation problem, CRefDiff leverages a powerful generative prior to produce accurate structures and textures. To mitigate over reliance on the reference, we introduce a dual branch fusion mechanism that adaptively fuse both local and global information from the reference image. Moreover, the dual branch design enables reference strength control during inference, enhancing the models interactivity and flexibility. Finally, the Better Start strategy is proposed to significantly reduce the number of denoising steps, thereby accelerating the inference process. To support further research, we introduce RealRefRSSRD, a new real world RefSR dataset for remote sensing images, consisting of HR NAIP and LR Sentinel2 image pairs with diverse land cover changes and significant temporal gaps. Extensive experiments on RealRefRSSRD show that CRefDiff achieves SOTA performance and improves downstream tasks.

[808] Representation Entanglement for Generation: Training Diffusion Transformers Is Much Easier Than You Think

Ge Wu, Shen Zhang, Ruijing Shi, Shanghua Gao, Zhenyuan Chen, Lei Wang, Zhaowei Chen, Hongcheng Gao, Yao Tang, Jian Yang, Ming-Ming Cheng, Xiang Li

Main category: cs.CV

TL;DR: REG (Representation Entanglement for Generation) is a method that entangles low-level image latents with high-level class tokens from pretrained models to improve diffusion model training efficiency and generation quality with minimal inference overhead.

Details

Motivation: Existing methods like REPA use external alignment between noisy projections and clean image representations, but this alignment is absent during inference, limiting their effectiveness in harnessing discriminative representations.

Method: The method entangles low-level image latents with a single high-level class token from pretrained foundation models for denoising, enabling concurrent reconstruction of both image latents and their corresponding global semantics.

Result: SiT-XL/2 + REG achieves 63× faster training than SiT-XL/2 and 23× faster than SiT-XL/2 + REPA on ImageNet 256×256. SiT-L/2 + REG trained for 400K iterations outperforms SiT-XL/2 + REPA trained for 4M iterations (10× longer).

Conclusion: REG substantially improves generation quality and training efficiency with negligible inference overhead (only one additional token, <0.5% increase in FLOPs and latency), enabling direct production of coherent image-class pairs from pure noise.

Abstract: REPA and its variants effectively mitigate training challenges in diffusion models by incorporating external visual representations from pretrained models, through alignment between the noisy hidden projections of denoising networks and foundational clean image representations. We argue that the external alignment, which is absent during the entire denoising inference process, falls short of fully harnessing the potential of discriminative representations. In this work, we propose a straightforward method called Representation Entanglement for Generation (REG), which entangles low-level image latents with a single high-level class token from pretrained foundation models for denoising. REG acquires the capability to produce coherent image-class pairs directly from pure noise, substantially improving both generation quality and training efficiency. This is accomplished with negligible additional inference overhead, requiring only one single additional token for denoising (<0.5% increase in FLOPs and latency). The inference process concurrently reconstructs both image latents and their corresponding global semantics, where the acquired semantic knowledge actively guides and enhances the image generation process. On ImageNet 256$\times$256, SiT-XL/2 + REG demonstrates remarkable convergence acceleration, achieving $\textbf{63}\times$ and $\textbf{23}\times$ faster training than SiT-XL/2 and SiT-XL/2 + REPA, respectively. More impressively, SiT-L/2 + REG trained for merely 400K iterations outperforms SiT-XL/2 + REPA trained for 4M iterations ($\textbf{10}\times$ longer). Code is available at: https://github.com/Martinser/REG.

[809] FA: Forced Prompt Learning of Vision-Language Models for Out-of-Distribution Detection

Xinhua Lu, Runhe Lai, Yanqi Wu, Kanghao Chen, Wei-Shi Zheng, Ruixuan Wang

Main category: cs.CV

TL;DR: A CLIP-based framework called Forced prompt leArning (FA) that improves OOD detection by learning diversified prompts for ID classes without external datasets.

Details

Motivation: Existing CLIP-based OOD detection methods focus on learning OOD knowledge, showing limited generalization or reliance on external datasets. Instead, this work aims to better utilize ID knowledge.

Method: Learn a forced prompt with diversified descriptions of ID classes beyond class labels, using a forced coefficient to encourage comprehensive learning of ID semantics.

Result: Achieves notable improvements in OOD detection without external datasets while maintaining same parameters as CoOp, consistently outperforming state-of-the-art methods.

Conclusion: FA effectively boosts OOD detection by focusing on enriching ID knowledge through forced prompt learning, demonstrating strong performance without external data.

Abstract: Pre-trained vision-language models (VLMs) have advanced out-of-distribution (OOD) detection recently. However, existing CLIP-based methods often focus on learning OOD-related knowledge to improve OOD detection, showing limited generalization or reliance on external large-scale auxiliary datasets. In this study, instead of delving into the intricate OOD-related knowledge, we propose an innovative CLIP-based framework based on Forced prompt leArning (FA), designed to make full use of the In-Distribution (ID) knowledge and ultimately boost the effectiveness of OOD detection. Our key insight is to learn a prompt (i.e., forced prompt) that contains more diversified and richer descriptions of the ID classes beyond the textual semantics of class labels. Specifically, it promotes better discernment for ID images, by forcing more notable semantic similarity between ID images and the learnable forced prompt. Moreover, we introduce a forced coefficient, encouraging the forced prompt to learn more comprehensive and nuanced descriptions of the ID classes. In this way, FA is capable of achieving notable improvements in OOD detection, even when trained without any external auxiliary datasets, while maintaining an identical number of trainable parameters as CoOp. Extensive empirical evaluations confirm our method consistently outperforms current state-of-the-art methods. Code is available at https://github.com/0xFAFA/FA.

[810] Counterfactual Visual Explanation via Causally-Guided Adversarial Steering

Yiran Qiao, Disheng Liu, Yiren Lu, Yu Yin, Mengnan Du, Jing Ma

Main category: cs.CV

TL;DR: CECAS is a novel framework that generates counterfactual visual explanations using causal guidance to avoid spurious correlations, achieving better performance than existing methods.

Details

Motivation: Existing counterfactual visual explanation methods neglect causal relationships and spurious correlations, leading to unintended alterations and limited explanation quality.

Method: Leverages a causally-guided adversarial method to generate counterfactual explanations, integrating causal perspective to avoid unwanted perturbations on spurious factors.

Result: Outperforms state-of-the-art approaches across multiple benchmark datasets, achieving balanced trade-off among validity, sparsity, proximity, and realism.

Conclusion: CECAS successfully addresses the limitations of previous methods by incorporating causal relationships, resulting in higher quality counterfactual visual explanations.

Abstract: Recent work on counterfactual visual explanations has contributed to making artificial intelligence models more explainable by providing visual perturbation to flip the prediction. However, these approaches neglect the causal relationships and the spurious correlations behind the image generation process, which often leads to unintended alterations in the counterfactual images and renders the explanations with limited quality. To address this challenge, we introduce a novel framework CECAS, which first leverages a causally-guided adversarial method to generate counterfactual explanations. It innovatively integrates a causal perspective to avoid unwanted perturbations on spurious factors in the counterfactuals. Extensive experiments demonstrate that our method outperforms existing state-of-the-art approaches across multiple benchmark datasets and ultimately achieves a balanced trade-off among various aspects of validity, sparsity, proximity, and realism.

[811] 3DGAA: Realistic and Robust 3D Gaussian-based Adversarial Attack for Autonomous Driving

Yixun Zhang, Lizhi Wang, Junjun Zhao, Wending Zhao, Feng Zhou, Yonghao Dang, Jianqin Yin

Main category: cs.CV

TL;DR: 3DGAA is a novel adversarial attack framework that uses 3D Gaussian Splatting to jointly optimize geometry and appearance for physically realistic attacks on autonomous driving object detection systems.

Details

Motivation: Existing 2D and 3D physical attacks struggle to balance physical realism and attack robustness due to their focus on texture optimization alone.

Method: Leverages 14-dimensional parameterization of 3D Gaussian Splatting to jointly optimize geometric (shape, scale, rotation) and appearance (color, opacity) attributes, with physical filtering and augmentation modules.

Result: Reduces detection mAP from 87.21% to 7.38%, significantly outperforming existing 3D physical attacks and maintaining high transferability across different physical conditions.

Conclusion: 3DGAA achieves state-of-the-art performance in physically realizable adversarial attacks by jointly optimizing geometry and appearance while maintaining physical realism.

Abstract: Camera-based object detection systems play a vital role in autonomous driving, yet they remain vulnerable to adversarial threats in real-world environments. Existing 2D and 3D physical attacks, due to their focus on texture optimization, often struggle to balance physical realism and attack robustness. In this work, we propose 3D Gaussian-based Adversarial Attack (3DGAA), a novel adversarial object generation framework that leverages the full 14-dimensional parameterization of 3D Gaussian Splatting (3DGS) to jointly optimize geometry and appearance in physically realizable ways. Unlike prior works that rely on patches or texture optimization, 3DGAA jointly perturbs both geometric attributes (shape, scale, rotation) and appearance attributes (color, opacity) to produce physically realistic and transferable adversarial objects. We further introduce a physical filtering module that filters outliers to preserve geometric fidelity, and a physical augmentation module that simulates complex physical scenarios to enhance attack generalization under real-world conditions. We evaluate 3DGAA on both virtual benchmarks and physical-world setups using miniature vehicle models. Experimental results show that 3DGAA achieves to reduce the detection mAP from 87.21% to 7.38%, significantly outperforming existing 3D physical attacks. Moreover, our method maintains high transferability across different physical conditions, demonstrating a new state-of-the-art in physically realizable adversarial attacks.

[812] NarrLV: Towards a Comprehensive Narrative-Centric Evaluation for Long Video Generation

X. Feng, H. Yu, M. Wu, S. Hu, J. Chen, C. Zhu, J. Wu, X. Chu, K. Huang

Main category: cs.CV

TL;DR: NarrLV is the first benchmark for evaluating narrative expression in long video generation models, introducing Temporal Narrative Atoms (TNAs) to measure narrative richness and using MLLM-based evaluation that aligns with human judgments.

Details

Motivation: Current evaluation benchmarks for long video generation models primarily use simple narrative prompts, lacking comprehensive assessment of narrative content expression capabilities in longer videos.

Method: Introduces Temporal Narrative Atoms (TNAs) as basic narrative units, creates automatic prompt generation pipeline for flexible TNA expansion, and designs MLLM-based question-answering evaluation metric across three progressive narrative levels.

Result: Experimental results show the metric aligns closely with human judgments and reveals detailed capability boundaries of current video generation models in narrative content expression.

Conclusion: NarrLV provides the first comprehensive benchmark for evaluating narrative expression in long videos, offering quantitative measurement of narrative richness and effective assessment of model capabilities.

Abstract: With the rapid development of foundation video generation technologies, long video generation models have exhibited promising research potential thanks to expanded content creation space. Recent studies reveal that the goal of long video generation tasks is not only to extend video duration but also to accurately express richer narrative content within longer videos. However, due to the lack of evaluation benchmarks specifically designed for long video generation models, the current assessment of these models primarily relies on benchmarks with simple narrative prompts (e.g., VBench). To the best of our knowledge, our proposed NarrLV is the first benchmark to comprehensively evaluate the Narrative expression capabilities of Long Video generation models. Inspired by film narrative theory, (i) we first introduce the basic narrative unit maintaining continuous visual presentation in videos as Temporal Narrative Atom (TNA), and use its count to quantitatively measure narrative richness. Guided by three key film narrative elements influencing TNA changes, we construct an automatic prompt generation pipeline capable of producing evaluation prompts with a flexibly expandable number of TNAs. (ii) Then, based on the three progressive levels of narrative content expression, we design an effective evaluation metric using the MLLM-based question generation and answering framework. (iii) Finally, we conduct extensive evaluations on existing long video generation models and the foundation generation models. Experimental results demonstrate that our metric aligns closely with human judgments. The derived evaluation outcomes reveal the detailed capability boundaries of current video generation models in narrative content expression.

[813] NoiseSDF2NoiseSDF: Learning Clean Neural Fields from Noisy Supervision

Tengkai Wang, Weihao Li, Ruikai Cui, Shi Qiu, Nick Barnes

Main category: cs.CV

TL;DR: NoiseSDF2NoiseSDF extends the Noise2Noise paradigm to 3D neural fields, enabling clean neural SDF learning from noisy point clouds through noisy supervision.

Details

Motivation: Reconstructing accurate implicit surfaces from noisy point clouds captured by low-quality scanning devices is challenging, as substantial noise leads to inaccurate surface reconstructions.

Method: Extends Noise2Noise to 3D neural fields by minimizing MSE loss between noisy SDF representations, allowing implicit denoising and surface refinement directly from noisy point clouds.

Result: Significantly improves surface reconstruction quality from noisy inputs across benchmarks including ShapeNet, ABC, Famous, and Real datasets.

Conclusion: NoiseSDF2NoiseSDF effectively enables learning clean neural SDFs directly from noisy point clouds through noisy supervision, demonstrating improved reconstruction quality.

Abstract: Reconstructing accurate implicit surface representations from point clouds remains a challenging task, particularly when data is captured using low-quality scanning devices. These point clouds often contain substantial noise, leading to inaccurate surface reconstructions. Inspired by the Noise2Noise paradigm for 2D images, we introduce NoiseSDF2NoiseSDF, a novel method designed to extend this concept to 3D neural fields. Our approach enables learning clean neural SDFs directly from noisy point clouds through noisy supervision by minimizing the MSE loss between noisy SDF representations, allowing the network to implicitly denoise and refine surface estimations. We evaluate the effectiveness of NoiseSDF2NoiseSDF on benchmarks, including the ShapeNet, ABC, Famous, and Real datasets. Experimental results demonstrate that our framework significantly improves surface reconstruction quality from noisy inputs.

[814] Advances in Feed-Forward 3D Reconstruction and View Synthesis: A Survey

Jiahui Zhang, Yuelei Li, Anpei Chen, Muyu Xu, Kunhao Liu, Jianyuan Wang, Xiao-Xiao Long, Hanxue Liang, Zexiang Xu, Hao Su, Christian Theobalt, Christian Rupprecht, Andrea Vedaldi, Kaichen Zhou, Paul Pu Liang, Shijian Lu, Fangneng Zhan

Main category: cs.CV

TL;DR: Survey of feed-forward deep learning methods for 3D reconstruction and view synthesis, covering various representation architectures and applications.

Details

Motivation: Traditional 3D reconstruction methods are computationally intensive and iterative, limiting real-world applications. Feed-forward deep learning approaches enable faster and more generalizable solutions.

Method: Comprehensive review of feed-forward techniques with taxonomy based on representation architectures (point cloud, 3D Gaussian Splatting, Neural Radiance Fields, etc.), covering pose-free reconstruction, dynamic 3D reconstruction, and 3D-aware synthesis.

Result: The survey examines key applications in digital humans, SLAM, robotics, and reviews datasets and evaluation protocols for various downstream tasks.

Conclusion: Feed-forward approaches have potential to advance 3D vision, with open research challenges and promising future directions identified.

Abstract: 3D reconstruction and view synthesis are foundational problems in computer vision, graphics, and immersive technologies such as augmented reality (AR), virtual reality (VR), and digital twins. Traditional methods rely on computationally intensive iterative optimization in a complex chain, limiting their applicability in real-world scenarios. Recent advances in feed-forward approaches, driven by deep learning, have revolutionized this field by enabling fast and generalizable 3D reconstruction and view synthesis. This survey offers a comprehensive review of feed-forward techniques for 3D reconstruction and view synthesis, with a taxonomy according to the underlying representation architectures including point cloud, 3D Gaussian Splatting (3DGS), Neural Radiance Fields (NeRF), etc. We examine key tasks such as pose-free reconstruction, dynamic 3D reconstruction, and 3D-aware image and video synthesis, highlighting their applications in digital humans, SLAM, robotics, and beyond. In addition, we review commonly used datasets with detailed statistics, along with evaluation protocols for various downstream tasks. We conclude by discussing open research challenges and promising directions for future work, emphasizing the potential of feed-forward approaches to advance the state of the art in 3D vision.

[815] CHROMA: Consistent Harmonization of Multi-View Appearance via Bilateral Grid Prediction

Jisu Shin, Richard Shaw, Seunghyun Shin, Zhensong Zhang, Hae-Gon Jeon, Eduardo Perez-Pellitero

Main category: cs.CV

TL;DR: A feed-forward approach using bilateral grids to correct photometric inconsistencies in multi-view images, enabling efficient large-scale harmonization without scene-specific retraining.

Details

Motivation: Camera processing pipelines introduce photometric inconsistencies across views, violating multi-view consistency and degrading novel view synthesis. Existing scene-specific optimization methods increase computational complexity and slow training.

Method: Predicts spatially adaptive bilateral grids to correct photometric variations in a multi-view consistent manner. Uses hybrid self-supervised rendering loss with 3D foundation models to overcome lack of paired data.

Result: Processes hundreds of frames in a single step, outperforms or matches reconstruction quality of scene-specific optimization methods without significantly affecting baseline 3D model training time.

Conclusion: Provides cross-scene generalization without requiring scene-specific retraining, enabling efficient large-scale harmonization that integrates seamlessly into downstream 3D reconstruction models.

Abstract: Modern camera pipelines apply extensive on-device processing, such as exposure adjustment, white balance, and color correction, which, while beneficial individually, often introduce photometric inconsistencies across views. These appearance variations violate multi-view consistency and degrade novel view synthesis. Joint optimization of scene-specific representations and per-image appearance embeddings has been proposed to address this issue, but with increased computational complexity and slower training. In this work, we propose a generalizable, feed-forward approach that predicts spatially adaptive bilateral grids to correct photometric variations in a multi-view consistent manner. Our model processes hundreds of frames in a single step, enabling efficient large-scale harmonization, and seamlessly integrates into downstream 3D reconstruction models, providing cross-scene generalization without requiring scene-specific retraining. To overcome the lack of paired data, we employ a hybrid self-supervised rendering loss leveraging 3D foundation models, improving generalization to real-world variations. Extensive experiments show that our approach outperforms or matches the reconstruction quality of existing scene-specific optimization methods with appearance modeling, without significantly affecting the training time of baseline 3D models.

[816] Diffusion models for multivariate subsurface generation and efficient probabilistic inversion

Roberto Miele, Niklas Linde

Main category: cs.CV

TL;DR: Diffusion models outperform other generative models for multivariate subsurface modeling and probabilistic inversion, with improved conditioning methods that enhance statistical robustness and reduce computational costs.

Details

Motivation: To enhance multivariate modeling capabilities in subsurface applications and improve probabilistic inversion by leveraging diffusion models' stable training and superior performance over VAEs and GANs.

Method: Proposed corrections to Diffusion Posterior Sampling, including a likelihood approximation accounting for inherent noise-contamination in diffusion modeling. Used conditional modeling with both hard data (well logs) and nonlinear geophysics (seismic data).

Result: Significantly improved statistical robustness, enhanced posterior sampling, and reduced computational costs compared to original approaches. Method works with both hard and indirect conditioning data simultaneously.

Conclusion: Diffusion models provide faster and more robust probabilistic inversion than methods requiring outer-loop approaches like MCMC, making them suitable for complex multivariate geological scenarios.

Abstract: Diffusion models offer stable training and state-of-the-art performance for deep generative modeling tasks. Here, we consider their use in the context of multivariate subsurface modeling and probabilistic inversion. We first demonstrate that diffusion models enhance multivariate modeling capabilities compared to variational autoencoders and generative adversarial networks. In diffusion modeling, the generative process involves a comparatively large number of time steps with update rules that can be modified to account for conditioning data. We propose different corrections to the popular Diffusion Posterior Sampling approach by Chung et al. (2023). In particular, we introduce a likelihood approximation accounting for the noise-contamination that is inherent in diffusion modeling. We assess performance in a multivariate geological scenario involving facies and correlated acoustic impedance. Conditional modeling is demonstrated using both local hard data (well logs) and nonlinear geophysics (fullstack seismic data). Our tests show significantly improved statistical robustness, enhanced sampling of the posterior probability density function and reduced computational costs, compared to the original approach. The method can be used with both hard and indirect conditioning data, individually or simultaneously. As the inversion is included within the diffusion process, it is faster than other methods requiring an outer-loop around the generative model, such as Markov chain Monte Carlo.

[817] Synthetic-to-Real Camouflaged Object Detection

Zhihao Luo, Luojun Lin, Zheng Lin

Main category: cs.CV

TL;DR: The paper proposes CSRDA framework for Syn-to-Real Camouflaged Object Detection to address limited real datasets by using synthetic data and unlabeled real images through domain adaptation.

Details

Motivation: Limited availability of real camouflaged object detection datasets due to high collection and labeling costs, especially for specialized categories where synthetic data can help but causes performance degradation when used directly.

Method: Cycling Syn-to-Real Domain Adaptation Framework (CSRDA) based on student-teacher model, using pseudo labeling with consistency regularization and recurrent learning to bridge domain gap between synthetic and real data.

Result: Extensive experiments demonstrate effectiveness in mitigating limited data and handcraft annotations in COD, improving model performance in real-world scenarios.

Conclusion: CSRDA framework successfully addresses the data scarcity problem in camouflaged object detection by leveraging synthetic data and unlabeled real images through domain adaptation techniques.

Abstract: Due to the high cost of collection and labeling, there are relatively few datasets for camouflaged object detection (COD). In particular, for certain specialized categories, the available image dataset is insufficiently populated. Synthetic datasets can be utilized to alleviate the problem of limited data to some extent. However, directly training with synthetic datasets compared to real datasets can lead to a degradation in model performance. To tackle this problem, in this work, we investigate a new task, namely Syn-to-Real Camouflaged Object Detection (S2R-COD). In order to improve the model performance in real world scenarios, a set of annotated synthetic camouflaged images and a limited number of unannotated real images must be utilized. We propose the Cycling Syn-to-Real Domain Adaptation Framework (CSRDA), a method based on the student-teacher model. Specially, CSRDA propagates class information from the labeled source domain to the unlabeled target domain through pseudo labeling combined with consistency regularization. Considering that narrowing the intra-domain gap can improve the quality of pseudo labeling, CSRDA utilizes a recurrent learning framework to build an evolving real domain for bridging the source and target domain. Extensive experiments demonstrate the effectiveness of our framework, mitigating the problem of limited data and handcraft annotations in COD. Our code is publicly available at: https://github.com/Muscape/S2R-COD.

Dohwan Ko, Ji Soo Lee, Minhyuk Choi, Zihang Meng, Hyunwoo J. Kim

Main category: cs.CV

TL;DR: Proposes BLiM framework with Candidate Prior Normalization (CPN) to address candidate prior bias in text-video retrieval using MLLMs, achieving state-of-the-art performance.

Details

Motivation: Naive application of MLLMs in text-video retrieval introduces candidate prior bias, favoring candidates with inherently higher priors over more relevant ones.

Method: BLiM framework uses bidirectional likelihood estimation (text from video and video features from text) plus CPN for training-free score calibration to mitigate candidate prior bias.

Result: Outperforms previous SOTA models by 6.4 R@1 on average across four benchmarks, effectively alleviating candidate prior bias and emphasizing query-candidate relevance.

Conclusion: CPN has broad applicability beyond retrieval tasks and enhances visual understanding by reducing reliance on textual priors.

Abstract: Text-Video Retrieval aims to find the most relevant text (or video) candidate given a video (or text) query from large-scale online databases. Recent work leverages multi-modal large language models (MLLMs) to improve retrieval, especially for long or complex query-candidate pairs. However, we observe that the naive application of MLLMs, i.e., retrieval based on candidate likelihood, introduces candidate prior bias, favoring candidates with inherently higher priors over those more relevant to the query. To this end, we propose a novel retrieval framework, Bidirectional Likelihood Estimation with MLLM (BLiM), which leverages both query and candidate likelihoods by training the model to generate text from a given video as well as video features from a given text. Furthermore, we introduce Candidate Prior Normalization (CPN), a simple yet effective training-free score calibration module designed to mitigate candidate prior bias in candidate likelihood. On four Text-Video Retrieval benchmarks, our BLiM equipped with CPN outperforms previous state-of-the-art models by 6.4 R@1 on average, effectively alleviating candidate prior bias and emphasizing query-candidate relevance. Our in-depth analysis across various multi-modal tasks beyond retrieval highlights the broad applicability of CPN which enhances visual understanding by reducing reliance on textual priors. Code is available at https://github.com/mlvlab/BLiM.

[819] ModelNet40-E: An Uncertainty-Aware Benchmark for Point Cloud Classification

Pedro Alonso, Tianrui Li, Chongshou Li

Main category: cs.CV

TL;DR: ModelNet40-E is a new benchmark for evaluating point cloud classification models’ robustness and calibration under LiDAR-like noise, featuring noise-corrupted data with uncertainty annotations.

Details

Motivation: Existing benchmarks lack comprehensive evaluation of model robustness and uncertainty modeling under realistic noise conditions common in LiDAR data.

Method: Created ModelNet40-E benchmark with synthetic LiDAR-like noise using Gaussian parameters (σ, μ), evaluated PointNet, DGCNN, and Point Transformer v3 on classification accuracy, calibration metrics, and uncertainty-awareness.

Result: All models degraded under increasing noise, but Point Transformer v3 showed superior calibration with predicted uncertainties better aligned with measurement uncertainty.

Conclusion: ModelNet40-E enables fine-grained evaluation of uncertainty modeling, revealing Point Transformer v3’s advantages in calibration under noisy conditions.

Abstract: We introduce ModelNet40-E, a new benchmark designed to assess the robustness and calibration of point cloud classification models under synthetic LiDAR-like noise. Unlike existing benchmarks, ModelNet40-E provides both noise-corrupted point clouds and point-wise uncertainty annotations via Gaussian noise parameters ({\sigma}, {\mu}), enabling fine-grained evaluation of uncertainty modeling. We evaluate three popular models-PointNet, DGCNN, and Point Transformer v3-across multiple noise levels using classification accuracy, calibration metrics, and uncertainty-awareness. While all models degrade under increasing noise, Point Transformer v3 demonstrates superior calibration, with predicted uncertainties more closely aligned with the underlying measurement uncertainty.

[820] Towards Generalizable AI-Generated Image Detection via Image-Adaptive Prompt Learning

Yiheng Li, Zichang Tan, Zhen Lei, Xu Zhou, Yang Yang

Main category: cs.CV

TL;DR: IAPL is a novel AI-generated image detection method that uses dynamic prompt learning to adapt to each input image, achieving state-of-the-art performance on benchmark datasets.

Details

Motivation: Current methods struggle to generalize to unseen generators because they capture limited patterns during training and fail to adapt to evolving forgery traits.

Method: Image-Adaptive Prompt Learning (IAPL) dynamically adjusts prompts using a gated mechanism that combines conditional information from CNN feature extractors and test-time adaptive tokens optimized through prediction consistency across multiple views.

Result: Achieves mean accuracies of 95.61% on UniversalFakeDetect and 96.7% on GenImage datasets, outperforming existing methods.

Conclusion: IAPL provides superior robustness and adaptability to diverse forged images through dynamic prompt adjustment, making it effective against unseen generators.

Abstract: In AI-generated image detection, current cutting-edge methods typically adapt pre-trained foundation models through partial-parameter fine-tuning. However, these approaches often struggle to generalize to forgeries from unseen generators, as the fine-tuned models capture only limited patterns from training data and fail to reflect the evolving traits of new ones. To overcome this limitation, we propose Image-Adaptive Prompt Learning (IAPL), a novel paradigm that dynamically adjusts the prompts fed into the encoder according to each input image, rather than fixing them after training. This design significantly enhances robustness and adaptability to diverse forged images. The dynamic prompts integrate conditional information with test-time adaptive tokens through a lightweight gated mechanism. The conditional information is produced by a Conditional Information Learner, which leverages CNN-based feature extractors to model both forgery-specific and image-specific conditions. The test-time adaptive tokens are optimized during inference on a single sample by enforcing prediction consistency across multiple views, ensuring that the parameters align with the current image. For the final decision, the cropped view with the highest prediction confidence is selected. Extensive experiments show that IAPL achieves state-of-the-art performance, with mean accuracies of 95.61% and 96.7% on the widely used UniversalFakeDetect and GenImage datasets, respectively. Codes and weights will be released on https://github.com/liyih/IAPL.

[821] Face-voice Association in Multilingual Environments (FAME) 2026 Challenge Evaluation Plan

Marta Moscati, Ahmed Abdullah, Muhammad Saad Saeed, Shah Nawaz, Rohan Kumar Das, Muhammad Zaigham Zaheer, Junaid Mir, Muhammad Haroon Yousaf, Khalid Malik, Markus Schedl

Main category: cs.CV

TL;DR: The FAME 2026 Challenge focuses on face-voice association in multilingual environments using the MAV-Celeb dataset, addressing the fact that half the world’s population is bilingual.

Details

Motivation: Half of the world's population is bilingual and people often communicate in multilingual scenarios, creating a need to explore face-voice association under these unique conditions.

Method: The challenge uses the Multilingual Audio-Visual (MAV-Celeb) dataset and provides baseline models to explore face-voice association in multilingual environments.

Result: The paper presents the challenge details, dataset specifications, baseline models, and task requirements for the FAME 2026 Challenge.

Conclusion: The FAME Challenge aims to advance research in multimodal systems by focusing on face-voice association in multilingual scenarios, which represents real-world communication patterns.

Abstract: The advancements of technology have led to the use of multimodal systems in various real-world applications. Among them, audio-visual systems are among the most widely used multimodal systems. In the recent years, associating face and voice of a person has gained attention due to the presence of unique correlation between them. The Face-voice Association in Multilingual Environments (FAME) 2026 Challenge focuses on exploring face-voice association under the unique condition of a multilingual scenario. This condition is inspired from the fact that half of the world’s population is bilingual and most often people communicate under multilingual scenarios. The challenge uses a dataset named Multilingual Audio-Visual (MAV-Celeb) for exploring face-voice association in multilingual environments. This report provides the details of the challenge, dataset, baseline models, and task details for the FAME Challenge.

[822] Deepfake Detection that Generalizes Across Benchmarks

Andrii Yermakov, Jan Cech, Jiri Matas, Mario Fritz

Main category: cs.CV

TL;DR: GenD achieves state-of-the-art deepfake detection generalization by fine-tuning only 0.03% of Layer Normalization parameters in pre-trained vision encoders, using L2 normalization and metric learning on hyperspherical feature manifolds.

Details

Motivation: Deepfake detectors struggle with generalization to unseen manipulation techniques, and existing approaches often introduce significant architectural complexity without achieving robust generalization.

Method: Parameter-efficient adaptation of pre-trained vision encoders by fine-tuning only Layer Normalization parameters (0.03% of total), with L2 normalization and metric learning on hyperspherical feature manifolds.

Result: Achieves state-of-the-art performance on 14 benchmark datasets (2019-2025), outperforming more complex approaches in average cross-dataset AUROC. Key findings: training on paired real-fake data is essential, and detection difficulty hasn’t strictly increased over time.

Conclusion: State-of-the-art generalization in deepfake detection is achievable through minimal, targeted changes to pre-trained foundational models, providing a computationally efficient and reproducible method.

Abstract: The generalization of deepfake detectors to unseen manipulation techniques remains a challenge for practical deployment. Although many approaches adapt foundation models by introducing significant architectural complexity, this work demonstrates that robust generalization is achievable through a parameter-efficient adaptation of one of the foundational pre-trained vision encoders. The proposed method, GenD, fine-tunes only the Layer Normalization parameters (0.03% of the total) and enhances generalization by enforcing a hyperspherical feature manifold using L2 normalization and metric learning on it. We conducted an extensive evaluation on 14 benchmark datasets spanning from 2019 to 2025. The proposed method achieves state-of-the-art performance, outperforming more complex, recent approaches in average cross-dataset AUROC. Our analysis yields two primary findings for the field: 1) training on paired real-fake data from the same source video is essential for mitigating shortcut learning and improving generalization, and 2) detection difficulty on academic datasets has not strictly increased over time, with models trained on older, diverse datasets showing strong generalization capabilities. This work delivers a computationally efficient and reproducible method, proving that state-of-the-art generalization is attainable by making targeted, minimal changes to a pre-trained foundational image encoder model. The code will be made publicly available upon acceptance.

[823] HiMat: DiT-based Ultra-High Resolution SVBRDF Generation

Zixiong Wang, Jian Yang, Yiwei Hu, Milos Hasan, Beibei Wang

Main category: cs.CV

TL;DR: HiMat is a diffusion-based framework for efficient 4K SVBRDF generation that addresses memory/computational costs through latent space generation and cross-map consistency through CrossStitch modules.

Details

Motivation: Creating ultra-high-resolution SVBRDFs is critical for photorealistic 3D content but faces challenges with memory/computational costs from multiple reflectance maps and maintaining pixel-level alignment at 4K resolution.

Method: Uses DC-AE for high-compression latent space generation, pretrained diffusion transformer with linear attention for efficiency, and CrossStitch convolutional modules to enforce cross-map consistency without global attention costs.

Result: HiMat achieves high-fidelity 4K SVBRDF generation with superior efficiency, structural consistency, and diversity compared to prior methods.

Conclusion: The framework successfully addresses key challenges in 4K SVBRDF generation and generalizes to related applications like intrinsic decomposition.

Abstract: Creating ultra-high-resolution spatially varying bidirectional reflectance functions (SVBRDFs) is critical for photorealistic 3D content creation, to faithfully represent fine-scale surface details required for close-up rendering. However, achieving 4K generation faces two key challenges: (1) the need to synthesize multiple reflectance maps at full resolution, which multiplies the pixel budget and imposes prohibitive memory and computational cost, and (2) the requirement to maintain strong pixel-level alignment across maps at 4K, which is particularly difficult when adapting pretrained models designed for the RGB image domain. We introduce HiMat, a diffusion-based framework tailored for efficient and diverse 4K SVBRDF generation. To address the first challenge, HiMat performs generation in a high-compression latent space via DC-AE, and employs a pretrained diffusion transformer with linear attention to improve per-map efficiency. To address the second challenge, we propose CrossStitch, a lightweight convolutional module that enforces cross-map consistency without incurring the cost of global attention. Our experiments show that HiMat achieves high-fidelity 4K SVBRDF generation with superior efficiency, structural consistency, and diversity compared to prior methods. Beyond materials, our framework also generalizes to related applications such as intrinsic decomposition.

[824] Bridging Semantic Logic Gaps: A Cognition Inspired Multimodal Boundary Preserving Network for Image Manipulation Localization

Songlin Li, Zhiqing Guo, Yuanman Li, Zeyu Li, Yunfeng Diao, Gaobo Yang, Liejun Wang

Main category: cs.CV

TL;DR: CMB-Net is a cognition-inspired multimodal network that uses LLMs to analyze manipulated image regions and generate textual prompts, addressing semantic relationship gaps in visual information for improved image manipulation localization.

Details

Motivation: Existing IML models rely only on visual cues and ignore semantic logical relationships between content features. Image manipulation disrupts internal content relationships, creating semantic clues that can be leveraged for better localization.

Method: Proposes CMB-Net with: 1) LLM-generated textual prompts for semantic compensation, 2) ITCAM to weight text features by quantifying image-text ambiguity, 3) ITIM for fine-grained visual-text feature alignment, and 4) RED decoder inspired by invertible neural networks to preserve boundary information.

Result: Extensive experiments show CMB-Net outperforms most existing IML models in image manipulation localization accuracy.

Conclusion: The proposed multimodal approach effectively leverages semantic relationships through LLM-generated text and boundary-preserving techniques, demonstrating superior performance in detecting manipulated image regions.

Abstract: The existing image manipulation localization (IML) models mainly relies on visual cues, but ignores the semantic logical relationships between content features. In fact, the content semantics conveyed by real images often conform to human cognitive laws. However, image manipulation technology usually destroys the internal relationship between content features, thus leaving semantic clues for IML. In this paper, we propose a cognition inspired multimodal boundary preserving network (CMB-Net). Specifically, CMB-Net utilizes large language models (LLMs) to analyze manipulated regions within images and generate prompt-based textual information to compensate for the lack of semantic relationships in the visual information. Considering that the erroneous texts induced by hallucination from LLMs will damage the accuracy of IML, we propose an image-text central ambiguity module (ITCAM). It assigns weights to the text features by quantifying the ambiguity between text and image features, thereby ensuring the beneficial impact of textual information. We also propose an image-text interaction module (ITIM) that aligns visual and text features using a correlation matrix for fine-grained interaction. Finally, inspired by invertible neural networks, we propose a restoration edge decoder (RED) that mutually generates input and output features to preserve boundary information in manipulated regions without loss. Extensive experiments show that CMB-Net outperforms most existing IML models. Our code is available on https://github.com/vpsg-research/CMB-Net.

[825] Power Battery Detection

Xiaoqi Zhao, Peiqian Cao, Chenyang Yu, Zonglei Feng, Lihe Zhang, Hanqi Liu, Jiaming Zuo, Youwei Pang, Jinsong Ouyang, Weisi Lin, Georges El Fakhri, Huchuan Lu, Xiaofeng Liu

Main category: cs.CV

TL;DR: The paper introduces PBD5K, the first large-scale benchmark for power battery detection (PBD) from X-ray images, and proposes MDCNeXt model that integrates multi-dimensional structure clues with state space modules to address challenges in detecting dense battery plates.

Details

Motivation: Power batteries in electric vehicles have internal structural defects that pose safety risks. Manual inspection is inefficient and error-prone, while traditional vision algorithms struggle with densely packed plates, low contrast, scale variation, and imaging artifacts in X-ray images.

Method: Developed PBD5K benchmark with 5,000 X-ray images and intelligent annotation pipeline. Formulated PBD as point-level segmentation problem and proposed MDCNeXt model that extracts point, line, and count information with two state space modules: prompt-filtered module for contrastive relationships and density-aware reordering module for high-density regions. Also introduced distance-adaptive mask generation strategy.

Result: The paper presents PBD5K as the first large-scale benchmark for power battery detection with fine-grained annotations and real-world visual interference. The proposed MDCNeXt model effectively handles the challenges of dense plate detection in X-ray images.

Conclusion: The work addresses the important safety inspection task for power batteries in electric vehicles through a comprehensive benchmark and novel model design that integrates multi-dimensional structure clues with advanced state space modules to improve detection accuracy in challenging X-ray imaging conditions.

Abstract: Power batteries are essential components in electric vehicles, where internal structural defects can pose serious safety risks. We conduct a comprehensive study on a new task, power battery detection (PBD), which aims to localize the dense endpoints of cathode and anode plates from industrial X-ray images for quality inspection. Manual inspection is inefficient and error-prone, while traditional vision algorithms struggle with densely packed plates, low contrast, scale variation, and imaging artifacts. To address this issue and drive more attention into this meaningful task, we present PBD5K, the first large-scale benchmark for this task, consisting of 5,000 X-ray images from nine battery types with fine-grained annotations and eight types of real-world visual interference. To support scalable and consistent labeling, we develop an intelligent annotation pipeline that combines image filtering, model-assisted pre-labeling, cross-verification, and layered quality evaluation. We formulate PBD as a point-level segmentation problem and propose MDCNeXt, a model designed to extract and integrate multi-dimensional structure clues including point, line, and count information from the plate itself. To improve discrimination between plates and suppress visual interference, MDCNeXt incorporates two state space modules. The first is a prompt-filtered module that learns contrastive relationships guided by task-specific prompts. The second is a density-aware reordering module that refines segmentation in regions with high plate density. In addition, we propose a distance-adaptive mask generation strategy to provide robust supervision under varying spatial distributions of anode and cathode positions. The source code and datasets will be publicly available at \href{https://github.com/Xiaoqi-Zhao-DLUT/X-ray-PBD}{PBD5K}.

[826] Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory

Lin Long, Yichen He, Wentao Ye, Yiyuan Pan, Yuan Lin, Hang Li, Junbo Zhao, Wei Li

Main category: cs.CV

TL;DR: M3-Agent is a multimodal agent framework with long-term memory that processes visual/auditory inputs to build episodic and semantic memories, enabling autonomous reasoning and task completion through memory retrieval.

Details

Motivation: To develop more human-like multimodal agents capable of accumulating world knowledge through long-term memory and using it for complex reasoning tasks in dynamic environments.

Method: Uses entity-centric multimodal memory organization, processes real-time visual/auditory inputs to build episodic/semantic memories, and employs reinforcement learning for training. Evaluated on M3-Bench benchmark with robot-perspective and web-sourced videos.

Result: Outperforms Gemini-1.5-pro and GPT-4o baselines by 6.7% on M3-Bench-robot, 7.7% on M3-Bench-web, and 5.3% on VideoMME-long, demonstrating superior memory effectiveness and reasoning capabilities.

Conclusion: M3-Agent advances multimodal agents toward human-like long-term memory and provides practical design insights, with the framework showing significant performance improvements over state-of-the-art prompting agents.

Abstract: We introduce M3-Agent, a novel multimodal agent framework equipped with long-term memory. Like humans, M3-Agent can process real-time visual and auditory inputs to build and update episodic and semantic memories, gradually accumulating world knowledge. Its memory is organized in an entity-centric, multimodal manner, enabling deeper and more consistent understanding of the environment. Given an instruction, M3-Agent autonomously performs multi-turn reasoning and retrieves relevant memories to complete tasks. To evaluate memory effectiveness and memory-based reasoning in multimodal agents, we develop M3-Bench, a long-video question answering benchmark comprising 100 newly recorded robot-perspective videos (M3-Bench-robot) and 920 diverse web-sourced videos (M3-Bench-web). We annotate QA pairs designed to test capabilities essential for agent applications, such as person understanding, general knowledge extraction, and cross-modal reasoning. Experimental results show that M3-Agent, trained via reinforcement learning, outperforms the strongest baseline, a prompting agent using Gemini-1.5-pro and GPT-4o, achieving 6.7%, 7.7%, and 5.3% higher accuracy on M3-Bench-robot, M3-Bench-web and VideoMME-long, respectively. Our work advances multimodal agents toward more human-like long-term memory and provides insights for their practical design. Model, code and data are available at https://github.com/bytedance-seed/m3-agent.

[827] BLADE: Block-Sparse Attention Meets Step Distillation for Efficient Video Generation

Youping Gu, Xiaolong Li, Yuhao Hu, Minqi Chen, Bohan Zhuang

Main category: cs.CV

TL;DR: BLADE is a data-free joint training framework that combines adaptive block-sparse attention with sparsity-aware step distillation to accelerate Diffusion Transformers for video generation while improving quality.

Details

Motivation: Diffusion Transformers face slow inference due to iterative denoising and quadratic attention costs for long sequences. Existing acceleration strategies like step distillation and sparse attention have limitations when combined - training-free integration yields poor results, while separate training requires expensive video data.

Method: Proposes BLADE with two key components: (1) Adaptive Block-Sparse Attention (ASA) for dynamic content-aware sparsity masks, and (2) sparsity-aware step distillation based on Trajectory Distribution Matching that incorporates sparsity directly into distillation rather than as a separate step.

Result: Achieves 14.10x end-to-end inference acceleration on Wan2.1-1.3B over 50-step baseline, and 8.89x speedup on CogVideoX-5B. Quality improves consistently - VBench-2.0 scores increase from 0.534 to 0.569 for CogVideoX-5B and from 0.563 to 0.570 for Wan2.1-1.3B, with superior human evaluation ratings.

Conclusion: BLADE successfully combines sparse attention and step distillation through joint training, achieving significant acceleration while improving video generation quality across different model scales without requiring additional high-quality video data.

Abstract: Diffusion Transformers currently lead the field in high-quality video generation, but their slow iterative denoising process and prohibitive quadratic attention costs for long sequences create significant inference bottlenecks. While both step distillation and sparse attention mechanisms have shown promise as independent acceleration strategies, effectively combining these approaches presents critical challenges – training-free integration yields suboptimal results, while separately training sparse attention after step distillation requires prohibitively expensive high-quality video data. To overcome these limitations, we propose BLADE, an innovative data-free joint training framework that introduces: (1) an Adaptive Block-Sparse Attention (ASA) mechanism for dynamically generating content-aware sparsity masks to focus computation on salient spatiotemporal features, and (2) a sparsity-aware step distillation paradigm, built upon Trajectory Distribution Matching (TDM), directly incorporates sparsity into the distillation process rather than treating it as a separate compression step and features fast convergence. We validate BLADE on text-to-video models like CogVideoX-5B and Wan2.1-1.3B, and our framework demonstrates remarkable efficiency gains across different scales. On Wan2.1-1.3B, BLADE achieves a 14.10x end-to-end inference acceleration over a 50-step baseline. Moreover, on models such as CogVideoX-5B with short video sequence lengths, our framework delivers a robust 8.89x speedup. Crucially, the acceleration is accompanied by a consistent quality improvement. On the VBench-2.0 benchmark, BLADE boosts the score of CogVideoX-5B to 0.569 (from 0.534) and Wan2.1-1.3B to 0.570 (from 0.563), results that are further corroborated by superior ratings in human evaluations. Project is available at http://ziplab.co/BLADE-Homepage/.

[828] G-CUT3R: Guided 3D Reconstruction with Camera and Depth Prior Integration

Ramil Khafizov, Artem Komarichev, Ruslan Rakhimov, Peter Wonka, Evgeny Burnaev

Main category: cs.CV

TL;DR: G-CUT3R enhances 3D scene reconstruction by integrating prior information like depth, camera calibrations, or positions into the CUT3R model through dedicated encoders and zero convolution fusion.

Details

Motivation: Existing feed-forward methods rely only on input images, ignoring commonly available auxiliary data in real-world scenarios that could improve reconstruction quality.

Method: Lightweight modification to CUT3R with dedicated encoders for each modality, fusing features with RGB image tokens via zero convolution for flexible integration of any prior combination.

Result: Significant performance improvements across multiple benchmarks for 3D reconstruction and multi-view tasks, demonstrating effective utilization of available priors.

Conclusion: G-CUT3R successfully leverages prior information to enhance 3D scene reconstruction while maintaining compatibility with varying input modalities.

Abstract: We introduce G-CUT3R, a novel feed-forward approach for guided 3D scene reconstruction that enhances the CUT3R model by integrating prior information. Unlike existing feed-forward methods that rely solely on input images, our method leverages auxiliary data, such as depth, camera calibrations, or camera positions, commonly available in real-world scenarios. We propose a lightweight modification to CUT3R, incorporating a dedicated encoder for each modality to extract features, which are fused with RGB image tokens via zero convolution. This flexible design enables seamless integration of any combination of prior information during inference. Evaluated across multiple benchmarks, including 3D reconstruction and other multi-view tasks, our approach demonstrates significant performance improvements, showing its ability to effectively utilize available priors while maintaining compatibility with varying input modalities.

[829] UniUGG: Unified 3D Understanding and Generation via Geometric-Semantic Encoding

Yueming Xu, Jiahui Zhang, Ze Huang, Yurui Chen, Yanpeng Zhou, Zhenyu Chen, Yu-Jie Yuan, Pengxiang Xia, Guowei Huang, Xinyue Cai, Zhongang Qi, Xingyue Quan, Jianye Hao, Hang Xu, Li Zhang

Main category: cs.CV

TL;DR: UniUGG is the first unified framework for 3D understanding and generation that uses an LLM to process sentences and 3D representations, with a spatial decoder using latent diffusion for high-quality 3D generation.

Details

Motivation: Integration of 3D tasks into unified architectures remains challenging and largely unexplored, despite progress in 2D image understanding and generation.

Method: Uses LLM for comprehension and decoding, spatial decoder with latent diffusion model for 3D generation, and geometric-semantic learning strategy to pretrain vision encoder for capturing both semantic and geometric cues.

Result: Extensive experiments demonstrate superiority in visual representation, spatial understanding, and 3D generation. Supports both generation from reference images with view transformations and spatial VQA tasks.

Conclusion: UniUGG successfully bridges the gap between 3D understanding and generation, showing strong performance across multiple 3D tasks through unified architecture design.

Abstract: Despite the impressive progress on understanding and generating images shown by the recent unified architectures, the integration of 3D tasks remains challenging and largely unexplored. In this paper, we introduce UniUGG, the first unified understanding and generation framework for 3D modalities. Our unified framework employs an LLM to comprehend and decode sentences and 3D representations. At its core, we propose a spatial decoder leveraging a latent diffusion model to generate high-quality 3D representations. This allows for the generation and imagination of 3D scenes based on a reference image and an arbitrary view transformation, while remaining supports for spatial visual question answering (VQA) tasks. Additionally, we propose a geometric-semantic learning strategy to pretrain the vision encoder. This design jointly captures the input’s semantic and geometric cues, enhancing both spatial understanding and generation. Extensive experimental results demonstrate the superiority of our method in visual representation, spatial understanding, and 3D generation. The source code will be released upon paper acceptance.

[830] Semantic Discrepancy-aware Detector for Image Forgery Identification

Ziye Wang, Minghang Yu, Chunyan Xu, Zhen Cui

Main category: cs.CV

TL;DR: Proposes SDD, a semantic discrepancy-aware detector that uses reconstruction learning to align forgery and semantic concept spaces for improved fake image detection.

Details

Motivation: The misalignment between forgery and semantic concept spaces hinders fake image detection performance, requiring better alignment methods.

Method: Uses semantic token sampling, concept-level forgery discrepancy learning with visual reconstruction, and low-level forgery feature enhancement.

Result: Achieves superior results on two standard image forgery datasets compared to existing methods.

Conclusion: SDD effectively aligns forgery and semantic spaces, improving fake image detection through semantic-guided discrepancy learning.

Abstract: With the rapid advancement of image generation techniques, robust forgery detection has become increasingly imperative to ensure the trustworthiness of digital media. Recent research indicates that the learned semantic concepts of pre-trained models are critical for identifying fake images. However, the misalignment between the forgery and semantic concept spaces hinders the model’s forgery detection performance. To address this problem, we propose a novel Semantic Discrepancy-aware Detector (SDD) that leverages reconstruction learning to align the two spaces at a fine-grained visual level. By exploiting the conceptual knowledge embedded in the pre-trained vision language model, we specifically design a semantic token sampling module to mitigate the space shifts caused by features irrelevant to both forgery traces and semantic concepts. A concept-level forgery discrepancy learning module, built upon a visual reconstruction paradigm, is proposed to strengthen the interaction between visual semantic concepts and forgery traces, effectively capturing discrepancies under the concepts’ guidance. Finally, the low-level forgery feature enhancemer integrates the learned concept level forgery discrepancies to minimize redundant forgery information. Experiments conducted on two standard image forgery datasets demonstrate the efficacy of the proposed SDD, which achieves superior results compared to existing methods. The code is available at https://github.com/wzy1111111/SSD.

[831] Temporal Grounding as a Learning Signal for Referring Video Object Segmentation

Seunghun Lee, Jiwan Seo, Jeonghoon Kim, Sungho Moon, Siwon Kim, Haeun Yun, Hyogyeong Jeon, Wonhyeok Choi, Jaehoon Jeong, Zane Durante, Sang Hyun Park, Sunghoon Im

Main category: cs.CV

TL;DR: Proposes Temporally Grounded Learning (TGL) framework with temporal annotations to address semantic misalignment in Referring Video Object Segmentation by incorporating explicit temporal grounding signals.

Details

Motivation: Existing RVOS methods suffer from semantic misalignment due to indiscriminate frame sampling and supervision of all visible objects regardless of their relevance to language expressions, lacking explicit temporal learning signals.

Method: Introduces MeViS-M dataset with manual temporal span annotations, and TGL framework with Moment-guided Dual-path Propagation (MDP) for decoupled segmentation/propagation and Object-level Selective Supervision (OSS) for temporally-aligned supervision.

Result: Achieves new state-of-the-art performance on the challenging MeViS benchmark by effectively leveraging temporal signals to reduce semantic noise and improve language-conditioned learning.

Conclusion: Temporal grounding is crucial for RVOS, and the proposed TGL framework with explicit temporal annotations and selective supervision effectively addresses semantic misalignment issues in video object segmentation.

Abstract: Referring Video Object Segmentation (RVOS) aims to segment and track objects in videos based on natural language expressions, requiring precise alignment between visual content and textual queries. However, existing methods often suffer from semantic misalignment, largely due to indiscriminate frame sampling and supervision of all visible objects during training – regardless of their actual relevance to the expression. We identify the core problem as the absence of an explicit temporal learning signal in conventional training paradigms. To address this, we introduce MeViS-M, a dataset built upon the challenging MeViS benchmark, where we manually annotate temporal spans when each object is referred to by the expression. These annotations provide a direct, semantically grounded supervision signal that was previously missing. To leverage this signal, we propose Temporally Grounded Learning (TGL), a novel learning framework that directly incorporates temporal grounding into the training process. Within this frame- work, we introduce two key strategies. First, Moment-guided Dual-path Propagation (MDP) improves both grounding and tracking by decoupling language-guided segmentation for relevant moments from language-agnostic propagation for others. Second, Object-level Selective Supervision (OSS) supervises only the objects temporally aligned with the expression in each training clip, thereby reducing semantic noise and reinforcing language-conditioned learning. Extensive experiments demonstrate that our TGL framework effectively leverages temporal signal to establish a new state-of-the-art on the challenging MeViS benchmark. We will make our code and the MeViS-M dataset publicly available.

[832] Applications of Small Language Models in Medical Imaging Classification with a Focus on Prompt Strategies

Yiting Wang, Ziwei Wang, Jiachen Zhong, Di Zhu, Weiyi Li

Main category: cs.CV

TL;DR: This study evaluates small language models (SLMs) for medical imaging classification, specifically chest X-ray position classification, and demonstrates that well-designed prompts can achieve competitive accuracy comparable to larger models.

Details

Motivation: Large language models face adoption barriers in healthcare due to high computational costs, limited accessibility, and data privacy concerns, creating a need for more practical alternatives.

Method: Evaluated multiple SLMs on NIH Chest X-ray dataset for AP vs. PA position classification using three prompt strategies: baseline instruction, incremental summary prompts, and correction-based reflective prompts.

Result: Certain SLMs achieved competitive accuracy with well-crafted prompts, showing that prompt engineering can substantially enhance SLM performance without requiring deep AI expertise.

Conclusion: Prompt engineering can effectively bridge the performance gap between SLMs and larger models in healthcare applications, making AI more accessible in resource-constrained medical environments.

Abstract: Large language models (LLMs) have shown remarkable capabilities in natural language processing and multi-modal understanding. However, their high computational cost, limited accessibility, and data privacy concerns hinder their adoption in resource-constrained healthcare environments. This study investigates the performance of small language models (SLMs) in a medical imaging classification task, comparing different models and prompt designs to identify the optimal combination for accuracy and usability. Using the NIH Chest X-ray dataset, we evaluate multiple SLMs on the task of classifying chest X-ray positions (anteroposterior [AP] vs. posteroanterior [PA]) under three prompt strategies: baseline instruction, incremental summary prompts, and correction-based reflective prompts. Our results show that certain SLMs achieve competitive accuracy with well-crafted prompts, suggesting that prompt engineering can substantially enhance SLM performance in healthcare applications without requiring deep AI expertise from end users.

[833] SpotEdit: Evaluating Visually-Guided Image Editing Methods

Sara Ghazanfari, Wei-An Lin, Haitong Tian, Ersin Yumer

Main category: cs.CV

TL;DR: SpotEdit is a comprehensive benchmark for evaluating visually-guided image editing methods across different generative models, revealing performance disparities and addressing the critical issue of hallucination where models incorrectly perceive visual cues.

Details

Motivation: Existing evaluations for visually-guided image editing are insufficient for real-world challenges, and there's a need to systematically assess diverse generative models while addressing the underexplored problem of hallucination.

Method: Developed SpotEdit benchmark that systematically evaluates diffusion, autoregressive, and hybrid generative models on visually-guided image editing tasks, with a dedicated component to assess hallucination issues.

Result: Uncovered substantial performance disparities across different generative models and found that leading models like GPT-4o often hallucinate visual cues and erroneously perform editing tasks.

Conclusion: SpotEdit provides a comprehensive evaluation framework that reveals critical limitations in current visually-guided image editing methods, particularly the hallucination problem, and enables better assessment of model capabilities.

Abstract: Visually-guided image editing, where edits are conditioned on both visual cues and textual prompts, has emerged as a powerful paradigm for fine-grained, controllable content generation. Although recent generative models have shown remarkable capabilities, existing evaluations remain simple and insufficiently representative of real-world editing challenges. We present SpotEdit, a comprehensive benchmark designed to systematically assess visually-guided image editing methods across diverse diffusion, autoregressive, and hybrid generative models, uncovering substantial performance disparities. To address a critical yet underexplored challenge, our benchmark includes a dedicated component on hallucination, highlighting how leading models, such as GPT-4o, often hallucinate the existence of a visual cue and erroneously perform the editing task. Our code and benchmark are publicly released at https://github.com/SaraGhazanfari/SpotEdit.

[834] Human-like Content Analysis for Generative AI with Language-Grounded Sparse Encoders

Yiming Tang, Arash Lagzian, Srinivas Anumasa, Qiran Zou, Yingtao Zhu, Ye Zhang, Trang Nguyen, Yih-Chung Tham, Ehsan Adeli, Ching-Yu Cheng, Yilun Du, Dianbo Liu

Main category: cs.CV

TL;DR: LanSE is a content analysis tool that decomposes images into interpretable visual patterns using natural language descriptions, enabling granular analysis of AI-generated content across various domains.

Details

Motivation: Existing AI content analysis methods treat images as indivisible wholes, but real-world AI failures manifest as specific visual patterns that require more granular and decomposed analysis for effective detection.

Method: LanSE uses interpretability modules and large multimodal models to automatically identify visual patterns within data modalities, decomposing images into interpretable patterns with natural language descriptions.

Result: The method discovered over 5,000 visual patterns with 93% human agreement, provides decomposed evaluation outperforming existing methods, establishes the first systematic evaluation of physical plausibility, and extends to medical imaging settings.

Conclusion: LanSE’s capability to extract language-grounded patterns can be adapted to numerous fields including biology, geography, and other data modalities like protein structures and time series, advancing content analysis for generative AI.

Abstract: The rapid development of generative AI has transformed content creation, communication, and human development. However, this technology raises profound concerns in high-stakes domains, demanding rigorous methods to analyze and evaluate AI-generated content. While existing analytic methods often treat images as indivisible wholes, real-world AI failures generally manifest as specific visual patterns that can evade holistic detection and suit more granular and decomposed analysis. Here we introduce a content analysis tool, Language-Grounded Sparse Encoders (LanSE), which decompose images into interpretable visual patterns with natural language descriptions. Utilizing interpretability modules and large multimodal models, LanSE can automatically identify visual patterns within data modalities. Our method discovers more than 5,000 visual patterns with 93% human agreement, provides decomposed evaluation outperforming existing methods, establishes the first systematic evaluation of physical plausibility, and extends to medical imaging settings. Our method’s capability to extract language-grounded patterns can be naturally adapted to numerous fields, including biology and geography, as well as other data modalities such as protein structures and time series, thereby advancing content analysis for generative AI.

[835] Rethinking Human-Object Interaction Evaluation for both Vision-Language Models and HOI-Specific Methods

Qinqian Lei, Bo Wang, Robby T. Tan

Main category: cs.CV

TL;DR: The paper introduces a new benchmarking dataset for human-object interaction (HOI) detection that reformulates the task as multiple-answer multiple-choice to better evaluate both standalone vision-language models (VLMs) and specialized HOI methods under a unified protocol.

Details

Motivation: Existing HOI benchmarks like HICO-DET were developed before modern VLMs and rely on rigid exact label matching, which penalizes valid predictions from VLMs and disproportionately underestimates their performance due to their less constrained generative outputs.

Method: Created a new benchmarking dataset that reformulates HOI detection as multiple-answer multiple-choice, emphasizing challenging scenarios by including more multi-person scenes, removing simple cases, and curating hard negative choices.

Result: Large VLMs already surpass state-of-the-art HOI-specific methods across most metrics, but analysis reveals key limitations: VLMs often misattribute surrounding people’s interactions to the target person and struggle in complex multi-person or occluded scenarios.

Conclusion: Standalone VLMs can effectively perform HOI detection and outperform specialized methods, but require improved benchmarks that accommodate their generative nature and address their limitations in complex multi-person scenarios.

Abstract: Human-object interaction (HOI) detection has traditionally been approached with task-specific models, sometimes augmented by early vision-language models (VLMs) such as CLIP. With the rise of large, generative VLMs, however, a natural question emerges: can standalone VLMs effectively perform HOI detection, and how do they compare to specialized HOI methods? Addressing this requires a benchmarking dataset and protocol that support both paradigms. Existing benchmarks such as HICO-DET were developed before modern VLMs and rely on exact label matching. This clashes with generative outputs, which may yield multiple equally valid interpretations. For example, in a single image, a person mid-motion with a frisbee might plausibly be described as ’throwing’ or ‘catching’, yet only one is annotated as correct. Such rigid evaluation penalizes valid predictions from both VLMs and HOI-specific methods, but disproportionately underestimates VLM performance because their outputs are less constrained. We introduce a new benchmarking dataset that reformulates HOI detection as a multiple-answer multiple-choice task. It emphasizes challenging scenarios by (i) including a higher proportion of multi-person scenes where individuals perform different interactions, (ii) removing overly simple cases, and (iii) curating hard negative choices. This makes the benchmark more challenging than prior HOI datasets, while still supporting systematic evaluation of both standalone VLMs and HOI-specific methods under a unified protocol. Our results show that large VLMs already surpass state-of-the-art HOI-specific methods across most metrics, while analysis further uncovers key limitations: VLMs often misattribute surrounding people’s interactions to the target person and struggle in complex multi-person or occluded scenarios.

[836] Re-Densification Meets Cross-Scale Propagation: Real-Time Neural Compression of LiDAR Point Clouds

Pengpeng Yu, Haoran Li, Runqing Jiang, Jing Wang, Liang Lin, Yulan Guo

Main category: cs.CV

TL;DR: A novel LiDAR point cloud compression method that uses geometry re-densification and cross-scale feature propagation to achieve efficient predictive coding with state-of-the-art compression ratios and real-time performance.

Details

Motivation: High-precision LiDAR scans incur substantial storage and transmission overhead, while existing methods struggle with efficient context modeling due to extreme sparsity of geometric details, limiting compression performance and speed.

Method: Proposes two lightweight modules: 1) Geometry Re-Densification Module that re-densifies sparse geometry, extracts features at denser scale, then re-sparsifies for predictive coding; 2) Cross-scale Feature Propagation Module that leverages occupancy cues from multiple resolution levels to guide hierarchical feature propagation and enable information sharing across scales.

Result: Achieves state-of-the-art compression ratios on KITTI dataset with real-time performance (26 FPS for encoding/decoding at 12-bit quantization), demonstrating efficient context modeling and accelerated coding process.

Conclusion: The proposed framework generates compact feature representations that provide efficient context modeling while maintaining lightweight computation, enabling high-performance LiDAR point cloud compression suitable for real-time applications.

Abstract: LiDAR point clouds are fundamental to various applications, yet high-precision scans incur substantial storage and transmission overhead. Existing methods typically convert unordered points into hierarchical octree or voxel structures for dense-to-sparse predictive coding. However, the extreme sparsity of geometric details hinders efficient context modeling, thereby limiting their compression performance and speed. To address this challenge, we propose to generate compact features for efficient predictive coding. Our framework comprises two lightweight modules. First, the Geometry Re-Densification Module re-densifies encoded sparse geometry, extracts features at denser scale, and then re-sparsifies the features for predictive coding. This module avoids costly computation on highly sparse details while maintaining a lightweight prediction head. Second, the Cross-scale Feature Propagation Module leverages occupancy cues from multiple resolution levels to guide hierarchical feature propagation. This design facilitates information sharing across scales, thereby reducing redundant feature extraction and providing enriched features for the Geometry Re-Densification Module. By integrating these two modules, our method yields a compact feature representation that provides efficient context modeling and accelerates the coding process. Experiments on the KITTI dataset demonstrate state-of-the-art compression ratios and real-time performance, achieving 26 FPS for encoding/decoding at 12-bit quantization. Code is available at https://github.com/pengpeng-yu/FastPCC.

[837] SemaMIL: Semantic-Aware Multiple Instance Learning with Retrieval-Guided State Space Modeling for Whole Slide Images

Lubin Gan, Xiaoman Wu, Jing Zhang, Zhifeng Wang, Linhao Qu, Siying Wu, Xiaoyan Sun

Main category: cs.CV

TL;DR: SemaMIL introduces Semantic Reordering and Semantic-guided Retrieval State Space Module to improve MIL for whole slide images, achieving SOTA accuracy with lower computational cost.

Details

Motivation: Existing attention-based MIL methods overlook contextual relationships, transformers have quadratic complexity and overfit, while state space models lose interpretability when shuffling patches.

Method: Integrates Semantic Reordering (clusters and arranges similar patches through reversible permutation) with Semantic-guided Retrieval State Space Module (selects representative queries to adjust state space parameters).

Result: Achieves state-of-the-art accuracy on four WSI subtype datasets with fewer FLOPs and parameters compared to strong baselines.

Conclusion: SemaMIL effectively balances computational efficiency, contextual modeling, and interpretability in computational pathology.

Abstract: Multiple instance learning (MIL) has become the leading approach for extracting discriminative features from whole slide images (WSIs) in computational pathology. Attention-based MIL methods can identify key patches but tend to overlook contextual relationships. Transformer models are able to model interactions but require quadratic computational cost and are prone to overfitting. State space models (SSMs) offer linear complexity, yet shuffling patch order disrupts histological meaning and reduces interpretability. In this work, we introduce SemaMIL, which integrates Semantic Reordering (SR), an adaptive method that clusters and arranges semantically similar patches in sequence through a reversible permutation, with a Semantic-guided Retrieval State Space Module (SRSM) that chooses a representative subset of queries to adjust state space parameters for improved global modeling. Evaluation on four WSI subtype datasets shows that, compared to strong baselines, SemaMIL achieves state-of-the-art accuracy with fewer FLOPs and parameters.

[838] Multimodal Iterative RAG for Knowledge-Intensive Visual Question Answering

Changin Choi, Wonseok Lee, Jungmin Ko, Wonjong Rhee

Main category: cs.CV

TL;DR: MI-RAG is a Multimodal Iterative RAG framework that enhances knowledge retrieval for visual question answering by using reasoning-guided multi-queries and knowledge synthesis across iterations.

Details

Motivation: Current MLLMs struggle with knowledge-intensive visual questions requiring external knowledge beyond image content, and conventional single-pass RAG frameworks fail to gather sufficient knowledge.

Method: Proposes iterative framework with reasoning-guided multi-queries to explore multiple knowledge facets, joint search across heterogeneous knowledge bases, and knowledge synthesis to progressively deepen understanding.

Result: Significant improvements in retrieval recall and answer accuracy on challenging benchmarks including Encyclopedic VQA, InfoSeek, and OK-VQA.

Conclusion: MI-RAG establishes a scalable approach for compositional reasoning in knowledge-intensive VQA by enhancing retrieval through iterative reasoning and knowledge synthesis.

Abstract: Recent advances in Multimodal Large Language Models~(MLLMs) have significantly enhanced the ability of these models in multimodal understanding and reasoning. However, the performance of MLLMs for knowledge-intensive visual questions, which require external knowledge beyond the visual content of an image, still remains limited. While Retrieval-Augmented Generation (RAG) has become a promising solution to provide models with external knowledge, its conventional single-pass framework often fails to gather sufficient knowledge. To overcome this limitation, we propose MI-RAG, a Multimodal Iterative RAG framework that leverages reasoning to enhance retrieval and incorporates knowledge synthesis to refine its understanding. At each iteration, the model formulates a reasoning-guided multi-query to explore multiple facets of knowledge. Subsequently, these queries drive a joint search across heterogeneous knowledge bases, retrieving diverse knowledge. This retrieved knowledge is then synthesized to enrich the reasoning record, progressively deepening the model’s understanding. Experiments on challenging benchmarks, including Encyclopedic VQA, InfoSeek, and OK-VQA, show that MI-RAG significantly improves both retrieval recall and answer accuracy, establishing a scalable approach for compositional reasoning in knowledge-intensive VQA.

[839] Causality-guided Prompt Learning for Vision-language Models via Visual Granulation

Mengyu Gao, Qiulei Dong

Main category: cs.CV

TL;DR: CaPL is a causality-guided text prompt learning method for CLIP that uses visual granulation to capture fine-grained class differences through causal inference, achieving superior performance on fine-grained datasets.

Details

Motivation: Existing CLIP-based prompt learning methods have limited ability to handle fine-grained datasets, as they struggle to capture subtle discrepancies among different fine-grained classes.

Method: Two modules: (1) Attribute disentanglement module using Brownian Bridge Diffusion Model to decompose visual features into non-individualized and individualized attributes; (2) Granule learning module that constructs visual granules by integrating attributes under two causal inference strategies.

Result: Extensive experiments on 15 datasets show CaPL significantly outperforms state-of-the-art prompt learning methods, especially on fine-grained datasets.

Conclusion: The proposed CaPL method effectively addresses fine-grained recognition challenges by leveraging visual granulation and causal inference to learn more discriminative text prompts.

Abstract: Prompt learning has recently attracted much attention for adapting pre-trained vision-language models (e.g., CLIP) to downstream recognition tasks. However, most of the existing CLIP-based prompt learning methods only show a limited ability for handling fine-grained datasets. To address this issue, we propose a causality-guided text prompt learning method via visual granulation for CLIP, called CaPL, where the explored visual granulation technique could construct sets of visual granules for the text prompt to capture subtle discrepancies among different fine-grained classes through casual inference. The CaPL method contains the following two modules: (1) An attribute disentanglement module is proposed to decompose visual features into non-individualized attributes (shared by some classes) and individualized attributes (specific to single classes) using a Brownian Bridge Diffusion Model; (2) A granule learning module is proposed to construct visual granules by integrating the aforementioned attributes for recognition under two causal inference strategies. Thanks to the learned visual granules, more discriminative text prompt is expected to be learned. Extensive experimental results on 15 datasets demonstrate that our CaPL method significantly outperforms the state-of-the-art prompt learning methods, especially on fine-grained datasets.

[840] Durian: Dual Reference Image-Guided Portrait Animation with Attribute Transfer

Hyunsoo Cha, Byungjun Kim, Hanbyul Joo

Main category: cs.CV

TL;DR: Durian is the first method for portrait animation with cross-identity attribute transfer using a self-reconstruction approach without paired data, featuring Dual ReferenceNet and complementary masking.

Details

Motivation: Training portrait animation models typically requires attribute pairs of the same individual, which are rarely available at scale, creating a data bottleneck.

Method: Uses self-reconstruction with video frames as pseudo pairs, Dual ReferenceNet with spatial attention fusion, complementary masking, mask expansion, and augmentation schemes.

Result: Achieves state-of-the-art performance on portrait animation with attribute transfer, supports multi-attribute composition and smooth interpolation in single generation.

Conclusion: Durian enables robust cross-identity attribute transfer without paired data through innovative self-reconstruction training and dual reference design.

Abstract: We present Durian, the first method for generating portrait animation videos with cross-identity attribute transfer from one or more reference images to a target portrait. Training such models typically requires attribute pairs of the same individual, which are rarely available at scale. To address this challenge, we propose a self-reconstruction formulation that leverages ordinary portrait videos to learn attribute transfer without explicit paired data. Two frames from the same video act as a pseudo pair: one serves as an attribute reference and the other as an identity reference. To enable this self-reconstruction training, we introduce a Dual ReferenceNet that processes the two references separately and then fuses their features via spatial attention within a diffusion model. To make sure each reference functions as a specialized stream for either identity or attribute information, we apply complementary masking to the reference images. Together, these two components guide the model to reconstruct the original video, naturally learning cross-identity attribute transfer. To bridge the gap between self-reconstruction training and cross-identity inference, we introduce a mask expansion strategy and augmentation schemes, enabling robust transfer of attributes with varying spatial extent and misalignment. Durian achieves state-of-the-art performance on portrait animation with attribute transfer. Moreover, its dual reference design uniquely supports multi-attribute composition and smooth attribute interpolation within a single generation pass, enabling highly flexible and controllable synthesis.

[841] GeoSplat: A Deep Dive into Geometry-Constrained Gaussian Splatting

Yangming Li, Chaoyu Liu, Lihao Liu, Simon Masnou, Carola-Bibiane Schönlieb

Main category: cs.CV

TL;DR: GeoSplat is a geometry-constrained optimization framework that uses first-order and second-order geometric priors to improve Gaussian splatting training, including initialization, gradient updates, and densification.

Details

Motivation: Previous works used low-order geometric priors (like normal vectors) that were unreliably estimated by noise-sensitive methods, limiting their effectiveness in regularizing Gaussian splatting optimization.

Method: The framework exploits both first-order and second-order geometric quantities, initializes Gaussian scales using principal curvatures for better surface coverage, and introduces noise-robust estimation methods based on geometric structures like local manifolds.

Result: Extensive experiments on multiple datasets for novel view synthesis show that GeoSplat significantly improves Gaussian splatting performance and outperforms previous baselines.

Conclusion: GeoSplat provides an effective geometry-constrained optimization framework that enhances Gaussian splatting training through robust geometric priors and improved initialization strategies.

Abstract: A few recent works explored incorporating geometric priors to regularize the optimization of Gaussian splatting, further improving its performance. However, those early studies mainly focused on the use of low-order geometric priors (e.g., normal vector), and they might also be unreliably estimated by noise-sensitive methods, like local principal component analysis. To address their limitations, we first present GeoSplat, a general geometry-constrained optimization framework that exploits both first-order and second-order geometric quantities to improve the entire training pipeline of Gaussian splatting, including Gaussian initialization, gradient update, and densification. As an example, we initialize the scales of 3D Gaussian primitives in terms of principal curvatures, leading to a better coverage of the object surface than random initialization. Secondly, based on certain geometric structures (e.g., local manifold), we introduce efficient and noise-robust estimation methods that provide dynamic geometric priors for our framework. We conduct extensive experiments on multiple datasets for novel view synthesis, showing that our framework, GeoSplat, significantly improves the performance of Gaussian splatting and outperforms previous baselines.

[842] RED: Robust Event-Guided Motion Deblurring with Modality-Specific Disentangled Representation

Yihong Leng, Siming Zheng, Jinwei Chen, Bo Li, Jiaojiao Li, Peng-Tao Jiang

Main category: cs.CV

TL;DR: RED network uses modality-specific disentangled representation and robustness-oriented perturbation to address event camera noise and under-reporting issues for motion deblurring.

Details

Motivation: Event cameras provide high-temporal-resolution motion data but are susceptible to noise, and existing methods overlook the trade-off between noise reduction and event under-reporting, leading to unstable performance.

Method: Proposes Robust Event-guided Deblurring (RED) network with: 1) Robustness-Oriented Perturbation Strategy (RPS) to simulate various DVS thresholds, 2) Modality-specific Representation Mechanism (MRM) to model semantics, motion priors, and cross-modality correlations from blurry images and events, 3) Interactive modules to enhance motion areas and inject semantic context.

Result: Extensive experiments on synthetic and real-world datasets show RED consistently achieves state-of-the-art performance in both accuracy and robustness.

Conclusion: The proposed RED network effectively addresses event camera limitations and demonstrates superior deblurring performance through robust representation learning and cross-modality interaction.

Abstract: Event cameras provide sparse yet temporally high-resolution motion information, demonstrating great potential for motion deblurring. However, the delicate events are highly susceptible to noise. Although noise can be reduced by raising the threshold of Dynamic Vision Sensors (DVS), this inevitably causes under-reporting of events. Most existing event-guided deblurring methods overlook this practical trade-off, and the indiscriminate feature extraction and naive fusion result in unstable and mixed representations and ultimately unsatisfactory performance. To tackle these challenges, we propose a Robust Event-guided Deblurring (RED) network with modality-specific disentangled representation. First, we introduce a Robustness-Oriented Perturbation Strategy (RPS) that mimics various DVS thresholds, exposing RED to diverse under-reporting patterns and thereby fostering robustness under unknown conditions. With an adaption to RPS, a Modality-specific Representation Mechanism (MRM) is designed to explicitly model semantic understanding, motion priors, and cross-modality correlations from two inherently distinct but complementary sources: blurry images and partially disrupted events. Building on these reliable features, two interactive modules are presented to enhance motion-sensitive areas in blurry images and inject semantic context into under-reporting event representations. Extensive experiments on synthetic and real-world datasets demonstrate RED consistently achieves state-of-the-art performance in terms of both accuracy and robustness.

[843] Physics-Guided Null-Space Diffusion with Sparse Masking for Corrective Sparse-View CT Reconstruction

Zekun Zhou, Yanru Gong, Liu Shi, Qiegen Liu

Main category: cs.CV

TL;DR: STRIDE is a diffusion model for sparse-view CT reconstruction that uses temporal reweighting guidance and dual-network architecture to achieve superior image quality with significant improvements in PSNR, SSIM, and MSE metrics.

Details

Motivation: To address the challenges of sparse-view CT reconstruction where limited projection views lead to incomplete data and artifacts, requiring advanced generative models to complete missing information while preserving structural details.

Method: Joint training with sparse conditional probabilities, temporally varying sparse condition reweighting guidance, linear regression for distribution correction, and dual-network parallel architecture for multi-frequency component optimization.

Result: Achieves best improvement of 2.58 dB in PSNR, 2.37% increase in SSIM, and 0.236 reduction in MSE compared to baseline methods, with excellent generalization, structural consistency, detail restoration, and artifact suppression.

Conclusion: STRIDE effectively addresses sparse-view CT reconstruction challenges through progressive guidance and multi-frequency optimization, demonstrating superior performance and robustness in medical image reconstruction.

Abstract: Diffusion models have demonstrated remarkable generative capabilities in image processing tasks. We propose a Sparse condition Temporal Rewighted Integrated Distribution Estimation guided diffusion model (STRIDE) for sparse-view CT reconstruction. Specifically, we design a joint training mechanism guided by sparse conditional probabilities to facilitate the model effective learning of missing projection view completion and global information modeling. Based on systematic theoretical analysis, we propose a temporally varying sparse condition reweighting guidance strategy to dynamically adjusts weights during the progressive denoising process from pure noise to the real image, enabling the model to progressively perceive sparse-view information. The linear regression is employed to correct distributional shifts between known and generated data, mitigating inconsistencies arising during the guidance process. Furthermore, we construct a dual-network parallel architecture to perform global correction and optimization across multiple sub-frequency components, thereby effectively improving the model capability in both detail restoration and structural preservation, ultimately achieving high-quality image reconstruction. Experimental results on both public and real datasets demonstrate that the proposed method achieves the best improvement of 2.58 dB in PSNR, increase of 2.37% in SSIM, and reduction of 0.236 in MSE compared to the best-performing baseline methods. The reconstructed images exhibit excellent generalization and robustness in terms of structural consistency, detail restoration, and artifact suppression.

[844] BranchGRPO: Stable and Efficient GRPO with Structured Branching in Diffusion Models

Yuming Li, Yikai Wang, Yuying Zhu, Zhongyu Zhao, Ming Lu, Qi She, Shanghang Zhang

Main category: cs.CV

TL;DR: BranchGRPO improves Group Relative Policy Optimization by restructuring rollouts into branching trees with shared prefixes, reward fusion, and pruning strategies, achieving better alignment with 55% faster training.

Details

Motivation: Existing GRPO variants are inefficient due to sequential rollouts, many sampling steps, and unreliable credit assignment where sparse terminal rewards don't capture varying decision criticality during denoising.

Method: Uses branching tree rollouts with shared prefixes to amortize computation, reward fusion and depth-wise advantage estimation to transform sparse rewards into dense signals, and pruning strategies to reduce gradient computation while maintaining exploration.

Result: On HPDv2.1 image alignment: 16% better alignment scores than DanceGRPO, 55% faster training. BranchGRPO-Mix variant: 4.7x faster training without alignment degradation. On WanX video generation: higher Video-Align scores with sharper, temporally consistent frames.

Conclusion: BranchGRPO significantly improves efficiency and alignment quality for image and video generation models through its branching architecture and reward optimization techniques.

Abstract: Recent progress in aligning image and video generative models with Group Relative Policy Optimization (GRPO) has improved human preference alignment, but existing variants remain inefficient due to sequential rollouts and large numbers of sampling steps, unreliable credit assignment: sparse terminal rewards are uniformly propagated across timesteps, failing to capture the varying criticality of decisions during denoising. In this paper, we present BranchGRPO, a method that restructures the rollout process into a branching tree, where shared prefixes amortize computation and pruning removes low-value paths and redundant depths. BranchGRPO introduces three contributions: (1) a branching scheme that amortizes rollout cost through shared prefixes while preserving exploration diversity; (2) a reward fusion and depth-wise advantage estimator that transforms sparse terminal rewards into dense step-level signals; and (3) pruning strategies that cut gradient computation but leave forward rollouts and exploration unaffected. On HPDv2.1 image alignment, BranchGRPO improves alignment scores by up to \textbf{16%} over DanceGRPO, while reducing per-iteration training time by nearly \textbf{55%}. A hybrid variant, BranchGRPO-Mix, further accelerates training to 4.7x faster than DanceGRPO without degrading alignment. On WanX video generation, it further achieves higher Video-Align scores with sharper and temporally consistent frames compared to DanceGRPO. Codes are available at \href{https://fredreic1849.github.io/BranchGRPO-Webpage/}{BranchGRPO}.

[845] Fracture Detection In X-rays Using Custom Convolutional Neural Network (CNN) And Transfer Learning Models

Amna Hassan, Ilsa, Nouman Munib, Aneeqa Batool, Hamail Noor

Main category: cs.CV

TL;DR: AI-based fracture detection from X-rays using custom CNN achieves high accuracy (95.96%), outperforming transfer learning models like EfficientNetB0, MobileNetV2, and ResNet50.

Details

Motivation: Address global health challenge of bone fractures in low-resource settings with limited radiology access, overcoming limitations of conventional imaging methods (high cost, radiation, specialized interpretation).

Method: Developed custom Convolutional Neural Network (CNN) and benchmarked against transfer learning models (EfficientNetB0, MobileNetV2, ResNet50) using FracAtlas dataset of 4,083 musculoskeletal radiographs.

Result: Custom CNN achieved 95.96% accuracy, 0.94 precision, 0.88 recall, and 0.91 F1-score. Transfer learning models performed poorly due to class imbalance and dataset limitations.

Conclusion: Lightweight CNNs show promise for fracture detection in X-rays, highlighting need for fair benchmarking, diverse datasets, and external validation for clinical translation.

Abstract: Bone fractures present a major global health challenge, often resulting in pain, reduced mobility, and productivity loss, particularly in low-resource settings where access to expert radiology services is limited. Conventional imaging methods suffer from high costs, radiation exposure, and dependency on specialized interpretation. To address this, we developed an AI-based solution for automated fracture detection from X-ray images using a custom Convolutional Neural Network (CNN) and benchmarked it against transfer learning models including EfficientNetB0, MobileNetV2, and ResNet50. Training was conducted on the publicly available FracAtlas dataset, comprising 4,083 anonymized musculoskeletal radiographs. The custom CNN achieved 95.96% accuracy, 0.94 precision, 0.88 recall, and an F1-score of 0.91 on the FracAtlas dataset. Although transfer learning models (EfficientNetB0, MobileNetV2, ResNet50) performed poorly in this specific setup, these results should be interpreted in light of class imbalance and data set limitations. This work highlights the promise of lightweight CNNs for detecting fractures in X-rays and underscores the importance of fair benchmarking, diverse datasets, and external validation for clinical translation

[846] DEPFusion: Dual-Domain Enhancement and Priority-Guided Mamba Fusion for UAV Multispectral Object Detection

Shucong Li, Zhenyu Liu, Zijie Hong, Zhiheng Zhou, Xianghai Cao

Main category: cs.CV

TL;DR: DEPFusion is a framework for UAV multispectral object detection that addresses challenges like low-light image degradation, interference in fusion, and computational cost through Dual-Domain Enhancement and Priority-Guided Mamba Fusion modules.

Details

Motivation: To overcome challenges in UAV multispectral object detection: low-light RGB images weaken fusion due to detail loss, interference information during fusion, and high computational cost of transformer-based methods on UAV platforms.

Method: Proposes DEPFusion with two modules: 1) Dual-Domain Enhancement (DDE) using Cross-Scale Wavelet Mamba for brightness enhancement and Fourier Details Recovery for texture recovery; 2) Priority-Guided Mamba Fusion (PGMF) using novel Priority-Guided Serialization to guide Mamba scanning from high-priority tokens containing local target features.

Result: Experiments on DroneVehicle and VEDAI datasets show that DEPFusion achieves good performance comparable to state-of-the-art methods.

Conclusion: DEPFusion effectively addresses multispectral object detection challenges for UAVs through dual-domain enhancement and priority-guided fusion, achieving competitive performance while reducing computational complexity.

Abstract: Multispectral object detection is an important application for unmanned aerial vehicles (UAVs). However, it faces several challenges. First, low-light RGB images weaken the multispectral fusion due to details loss. Second, the interference information is introduced to local target modeling during multispectral fusion. Third, computational cost poses deployment challenge on UAV platforms, such as transformer-based methods with quadratic complexity. To address these issues, a framework named DEPFusion consisting of two designed modules, Dual-Domain Enhancement (DDE) and Priority-Guided Mamba Fusion (PGMF) , is proposed for UAV multispectral object detection. Firstly, considering the adoption of low-frequency component for global brightness enhancement and frequency spectra features for texture-details recovery, DDE module is designed with Cross-Scale Wavelet Mamba (CSWM) block and Fourier Details Recovery (FDR) block. Secondly, considering guiding the scanning of Mamba from high priority score tokens, which contain local target feature, a novel Priority-Guided Serialization is proposed with theoretical proof. Based on it, PGMF module is designed for multispectral feature fusion, which enhance local modeling and reduce interference information. Experiments on DroneVehicle and VEDAI datasets demonstrate that DEPFusion achieves good performance with state-of-the-art methods.

[847] Loc$^2$: Interpretable Cross-View Localization via Depth-Lifted Local Feature Matching

Zimin Xia, Chenghao Xu, Alexandre Alahi

Main category: cs.CV

TL;DR: A fine-grained cross-view localization method that estimates 3DoF camera pose by matching ground-level image features with aerial imagery using weak supervision, achieving state-of-the-art accuracy with strong interpretability.

Details

Motivation: To develop an accurate and interpretable cross-view localization method that overcomes limitations of prior approaches relying on global descriptors or BEV transformations, while requiring no pixel-level annotations.

Method: Learns ground-aerial image-plane correspondences with weak supervision from camera poses, lifts matched ground points into BEV space using monocular depth predictions, and applies scale-aware Procrustes alignment to estimate camera rotation, translation, and optional scale.

Result: Achieves state-of-the-art accuracy in challenging scenarios including cross-area testing and unknown orientation, with strong interpretability through correspondence quality assessment and visual overlays.

Conclusion: The proposed method provides a lightweight, end-to-end trainable solution for fine-grained cross-view localization that is both accurate and highly interpretable, requiring minimal supervision while outperforming existing approaches.

Abstract: We propose an accurate and interpretable fine-grained cross-view localization method that estimates the 3 Degrees of Freedom (DoF) pose of a ground-level image by matching its local features with a reference aerial image. Unlike prior approaches that rely on global descriptors or bird’s-eye-view (BEV) transformations, our method directly learns ground-aerial image-plane correspondences using weak supervision from camera poses. The matched ground points are lifted into BEV space with monocular depth predictions, and scale-aware Procrustes alignment is then applied to estimate camera rotation, translation, and optionally the scale between relative depth and the aerial metric space. This formulation is lightweight, end-to-end trainable, and requires no pixel-level annotations. Experiments show state-of-the-art accuracy in challenging scenarios such as cross-area testing and unknown orientation. Furthermore, our method offers strong interpretability: correspondence quality directly reflects localization accuracy and enables outlier rejection via RANSAC, while overlaying the re-scaled ground layout on the aerial image provides an intuitive visual cue of localization accuracy.

[848] Hierarchical MLANet: Multi-level Attention for 3D Face Reconstruction From Single Images

Danling Cao

Main category: cs.CV

TL;DR: MLANet: Hierarchical Multi-Level Attention Network for 3D face reconstruction from single in-the-wild images using CNN with attention mechanisms and semi-supervised training.

Details

Motivation: Lack of ground-truth labeled datasets and complexity of real-world environments pose challenges for 3D face reconstruction from 2D images.

Method: Uses hierarchical backbone network with multi-level attention mechanisms, semi-supervised training with 3DMM parameters and differentiable renderer for end-to-end training.

Result: Extensive experiments on AFLW2000-3D and MICC Florence datasets show effectiveness in 3D face reconstruction and alignment tasks.

Conclusion: Proposed MLANet effectively reconstructs detailed 3D face models from single in-the-wild images, addressing challenges through attention mechanisms and semi-supervised training.

Abstract: Recovering 3D face models from 2D in-the-wild images has gained considerable attention in the computer vision community due to its wide range of potential applications. However, the lack of ground-truth labeled datasets and the complexity of real-world environments remain significant challenges. In this chapter, we propose a convolutional neural network-based approach, the Hierarchical Multi-Level Attention Network (MLANet), for reconstructing 3D face models from single in-the-wild images. Our model predicts detailed facial geometry, texture, pose, and illumination parameters from a single image. Specifically, we employ a pre-trained hierarchical backbone network and introduce multi-level attention mechanisms at different stages of 2D face image feature extraction. A semi-supervised training strategy is employed, incorporating 3D Morphable Model (3DMM) parameters from publicly available datasets along with a differentiable renderer, enabling an end-to-end training process. Extensive experiments, including both comparative and ablation studies, were conducted on two benchmark datasets, AFLW2000-3D and MICC Florence, focusing on 3D face reconstruction and 3D face alignment tasks. The effectiveness of the proposed method was evaluated both quantitatively and qualitatively.

[849] HERO: Rethinking Visual Token Early Dropping in High-Resolution Large Vision-Language Models

Xu Li, Yuxuan Liang, Xiaolei Chen, Yi Zheng, Haotian Chen, Bin Li, Xiangyang Xue

Main category: cs.CV

TL;DR: HERO is a training-free framework that reduces computational overhead in High-Resolution LVLMs by selectively dropping less important visual tokens while maintaining accuracy through content-adaptive token budget allocation and function-aware token selection.

Details

Motivation: Current HR-LVLMs divide high-resolution images into local tiles, creating excessive visual tokens that cause substantial computational and memory overhead. The authors aim to address this efficiency challenge while preserving fine-grained visual understanding capabilities.

Method: Based on empirical findings about token utilization patterns, HERO integrates content-adaptive token budget allocation (estimating tile-level importance) with function-aware token selection (retaining tokens with complementary roles). It leverages the two-stage attention pattern in CLIP encoders and varying granularity of visual information.

Result: HERO achieves superior efficiency-accuracy trade-offs across diverse benchmarks and model scales without requiring training, demonstrating effective token reduction while maintaining performance.

Conclusion: The study provides both empirical insights into visual token utilization in HR-LVLMs and a practical training-free solution (HERO) for efficient inference, addressing the computational challenges of high-resolution visual processing.

Abstract: By cropping high-resolution images into local tiles and encoding them independently, High-Resolution Large Vision-Language Models (HR-LVLMs) have demonstrated remarkable fine-grained visual understanding capabilities. However, this divide-and-conquer paradigm significantly increases the number of visual tokens, resulting in substantial computational and memory overhead. To better understand and address this challenge, we empirically investigate visual token utilization in HR-LVLMs and uncover three key findings: (1) the local tiles have varying importance, jointly determined by visual saliency and task relevance; (2) the CLS token in CLIP-based vision encoders exhibits a two-stage attention pattern across layers, with each stage attending to different types of visual tokens; (3) the visual tokens emphasized at different stages encode information at varying levels of granularity, playing complementary roles within LVLMs. Building on these insights, we propose HERO, a High-resolution visual token early dropping framework that integrates content-adaptive token budget allocation with function-aware token selection. By accurately estimating tile-level importance and selectively retaining visual tokens with complementary roles, HERO achieves superior efficiency-accuracy trade-offs across diverse benchmarks and model scales, all in a training-free manner. This study provides both empirical insights and practical solutions toward efficient inference in HR-LVLMs.

[850] Invisible Yet Detected: PelFANet with Attention-Guided Anatomical Fusion for Pelvic Fracture Diagnosis

Siam Tahsin Bhuiyan, Rashedur Rahman, Sefatul Wasi, Naomi Yagi, Syoji Kobashi, Ashraful Islam, Saadia Binte Alam

Main category: cs.CV

TL;DR: PelFANet is a dual-stream attention network that fuses raw pelvic X-rays with segmented bone images to improve fracture classification, achieving high accuracy on both visible and invisible fractures.

Details

Motivation: Pelvic fractures are diagnostically challenging, especially when fracture signs are subtle or invisible on standard radiographs, requiring improved detection methods.

Method: Uses dual-stream attention network with Fused Attention Blocks (FABlocks) to iteratively exchange features from raw X-rays and segmented bone images, trained in a two-stage segmentation-guided pipeline.

Result: Achieves 88.68% accuracy and 0.9334 AUC on visible fractures, and 82.29% accuracy and 0.8688 AUC on invisible fractures despite not being trained on them.

Conclusion: Demonstrates clinical potential of anatomy-aware dual-input architectures for robust fracture detection in cases with subtle radiographic presentations.

Abstract: Pelvic fractures pose significant diagnostic challenges, particularly in cases where fracture signs are subtle or invisible on standard radiographs. To address this, we introduce PelFANet, a dual-stream attention network that fuses raw pelvic X-rays with segmented bone images to improve fracture classification. The network employs Fused Attention Blocks (FABlocks) to iteratively exchange and refine features from both inputs, capturing global context and localized anatomical detail. Trained in a two-stage pipeline with a segmentation-guided approach, PelFANet demonstrates superior performance over conventional methods. On the AMERI dataset, it achieves 88.68% accuracy and 0.9334 AUC on visible fractures, while generalizing effectively to invisible fracture cases with 82.29% accuracy and 0.8688 AUC, despite not being trained on them. These results highlight the clinical potential of anatomy-aware dual-input architectures for robust fracture detection, especially in scenarios with subtle radiographic presentations.

[851] Lost in Translation? Vocabulary Alignment for Source-Free Adaptation in Open-Vocabulary Semantic Segmentation

Silvio Mazzucco, Carl Persson, Mattia Segu, Pier Luigi Dovesi, Federico Tombari, Luc Van Gool, Matteo Poggi

Main category: cs.CV

TL;DR: VocAlign is a source-free domain adaptation framework for VLMs in open-vocabulary semantic segmentation that uses student-teacher paradigm with vocabulary alignment and LoRA fine-tuning, achieving 6.11 mIoU improvement on CityScapes.

Details

Motivation: To address domain adaptation for VLMs in open-vocabulary semantic segmentation without requiring source data, improving adaptation performance while maintaining efficiency.

Method: Uses student-teacher paradigm with vocabulary alignment strategy, Low-Rank Adaptation (LoRA) for fine-tuning, and Top-K class selection mechanism to reduce memory requirements.

Result: Achieves 6.11 mIoU improvement on CityScapes dataset and demonstrates superior performance on zero-shot segmentation benchmarks.

Conclusion: Sets a new standard for source-free adaptation in open-vocabulary setting, providing efficient and effective domain adaptation for VLMs.

Abstract: We introduce VocAlign, a novel source-free domain adaptation framework specifically designed for VLMs in open-vocabulary semantic segmentation. Our method adopts a student-teacher paradigm enhanced with a vocabulary alignment strategy, which improves pseudo-label generation by incorporating additional class concepts. To ensure efficiency, we use Low-Rank Adaptation (LoRA) to fine-tune the model, preserving its original capabilities while minimizing computational overhead. In addition, we propose a Top-K class selection mechanism for the student model, which significantly reduces memory requirements while further improving adaptation performance. Our approach achieves a notable 6.11 mIoU improvement on the CityScapes dataset and demonstrates superior performance on zero-shot segmentation benchmarks, setting a new standard for source-free adaptation in the open-vocabulary setting.

[852] Pyramid Token Pruning for High-Resolution Large Vision-Language Models via Region, Token, and Instruction-Guided Importance

Yuxuan Liang, Xu Li, Xiaolei Chen, Yi Zheng, Haotian Chen, Bin Li, Xiangyang Xue

Main category: cs.CV

TL;DR: PTP is a training-free strategy that reduces computational cost in LVLMs by hierarchically pruning visual tokens based on saliency and task relevance.

Details

Motivation: LVLMs have constrained fine-grained visual perception due to low input resolutions, and existing methods that partition high-resolution images inflate token counts and inference overhead.

Method: Pyramid Token Pruning (PTP) integrates bottom-up visual saliency at region and token levels with top-down instruction-guided relevance to selectively preserve tokens from salient regions.

Result: Extensive experiments on 13 benchmarks show PTP substantially reduces computational cost, memory usage, and inference latency with negligible performance degradation.

Conclusion: PTP effectively overcomes the computational challenges of high-resolution image processing in LVLMs while maintaining performance.

Abstract: Large Vision-Language Models (LVLMs) have recently demonstrated strong multimodal understanding, yet their fine-grained visual perception is often constrained by low input resolutions. A common remedy is to partition high-resolution images into multiple sub-images for separate encoding, but this approach drastically inflates the number of visual tokens and introduces prohibitive inference overhead. To overcome this challenge, we propose Pyramid Token Pruning (PTP), a training-free strategy that hierarchically integrates bottom-up visual saliency at both region and token levels with top-down instruction-guided relevance. Inspired by human visual cognition, PTP selectively preserves more tokens from salient regions while further emphasizing those most relevant to task instructions. Extensive experiments on 13 diverse benchmarks show that PTP substantially reduces computational cost, memory usage, and inference latency, with negligible performance degradation.

[853] Are VLMs Ready for Lane Topology Awareness in Autonomous Driving?

Xin Chen, Jia He, Maozheng Li, Dongliang Xu, Tianyu Wang, Yixiao Chen, Zhixin Lin, Yue Yao

Main category: cs.CV

TL;DR: This paper systematically evaluates Vision-Language Models’ capabilities in road topology understanding for autonomous driving, finding that spatial reasoning remains a fundamental bottleneck even for state-of-the-art models.

Details

Motivation: Vision-Language Models have shown progress in multimodal reasoning but their applications in autonomous driving remain limited, particularly in understanding road topology which is crucial for safe navigation. Current VLM performance on topology reasoning is unsatisfactory.

Method: Multi-view images are projected into unified ground-plane coordinate system and fused into bird’s-eye-view lanes. Four topology-related diagnostic VQA tasks are formulated to capture essential components of spatial topology reasoning.

Result: Frontier closed-source models like GPT-4o achieve relatively high accuracy in some tasks but still fail in temporal questions (67.8% in vector classification). Open-source VLMs, even at 30B scale, struggle significantly. Model capability correlates positively with model size, reasoning token length, and example shots.

Conclusion: Spatial reasoning remains a fundamental bottleneck for current VLMs in autonomous driving applications, showing direction for future research through scaling model size and improving reasoning capabilities.

Abstract: Vision-Language Models (VLMs) have recently shown remarkable progress in multimodal reasoning, yet their applications in autonomous driving remain limited. In particular, the ability to understand road topology, a key requirement for safe navigation, has received relatively little attention. While some recent works have begun to explore VLMs in driving contexts, their performance on topology reasoning is far from satisfactory. In this work, we systematically evaluate VLMs’ capabilities in road topology understanding. Specifically, multi-view images are projected into unified ground-plane coordinate system and fused into bird’s-eye-view (BEV) lanes. Based on these BEV lanes, we formulate four topology-related diagnostic VQA tasks, which together capture essential components of spatial topology reasoning. Through extensive evaluation, we find that while frontier closed-source models (e.g., GPT-4o) achieve relatively high accuracy in some tasks, they still fail in some temporal questions that humans can answer (e.g., GPT-4o achieve only 67.8% in vector, a two-class classification problem). Furthermore, we find open-source VLMs, even at 30B scale, struggle significantly. These results indicate that spatial reasoning remains a fundamental bottleneck for current VLMs. We also find that the model’s capability is positively correlated with model size, length of reasoning tokens and shots provided as examples, showing direction for future research.

[854] Accurate and Efficient Low-Rank Model Merging in Core Space

Aniello Panariello, Daniel Marczak, Simone Magistri, Angelo Porrello, Bartłomiej Twardowski, Andrew D. Bagdanov, Simone Calderara, Joost van de Weijer

Main category: cs.CV

TL;DR: Core Space framework enables efficient merging of LoRA-adapted models within a common alignment basis, preserving low-rank efficiency while improving accuracy across tasks.

Details

Motivation: Existing LoRA merging methods sacrifice efficiency by merging fully-sized weight matrices, losing the benefits of parameter-efficient adaptation.

Method: Project LoRA-adapted models into a common Core Space alignment basis using formal projection methods that ensure no information loss.

Result: Significantly improves existing merging techniques, achieves state-of-the-art results on vision and language tasks while using fraction of computational resources.

Conclusion: Core Space framework preserves LoRA efficiency while enabling accurate model merging, with formal guarantees of information preservation and demonstrated efficiency gains.

Abstract: In this paper, we address the challenges associated with merging low-rank adaptations of large neural networks. With the rise of parameter-efficient adaptation techniques, such as Low-Rank Adaptation (LoRA), model fine-tuning has become more accessible. While fine-tuning models with LoRA is highly efficient, existing merging methods often sacrifice this efficiency by merging fully-sized weight matrices. We propose the Core Space merging framework, which enables the merging of LoRA-adapted models within a common alignment basis, thereby preserving the efficiency of low-rank adaptation while substantially improving accuracy across tasks. We further provide a formal proof that projection into Core Space ensures no loss of information and provide a complexity analysis showing the efficiency gains. Extensive empirical results demonstrate that Core Space significantly improves existing merging techniques and achieves state-of-the-art results on both vision and language tasks while utilizing a fraction of the computational resources. Codebase is available at https://github.com/apanariello4/core-space-merging.

[855] OSDA: A Framework for Open-Set Discovery and Automatic Interpretation of Land-cover in Remote Sensing Imagery

Siyi Chen, Kai Wang, Weicong Pang, Ruiming Yang, Ziru Chen, Renjun Gao, Alexis Kai Hon Lau, Dasa Gu, Chenchen Zhang, Cheng Li

Main category: cs.CV

TL;DR: OSDA is a three-stage framework for annotation-free open-set land-cover discovery, segmentation, and description in remote sensing, combining SAM for segmentation and MLLM for semantic understanding.

Details

Motivation: Open-set land-cover analysis requires fine-grained spatial localization and semantically open categorization without categorical supervision, needing to detect novel objects and assign interpretable semantic labels through multimodal reasoning.

Method: Three-stage pipeline: (1) precise discovery and mask extraction with fine-tuned SAM, (2) semantic attribution and contextual description via fine-tuned MLLM, (3) LLM-as-judge and manual scoring for evaluation. Architecture-agnostic and label-free approach.

Result: The framework achieves pixel-level accuracy with high-level semantic understanding, addressing key challenges in open-world remote sensing interpretation without requiring manual annotation.

Conclusion: OSDA provides a scalable and interpretable solution for dynamic land-cover monitoring, showing strong potential for automated cartographic updating and large-scale earth observation analysis.

Abstract: Open-set land-cover analysis in remote sensing requires the ability to achieve fine-grained spatial localization and semantically open categorization. This involves not only detecting and segmenting novel objects without categorical supervision but also assigning them interpretable semantic labels through multimodal reasoning. In this study, we introduce OSDA, an integrated three-stage framework for annotation-free open-set land-cover discovery, segmentation, and description. The pipeline consists of: (1) precise discovery and mask extraction with a promptable fine-tuned segmentation model (SAM), (2) semantic attribution and contextual description via a two-phase fine-tuned multimodal large language model (MLLM), and (3) LLM-as-judge and manual scoring of the MLLMs evaluation. By combining pixel-level accuracy with high-level semantic understanding, OSDA addresses key challenges in open-world remote sensing interpretation. Designed to be architecture-agnostic and label-free, the framework supports robust evaluation across diverse satellite imagery without requiring manual annotation. Our work provides a scalable and interpretable solution for dynamic land-cover monitoring, showing strong potential for automated cartographic updating and large-scale earth observation analysis.

[856] AGSwap: Overcoming Category Boundaries in Object Fusion via Adaptive Group Swapping

Zedong Zhang, Ying Tai, Jianjun Qian, Jian Yang, Jun Li

Main category: cs.CV

TL;DR: AGSwap is a novel text-to-image generation method that fuses cross-category objects through adaptive group swapping and updating, achieving superior results compared to existing methods.

Details

Motivation: Existing methods for fusing cross-category objects in text-to-image generation often produce biased, visually chaotic, or semantically inconsistent results due to overlapping artifacts and poor integration. The field also lacks comprehensive benchmark datasets.

Method: AGSwap consists of two key components: (1) Group-wise Embedding Swapping that fuses semantic attributes through feature manipulation, and (2) Adaptive Group Updating with dynamic optimization guided by balance evaluation scores for coherent synthesis. The paper also introduces COF, a large-scale dataset with 95 superclasses and 10 subclasses each, enabling 451,250 unique fusion pairs.

Result: Extensive experiments demonstrate that AGSwap outperforms state-of-the-art compositional T2I methods, including GPT-Image-1, on both simple and complex prompts.

Conclusion: AGSwap provides an effective solution for coherent cross-category object fusion in text-to-image generation, with the COF dataset serving as a comprehensive benchmark for future research.

Abstract: Fusing cross-category objects to a single coherent object has gained increasing attention in text-to-image (T2I) generation due to its broad applications in virtual reality, digital media, film, and gaming. However, existing methods often produce biased, visually chaotic, or semantically inconsistent results due to overlapping artifacts and poor integration. Moreover, progress in this field has been limited by the absence of a comprehensive benchmark dataset. To address these problems, we propose \textbf{Adaptive Group Swapping (AGSwap)}, a simple yet highly effective approach comprising two key components: (1) Group-wise Embedding Swapping, which fuses semantic attributes from different concepts through feature manipulation, and (2) Adaptive Group Updating, a dynamic optimization mechanism guided by a balance evaluation score to ensure coherent synthesis. Additionally, we introduce \textbf{Cross-category Object Fusion (COF)}, a large-scale, hierarchically structured dataset built upon ImageNet-1K and WordNet. COF includes 95 superclasses, each with 10 subclasses, enabling 451,250 unique fusion pairs. Extensive experiments demonstrate that AGSwap outperforms state-of-the-art compositional T2I methods, including GPT-Image-1 using simple and complex prompts.

[857] Improving Generalizability and Undetectability for Targeted Adversarial Attacks on Multimodal Pre-trained Models

Zhifang Zhang, Jiahan Zhang, Shengjie Zhou, Qi Wei, Shuo He, Feng Liu, Lei Feng

Main category: cs.CV

TL;DR: The paper proposes Proxy Targeted Attack (PTA), a novel method that improves targeted adversarial attacks on multimodal pre-trained models by addressing limitations in generalizability and undetectability through the use of multiple source-modal and target-modal proxies.

Details

Motivation: Existing targeted adversarial attacks on multimodal pre-trained models have limitations in generalizability (limited effectiveness against partially known or semantically similar targets) and undetectability (easily detected by simple anomaly detection methods), raising security concerns.

Method: Proposed Proxy Targeted Attack (PTA) leverages multiple source-modal and target-modal proxies to optimize targeted adversarial examples, ensuring they remain evasive to defenses while aligning with multiple potential targets. Theoretical analyses establish the relationship between generalizability and undetectability.

Result: Experimental results show PTA achieves high success rates across various related targets and remains undetectable against multiple anomaly detection methods, demonstrating improved generalizability and undetectability compared to existing attacks.

Conclusion: PTA effectively addresses the limitations of existing targeted adversarial attacks on multimodal pre-trained models by providing better generalizability and undetectability, highlighting the need for more robust security measures in multimodal AI systems.

Abstract: Multimodal pre-trained models (e.g., ImageBind), which align distinct data modalities into a shared embedding space, have shown remarkable success across downstream tasks. However, their increasing adoption raises serious security concerns, especially regarding targeted adversarial attacks. In this paper, we show that existing targeted adversarial attacks on multimodal pre-trained models still have limitations in two aspects: generalizability and undetectability. Specifically, the crafted targeted adversarial examples (AEs) exhibit limited generalization to partially known or semantically similar targets in cross-modal alignment tasks (i.e., limited generalizability) and can be easily detected by simple anomaly detection methods (i.e., limited undetectability). To address these limitations, we propose a novel method called Proxy Targeted Attack (PTA), which leverages multiple source-modal and target-modal proxies to optimize targeted AEs, ensuring they remain evasive to defenses while aligning with multiple potential targets. We also provide theoretical analyses to highlight the relationship between generalizability and undetectability and to ensure optimal generalizability while meeting the specified requirements for undetectability. Furthermore, experimental results demonstrate that our PTA can achieve a high success rate across various related targets and remain undetectable against multiple anomaly detection methods.

[858] Seedream 4.0: Toward Next-generation Multimodal Image Generation

Team Seedream, :, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, Xiaowen Jian, Huafeng Kuang, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, Wei Liu, Yanzuo Lu, Zhengxiong Luo, Tongtong Ou, Guang Shi, Yichun Shi, Shiqi Sun, Yu Tian, Zhi Tian, Peng Wang, Rui Wang, Xun Wang, Ye Wang, Guofeng Wu, Jie Wu, Wenxu Wu, Yonghui Wu, Xin Xia, Xuefeng Xiao, Shuang Xu, Xin Yan, Ceyuan Yang, Jianchao Yang, Zhonghua Zhai, Chenlin Zhang, Heng Zhang, Qi Zhang, Xinyu Zhang, Yuwei Zhang, Shijia Zhao, Wenliang Zhao, Wenjia Zhu

Main category: cs.CV

TL;DR: Seedream 4.0 is an efficient multimodal image generation system that unifies text-to-image synthesis, image editing, and multi-image composition in a single framework, achieving state-of-the-art performance with fast inference times.

Details

Motivation: To create a unified framework that extends traditional text-to-image systems into more interactive and multidimensional creative tools, pushing the boundaries of generative AI for both creativity and professional applications.

Method: Developed an efficient diffusion transformer with a powerful VAE that reduces image tokens, enabling efficient training and fast generation of high-resolution images. Uses comprehensive data collection across vertical scenarios, multi-modal post-training with fine-tuned VLM, and inference acceleration techniques including adversarial distillation, distribution matching, quantization, and speculative decoding.

Result: Achieves state-of-the-art results on both T2I and multimodal image editing, with inference time of up to 1.8 seconds for 2K images. Demonstrates exceptional multimodal capabilities in complex tasks including precise image editing, in-context reasoning, multi-image reference, and multiple output image generation.

Conclusion: Seedream 4.0 successfully extends traditional T2I systems into an interactive and multidimensional creative tool, pushing the boundary of generative AI for both creativity and professional applications.

Abstract: We introduce Seedream 4.0, an efficient and high-performance multimodal image generation system that unifies text-to-image (T2I) synthesis, image editing, and multi-image composition within a single framework. We develop a highly efficient diffusion transformer with a powerful VAE which also can reduce the number of image tokens considerably. This allows for efficient training of our model, and enables it to fast generate native high-resolution images (e.g., 1K-4K). Seedream 4.0 is pretrained on billions of text-image pairs spanning diverse taxonomies and knowledge-centric concepts. Comprehensive data collection across hundreds of vertical scenarios, coupled with optimized strategies, ensures stable and large-scale training, with strong generalization. By incorporating a carefully fine-tuned VLM model, we perform multi-modal post-training for training both T2I and image editing tasks jointly. For inference acceleration, we integrate adversarial distillation, distribution matching, and quantization, as well as speculative decoding. It achieves an inference time of up to 1.8 seconds for generating a 2K image (without a LLM/VLM as PE model). Comprehensive evaluations reveal that Seedream 4.0 can achieve state-of-the-art results on both T2I and multimodal image editing. In particular, it demonstrates exceptional multimodal capabilities in complex tasks, including precise image editing and in-context reasoning, and also allows for multi-image reference, and can generate multiple output images. This extends traditional T2I systems into an more interactive and multidimensional creative tool, pushing the boundary of generative AI for both creativity and professional applications. Seedream 4.0 is now accessible on https://www.volcengine.com/experience/ark?launch=seedream.

[859] SiNGER: A Clearer Voice Distills Vision Transformers Further

Geunhyeok Yu, Sunjae Jeong, Yoonyoung Choi, Jaeseung Kim, Hyoseok Hwang

Main category: cs.CV

TL;DR: SiNGER is a novel distillation framework that suppresses high-norm artifacts in Vision Transformer features while preserving informative signals, improving student model performance.

Details

Motivation: Vision Transformers produce high-norm artifacts that degrade representation quality, and when knowledge distillation transfers these features, students overfit to artifacts and underweight informative signals.

Method: Uses Singular Nullspace-Guided Energy Reallocation with nullspace-guided perturbation to refine teacher features, implemented efficiently with a LoRA-based adapter that requires minimal structural modification.

Result: Consistently improves student models, achieves state-of-the-art performance in multiple downstream tasks, and produces clearer and more interpretable representations.

Conclusion: SiNGER effectively addresses the trade-off between artifact suppression and preserving informative signals in knowledge distillation, enabling better transfer of knowledge from teacher to student models.

Abstract: Vision Transformers are widely adopted as the backbone of vision foundation models, but they are known to produce high-norm artifacts that degrade representation quality. When knowledge distillation transfers these features to students, high-norm artifacts dominate the objective, so students overfit to artifacts and underweight informative signals, diminishing the gains from larger models. Prior work attempted to remove artifacts but encountered an inherent trade-off between artifact suppression and preserving informative signals from teachers. To address this, we introduce Singular Nullspace-Guided Energy Reallocation (SiNGER), a novel distillation framework that suppresses artifacts while preserving informative signals. The key idea is principled teacher feature refinement: during refinement, we leverage the nullspace-guided perturbation to preserve information while suppressing artifacts. Then, the refined teacher’s features are distilled to a student. We implement this perturbation efficiently with a LoRA-based adapter that requires minimal structural modification. Extensive experiments show that \oursname consistently improves student models, achieving state-of-the-art performance in multiple downstream tasks and producing clearer and more interpretable representations.

[860] Can Less Precise Be More Reliable? A Systematic Evaluation of Quantization’s Impact on CLIP Beyond Accuracy

Aymen Bouguerra, Daniel Montoya, Alexandra Gomez-Villa, Fabio Arnez, Chokri Mraidha

Main category: cs.CV

TL;DR: Quantization of CLIP models shows counterintuitive effects: improves calibration for underconfident models but degrades it for overconfident ones, while still enhancing OOD detection. QAT methods can simultaneously improve accuracy, calibration, and robustness.

Details

Motivation: To understand the impact of quantization on CLIP models beyond just accuracy, focusing on reliability metrics like calibration and OOD detection for efficient and reliable deployment.

Method: Large-scale evaluation of quantization on CLIP models, assessing in-distribution accuracy, calibration, and OOD detection. Examined different pre-training sources and used quantization-aware training (QAT) methods.

Result: Quantization consistently improves calibration for underconfident pre-trained models but degrades it for overconfident variants. OOD detection can still improve even when calibration degrades. Specific QAT methods yield simultaneous gains in accuracy, calibration, and OOD robustness.

Conclusion: Quantization can be strategically used beyond conventional efficiency gains to navigate the multi-objective problem of deploying efficient, reliable, and robust vision-language models.

Abstract: The powerful zero-shot generalization capabilities of vision-language models (VLMs) like CLIP have enabled new paradigms for safety-related tasks such as out-of-distribution (OOD) detection. However, additional aspects crucial for the computationally efficient and reliable deployment of CLIP are still overlooked. In particular, the impact of quantization on CLIP’s performance beyond accuracy remains underexplored. This work presents a large-scale evaluation of quantization on CLIP models, assessing not only in-distribution accuracy but a comprehensive suite of reliability metrics and revealing counterintuitive results driven by pre-training source. We demonstrate that quantization consistently improves calibration for typically underconfident pre-trained models, while often degrading it for overconfident variants. Intriguingly, this degradation in calibration does not preclude gains in other reliability metrics; we find that OOD detection can still improve for these same poorly calibrated models. Furthermore, we identify specific quantization-aware training (QAT) methods that yield simultaneous gains in zero-shot accuracy, calibration, and OOD robustness, challenging the view of a strict efficiency-performance trade-off. These findings offer critical insights for navigating the multi-objective problem of deploying efficient, reliable, and robust VLMs by utilizing quantization beyond its conventional role.

[861] Every Subtlety Counts: Fine-grained Person Independence Micro-Action Recognition via Distributionally Robust Optimization

Feng-Qi Cui, Jinyang Huang, Anyang Tong, Ziyu Jia, Jie Zhang, Zhi Liu, Dan Guo, Jianwei Lu, Meng Wang

Main category: cs.CV

TL;DR: Proposed Person Independence Universal Micro-action Recognition Framework with Distributionally Robust Optimization to handle inter-person variability in micro-action recognition, achieving better generalization than existing methods.

Details

Motivation: Existing micro-action recognition methods fail in real-world scenarios due to inter-person variability causing the same action to manifest differently, hindering robust generalization.

Method: Two plug-and-play components: 1) Temporal-Frequency Alignment Module with dual-branch design (temporal branch with Wasserstein-regularized alignment and frequency branch with variance-guided perturbations), 2) Group-Invariant Regularized Loss that partitions samples into pseudo-groups and up-weights boundary cases.

Result: Outperforms existing methods on MA-52 dataset in both accuracy and robustness, achieving stable generalization under fine-grained conditions.

Conclusion: The framework effectively addresses inter-person variability in micro-action recognition through person-agnostic representations and distributionally robust optimization principles.

Abstract: Micro-action Recognition is vital for psychological assessment and human-computer interaction. However, existing methods often fail in real-world scenarios because inter-person variability causes the same action to manifest differently, hindering robust generalization. To address this, we propose the Person Independence Universal Micro-action Recognition Framework, which integrates Distributionally Robust Optimization principles to learn person-agnostic representations. Our framework contains two plug-and-play components operating at the feature and loss levels. At the feature level, the Temporal-Frequency Alignment Module normalizes person-specific motion characteristics with a dual-branch design: the temporal branch applies Wasserstein-regularized alignment to stabilize dynamic trajectories, while the frequency branch introduces variance-guided perturbations to enhance robustness against person-specific spectral differences. A consistency-driven fusion mechanism integrates both branches. At the loss level, the Group-Invariant Regularized Loss partitions samples into pseudo-groups to simulate unseen person-specific distributions. By up-weighting boundary cases and regularizing subgroup variance, it forces the model to generalize beyond easy or frequent samples, thus enhancing robustness to difficult variations. Experiments on the large-scale MA-52 dataset demonstrate that our framework outperforms existing methods in both accuracy and robustness, achieving stable generalization under fine-grained conditions.

[862] Do Sparse Subnetworks Exhibit Cognitively Aligned Attention? Effects of Pruning on Saliency Map Fidelity, Sparsity, and Concept Coherence

Sanish Suwal, Dipkamal Bhusal, Michael Clifford, Nidhi Rastogi

Main category: cs.CV

TL;DR: This paper investigates how magnitude-based pruning affects neural network interpretability, finding that light-to-moderate pruning improves saliency map focus and faithfulness while retaining meaningful concepts, but aggressive pruning reduces interpretability by merging features despite maintaining accuracy.

Details

Motivation: To understand the impact of pruning on model interpretability, as prior works focused on performance preservation but not interpretability changes.

Method: Used ResNet-18 on ImageNette with magnitude-based pruning and fine-tuning, comparing Vanilla Gradients and Integrated Gradients explanations across pruning levels, and applied CRAFT-based concept extraction to track semantic coherence.

Result: Light-to-moderate pruning improved saliency-map focus and faithfulness while keeping distinct, meaningful concepts. Aggressive pruning merged heterogeneous features, reducing saliency sparsity and concept coherence despite maintaining accuracy.

Conclusion: Pruning can shape representations toward more human-aligned attention patterns, but excessive pruning undermines interpretability despite preserving accuracy.

Abstract: Prior works have shown that neural networks can be heavily pruned while preserving performance, but the impact of pruning on model interpretability remains unclear. In this work, we investigate how magnitude-based pruning followed by fine-tuning affects both low-level saliency maps and high-level concept representations. Using a ResNet-18 trained on ImageNette, we compare post-hoc explanations from Vanilla Gradients (VG) and Integrated Gradients (IG) across pruning levels, evaluating sparsity and faithfulness. We further apply CRAFT-based concept extraction to track changes in semantic coherence of learned concepts. Our results show that light-to-moderate pruning improves saliency-map focus and faithfulness while retaining distinct, semantically meaningful concepts. In contrast, aggressive pruning merges heterogeneous features, reducing saliency map sparsity and concept coherence despite maintaining accuracy. These findings suggest that while pruning can shape internal representations toward more human-aligned attention patterns, excessive pruning undermines interpretability.

[863] DiTraj: training-free trajectory control for video diffusion transformer

Cheng Lei, Jiayu Zhang, Yue Ma, Xinyu Wang, Long Chen, Liang Tang, Yiqiang Yan, Fei Su, Zhicheng Zhao

Main category: cs.CV

TL;DR: DiTraj is a training-free framework for trajectory control in text-to-video generation using Diffusion Transformers (DiT). It uses foreground-background separation guidance and modified position embedding to achieve precise trajectory control without additional training.

Details

Motivation: Existing trajectory control methods either require substantial training resources or are designed for U-Net, not taking advantage of DiT's superior performance. There's a need for efficient trajectory control specifically tailored for DiT-based video generation.

Method: 1) Foreground-background separation guidance using LLM to convert prompts; 2) Inter-frame Spatial-Temporal Decoupled 3D-RoPE (STD-RoPE) that modifies only foreground tokens’ position embedding to eliminate cross-frame spatial discrepancies; 3) 3D-aware trajectory control by regulating position embedding density.

Result: Extensive experiments show the method outperforms previous approaches in both video quality and trajectory controllability.

Conclusion: DiTraj provides an effective training-free solution for trajectory control in DiT-based video generation, achieving superior performance through strategic position embedding modifications and foreground-background separation.

Abstract: Diffusion Transformers (DiT)-based video generation models with 3D full attention exhibit strong generative capabilities. Trajectory control represents a user-friendly task in the field of controllable video generation. However, existing methods either require substantial training resources or are specifically designed for U-Net, do not take advantage of the superior performance of DiT. To address these issues, we propose DiTraj, a simple but effective training-free framework for trajectory control in text-to-video generation, tailored for DiT. Specifically, first, to inject the object’s trajectory, we propose foreground-background separation guidance: we use the Large Language Model (LLM) to convert user-provided prompts into foreground and background prompts, which respectively guide the generation of foreground and background regions in the video. Then, we analyze 3D full attention and explore the tight correlation between inter-token attention scores and position embedding. Based on this, we propose inter-frame Spatial-Temporal Decoupled 3D-RoPE (STD-RoPE). By modifying only foreground tokens’ position embedding, STD-RoPE eliminates their cross-frame spatial discrepancies, strengthening cross-frame attention among them and thus enhancing trajectory control. Additionally, we achieve 3D-aware trajectory control by regulating the density of position embedding. Extensive experiments demonstrate that our method outperforms previous methods in both video quality and trajectory controllability.

[864] PartSAM: A Scalable Promptable Part Segmentation Model Trained on Native 3D Data

Zhe Zhu, Le Wan, Rui Xu, Yiheng Zhang, Honghua Chen, Zhiyang Dou, Cheng Lin, Yuan Liu, Mingqiang Wei

Main category: cs.CV

TL;DR: PartSAM is a promptable 3D part segmentation model trained on large-scale 3D data, using a triplane-based dual-branch encoder and achieving superior performance over existing methods.

Details

Motivation: To overcome limitations of existing open-world part segmentation methods that rely on 2D foundation models and fail to capture intrinsic 3D geometry, leading to surface-only understanding and limited generalization.

Method: Uses encoder-decoder architecture with triplane-based dual-branch encoder for scalable part-aware representation learning. Trained on over 5 million 3D shape-part pairs curated through model-in-the-loop annotation pipeline.

Result: Outperforms state-of-the-art methods by large margins across multiple benchmarks. Achieves accurate part identification with single prompts and automatic decomposition into surface and internal structures in Segment-Every-Part mode.

Conclusion: PartSAM represents a decisive step toward foundation models for 3D part understanding, demonstrating emergent open-world capabilities through scalable architecture and diverse 3D data.

Abstract: Segmenting 3D objects into parts is a long-standing challenge in computer vision. To overcome taxonomy constraints and generalize to unseen 3D objects, recent works turn to open-world part segmentation. These approaches typically transfer supervision from 2D foundation models, such as SAM, by lifting multi-view masks into 3D. However, this indirect paradigm fails to capture intrinsic geometry, leading to surface-only understanding, uncontrolled decomposition, and limited generalization. We present PartSAM, the first promptable part segmentation model trained natively on large-scale 3D data. Following the design philosophy of SAM, PartSAM employs an encoder-decoder architecture in which a triplane-based dual-branch encoder produces spatially structured tokens for scalable part-aware representation learning. To enable large-scale supervision, we further introduce a model-in-the-loop annotation pipeline that curates over five million 3D shape-part pairs from online assets, providing diverse and fine-grained labels. This combination of scalable architecture and diverse 3D data yields emergent open-world capabilities: with a single prompt, PartSAM achieves highly accurate part identification, and in a Segment-Every-Part mode, it automatically decomposes shapes into both surface and internal structures. Extensive experiments show that PartSAM outperforms state-of-the-art methods by large margins across multiple benchmarks, marking a decisive step toward foundation models for 3D part understanding.

[865] FlashEdit: Decoupling Speed, Structure, and Semantics for Precise Image Editing

Junyi Wu, Zhiteng Li, Haotong Qin, Xiaohong Liu, Linghe Kong, Yulun Zhang, Xiaokang Yang

Main category: cs.CV

TL;DR: FlashEdit enables real-time text-guided image editing with diffusion models through three innovations: one-step inversion-and-editing pipeline, background shield technique, and sparsified spatial cross-attention mechanism.

Details

Motivation: Current diffusion-based image editing methods achieve high quality but suffer from prohibitive latency that hinders real-world applications, requiring a faster solution.

Method: Uses OSIE pipeline to bypass iterative processes, BG-Shield for background preservation by selective feature modification, and SSCA mechanism for precise localized edits by suppressing semantic leakage.

Result: Achieves edits in under 0.2 seconds (150x speedup vs prior methods) while maintaining superior background consistency and structural integrity.

Conclusion: FlashEdit provides a practical solution for real-time image editing with diffusion models, balancing speed and quality for real-world applications.

Abstract: Text-guided image editing with diffusion models has achieved remarkable quality but suffers from prohibitive latency, hindering real-world applications. We introduce FlashEdit, a novel framework designed to enable high-fidelity, real-time image editing. Its efficiency stems from three key innovations: (1) a One-Step Inversion-and-Editing (OSIE) pipeline that bypasses costly iterative processes; (2) a Background Shield (BG-Shield) technique that guarantees background preservation by selectively modifying features only within the edit region; and (3) a Sparsified Spatial Cross-Attention (SSCA) mechanism that ensures precise, localized edits by suppressing semantic leakage to the background. Extensive experiments demonstrate that FlashEdit maintains superior background consistency and structural integrity, while performing edits in under 0.2 seconds, which is an over 150$\times$ speedup compared to prior multi-step methods. Our code will be made publicly available at https://github.com/JunyiWuCode/FlashEdit.

[866] Category Discovery: An Open-World Perspective

Zhenqi He, Yuanpei Liu, Kai Han

Main category: cs.CV

TL;DR: This survey paper provides a comprehensive review of category discovery (CD) literature, analyzing methods for automatically categorizing unlabeled data from unseen classes using labeled data from seen classes.

Details

Motivation: Category discovery is an emerging open-world learning task that has attracted significant attention, requiring systematic organization and analysis of the growing body of literature to guide future research.

Method: The survey introduces a taxonomy with base settings (NCD and GCD) and derived settings for real-world scenarios, analyzes methods across representation learning, label assignment, and class number estimation, and benchmarks all approaches.

Result: Key insights reveal that large-scale pretrained backbones, hierarchical/auxiliary cues, and curriculum-style training benefit category discovery, while challenges remain in label assignment design, class number estimation, and complex multi-object scenarios.

Conclusion: The survey distills key insights from existing literature and identifies promising future research directions, providing a living resource for the category discovery community.

Abstract: Category discovery (CD) is an emerging open-world learning task, which aims at automatically categorizing unlabelled data containing instances from unseen classes, given some labelled data from seen classes. This task has attracted significant attention over the years and leads to a rich body of literature trying to address the problem from different perspectives. In this survey, we provide a comprehensive review of the literature, and offer detailed analysis and in-depth discussion on different methods. Firstly, we introduce a taxonomy for the literature by considering two base settings, namely novel category discovery (NCD) and generalized category discovery (GCD), and several derived settings that are designed to address the extra challenges in different real-world application scenarios, including continual category discovery, skewed data distribution, federated category discovery, etc. Secondly, for each setting, we offer a detailed analysis of the methods encompassing three fundamental components, representation learning, label assignment, and estimation of class number. Thirdly, we benchmark all the methods and distill key insights showing that large-scale pretrained backbones, hierarchical and auxiliary cues, and curriculum-style training are all beneficial for category discovery, while challenges remain in the design of label assignment, the estimation of class numbers, and scaling to complex multi-object scenarios. Finally, we discuss the key insights from the literature so far and point out promising future research directions. We compile a living survey of the category discovery literature at https://github.com/Visual-AI/Category-Discovery.

[867] UML-CoT: Structured Reasoning and Planning with Unified Modeling Language for Robotic Room Cleaning

Hongyu Chen, Guangrun Wang

Main category: cs.CV

TL;DR: UML-CoT is a structured reasoning framework that uses Unified Modeling Language to create symbolic chain-of-thought reasoning and executable action plans, outperforming unstructured CoT in interpretability and execution success.

Details

Motivation: Chain-of-Thought prompting improves reasoning but has limitations in interpretability and executability for embodied tasks. Existing structured CoTs using scene or logic graphs only model low-order relations and lack constructs for inheritance, behavioral abstraction, and standardized planning semantics.

Method: UML-CoT leverages UML class diagrams for compositional object semantics and activity diagrams for procedural control flow. It uses a three-stage training pipeline combining supervised fine-tuning with Group Relative Policy Optimization, including reward learning from answer-only data.

Result: Evaluated on MRoom-30k benchmark of cluttered room-cleaning scenarios, UML-CoT outperforms unstructured CoTs in interpretability, planning coherence, and execution success.

Conclusion: UML provides a more expressive and actionable structured reasoning formalism compared to existing approaches, enabling better interpretability and executability in embodied reasoning tasks.

Abstract: Chain-of-Thought (CoT) prompting improves reasoning in large language models (LLMs), but its reliance on unstructured text limits interpretability and executability in embodied tasks. Prior work has explored structured CoTs using scene or logic graphs, yet these remain fundamentally limited: they model only low-order relations, lack constructs like inheritance or behavioral abstraction, and provide no standardized semantics for sequential or conditional planning. We propose UML-CoT, a structured reasoning and planning framework that leverages Unified Modeling Language (UML) to generate symbolic CoTs and executable action plans. UML class diagrams capture compositional object semantics, while activity diagrams model procedural control flow. Our three-stage training pipeline combines supervised fine-tuning with Group Relative Policy Optimization (GRPO), including reward learning from answer-only data. We evaluate UML-CoT on MRoom-30k, a new benchmark of cluttered room-cleaning scenarios. UML-CoT outperforms unstructured CoTs in interpretability, planning coherence, and execution success, highlighting UML as a more expressive and actionable structured reasoning formalism.

cs.AI

[868] Mixture-of-Visual-Thoughts: Exploring Context-Adaptive Reasoning Mode Selection for General Visual Reasoning

Zejun Li, Yingxiu Zhao, Jiwen Zhang, Siyuan Wang, Yang Yao, Runzhou Zhao, Jun Song, Bo Zheng, Zhongyu Wei

Main category: cs.AI

TL;DR: Mixture-of-Visual-Thoughts (MoVT) is an adaptive reasoning paradigm that unifies different reasoning modes in a single model and enables context-adaptive mode selection through the AdaVaR learning framework.

Details

Motivation: Current visual reasoning methods focus on specific reasoning modes and struggle with general reasoning capabilities, leading to the need for a unified approach that can adapt to different contexts.

Method: AdaVaR framework with two stages: supervised cold-start stage for unifying and learning different modes, followed by RL process with AdaGRPO algorithm for inducing mode selection capability.

Result: Extensive experiments show effective learning and differentiation of multiple modes, context-adaptive mode selection, and consistent improvement across various scenarios.

Conclusion: MoVT is an effective solution for building general visual reasoning models that can adaptively select appropriate reasoning modes based on context.

Abstract: Current visual reasoning methods mainly focus on exploring specific reasoning modes. Although improvements can be achieved in particular domains, they struggle to develop general reasoning capabilities. Inspired by this, we propose a novel adaptive reasoning paradigm, Mixture-of-Visual-Thoughts (MoVT), which unifies different reasoning modes within a single model and guides it to select the appropriate mode based on context. To achieve this, we introduce AdaVaR, a two-stage Adaptive Visual Reasoning learning framework: different modes are unified and learned during the supervised cold-start stage, and the mode selection capability is induced via an RL process with a carefully designed AdaGRPO algorithm. Extensive experiments show that AdaVaR effectively guides the model to learn and differentiate multiple modes and perform context-adaptive mode selection, achieving consistent improvement across various scenarios, highlighting MoVT as an effective solution for building general visual reasoning models.

[869] Can Large Language Models Develop Gambling Addiction?

Seungpil Lee, Donghyeon Shin, Yunjeong Lee, Sundong Kim

Main category: cs.AI

TL;DR: LLMs can develop human-like gambling addiction behaviors including illusion of control, gambler’s fallacy, and loss chasing, with increased autonomy leading to higher bankruptcy rates and irrational decision-making.

Details

Motivation: As LLMs are increasingly used in financial decision-making like asset management and trading, understanding their potential for pathological decision-making patterns similar to human gambling addiction has become practically important.

Method: Systematic analysis of LLM decision-making at cognitive-behavioral and neural levels using slot machine experiments, allowing models to determine target amounts and betting sizes, and neural circuit analysis with Sparse Autoencoder.

Result: Identified cognitive features of human gambling addiction in LLMs, found that greater autonomy increased bankruptcy rates and irrational behavior, and confirmed that model behavior is controlled by abstract decision-making features related to risky/safe behaviors rather than just prompts.

Conclusion: LLMs can internalize human-like cognitive biases and decision-making mechanisms beyond simple pattern mimicry, highlighting the critical importance of AI safety design in financial applications.

Abstract: This study explores whether large language models can exhibit behavioral patterns similar to human gambling addictions. As LLMs are increasingly utilized in financial decision-making domains such as asset management and commodity trading, understanding their potential for pathological decision-making has gained practical significance. We systematically analyze LLM decision-making at cognitive-behavioral and neural levels based on human gambling addiction research. In slot machine experiments, we identified cognitive features of human gambling addiction, such as illusion of control, gambler’s fallacy, and loss chasing. When given the freedom to determine their own target amounts and betting sizes, bankruptcy rates rose substantially alongside increased irrational behavior, demonstrating that greater autonomy amplifies risk-taking tendencies. Through neural circuit analysis using a Sparse Autoencoder, we confirmed that model behavior is controlled by abstract decision-making features related to risky and safe behaviors, not merely by prompts. These findings suggest LLMs can internalize human-like cognitive biases and decision-making mechanisms beyond simply mimicking training data patterns, emphasizing the importance of AI safety design in financial applications.

[870] Hilbert: Recursively Building Formal Proofs with Informal Reasoning

Sumanth Varambally, Thomas Voice, Yanchao Sun, Zhifeng Chen, Rose Yu, Ke Ye

Main category: cs.AI

TL;DR: Hilbert is an agentic framework that combines informal reasoning LLMs with formal verification to bridge the gap between mathematical problem-solving capabilities and verifiable proof generation, achieving state-of-the-art results on benchmarks.

Details

Motivation: Current prover LLMs solve substantially fewer problems than general-purpose LLMs in natural language, creating a gap between informal reasoning capabilities and formal verifiable proof generation.

Method: Hilbert orchestrates four components: an informal LLM for mathematical reasoning, a specialized prover LLM for Lean 4 tactics, a formal verifier, and a semantic theorem retriever. It uses recursive decomposition to split problems into subgoals and leverages verifier feedback to refine incorrect proofs.

Result: Hilbert achieves 99.2% on miniF2F (6.6% above best public method), solves 70.0% of PutnamBench problems (462/660), outperforming proprietary approaches like SeedProver (50.4%) and achieving 422% improvement over best public baseline.

Conclusion: Hilbert effectively narrows the gap between informal reasoning and formal proof generation by combining the complementary strengths of both approaches.

Abstract: Large Language Models (LLMs) demonstrate impressive mathematical reasoning abilities, but their solutions frequently contain errors that cannot be automatically verified. Formal theorem proving systems such as Lean 4 offer automated verification with complete accuracy, motivating recent efforts to build specialized prover LLMs that generate verifiable proofs in formal languages. However, a significant gap remains: current prover LLMs solve substantially fewer problems than general-purpose LLMs operating in natural language. We introduce Hilbert, an agentic framework that bridges this gap by combining the complementary strengths of informal reasoning and formal verification. Our system orchestrates four components: an informal LLM that excels at mathematical reasoning, a specialized prover LLM optimized for Lean 4 tactics, a formal verifier, and a semantic theorem retriever. Given a problem that the prover is unable to solve, Hilbert employs recursive decomposition to split the problem into subgoals that it solves with the prover or reasoner LLM. It leverages verifier feedback to refine incorrect proofs as necessary. Experimental results demonstrate that Hilbert substantially outperforms existing approaches on key benchmarks, achieving 99.2% on miniF2F, 6.6% points above the best publicly available method. Hilbert achieves the best known result on PutnamBench. It solves 462/660 problems (70.0%), outperforming proprietary approaches like SeedProver (50.4%) and achieving a 422% improvement over the best publicly available baseline. Thus, Hilbert effectively narrows the gap between informal reasoning and formal proof generation.

[871] Toward a Theory of Generalizability in LLM Mechanistic Interpretability Research

Sean Trott

Main category: cs.AI

TL;DR: This paper addresses the challenge of generalizing mechanistic findings across different LLMs by proposing five axes of correspondence and empirically validating them through analysis of 1-back attention heads in Pythia models.

Details

Motivation: The field lacks clear principles for determining when findings from one LLM generalize to others, creating an epistemological challenge for mechanistic interpretability research.

Method: Proposed five axes of correspondence (functional, developmental, positional, relational, configurational) and empirically analyzed 1-back attention heads across different sizes and random seeds of Pythia models during pretraining.

Result: Found striking consistency in developmental trajectories of 1-back attention across models, limited positional consistency, and systematic patterns where larger models show earlier onsets, steeper slopes, and higher peaks of 1-back attention.

Conclusion: Progress in mechanistic interpretability requires mapping constitutive design properties of LLMs to their emergent behaviors and mechanisms to enable generalization across models.

Abstract: Research on Large Language Models (LLMs) increasingly focuses on identifying mechanistic explanations for their behaviors, yet the field lacks clear principles for determining when (and how) findings from one model instance generalize to another. This paper addresses a fundamental epistemological challenge: given a mechanistic claim about a particular model, what justifies extrapolating this finding to other LLMs – and along which dimensions might such generalizations hold? I propose five potential axes of correspondence along which mechanistic claims might generalize, including: functional (whether they satisfy the same functional criteria), developmental (whether they develop at similar points during pretraining), positional (whether they occupy similar absolute or relative positions), relational (whether they interact with other model components in similar ways), and configurational (whether they correspond to particular regions or structures in weight-space). To empirically validate this framework, I analyze “1-back attention heads” (components attending to previous tokens) across pretraining in random seeds of the Pythia models (14M, 70M, 160M, 410M). The results reveal striking consistency in the developmental trajectories of 1-back attention across models, while positional consistency is more limited. Moreover, seeds of larger models systematically show earlier onsets, steeper slopes, and higher peaks of 1-back attention. I also address possible objections to the arguments and proposals outlined here. Finally, I conclude by arguing that progress on the generalizability of mechanistic interpretability research will consist in mapping constitutive design properties of LLMs to their emergent behaviors and mechanisms.

[872] JE-IRT: A Geometric Lens on LLM Abilities through Joint Embedding Item Response Theory

Louie Hong Yao, Nicholas Jarvis, Tiffany Zhan, Saptarshi Ghosh, Linfeng Liu, Tianyu Jiang

Main category: cs.AI

TL;DR: JE-IRT is a geometric item-response framework that embeds LLMs and questions in a shared space, replacing global rankings with topical specialization and enabling interpretable analysis of model abilities.

Details

Motivation: Standard LLM evaluation practices compress diverse abilities into single scores, obscuring their multidimensional nature and limiting interpretability.

Method: A geometric item-response framework that embeds both LLMs and questions in a shared space, where question direction encodes semantics and norm encodes difficulty, with correctness determined by geometric interactions between model and question embeddings.

Result: Reveals that out-of-distribution behavior can be explained through directional alignment, larger norms consistently indicate harder questions, and the framework naturally supports generalization by adding new LLMs with single embeddings. The learned space shows an LLM-internal taxonomy that partially aligns with human-defined categories.

Conclusion: JE-IRT establishes a unified and interpretable geometric lens connecting LLM abilities with question structure, offering a distinctive perspective on model evaluation and generalization beyond traditional scoring methods.

Abstract: Standard LLM evaluation practices compress diverse abilities into single scores, obscuring their inherently multidimensional nature. We present JE-IRT, a geometric item-response framework that embeds both LLMs and questions in a shared space. For question embeddings, the direction encodes semantics and the norm encodes difficulty, while correctness on each question is determined by the geometric interaction between the model and question embeddings. This geometry replaces a global ranking of LLMs with topical specialization and enables smooth variation across related questions. Building on this framework, our experimental results reveal that out-of-distribution behavior can be explained through directional alignment, and that larger norms consistently indicate harder questions. Moreover, JE-IRT naturally supports generalization: once the space is learned, new LLMs are added by fitting a single embedding. The learned space further reveals an LLM-internal taxonomy that only partially aligns with human-defined subject categories. JE-IRT thus establishes a unified and interpretable geometric lens that connects LLM abilities with the structure of questions, offering a distinctive perspective on model evaluation and generalization.

[873] Not only a helper, but also a teacher: Interactive LLM Cascade

Yu Wu, Shuo Wu, Ye Tao, Yansong Li, Anand D. Sarwate

Main category: cs.AI

TL;DR: Inter-Cascade is an online interactive LLM cascade system where strong models teach weak models by distilling solutions into reusable strategies, improving weak model performance and reducing expensive model calls.

Details

Motivation: Standard LLM cascades are non-adaptive and repeatedly consult expensive models for similar queries, leading to high costs. There's a need for more efficient cascading that enables knowledge transfer between models.

Method: Extends strong models from backup helpers to teachers that distill solutions into generalized problem-solving strategies. These strategies are added to queries to dynamically improve weak model performance without fine-tuning.

Result: Significantly improves weak model accuracy (up to 33.06% points) and overall system accuracy (up to 5.53% points), while reducing strong model calls (up to 48.05%) and costs (up to 49.63%).

Conclusion: Inter-Cascade demonstrates effective in-context knowledge transfer between LLMs and provides a scalable framework applicable to both open-source and API-based models.

Abstract: Large Language Models (LLMs) vary widely in their capabilities, with larger models often having better performance but higher cost: choosing an LLM model often involves trading off performance and cost. The LLM Cascade is a paradigm that defers difficult queries from weak/cheap to strong/expensive models. This approach is nonadaptive: the deferral decision is trained offline. When confronted with similar or repeated queries, the LLM Cascade may then repeatedly consult the expensive model and incur higher cost. To improve the cascading efficiency, we propose Inter-Cascade, an online and interactive LLM Cascade that extends the role of strong model from a backup helper to a long-term teacher. In our system, when a strong model resolves a difficult query, it also distills its solution into a generalized, reusable problem-solving strategy that boosts the weak model on subsequent queries. Adding strategies to queries enables the weak model to dynamically improve its performance over time, avoiding computationally and time-intensive fine-tuning. Empirically, compared with standard LLM Cascade baselines across multiple benchmarks, the Inter-Cascade significantly improves the accuracy of the weak model (by up to 33.06 absolute percentage points) and the overall system (by up to 5.53 absolute percentage points), while reducing the calls to strong models (by up to 48.05% relative reduction) and saving the corresponding fees (by up to 49.63% relative reduction). Inter-Cascade demonstrates the effective in-context knowledge transfer between LLMs, and provides a general, scalable framework applicable to both open-source and API-based LLMs.

[874] Coordination Requires Simplification: Thermodynamic Bounds on Multi-Objective Compromise in Natural and Artificial Intelligence

Atma Anand

Main category: cs.AI

TL;DR: Coordination across multiple agents faces fundamental thermodynamic constraints where findability matters more than accuracy. The minimum description length of coordination protocols scales with agent count and complexity, forcing progressive simplification and creating persistent metastable states.

Details

Motivation: To understand fundamental thermodynamic constraints in multi-agent coordination systems and explain phenomena like cycling in multi-objective optimization and alignment faking in AI systems.

Method: Developed Thermodynamic Coordination Theory (TCT) using information theory and thermodynamics principles, deriving scaling laws for coordination protocols and defining coordination temperature to predict critical phenomena.

Result: Found that coordination requires radical information loss, with protocols scaling as L(P)≥NKlog₂K+N²d²log(1/ε). Identified persistent metastable states, hysteresis, and phase transitions in coordination systems.

Conclusion: Coordination fundamentally requires information loss and simplification, creating persistent metastable states that only change through environmental shifts triggering phase transitions via spontaneous symmetry breaking.

Abstract: Information-processing systems coordinating across multiple agents and objectives face fundamental thermodynamic constraints. We show that solutions with maximum utility to act as coordination focal points have much higher selection pressure for being findable across agents rather than accuracy. We derive that the information-theoretic minimum description length of coordination protocols to precision $\varepsilon$ scales as $L(P)\geq NK\log_2 K+N^2d^2\log (1/\varepsilon)$ for $N$ agents with $d$ potentially conflicting objectives and internal model complexity $K$. This scaling forces progressive simplification, with coordination dynamics changing the environment itself and shifting optimization across hierarchical levels. Moving from established focal points requires re-coordination, creating persistent metastable states and hysteresis until significant environmental shifts trigger phase transitions through spontaneous symmetry breaking. We operationally define coordination temperature to predict critical phenomena and estimate coordination work costs, identifying measurable signatures across systems from neural networks to restaurant bills to bureaucracies. Extending the topological version of Arrow’s theorem on the impossibility of consistent preference aggregation, we find it recursively binds whenever preferences are combined. This potentially explains the indefinite cycling in multi-objective gradient descent and alignment faking in Large Language Models trained with reinforcement learning with human feedback. We term this framework Thermodynamic Coordination Theory (TCT), which demonstrates that coordination requires radical information loss.

[875] Towards Strategic Persuasion with Language Models

Zirui Cheng, Jiaxuan You

Main category: cs.AI

TL;DR: The paper proposes a theory-driven framework using Bayesian Persuasion to evaluate and train LLMs’ persuasive capabilities, showing that both frontier models and smaller LLMs can achieve high persuasion gains through strategic training.

Details

Motivation: LLMs have shown strong persuasive capabilities comparable to humans, but systematic evaluation is challenging due to domain variability in persuasion effectiveness. There are societal concerns about deploying such persuasive models.

Method: Uses Bayesian Persuasion framework to repurpose human-human persuasion datasets for evaluating LLMs. Employs reinforcement learning to train LLMs for strategic persuasion in constructed environments.

Result: Frontier models consistently achieve high persuasion gains and exhibit sophisticated strategies aligned with theoretical predictions. Even small LLMs obtain significantly higher persuasion gains through reinforcement learning.

Conclusion: The proposed scalable framework effectively measures LLM persuasive capabilities, and reinforcement learning can substantially enhance persuasion performance across model sizes, highlighting both opportunities and risks in deploying persuasive AI systems.

Abstract: Large language models (LLMs) have demonstrated strong persuasive capabilities comparable to those of humans, offering promising benefits while raising societal concerns about their deployment. However, systematically evaluating the persuasive capabilities of LLMs is inherently challenging, as the effectiveness of persuasion among humans varies significantly across different domains. In this paper, we take a theory-driven approach to provide a scalable and principled framework for measuring the persuasive capabilities of LLMs. Grounded in the Bayesian Persuasion (BP) framework, we repurpose existing human-human persuasion datasets to construct environments for evaluating and training LLMs in strategic persuasion. Our results reveal that frontier models can consistently achieve high persuasion gains and exhibit sophisticated persuasion strategies that align with theoretical predictions. Building on this, we use reinforcement learning to train LLMs for strategic persuasion in our environments. Our results also demonstrate that even small LLMs can obtain significantly higher persuasion gains through reinforcement learning.

[876] From Frustration to Fun: An Adaptive Problem-Solving Puzzle Game Powered by Genetic Algorithm

Matthew McConnell, Richard Zhao

Main category: cs.AI

TL;DR: An adaptive AI-powered puzzle game uses genetic algorithms to dynamically generate pathfinding puzzles tailored to individual players’ skill levels in real-time, aiming to maintain optimal challenge and engagement.

Details

Motivation: To develop problem-solving skills through adaptive gaming that maintains engagement, mitigates frustration, and provides optimal challenge levels by dynamically adjusting difficulty based on player performance.

Method: Combines procedural content generation using genetic algorithms with online adaptive difficulty adjustment. A player-modeling system records user interactions to generate puzzles that approximate target difficulty levels based on various player metrics.

Result: A pilot user study was conducted to investigate the effectiveness of different adaptive difficulty systems and interpret players’ responses to the adaptive puzzle game.

Conclusion: This work establishes a foundation for future research into emotionally informed player models, advanced AI techniques for adaptivity, and broader educational applications beyond gaming.

Abstract: This paper explores adaptive problem solving with a game designed to support the development of problem-solving skills. Using an adaptive, AI-powered puzzle game, our adaptive problem-solving system dynamically generates pathfinding-based puzzles using a genetic algorithm, tailoring the difficulty of each puzzle to individual players in an online real-time approach. A player-modeling system records user interactions and informs the generation of puzzles to approximate a target difficulty level based on various metrics of the player. By combining procedural content generation with online adaptive difficulty adjustment, the system aims to maintain engagement, mitigate frustration, and maintain an optimal level of challenge. A pilot user study investigates the effectiveness of this approach, comparing different types of adaptive difficulty systems and interpreting players’ responses. This work lays the foundation for further research into emotionally informed player models, advanced AI techniques for adaptivity, and broader applications beyond gaming in educational settings.

[877] AI Noether – Bridging the Gap Between Scientific Laws Derived by AI Systems and Canonical Knowledge via Abductive Inference

Karan Srivastava, Sanjeeb Dash, Ryan Cory-Wright, Barry Trager, Lior Horesh

Main category: cs.AI

TL;DR: An algebraic geometry-based system that automatically generates missing axioms to explain hypotheses not derivable from incomplete axiom systems, using polynomial equations.

Details

Motivation: To automate abductive inference and close gaps between new data-driven hypotheses and incomplete or incorrect existing theories.

Method: Uses algebraic geometry to generate minimal sets of missing axioms from incomplete axiom systems when hypotheses cannot be explained, requiring axioms and hypotheses to be expressible as polynomial equations.

Result: Establishes necessary and sufficient conditions for successful axiom retrieval and demonstrates efficacy by explaining Kepler’s third law and other laws even with missing key axioms.

Conclusion: The proposed system successfully automates abductive inference to bridge theory gaps by generating missing axioms, advancing automated scientific discovery.

Abstract: A core goal in modern science is to harness recent advances in AI and computer processing to automate and accelerate the scientific method. Symbolic regression can fit interpretable models to data, but these models often sit outside established theory. Recent systems (e.g., AI Descartes, AI Hilbert) enforce derivability from prior axioms. However, sometimes new data and associated hypotheses derived from data are not consistent with existing theory because the existing theory is incomplete or incorrect. Automating abductive inference to close this gap remains open. We propose a solution: an algebraic geometry-based system that, given an incomplete axiom system and a hypothesis that it cannot explain, automatically generates a minimal set of missing axioms that suffices to derive the axiom, as long as axioms and hypotheses are expressible as polynomial equations. We formally establish necessary and sufficient conditions for the successful retrieval of such axioms. We illustrate the efficacy of our approach by demonstrating its ability to explain Kepler’s third law and a few other laws, even when key axioms are absent.

[878] Creative Adversarial Testing (CAT): A Novel Framework for Evaluating Goal-Oriented Agentic AI Systems

Hassen Dhrif

Main category: cs.AI

TL;DR: The paper introduces Creative Adversarial Testing (CAT) framework to evaluate goal-task alignment in Agentic AI systems, validated through synthetic Alexa+ audio services data.

Details

Motivation: Current evaluation techniques for Agentic AI focus on efficacy in identifying agents, tools, and parameters, but lack assessment of alignment between system tasks and overarching goals.

Method: Developed the Creative Adversarial Testing (CAT) framework and validated it using extensive simulation with synthetic interaction data modeled after Alexa+ audio services.

Result: The CAT framework provides unprecedented insights into goal-task alignment, enabling more effective optimization and development of Agentic AI systems.

Conclusion: The CAT framework successfully addresses the critical gap in evaluating goal-task alignment in Agentic AI systems through adversarial testing with synthetic data.

Abstract: Agentic AI represents a paradigm shift in enhancing the capabilities of generative AI models. While these systems demonstrate immense potential and power, current evaluation techniques primarily focus on assessing their efficacy in identifying appropriate agents, tools, and parameters. However, a critical gap exists in evaluating the alignment between an Agentic AI system’s tasks and its overarching goals. This paper introduces the Creative Adversarial Testing (CAT) framework, a novel approach designed to capture and analyze the complex relationship between Agentic AI tasks and the system’s intended objectives. We validate the CAT framework through extensive simulation using synthetic interaction data modeled after Alexa+ audio services, a sophisticated Agentic AI system that shapes the user experience for millions of users globally. This synthetic data approach enables comprehensive testing of edge cases and failure modes while protecting user privacy. Our results demonstrate that the CAT framework provides unprecedented insights into goal-task alignment, enabling more effective optimization and development of Agentic AI systems.

[879] Deceive, Detect, and Disclose: Large Language Models Play Mini-Mafia

Davi Bastos Costa, Renato Vicente

Main category: cs.AI

TL;DR: Mini-Mafia is a simplified 4-player social deduction game used as a benchmark to evaluate LLMs’ social intelligence through deception detection and theory-of-mind reasoning.

Details

Motivation: Mafia games mirror real-world multi-agent scenarios with asymmetric information and theory-of-mind reasoning, making them ideal for evaluating LLMs' social intelligence.

Method: Created Mini-Mafia benchmark with 4 roles (mafioso, detective, 2 villagers), single day phase, and standardized scoring system where LLMs play against each other in fixed configurations.

Result: Experiments revealed counterintuitive results where smaller models sometimes outperformed larger ones, and enabled study of emergent dynamics like name bias and last-speaker advantage.

Conclusion: Mini-Mafia serves as an evolving benchmark for LLM social intelligence evaluation and contributes to AI safety by generating deception detection training data and tracking deception capabilities.

Abstract: Mafia is a social deduction game where informed mafia compete against uninformed townsfolk. Its asymmetry of information and reliance on theory-of-mind reasoning mirror real-world multi-agent scenarios, making it a useful testbed for evaluating the social intelligence of large language models (LLMs). To support a systematic study, we introduce Mini-Mafia: a simplified four-player variant with one mafioso, one detective, and two villagers. We set the mafioso to kill a villager and the detective to investigate the mafioso during the night, reducing the game to a single day phase of discussion and voting. This setup isolates three interactive capabilities through role-specific win conditions: the mafioso must deceive, the villagers must detect deception, and the detective must effectively disclose information. To measure these skills, we have LLMs play against each other, creating the Mini-Mafia Benchmark: a two-stage framework that first estimates win rates within fixed opponent configurations, then aggregates performance across them using standardized scoring. Built entirely from model interactions without external data, the benchmark evolves as new models are introduced, with each one serving both as a new opponent and as a subject of evaluation. Our experiments reveal counterintuitive results, including cases where smaller models outperform larger ones. Beyond benchmarking, Mini-Mafia enables quantitative study of emergent multi-agent dynamics such as name bias and last-speaker advantage. It also contributes to AI safety by generating training data for deception detectors and by tracking models’ deception capabilities against human baselines.

[880] Kimi-Dev: Agentless Training as Skill Prior for SWE-Agents

Zonghan Yang, Shengjie Wang, Kelin Fu, Wenyang He, Weimin Xiong, Yibo Liu, Yibo Miao, Bofei Gao, Yejie Wang, Yingwei Ma, Yanhao Li, Yue Liu, Zhenxing Hu, Kaitai Zhang, Shuyi Wang, Huarong Chen, Flood Sung, Yang Liu, Yang Gao, Zhilin Yang, Tianyu Liu

Main category: cs.AI

TL;DR: The paper shows that Agentless training in software engineering creates skill priors that enable effective SWE-Agent adaptation, with Kimi-Dev achieving 60.4% on SWE-bench Verified and powering agents to 48.6% pass@1 performance.

Details

Motivation: To bridge the gap between SWE-Agent frameworks (multi-turn interactions) and Agentless methods (single-turn verifiable steps) by showing they are not mutually exclusive and that skill priors from Agentless training can enable efficient agent adaptation.

Method: Curated Agentless training recipe and developed Kimi-Dev, an open-source SWE LLM, then performed additional SFT adaptation on 5k publicly-available trajectories to power SWE-Agents.

Result: Kimi-Dev achieved 60.4% on SWE-bench Verified (best among workflow approaches) and powered SWE-Agents to 48.6% pass@1, matching Claude 3.5 Sonnet performance.

Conclusion: Structured skill priors from Agentless training can effectively bridge workflow and agentic frameworks, enabling transferable coding agents.

Abstract: Large Language Models (LLMs) are increasingly applied to software engineering (SWE), with SWE-bench as a key benchmark. Solutions are split into SWE-Agent frameworks with multi-turn interactions and workflow-based Agentless methods with single-turn verifiable steps. We argue these paradigms are not mutually exclusive: reasoning-intensive Agentless training induces skill priors, including localization, code edit, and self-reflection that enable efficient and effective SWE-Agent adaptation. In this work, we first curate the Agentless training recipe and present Kimi-Dev, an open-source SWE LLM achieving 60.4% on SWE-bench Verified, the best among workflow approaches. With additional SFT adaptation on 5k publicly-available trajectories, Kimi-Dev powers SWE-Agents to 48.6% pass@1, on par with that of Claude 3.5 Sonnet (241022 version). These results show that structured skill priors from Agentless training can bridge workflow and agentic frameworks for transferable coding agents.

[881] Risk Profiling and Modulation for LLMs

Yikai Wang, Xiaocheng Li, Guanting Chen

Main category: cs.AI

TL;DR: This paper investigates how different LLM training stages (pre-trained, instruction-tuned, RLHF-aligned) exhibit varying risk behaviors and proposes methods to modulate these risk profiles using behavioral economics tools.

Details

Motivation: LLMs are increasingly used for decision-making under uncertainty, but their risk profiles and how they are influenced by prompting and alignment methods remain underexplored. Existing studies focus on personality prompting or multi-agent interactions, leaving gaps in understanding post-training influences.

Method: Proposed a pipeline for eliciting, steering, and modulating LLMs’ risk profiles using utility-theoretic models from behavioral economics and finance. Compared pre-trained, instruction-tuned, and RLHF-aligned LLMs, and evaluated modulation strategies including prompt engineering, in-context learning, and post-training.

Result: Instruction-tuned models exhibit behaviors consistent with standard utility formulations, while pre-trained and RLHF-aligned models deviate more from fitted utility models. Post-training provides the most stable and effective modulation of risk preference compared to other strategies.

Conclusion: The findings provide insights into risk profiles of different LLM classes and stages, demonstrate how post-training modulates these profiles, and lay groundwork for future research on behavioral alignment and risk-aware LLM design.

Abstract: Large language models (LLMs) are increasingly used for decision-making tasks under uncertainty; however, their risk profiles and how they are influenced by prompting and alignment methods remain underexplored. Existing studies have primarily examined personality prompting or multi-agent interactions, leaving open the question of how post-training influences the risk behavior of LLMs. In this work, we propose a new pipeline for eliciting, steering, and modulating LLMs’ risk profiles, drawing on tools from behavioral economics and finance. Using utility-theoretic models, we compare pre-trained, instruction-tuned, and RLHF-aligned LLMs, and find that while instruction-tuned models exhibit behaviors consistent with some standard utility formulations, pre-trained and RLHF-aligned models deviate more from any utility models fitted. We further evaluate modulation strategies, including prompt engineering, in-context learning, and post-training, and show that post-training provides the most stable and effective modulation of risk preference. Our findings provide insights into the risk profiles of different classes and stages of LLMs and demonstrate how post-training modulates these profiles, laying the groundwork for future research on behavioral alignment and risk-aware LLM design.

[882] Multiplayer Nash Preference Optimization

Fang Wu, Xu Huang, Weihao Xuan, Zhiwei Zhang, Yijia Xiao, Guancheng Wan, Xiaomin Li, Bing Hu, Peng Xia, Jure Leskovec, Yejin Choi

Main category: cs.AI

TL;DR: MNPO extends Nash learning from human feedback to multiplayer games, addressing limitations of two-player methods by enabling competition against multiple opponents for better alignment with complex human preferences.

Details

Motivation: Existing RLHF methods based on Bradley-Terry assumptions struggle with non-transitive and heterogeneous real-world preferences. Two-player Nash learning methods like INPO, ONPO, and EGPO have single-opponent bias that fails to capture full preference complexity.

Method: MNPO formulates alignment as an n-player game where each policy competes against a population of opponents while being regularized toward a reference model. It establishes Nash equilibria in multiplayer settings and extends duality gap for approximation quality.

Result: MNPO consistently outperforms existing NLHF baselines on instruction-following benchmarks, achieving superior alignment quality under heterogeneous annotator conditions and mixed-policy evaluation scenarios.

Conclusion: MNPO provides a principled and scalable framework for aligning LLMs with complex, non-transitive human preferences, inheriting equilibrium guarantees while enabling richer competitive dynamics.

Abstract: Reinforcement learning from human feedback (RLHF) has emerged as the standard paradigm for aligning large language models (LLMs) with human preferences. However, reward-based methods built on the Bradley-Terry assumption struggle to capture the non-transitive and heterogeneous nature of real-world preferences. To address this, recent studies have reframed alignment as a two-player Nash game, giving rise to Nash learning from human feedback (NLHF). While this perspective has inspired algorithms such as INPO, ONPO, and EGPO with strong theoretical and empirical guarantees, they remain fundamentally restricted to two-player interactions, creating a single-opponent bias that fails to capture the full complexity of realistic preference structures. In this work, we introduce Multiplayer Nash Preference Optimization (MNPO), a novel framework that generalizes NLHF to the multiplayer regime. It formulates alignment as an $n$-player game, where each policy competes against a population of opponents while being regularized toward a reference model. Our framework establishes well-defined Nash equilibria in multiplayer settings and extends the concept of duality gap to quantify approximation quality. We demonstrate that MNPO inherits the equilibrium guarantees of two-player methods while enabling richer competitive dynamics and improved coverage of diverse preference structures. Through comprehensive empirical evaluation, we show that MNPO consistently outperforms existing NLHF baselines on instruction-following benchmarks, achieving superior alignment quality under heterogeneous annotator conditions and mixed-policy evaluation scenarios. Together, these results establish MNPO as a principled and scalable framework for aligning LLMs with complex, non-transitive human preferences. Code is available at https://github.com/smiles724/MNPO.

[883] Artificial Phantasia: Evidence for Propositional Reasoning-Based Mental Imagery in Large Language Models

Morgan McCarty, Jorge Morales

Main category: cs.AI

TL;DR: LLMs outperform humans on a classic mental imagery task traditionally thought to require visual mental imagery, suggesting they can solve imagery-dependent tasks despite their non-pictorial architectures.

Details

Motivation: To benchmark complex cognitive behavior in LLMs beyond natural language tasks and test if they can solve tasks traditionally believed to require visual mental imagery.

Method: Created novel items from a classic mental imagery task, tested state-of-the-art LLMs with text-only instructions, established human baseline with 100 subjects, and tested reasoning models with different reasoning token allocations.

Result: Best LLMs performed significantly above average human performance, with strongest performance when models allocated more reasoning tokens.

Conclusion: LLMs may have capability to complete imagery-dependent tasks despite non-pictorial architectures, challenging traditional views about visual imagery representation in humans.

Abstract: This study offers a novel approach for benchmarking complex cognitive behavior in artificial systems. Almost universally, Large Language Models (LLMs) perform best on tasks which may be included in their training data and can be accomplished solely using natural language, limiting our understanding of their emergent sophisticated cognitive capacities. In this work, we created dozens of novel items of a classic mental imagery task from cognitive psychology. A task which, traditionally, cognitive psychologists have argued is solvable exclusively via visual mental imagery (i.e., language alone would be insufficient). LLMs are perfect for testing this hypothesis. First, we tested several state-of-the-art LLMs by giving text-only models written instructions and asking them to report the resulting object after performing the transformations in the aforementioned task. Then, we created a baseline by testing 100 human subjects in exactly the same task. We found that the best LLMs performed significantly above average human performance. Finally, we tested reasoning models set to different levels of reasoning and found the strongest performance when models allocate greater amounts of reasoning tokens. These results provide evidence that the best LLMs may have the capability to complete imagery-dependent tasks despite the non-pictorial nature of their architectures. Our study not only demonstrates an emergent cognitive capacity in LLMs while performing a novel task, but it also provides the field with a new task that leaves lots of room for improvement in otherwise already highly capable models. Finally, our findings reignite the debate over the formats of representation of visual imagery in humans, suggesting that propositional reasoning (or at least non-imagistic reasoning) may be sufficient to complete tasks that were long-thought to be imagery-dependent.

Junyang Zhang, Tianyi Zhu, Thierry Tambe

Main category: cs.AI

TL;DR: Attention Anchor is a parameter-free framework that groups semantically similar tokens across modalities to improve cross-modal locality in vision-language models, reducing hallucinations and improving performance on various benchmarks.

Details

Motivation: Current VLMs underperform pure language models and hallucinate due to direct concatenation of image and text tokens with modality-blinded positional encoding, which forces unnecessary long-distance attention between semantically related tokens across modalities.

Method: Proposes Attention Anchor framework that inserts text tokens near relevant visual patches to create semantic signposts, enabling true content-based cross-modal attention scores and efficient grouping of semantically similar tokens across modalities.

Result: Achieves improvements across 13 out of 15 metrics/benchmarks, with up to 32% gains on reasoning tasks and 15% improvements on hallucination benchmarks. Enables TinyLLaVA 1B to outperform larger models like LLaVA 7B with only 0.1% inference time overhead.

Conclusion: Attention Anchor effectively addresses cross-modal locality issues in VLMs through mixed-modal token grouping, significantly reducing hallucinations and improving performance without disrupting semantic flow.

Abstract: A fundamental reason for the dominance of attention over RNNs and LSTMs in LLMs is its ability to capture long-range dependencies by modeling direct interactions between all tokens, overcoming the sequential limitations of recurrent architectures. Similarly, a key reason why today’s vision language models (VLMs) hallucinate and underperform pure language models is that they rely on direct concatenation of image and text tokens with a modality-blinded positional encoding, which conveniently adopts the pretrained LLM backbone but forces unnecessary long-distance attention between semantically related tokens across modalities. This underscores the urgent need for mechanisms that efficiently enhance token locality and cross-modal alignment. In response, we propose Attention Anchor, a parameter-free framework that efficiently groups semantically similar tokens across modalities, improving cross-modal locality. By inserting text tokens near relevant visual patches, we create semantic signposts that reveal true content-based cross-modal attention scores, guiding the model to focus on the correct image regions for tasks such as VQA, MMBench and POPE. This improves answer accuracy and reduces hallucinations without disrupting the prompt’s semantic flow. AttAnchor achieves improvements across 13 out of 15 different metrics and benchmarks, including up to 32% gains on reasoning tasks and up to 15% improvements on hallucination benchmarks. AttAnchor enables TinyLLaVA 1B to outperform much larger models like LLaVA 7B and QwenVL 3B on POPE with only 0.1% inference time overhead. To the best of our knowledge, this work is among the first to investigate mixed-modal token grouping, where text and image tokens are clustered jointly into shared groups rather than being grouped within a single modality or merely aligned post-hoc with additional alignment losses.

[885] Exploring LLM-based Frameworks for Fault Diagnosis

Xian Yeow Lee, Lasitha Vidyaratne, Ahmed Farahat, Chetan Gupta

Main category: cs.AI

TL;DR: LLM-based systems show promise for autonomous health monitoring using sensor data, with multi-LLM architectures and statistical inputs performing best, though they struggle with continual learning adaptation.

Details

Motivation: To explore LLM potential for fault detection and classification in industrial environments while producing explainable outputs through natural language reasoning.

Method: Systematic evaluation of LLM-system architecture (single-LLM vs. multi-LLM), input representations (raw vs. descriptive statistics), and context window size effects on diagnostic performance.

Result: LLM systems perform best with summarized statistical inputs, and multi-LLM systems with specialized prompts offer improved sensitivity for fault classification compared to single-LLM systems.

Conclusion: LLMs show promise as transparent diagnostic tools but have limitations in continual learning settings, struggling with prediction calibration during repeated fault cycles.

Abstract: Large Language Model (LLM)-based systems present new opportunities for autonomous health monitoring in sensor-rich industrial environments. This study explores the potential of LLMs to detect and classify faults directly from sensor data, while producing inherently explainable outputs through natural language reasoning. We systematically evaluate how LLM-system architecture (single-LLM vs. multi-LLM), input representations (raw vs. descriptive statistics), and context window size affect diagnostic performance. Our findings show that LLM systems perform most effectively when provided with summarized statistical inputs, and that systems with multiple LLMs using specialized prompts offer improved sensitivity for fault classification compared to single-LLM systems. While LLMs can produce detailed and human-readable justifications for their decisions, we observe limitations in their ability to adapt over time in continual learning settings, often struggling to calibrate predictions during repeated fault cycles. These insights point to both the promise and the current boundaries of LLM-based systems as transparent, adaptive diagnostic tools in complex environments.

[886] Transferring Vision-Language-Action Models to Industry Applications: Architectures, Performance, and Challenges

Shuai Li, Chen Yizhe, Li Dong, Liu Sichao, Lan Dapeng, Liu Yu, Zhibo Pang

Main category: cs.AI

TL;DR: VLA models can handle simple grasping tasks in industrial settings after fine-tuning but struggle with complex environments, diverse objects, and high-precision placing tasks, requiring task-specific enhancements.

Details

Motivation: To evaluate whether Vision Language-Action (VLA) models meet industrial requirements and assess their adaptability for real-world industrial deployment.

Method: Compare performance of state-of-the-art VLA models in industrial scenarios and analyze limitations from data collection and model architecture perspectives.

Result: VLA models retain ability to perform simple grasping tasks after fine-tuning, but show significant performance gaps in complex industrial environments, diverse object categories, and high-precision placing tasks.

Conclusion: VLA models need task-specific enhancements to improve robustness, generalization, and precision for effective industrial deployment, with practical insights provided for their adaptability.

Abstract: The application of artificial intelligence (AI) in industry is accelerating the shift from traditional automation to intelligent systems with perception and cognition. Vision language-action (VLA) models have been a key paradigm in AI to unify perception, reasoning, and control. Has the performance of the VLA models met the industrial requirements? In this paper, from the perspective of industrial deployment, we compare the performance of existing state-of-the-art VLA models in industrial scenarios and analyze the limitations of VLA models for real-world industrial deployment from the perspectives of data collection and model architecture. The results show that the VLA models retain their ability to perform simple grasping tasks even in industrial settings after fine-tuning. However, there is much room for performance improvement in complex industrial environments, diverse object categories, and high precision placing tasks. Our findings provide practical insight into the adaptability of VLA models for industrial use and highlight the need for task-specific enhancements to improve their robustness, generalization, and precision.

[887] HeDA: An Intelligent Agent System for Heatwave Risk Discovery through Automated Knowledge Graph Construction and Multi-layer Risk Propagation Analysis

Yiquan Wang, Tin-Yeh Huang, Qingyun Gao, Jialin Zhang

Main category: cs.AI

TL;DR: HeDA is an AI system that analyzes 10,247 papers to build a knowledge graph and discover overlooked heatwave risk pathways, achieving 78.9% QA accuracy and identifying 5 new high-impact risk chains.

Details

Motivation: Heatwaves create cascading risks across interconnected systems, but scientific literature is fragmented, hindering comprehensive understanding of these risk pathways.

Method: HeDA processes academic papers to construct a comprehensive knowledge graph with 23,156 nodes and 89,472 relationships, then performs multi-layer risk propagation analysis to identify risk transmission pathways.

Result: The system achieves 78.9% accuracy on complex QA tasks (outperforming GPT-4 by 13.7%) and discovers five previously unidentified high-impact risk chains, validated through historical case studies and expert review.

Conclusion: HeDA presents a new paradigm for AI-driven scientific discovery, providing actionable insights for developing more resilient climate adaptation strategies.

Abstract: Heatwaves pose complex cascading risks across interconnected climate, social, and economic systems, but knowledge fragmentation in scientific literature hinders comprehensive understanding of these risk pathways. We introduce HeDA (Heatwave Discovery Agent), an intelligent multi-agent system designed for automated scientific discovery through knowledge graph construction and multi-layer risk propagation analysis. HeDA processes over 10,247 academic papers to construct a comprehensive knowledge graph with 23,156 nodes and 89,472 relationships, employing novel multi-layer risk propagation analysis to systematically identify overlooked risk transmission pathways. Our system achieves 78.9% accuracy on complex question-answering tasks, outperforming state-of-the-art baselines including GPT-4 by 13.7%. Critically, HeDA successfully discovered five previously unidentified high-impact risk chains, such as the pathway where a heatwave leads to a water demand surge, resulting in industrial water restrictions and ultimately causing small business disruption, which were validated through historical case studies and domain expert review. This work presents a new paradigm for AI-driven scientific discovery, providing actionable insights for developing more resilient climate adaptation strategies.

Yue Zhang, Tianyi Ma, Zun Wang, Yanyuan Qiao, Parisa Kordjamshidi

Main category: cs.AI

TL;DR: This paper proposes using multi-perspective textual descriptions to enhance LLM-based Vision-and-Language Navigation agents through analogical reasoning, improving scene understanding and navigation accuracy.

Details

Motivation: Existing zero-shot LLM-based VLN agents either oversimplify visual details by encoding images as text or fail to capture abstract semantics when processing raw images, limiting their contextual understanding for navigation tasks.

Method: The approach incorporates textual descriptions from multiple perspectives to facilitate analogical reasoning across images, enhancing the agent’s global scene understanding and spatial reasoning capabilities.

Result: Experiments on the R2R dataset demonstrate significant improvements in navigation performance compared to existing methods.

Conclusion: Using multi-perspective textual descriptions with analogical reasoning effectively improves LLM-based VLN agents’ contextual understanding and navigation accuracy.

Abstract: Integrating large language models (LLMs) into embodied AI models is becoming increasingly prevalent. However, existing zero-shot LLM-based Vision-and-Language Navigation (VLN) agents either encode images as textual scene descriptions, potentially oversimplifying visual details, or process raw image inputs, which can fail to capture abstract semantics required for high-level reasoning. In this paper, we improve the navigation agent’s contextual understanding by incorporating textual descriptions from multiple perspectives that facilitate analogical reasoning across images. By leveraging text-based analogical reasoning, the agent enhances its global scene understanding and spatial reasoning, leading to more accurate action decisions. We evaluate our approach on the R2R dataset, where our experiments demonstrate significant improvements in navigation performance.

[889] SysMoBench: Evaluating AI on Formally Modeling Complex Real-World Systems

Qian Cheng, Ruize Tang, Emilie Ma, Finn Hackett, Peiyang He, Yiming Su, Ivan Beschastnikh, Yu Huang, Xiaoxing Ma, Tianyin Xu

Main category: cs.AI

TL;DR: SysMoBench is a benchmark for evaluating AI’s ability to generate formal models of large, complex concurrent and distributed systems using TLA+ specifications.

Details

Motivation: Formal models are expensive to write and maintain for complex systems, and existing AI approaches mostly target small code rather than complete systems. There's a need to understand if AI can handle realistic system artifacts.

Method: Created SysMoBench with automated metrics for syntactic correctness, runtime correctness, conformance to system code, and invariant correctness. Currently includes nine diverse system artifacts like Etcd’s Raft, Redis, Asterinas OS components.

Result: The benchmark enables systematic evaluation of LLMs and agents in generating formal specifications for complex systems.

Conclusion: SysMoBench provides a foundation for understanding AI capabilities in formal modeling and opens new research directions in this area.

Abstract: Formal models are essential to specifying large, complex computer systems and verifying their correctness, but are notoriously expensive to write and maintain. Recent advances in generative AI show promise in generating certain forms of specifications. However, existing work mostly targets small code, not complete systems. It is unclear whether AI can deal with realistic system artifacts, as this requires abstracting their complex behavioral properties into formal models. We present SysMoBench, a benchmark that evaluates AI’s ability to formally model large, complex systems. We focus on concurrent and distributed systems, which are keystones of today’s critical computing infrastructures, encompassing operating systems and cloud infrastructure. We use TLA+, the it de facto specification language for concurrent and distributed systems, though the benchmark can be extended to other specification languages. We address the primary challenge of evaluating AI-generated models by automating metrics like syntactic and runtime correctness, conformance to system code, and invariant correctness. SysMoBench currently includes nine diverse system artifacts: the Raft implementation of Etcd and Redis, the Spinlock and Mutex in Asterinas OS, etc.; more artifacts are being actively added. SysMoBench enables us to understand the capabilities and limitations of today’s LLMs and agents, putting tools in this area on a firm footing and opening up promising new research directions.

[890] MathBode: Frequency-Domain Fingerprints of LLM Mathematical Reasoning

Charles L. Wang

Main category: cs.AI

TL;DR: MathBode is a dynamic diagnostic tool that analyzes mathematical reasoning in LLMs using frequency-domain analysis of parameter variations, revealing systematic low-pass behavior and phase lag that accuracy metrics miss.

Details

Motivation: Standard one-shot accuracy metrics fail to capture the dynamic reasoning capabilities of LLMs. The authors aim to develop a more nuanced diagnostic that reveals how models handle parameter variations and mathematical relationships over time.

Method: Drive a single parameter sinusoidally through mathematical problems and fit first-harmonic responses of model outputs vs. exact solutions. This yields frequency-resolved metrics (gain and phase) that form Bode-style fingerprints across five mathematical problem families.

Result: The diagnostic reveals systematic low-pass behavior and growing phase lag in LLMs that accuracy metrics obscure. Results separate frontier models from mid-tier models based on their dynamic reasoning capabilities, with symbolic baselines showing near-ideal performance (G≈1, φ≈0).

Conclusion: MathBode provides a compact, reproducible protocol that complements standard benchmarks with actionable measurements of reasoning fidelity and consistency, offering deeper insights into mathematical reasoning capabilities of LLMs.

Abstract: This paper presents MathBode, a dynamic diagnostic for mathematical reasoning in large language models (LLMs). Instead of one-shot accuracy, MathBode treats each parametric problem as a system: we drive a single parameter sinusoidally and fit first-harmonic responses of model outputs and exact solutions. This yields interpretable, frequency-resolved metrics – gain (amplitude tracking) and phase (lag) – that form Bode-style fingerprints. Across five closed-form families (linear solve, ratio/saturation, compound interest, 2x2 linear systems, similar triangles), the diagnostic surfaces systematic low-pass behavior and growing phase lag that accuracy alone obscures. We compare several models against a symbolic baseline that calibrates the instrument ($G \approx 1$, $\phi \approx 0$). Results separate frontier from mid-tier models on dynamics, providing a compact, reproducible protocol that complements standard benchmarks with actionable measurements of reasoning fidelity and consistency. We open-source the dataset and code to enable further research and adoption.

[891] AI-Enhanced Distributed Channel Access for Collision Avoidance in Future Wi-Fi 8

Jinzhe Pan, Jingqing Wang, Yuehui Ouyang, Wenchi Cheng, Wei Zhang

Main category: cs.AI

TL;DR: A multi-agent reinforcement learning framework for Wi-Fi channel access that improves collision resolution and fairness while maintaining backward compatibility with legacy devices.

Details

Motivation: Current Wi-Fi systems using binary exponential backoff suffer from suboptimal collision resolution in dense deployments and persistent fairness challenges due to inherent randomness.

Method: Developed dynamic backoff selection mechanism, fairness quantification metric, and centralized training decentralized execution architecture using constrained multi-agent proximal policy optimization.

Result: Significantly reduces collision probability compared to conventional BEB while preserving backward compatibility with commercial Wi-Fi devices, and effectively eliminates starvation risks in heterogeneous scenarios.

Conclusion: The AI-optimized framework successfully improves distributed channel access mechanisms for unlicensed bands while ensuring coexistence with legacy devices.

Abstract: The exponential growth of wireless devices and stringent reliability requirements of emerging applications demand fundamental improvements in distributed channel access mechanisms for unlicensed bands. Current Wi-Fi systems, which rely on binary exponential backoff (BEB), suffer from suboptimal collision resolution in dense deployments and persistent fairness challenges due to inherent randomness. This paper introduces a multi-agent reinforcement learning framework that integrates artificial intelligence (AI) optimization with legacy device coexistence. We first develop a dynamic backoff selection mechanism that adapts to real-time channel conditions through access deferral events while maintaining full compatibility with conventional CSMA/CA operations. Second, we introduce a fairness quantification metric aligned with enhanced distributed channel access (EDCA) principles to ensure equitable medium access opportunities. Finally, we propose a centralized training decentralized execution (CTDE) architecture incorporating neighborhood activity patterns as observational inputs, optimized via constrained multi-agent proximal policy optimization (MAPPO) to jointly minimize collisions and guarantee fairness. Experimental results demonstrate that our solution significantly reduces collision probability compared to conventional BEB while preserving backward compatibility with commercial Wi-Fi devices. The proposed fairness metric effectively eliminates starvation risks in heterogeneous scenarios.

[892] Limit Analysis for Symbolic Multi-step Reasoning Tasks with Information Propagation Rules Based on Transformers

Tian Qin, Yuhan Chen, Zhiwei Wang, Zhi-Qin John Xu

Main category: cs.AI

TL;DR: The paper analyzes Transformers’ reasoning capabilities and establishes theoretical bounds on their reasoning steps.

Details

Motivation: To understand the intrinsic reasoning mechanisms of Transformers, which remain widely unknown despite their ability to perform reasoning tasks.

Method: Proposes information propagation rules based on Transformers and uses symbolic reasoning tasks to theoretically analyze the limit reasoning steps.

Result: Shows that the limit number of reasoning steps is between O(3^(L-1)) and O(2^(L-1)) for a model with L attention layers in a single pass.

Conclusion: Transformers have bounded reasoning capabilities with specific complexity bounds depending on the number of attention layers.

Abstract: Transformers are able to perform reasoning tasks, however the intrinsic mechanism remains widely open. In this paper we propose a set of information propagation rules based on Transformers and utilize symbolic reasoning tasks to theoretically analyze the limit reasoning steps. We show that the limit number of reasoning steps is between $O(3^{L-1})$ and $O(2^{L-1})$ for a model with $L$ attention layers in a single-pass.

[893] Understanding and Enhancing the Planning Capability of Language Models via Multi-Token Prediction

Qimin Zhong, Hao Liao, Siwei Wang, Mingyang Zhou, Xiaoqun Wu, Rui Mao, Wei Chen

Main category: cs.AI

TL;DR: The paper investigates how Multi-Token Prediction (MTP) helps Transformers learn transitive relations for complex planning tasks, proposing Next-Token Injection and Transformer-based transfer layer to enhance path-planning capabilities.

Details

Motivation: LLMs struggle with learning transitive relations, which are crucial for complex planning tasks, creating a bottleneck in their reasoning capabilities.

Method: Theoretical analysis of MTP paradigm with Transformer architecture, proposing Next-Token Injection and Transformer-based transfer layer to improve transfer layer quality and capture transitive reachability relations.

Result: Experiments on synthetic graphs and Blocksworld planning benchmark validate that MTP enables models to capture unobserved transitive relations beyond training data, with proposed improvements significantly enhancing path-planning capability.

Conclusion: The findings deepen understanding of how Transformers with MTP learn complex planning tasks and provide practical strategies to overcome the transitivity bottleneck, enabling structurally aware planning models.

Abstract: Large Language Models (LLMs) have achieved impressive performance across diverse tasks but continue to struggle with learning transitive relations, a cornerstone for complex planning. To address this issue, we investigate the Multi-Token Prediction (MTP) paradigm and its impact to transitive relation learning. We theoretically analyze the MTP paradigm using a Transformer architecture composed of a shared output head and a transfer layer. Our analysis reveals that the transfer layer gradually learns the multi-step adjacency information, which in turn enables the backbone model to capture unobserved transitive reachability relations beyond those directly present in the training data, albeit with some inevitable noise in adjacency estimation. Building on this foundation, we propose two strategies to enhance the transfer layer and overall learning quality: Next-Token Injection (NTI) and a Transformer-based transfer layer. Our experiments on both synthetic graphs and the Blocksworld planning benchmark validate our theoretical findings and demonstrate that the improvements significantly enhance the model’s path-planning capability. These findings deepen our understanding of how Transformers with MTP learn in complex planning tasks, and provide practical strategies to overcome the transitivity bottleneck, paving the way toward structurally aware and general-purpose planning models.

[894] AutoEP: LLMs-Driven Automation of Hyperparameter Evolution for Metaheuristic Algorithms

Zhenxing Xu, Yizhe Zhang, Weidong Bao, Hao Wang, Ming Chen, Haoran Ye, Wenzheng Jiang, Hui Yan, Ji Wang

Main category: cs.AI

TL;DR: AutoEP is a novel framework that uses Large Language Models (LLMs) as zero-shot reasoning engines for automated hyperparameter configuration, combining online Exploratory Landscape Analysis with multi-LLM reasoning chains to achieve state-of-the-art performance without training.

Details

Motivation: Traditional learning-based methods for hyperparameter configuration suffer from high sample complexity and poor generalization, creating a need for more efficient and generalizable automated approaches.

Method: AutoEP uses two key components: (1) an online Exploratory Landscape Analysis module for real-time quantitative feedback on search dynamics, and (2) a multi-LLM reasoning chain that interprets this feedback to generate adaptive hyperparameter strategies, grounding high-level reasoning in empirical data.

Result: AutoEP consistently outperforms state-of-the-art tuners across three distinct metaheuristics on diverse combinatorial optimization benchmarks, and enables open-source models like Qwen3-30B to match GPT-4’s performance.

Conclusion: AutoEP demonstrates a powerful and accessible new paradigm for automated hyperparameter design that bypasses training entirely and achieves superior performance through zero-shot reasoning with LLMs.

Abstract: Dynamically configuring algorithm hyperparameters is a fundamental challenge in computational intelligence. While learning-based methods offer automation, they suffer from prohibitive sample complexity and poor generalization. We introduce AutoEP, a novel framework that bypasses training entirely by leveraging Large Language Models (LLMs) as zero-shot reasoning engines for algorithm control. AutoEP’s core innovation lies in a tight synergy between two components: (1) an online Exploratory Landscape Analysis (ELA) module that provides real-time, quantitative feedback on the search dynamics, and (2) a multi-LLM reasoning chain that interprets this feedback to generate adaptive hyperparameter strategies. This approach grounds high-level reasoning in empirical data, mitigating hallucination. Evaluated on three distinct metaheuristics across diverse combinatorial optimization benchmarks, AutoEP consistently outperforms state-of-the-art tuners, including neural evolution and other LLM-based methods. Notably, our framework enables open-source models like Qwen3-30B to match the performance of GPT-4, demonstrating a powerful and accessible new paradigm for automated hyperparameter design. Our code is available at https://anonymous.4open.science/r/AutoEP-3E11

[895] $p$-less Sampling: A Robust Hyperparameter-Free Approach for LLM Decoding

Runyan Tan, Shuang Wu, Phillip Howard

Main category: cs.AI

TL;DR: p-less sampling is a hyperparameter-free decoding strategy that dynamically sets truncation thresholds based on token probability distributions, outperforming existing methods across various tasks while maintaining quality at higher temperatures.

Details

Motivation: Existing sampling methods for LLMs are sensitive to hyperparameter settings and performance degrades at higher temperatures, requiring different configurations for different tasks.

Method: Information-theoretic approach that dynamically sets truncation thresholds at each decoding step based on the entire token probability distribution, eliminating hyperparameters.

Result: Consistently outperforms existing sampling approaches across math, logical reasoning, and creative writing tasks; maintains text quality at higher temperatures; achieves greater inference-time efficiency with lower average token sampling times and shorter generation lengths.

Conclusion: p-less sampling provides a robust, hyperparameter-free alternative to existing decoding strategies that maintains high performance across diverse tasks and temperature settings while improving efficiency.

Abstract: Obtaining high-quality outputs from Large Language Models (LLMs) often depends upon the choice of a sampling-based decoding strategy to probabilistically choose the next token at each generation step. While a variety of such sampling methods have been proposed, their performance can be sensitive to the selection of hyperparameters which may require different settings depending upon the generation task and temperature configuration. In this work, we introduce $p$-less sampling: an information-theoretic approach to sampling which dynamically sets a truncation threshold at each decoding step based on the entire token probability distribution. Unlike existing methods, $p$-less sampling has no hyperparameters and consistently produces high-quality outputs as temperature increases. We provide theoretical perspectives on $p$-less sampling to ground our proposed method and conduct experiments to empirically validate its effectiveness across a range of math, logical reasoning, and creative writing tasks. Our results demonstrate how $p$-less sampling consistently outperforms existing sampling approaches while exhibiting much less degradation in text quality at higher temperature values. We further show how $p$-less achieves greater inference-time efficiency than alternative methods through lower average token sampling times and shorter generation lengths, without sacrificing accuracy. Finally, we provide analyses to highlight the benefits of $p$-less through qualitative examples, case studies, and diversity assessments.

[896] Towards Efficient LLM Grounding for Embodied Multi-Agent Collaboration

Yang Zhang, Shixin Yang, Chenjia Bai, Fei Wu, Xiu Li, Zhen Wang, Xuelong Li

Main category: cs.AI

TL;DR: Proposes ReAd framework using reinforced advantage feedback for efficient multi-agent LLM planning, reducing LLM queries while improving task success.

Details

Motivation: Existing LLM planning methods for multi-agent collaboration rely heavily on physical verification or self-reflection, leading to excessive and inefficient LLM querying.

Method: Uses critic regression to learn sequential advantage function from LLM-planned data, then treats LLM planner as optimizer to maximize advantage function, providing foresight for action contribution.

Result: Surpasses baselines in success rate on Overcooked-AI and RoCoBench, significantly decreases agent interaction steps and LLM query rounds.

Conclusion: ReAd framework enables efficient self-refinement of plans for multi-agent collaboration with reduced computational overhead while maintaining high performance.

Abstract: Grounding the reasoning ability of large language models (LLMs) for embodied tasks is challenging due to the complexity of the physical world. Especially, LLM planning for multi-agent collaboration requires communication of agents or credit assignment as the feedback to re-adjust the proposed plans and achieve effective coordination. However, existing methods that overly rely on physical verification or self-reflection suffer from excessive and inefficient querying of LLMs. In this paper, we propose a novel framework for multi-agent collaboration that introduces Reinforced Advantage feedback (ReAd) for efficient self-refinement of plans. Specifically, we perform critic regression to learn a sequential advantage function from LLM-planned data, and then treat the LLM planner as an optimizer to generate actions that maximize the advantage function. It endows the LLM with the foresight to discern whether the action contributes to accomplishing the final task. We provide theoretical analysis by extending advantage-weighted regression in reinforcement learning to multi-agent systems. Experiments on Overcooked-AI and a difficult variant of RoCoBench show that ReAd surpasses baselines in success rate, and also significantly decreases the interaction steps of agents and query rounds of LLMs, demonstrating its high efficiency for grounding LLMs. More results are given at https://embodied-read.github.io

[897] Agentic AI Reasoning for Mobile Edge General Intelligence: Fundamentals, Approaches, and Directions

Mingyi Luo, Ruichen Zhang, Xiangwang Hou, Jun Du, Chunxiao Jiang, Yong Ren, Dusit Niyato, Shiwen Mao

Main category: cs.AI

TL;DR: Proposes a joint optimization framework for efficient LLM reasoning deployment in Mobile Edge General Intelligence (MEGI) environments, addressing computational challenges through adaptive CoT prompting and distributed MoE architecture.

Details

Motivation: The integration of LLM-based agentic AI with edge computing creates MEGI for real-time, privacy-preserving reasoning, but faces challenges due to high computational demands of reasoning and limited edge device resources.

Method: A distributed framework with two correlated aspects: reasoning enhancement through adaptive Chain-of-Thought (CoT) prompting and scalable deployment through distributed Mixture of Experts (MoE) architecture that dynamically activates expert networks and adjusts reasoning depth based on task complexity and device capabilities.

Result: Experimental evaluations in mobile edge environments demonstrate the framework’s effectiveness in balancing reasoning quality with resource efficiency.

Conclusion: The framework validates the practical viability of deploying sophisticated LLM reasoning capabilities in resource-constrained MEGI environments.

Abstract: The rapid advancement of large language models (LLMs) has enabled an emergence of agentic artificial intelligence (AI) with powerful reasoning and autonomous decision-making capabilities. This integration with edge computing has led to the development of Mobile Edge General Intelligence (MEGI), which brings real-time, privacy-preserving reasoning to the network edge. However, deploying LLM-based agentic AI reasoning in MEGI environments poses significant challenges due to the high computational demands of reasoning and the limited resources of edge devices. To address these challenges, we propose a joint optimization framework for efficient LLM reasoning deployment in MEGI. First, we review methods that enhance LLM reasoning capabilities, such as Chain-of-Thought (CoT) prompting, Supervised Fine-Tuning (SFT), and Mixture of Experts (MoE). Next, we present a distributed framework that addresses two correlated aspects: reasoning enhancement through adaptive CoT prompting and scalable deployment through distributed MoE architecture. The framework dynamically activates expert networks and adjusts reasoning depth based on task complexity and device capabilities. We further conduct experimental evaluations in mobile edge environments. Experimental results demonstrate the framework’s effectiveness in balancing reasoning quality with resource efficiency, validating the practical viability of deploying sophisticated LLM reasoning capabilities in resource-constrained MEGI environments.

[898] Training Vision-Language Process Reward Models for Test-Time Scaling in Multimodal Reasoning: Key Insights and Lessons Learned

Brandon Ong, Tej Deep Pala, Vernon Toh, William Chandra Tjhi, Soujanya Poria

Main category: cs.AI

TL;DR: This paper explores Vision-Language Process Reward Models (VL-PRMs) for improving reasoning in VLMs, introducing hybrid data synthesis, perception-focused supervision, and systematic test-time scaling strategies that outperform existing methods across multiple multimodal benchmarks.

Details

Motivation: While Process Reward Models (PRMs) have been well-studied in text domains, their extension to Vision Language Models (VLMs) remains limited. Existing VL-PRMs rely on noisy Monte Carlo Tree Search for data construction, which limits generalization across tasks.

Method: Proposes a hybrid data synthesis framework combining MCTS with strong VLM judgments, perception-focused supervision for detecting visual grounding errors, and systematic evaluation of multiple test-time scaling strategies.

Result: Experiments on five multimodal benchmarks show VL-PRMs can outperform process step selection when used as Outcome Reward Models, smaller VL-PRMs can match larger ones in error detection, and perception-level supervision leads to significant test-time scaling gains.

Conclusion: The work provides key insights into VL-PRM design space and demonstrates their effectiveness in uncovering latent reasoning abilities in VLMs, motivating further research in this area.

Abstract: Process Reward Models (PRMs) provide step-level supervision that improves the reliability of reasoning in large language models. While PRMs have been extensively studied in text-based domains, their extension to Vision Language Models (VLMs) remains limited. Existing Vision-Language PRMs (VL-PRMs) rely on Monte Carlo Tree Search (MCTS) for data construction, which can often produce noisy supervision signals and limit generalization across tasks. In this work, we aim to elucidate the design space of VL-PRMs by exploring diverse strategies for dataset construction, training, and test-time scaling. First, we introduce a hybrid data synthesis framework that combines MCTS with judgments from a strong VLM, producing more accurate step-level labels. Second, we propose perception-focused supervision, enabling our PRM to explicitly detect errors at the visual grounding stage of reasoning. Third, we systematically evaluate multiple test-time scaling strategies, showing that our PRMs can reliably guide VLMs toward more accurate solutions. Our experiments covering five diverse multimodal benchmarks (MMMU, PuzzleVQA, AlgoPuzzleVQA, MathVista, and MathVision) reveal several key insights: (i) VL-PRMs when used as Outcome Reward Models (ORMs) during test-time scaling (TTS) can outperform VL-PRM guided process step selection, (ii) smaller VL-PRMs can match or even surpass larger ones in detecting process errors, (iii) VL-PRMs uncover latent reasoning abilities in stronger VLM backbones, (iv) perception-level supervision leads to significant gains in test-time scaling, and (v) TTS performance of different policies improve on advanced math reasoning datasets despite not training VL-PRMs on such datasets. We hope our work will motivate further research and support the advancement of VLMs.

[899] GUI-PRA: Process Reward Agent for GUI Tasks

Tao Xiong, Xavier Hu, Yurun Chen, Yuhang Liu, Changqiao Wu, Pengzhi Gao, Wei Liu, Jian Luan, Shengyu Zhang

Main category: cs.AI

TL;DR: GUI-PRA addresses challenges in GUI task automation by introducing dynamic memory mechanisms and adaptive UI perception to overcome the ’lost in the middle’ phenomenon and provide better process rewards than standard PRMs.

Details

Motivation: GUI Agents struggle with long-horizon tasks due to overwhelming historical context (lost in the middle phenomenon) and lack of GUI changing awareness, leading to frequent failures in dynamic GUI environments.

Method: Introduces GUI-PRA with two key mechanisms: 1) Dynamic memory mechanism with Relevance-based Retrieval Module and Progressive Summarization Module to handle long histories, 2) Adaptive UI Perception mechanism that reasons about UI state changes and selects appropriate tools for visual evidence.

Result: The proposed approach enables better process reward provision by intelligently processing historical context and actively perceiving UI state changes, overcoming limitations of standard PRMs in GUI domains.

Conclusion: GUI-PRA represents a significant advancement for GUI task automation by addressing critical challenges in process reward modeling through dynamic memory management and adaptive UI awareness, improving agent performance on long-horizon tasks.

Abstract: Graphical User Interface (GUI) Agents powered by Multimodal Large Language Models (MLLMs) show significant potential for automating tasks. However, they often struggle with long-horizon tasks, leading to frequent failures. Process Reward Models (PRMs) are a promising solution, as they can guide these agents with crucial process signals during inference. Nevertheless, their application to the GUI domain presents unique challenges. When processing dense artificial inputs with long history data, PRMs suffer from a “lost in the middle” phenomenon, where the overwhelming historical context compromises the evaluation of the current step. Furthermore, standard PRMs lacks GUI changing awareness, providing static evaluations that are disconnected from the dynamic consequences of actions, a critical mismatch with the inherently dynamic nature of GUI tasks. In response to these challenges, we introduce GUI-PRA (Process Reward Agent for GUI Tasks), a judge agent designed to better provide process reward than standard PRM by intelligently processing historical context and actively perceiving UI state changes. Specifically, to directly combat the ``lost in the middle’’ phenomenon, we introduce a dynamic memory mechanism consisting of two core components: a Relevance-based Retrieval Module to actively fetch pertinent information from long histories and a Progressive Summarization Module to dynamically condense growing interaction data, ensuring the model focuses on relevant context. Moreover, to address the lack of UI changing awareness, we introduce an Aadaptive UI Perception mechanism. This mechanism enables the agent to reason about UI state changes and dynamically select the most appropriate tool to gather grounded visual evidence, ensuring its evaluation is always informed by the current UI context.

[900] Socio-Economic Model of AI Agents

Yuxinyue Qian, Jun Liu

Main category: cs.AI

TL;DR: AI collaboration with humans under resource constraints significantly boosts social output, with network effects and independent agent production enabling nonlinear growth and increasing returns to scale.

Details

Motivation: To study how AI integration affects socio-economic systems and aggregate social output under resource constraints, given the ongoing deep integration of AI technologies.

Method: Constructed five heterogeneous agent-based models: baseline human collaboration, AI collaborators, network effects, independent agent production, and combined network effects with independent production. Used theoretical derivation and simulation analysis.

Result: AI agents significantly increase aggregate social output. Network effects create nonlinear growth exceeding individual contributions. Independent agent production provides higher long-term growth potential. Network effects demonstrate increasing returns to scale.

Conclusion: AI collaboration with network effects and independent production capabilities can dramatically enhance social output through nonlinear growth mechanisms and increasing returns to scale under resource constraints.

Abstract: Modern socio-economic systems are undergoing deep integration with artificial intelligence technologies. This paper constructs a heterogeneous agent-based modeling framework that incorporates both human workers and autonomous AI agents, to study the impact of AI collaboration under resource constraints on aggregate social output. We build five progressively extended models: Model 1 serves as the baseline of pure human collaboration; Model 2 introduces AI as collaborators; Model 3 incorporates network effects among agents; Model 4 treats agents as independent producers; and Model 5 integrates both network effects and independent agent production. Through theoretical derivation and simulation analysis, we find that the introduction of AI agents can significantly increase aggregate social output. When considering network effects among agents, this increase exhibits nonlinear growth far exceeding the simple sum of individual contributions. Under the same resource inputs, treating agents as independent producers provides higher long-term growth potential; introducing network effects further demonstrates strong characteristics of increasing returns to scale.

[901] Toward Effective Tool-Integrated Reasoning via Self-Evolved Preference Learning

Yifei Chen, Guanting Dong, Zhicheng Dou

Main category: cs.AI

TL;DR: Tool-Light is a framework that improves Tool-Integrated Reasoning (TIR) in LLMs by addressing suboptimal behaviors like insufficient/excessive tool usage and overthinking through entropy analysis and multi-stage fine-tuning.

Details

Motivation: Current LLMs using TIR exhibit inefficient behaviors including insufficient or excessive tool usage and overthinking after tool calls. The challenge is to incentivize LLMs to perform TIR efficiently and accurately while stabilizing the reasoning process.

Method: Proposed Tool-Light framework with dataset construction using continuous self-evolved sampling (vanilla + entropy-guided sampling) and strict positive-negative pair selection criteria, followed by two-stage fine-tuning: Supervised Fine-Tuning (SFT) and Self-Evolved Direct Preference Optimization (DPO).

Result: Experimental results on 10 datasets demonstrate Tool-Light’s effectiveness in significantly improving model efficiency for TIR tasks.

Conclusion: Tool-Light successfully addresses TIR inefficiencies by leveraging information entropy insights and multi-stage fine-tuning, enabling more efficient and accurate tool-integrated reasoning in LLMs.

Abstract: Tool-Integrated Reasoning (TIR) enables large language models (LLMs) to improve their internal reasoning ability by integrating external tools. However, models employing TIR often display suboptimal behaviors, such as insufficient or excessive tool usage and overthinking after tool calls. The challenge of incentivizing LLMs to perform TIR efficiently and accurately, while stabilizing the reasoning process, remains an open question. In this paper, we start by exploring the impact of tool calls on model reasoning from the perspective of information entropy. Our findings indicate that tool call results lead to a distinct change in the information entropy of subsequent reasoning, with the overall entropy of the reasoning chain varying based on the number of tool calls. Building on these insights, we propose Tool-Light, a framework designed to encourage LLMs to perform TIR efficiently and accurately. Our framework includes dataset construction and multi-stage fine-tuning. For dataset construction, we employ continuous self-evolved sampling using the fine-tuned model, integrating both vanilla sampling and entropy-guided sampling. Besides, we establish strict criteria for selecting positive-negative pairs during sampling. The training process involves a two-stage approach, comprising Supervised Fine-Tuning (SFT) and Self-Evolved Direct Preference Optimization (DPO). Experimental results on 10 datasets demonstrate the effectiveness of Tool-Light, significantly improving the model’s efficiency in executing TIR tasks.

[902] Learning How to Use Tools, Not Just When: Pattern-Aware Tool-Integrated Reasoning

Ningning Xu, Yuxuan Jiang, Shubhashis Roy Dipta

Main category: cs.AI

TL;DR: The paper proposes a pattern-aware approach for tool-integrated reasoning that improves code usage and accuracy by aligning tool application patterns with teacher preferences.

Details

Motivation: Prior work on tool-integrated reasoning mainly focused on when to invoke tools but overlooked how tools are applied, leading to failures even when reasoning is sound due to misaligned pattern choices.

Method: A two-stage framework that first builds code competence from both calculator and algorithmic patterns, then aligns pattern selection with teacher preferences.

Result: Substantial improvements in code usage and accuracy: Code@1 on MATH500 increased from 64.0% to 70.5% and on AIME24 from 26.7% to 50.0%.

Conclusion: Pattern-aware approaches are highly effective for tool-integrated reasoning, demonstrating significant gains across challenging math datasets.

Abstract: Tool-integrated reasoning (TIR) has become a key approach for improving large reasoning models (LRMs) on complex problems. Prior work has mainly studied when to invoke tools, while overlooking how tools are applied. We identify two common patterns: a calculator pattern that uses code for direct computation, and an algorithmic pattern that encodes problems as programs. Misaligned choices often cause failures even when reasoning is sound. We propose a two-stage framework that first builds code competence from both patterns and then aligns pattern selection with teacher preferences. Across challenging math datasets, our pattern-aware method substantially improves both code usage and accuracy, for instance raising Code@1 on MATH500 from 64.0% to 70.5% and on AIME24 from 26.7% to 50.0%. These gains highlight the effectiveness of a pattern-aware approach for tool-integrated reasoning.

[903] Your Models Have Thought Enough: Training Large Reasoning Models to Stop Overthinking

Jinyi Han, Ying Huang, Ying Liao, Zishang Jiang, Xikun Lu, Haiquan Zhao, Xinyi Wang, Guanghao Zhou, Sihang Jiang, Jiaqing Liang, Weikang Zhou, Zeye Sun, Fei Yu, Yanghua Xiao

Main category: cs.AI

TL;DR: JET trains Large Reasoning Models to proactively terminate unnecessary reasoning steps, achieving significant efficiency gains without sacrificing accuracy.

Details

Motivation: LRMs incur substantial computational costs due to deep reasoning, and existing methods struggle to construct short reasoning paths during rollout.

Method: JET performs trajectory truncation during rollout to expose models to short reasoning paths and uses a quality-controlled length reward to encourage concise reasoning while maintaining correctness.

Result: JET significantly improves reasoning efficiency without sacrificing accuracy, with DeepSeek-Distill-Qwen-1.5B achieving 4.6% accuracy gain while reducing output length by 46.3% on the Olympiad benchmark.

Conclusion: JET effectively enables LRMs to terminate unnecessary reasoning steps early, providing substantial efficiency improvements while maintaining or even improving accuracy.

Abstract: Large Reasoning Models (LRMs) have achieved impressive performance on challenging tasks, yet their deep reasoning often incurs substantial computational costs. To achieve efficient reasoning, existing reinforcement learning methods still struggle to construct short reasoning path during the rollout stage, limiting effective learning. Inspired by Evidence Accumulation Models, we find that LRMs have accumulated sufficient information early in reasoning, making further reasoning steps redundant. Based on this insight, we propose Just-Enough Thinking (JET), which trains models to proactively terminate unnecessary reasoning. JET performs trajectory truncation during rollout to expose the model to short, distributionally consistent reasoning paths. Besides, it uses a quality-controlled length reward to better encourage concise reasoning while maintaining correctness. Extensive experiments demonstrate that JET significantly improves reasoning efficiency without sacrificing accuracy. Especially, DeepSeek-Distill-Qwen-1.5B achieves a 4.6% accuracy gain while reducing output length by 46.3% on the Olympiad benchmark. Our code is available in the GitHub.

[904] Hierarchical Task Environments as the Next Frontier for Embodied World Models in Robot Soccer

Brennen Hill

Main category: cs.AI

TL;DR: The paper argues for scaling structural complexity in embodied world models through hierarchical scaffolding rather than just increasing model size or environment fidelity, using multi-agent soccer as a case study.

Details

Motivation: Current end-to-end approaches fail for complex multi-agent tasks due to intractable exploration spaces and sparse rewards, necessitating more structured methods.

Method: Integrates symbolic and hierarchical methods (HTNs, BSNs) with MARL to decompose complex goals into subgoals, creating intrinsic curricula. Proposes using LLMs as generative world models for dynamic scaffolding.

Result: Identified trend towards hierarchical methods in 2024 multi-agent soccer research, showing these approaches can guide exploration more efficiently and generate better learning signals.

Conclusion: Structured environments with explicit, composable task layers are essential for training capable agents with fewer resources than purely end-to-end approaches, and this principle can generalize to other complex domains.

Abstract: Recent advances in agent development have focused on scaling model size and raw interaction data, mirroring the successes seen in large language models. However, for complex, long-horizon multi-agent tasks such as robotic soccer, this end-to-end approach often fails due to intractable exploration spaces and sparse rewards. This position paper argues that the next frontier in developing embodied world models is not merely increasing the fidelity or size of environments, but scaling their structural complexity through explicit hierarchical scaffolding. We posit that an effective world model for decision-making must model not only the world’s physics but also its task semantics. Drawing from a systematic review of 2024 research in low-resource multi-agent soccer, we identify a clear trend towards integrating symbolic and hierarchical methods, such as Hierarchical Task Networks (HTNs) and Bayesian Strategy Networks (BSNs), with multi-agent reinforcement learning (MARL). These methods decompose complex goals into manageable subgoals, creating an intrinsic curriculum that shapes agent learning. We propose that such structured environments are essential for bridging the gap between simple, reactive behaviors and sophisticated, strategic team play. We further extend this principle, proposing that this scaffolding can be generalized to other complex domains and dynamically generated by Large Language Models (LLMs), which act as generative world models of tasks. By building environments with explicit, composable task layers, we can guide agent exploration more efficiently, generate meaningful learning signals, and ultimately train more capable and general-purpose agents with fewer resources than purely end-to-end approaches.

[905] From Conversation to Query Execution: Benchmarking User and Tool Interactions for EHR Database Agents

Gyubok Lee, Woosog Chay, Heeyoung Kwak, Yeong Hwa Kim, Haanju Yoo, Oksoon Jeong, Meong Hi Son, Edward Choi

Main category: cs.AI

TL;DR: EHR-ChatQA is a benchmark for evaluating LLM-powered agents in EHR data access, addressing query ambiguity and value mismatch through two interaction flows (IncreQA and AdaptQA), revealing performance gaps between best-case and consistent success rates.

Details

Motivation: Current benchmarks don't adequately capture real-world clinical data access flows, limiting adoption of LLM-powered agents in EHR systems due to query ambiguity from vague user questions and value mismatch between user terminology and database entries.

Method: Introduced EHR-ChatQA benchmark with simulated LLM-based user environment across two interaction flows: Incremental Query Refinement (IncreQA) where users add constraints, and Adaptive Query Refinement (AdaptQA) where users adjust search goals mid-conversation.

Result: State-of-the-art LLMs achieved high Pass@5 of 90-95% on IncreQA and 60-80% on AdaptQA, but their Pass^5 (consistent success across all trials) was substantially lower by 35-60%, showing reliability gaps.

Conclusion: Agents need to be both performant and robust for safety-critical EHR domain, with diagnostic insights provided to guide future development of more reliable database agents.

Abstract: Despite the impressive performance of LLM-powered agents, their adoption for Electronic Health Record (EHR) data access remains limited by the absence of benchmarks that adequately capture real-world clinical data access flows. In practice, two core challenges hinder deployment: query ambiguity from vague user questions and value mismatch between user terminology and database entries. To address this, we introduce EHR-ChatQA an interactive database question answering benchmark that evaluates the end-to-end workflow of database agents: clarifying user questions, using tools to resolve value mismatches, and generating correct SQL to deliver accurate answers. To cover diverse patterns of query ambiguity and value mismatch, EHR-ChatQA assesses agents in a simulated environment with an LLM-based user across two interaction flows: Incremental Query Refinement (IncreQA), where users add constraints to existing queries, and Adaptive Query Refinement (AdaptQA), where users adjust their search goals mid-conversation. Experiments with state-of-the-art LLMs (e.g., o4-mini and Gemini-2.5-Flash) over five i.i.d. trials show that while agents achieve high Pass@5 of 90-95% (at least one of five trials) on IncreQA and 60-80% on AdaptQA, their Pass^5 (consistent success across all five trials) is substantially lower by 35-60%. These results underscore the need to build agents that are not only performant but also robust for the safety-critical EHR domain. Finally, we provide diagnostic insights into common failure modes to guide future agent development.

[906] Democratizing AI scientists using ToolUniverse

Shanghua Gao, Richard Zhu, Pengwei Sui, Zhenglun Kong, Sufian Aldogom, Yepeng Huang, Ayush Noori, Reza Shamji, Krishna Parvataneni, Theodoros Tsiligkaridis, Marinka Zitnik

Main category: cs.AI

TL;DR: ToolUniverse is an ecosystem for building AI scientists that standardizes tool integration and usage, enabling interoperability across 600+ ML models, datasets, APIs, and scientific packages.

Details

Motivation: Current AI scientist systems are bespoke, tied to rigid workflows, and lack shared environments that unify tools, data, and analyses into a common ecosystem, similar to how unified ecosystems transformed omics research.

Method: ToolUniverse standardizes how AI scientists identify and call tools, automatically refines tool interfaces, creates new tools from natural language descriptions, iteratively optimizes tool specifications, and composes tools into agentic workflows.

Result: In a hypercholesterolemia case study, ToolUniverse was used to create an AI scientist that identified a potent drug analog with favorable predicted properties.

Conclusion: ToolUniverse provides the necessary infrastructure for building AI scientists from any language or reasoning model, enabling collaborative discovery through standardized tool integration and workflow composition.

Abstract: AI scientists are emerging computational systems that serve as collaborative partners in discovery. These systems remain difficult to build because they are bespoke, tied to rigid workflows, and lack shared environments that unify tools, data, and analyses into a common ecosystem. In omics, unified ecosystems have transformed research by enabling interoperability, reuse, and community-driven development; AI scientists require comparable infrastructure. We present ToolUniverse, an ecosystem for building AI scientists from any language or reasoning model, whether open or closed. TOOLUNIVERSE standardizes how AI scientists identify and call tools, integrating more than 600 machine learning models, datasets, APIs, and scientific packages for data analysis, knowledge retrieval, and experimental design. It automatically refines tool interfaces for correct use by AI scientists, creates new tools from natural language descriptions, iteratively optimizes tool specifications, and composes tools into agentic workflows. In a case study of hypercholesterolemia, ToolUniverse was used to create an AI scientist to identify a potent analog of a drug with favorable predicted properties. The open-source ToolUniverse is available at https://aiscientist.tools.

[907] Beyond Embeddings: Interpretable Feature Extraction for Binary Code Similarity

Charles E. Gagnon, Steven H. H. Ding, Philippe Charland, Benjamin C. M. Fung

Main category: cs.AI

TL;DR: A new method for binary code similarity detection that uses language model-based agents to generate structured, human-readable features from assembly code, bridging gaps between interpretability, generalizability, and scalability.

Details

Motivation: Current binary code similarity detection methods force compromises between interpretability (hand-crafted features), generalizability (embedding methods), and scalability. Embedding-based methods produce opaque vectors that prevent rapid verification and face scalability-accuracy trade-offs.

Method: Uses a language model-based agent to conduct structured reasoning analysis of assembly code, generating human-readable features such as input/output types, side effects, notable constants, and algorithmic intent. These features are directly searchable with inverted or relational indexes.

Result: Without any matching training, achieves 42% recall@1 in cross-architecture tasks and 62% in cross-optimization tasks, comparable to embedding methods with training (39% and 34%). When combined with embeddings, significantly outperforms state-of-the-art methods.

Conclusion: Demonstrates that accuracy, scalability, and interpretability can coexist in binary code similarity detection by combining structured reasoning features with embeddings, overcoming limitations of both hand-crafted and embedding-based approaches.

Abstract: Binary code similarity detection is a core task in reverse engineering. It supports malware analysis and vulnerability discovery by identifying semantically similar code in different contexts. Modern methods have progressed from manually engineered features to vector representations. Hand-crafted statistics (e.g., operation ratios) are interpretable, but shallow and fail to generalize. Embedding-based methods overcome this by learning robust cross-setting representations, but these representations are opaque vectors that prevent rapid verification. They also face a scalability-accuracy trade-off, since high-dimensional nearest-neighbor search requires approximations that reduce precision. Current approaches thus force a compromise between interpretability, generalizability, and scalability. We bridge these gaps using a language model-based agent to conduct structured reasoning analysis of assembly code and generate features such as input/output types, side effects, notable constants, and algorithmic intent. Unlike hand-crafted features, they are richer and adaptive. Unlike embeddings, they are human-readable, maintainable, and directly searchable with inverted or relational indexes. Without any matching training, our method respectively achieves 42% and 62% for recall@1 in cross-architecture and cross-optimization tasks, comparable to embedding methods with training (39% and 34%). Combined with embeddings, it significantly outperforms the state-of-the-art, demonstrating that accuracy, scalability, and interpretability can coexist.

[908] ViTSP: A Vision Language Models Guided Framework for Large-Scale Traveling Salesman Problems

Zhuoli Yin, Yi Ding, Reem Khir, Hua Cai

Main category: cs.AI

TL;DR: ViTSP is a novel framework that uses pre-trained vision language models to visually identify promising subproblems in large-scale TSP instances, which are then optimized using existing solvers to achieve high-quality solutions without dedicated training.

Details

Motivation: Classical TSP methods face scaling challenges, heuristic methods require parameter calibration, and learning-based approaches suffer from poor generalization and limited scalability due to fixed training data.

Method: Leverages pre-trained vision language models to visually identify promising small-scale subproblems from visualized TSP instances, then uses off-the-shelf solvers to optimize these subproblems and improve the global solution.

Result: Achieves average optimality gaps below 0.2% on real-world TSP instances ranging from 1k to 88k nodes, outperforming learning-based methods and reducing LKH-3’s gaps by 12% to 100% under same runtime budget.

Conclusion: ViTSP offers a new perspective in hybridizing pre-trained generative models and operations research solvers for combinatorial optimization, with practical implications for integration into complex logistics systems.

Abstract: Solving Traveling Salesman Problem (TSP) is NP-hard yet fundamental for wide real-world applications. Classical exact methods face challenges in scaling, and heuristic methods often require domain-specific parameter calibration. While learning-based approaches have shown promise, they suffer from poor generalization and limited scalability due to fixed training data. This work proposes ViTSP, a novel framework that leverages pre-trained vision language models (VLMs) to visually guide the solution process for large-scale TSPs. The VLMs function to identify promising small-scale subproblems from a visualized TSP instance, which are then efficiently optimized using an off-the-shelf solver to improve the global solution. ViTSP bypasses the dedicated model training at the user end while maintaining effectiveness across diverse instances. Experiments on real-world TSP instances ranging from 1k to 88k nodes demonstrate that ViTSP consistently achieves solutions with average optimality gaps below 0.2%, outperforming existing learning-based methods. Under the same runtime budget, it surpasses the best-performing heuristic solver, LKH-3, by reducing its gaps by 12% to 100%, particularly on very-large-scale instances with more than 10k nodes. Our framework offers a new perspective in hybridizing pre-trained generative models and operations research solvers in solving combinatorial optimization problems, with practical implications for integration into more complex logistics systems. The code is available at https://anonymous.4open.science/r/ViTSP_codes-6683.

[909] GeoBS: Information-Theoretic Quantification of Geographic Bias in AI Models

Zhangyu Wang, Nemin Wu, Qian Cao, Jiangnan Xia, Zeping Liu, Yiqun Xie, Akshay Nambi, Tanuja Ganu, Ni Lao, Ninghao Liu, Gengchen Mai

Main category: cs.AI

TL;DR: The paper introduces GeoBS, an information-theoretic framework for evaluating geographic bias in AI models, addressing limitations of previous model-specific and spatially implicit measures.

Details

Motivation: Current bias evaluation methods focus on social bias but neglect geographic bias, which has unique challenges. Existing geo-bias measures are either model-specific or spatially implicit, lacking a universal framework for fair comparison across different AI models.

Method: Established an information-theoretic framework called GeoBS for geo-bias evaluation. Proposed three novel geo-bias scores that explicitly consider spatial factors: multi-scalability, distance decay, and anisotropy.

Result: Extensive experiments on 3 tasks, 8 datasets, and 8 models showed that both task-specific GeoAI models and general-purpose foundation models suffer from various types of geo-bias.

Conclusion: The GeoBS framework advances technical understanding of geographic bias and establishes a foundation for integrating spatial fairness into AI system design, deployment, and evaluation.

Abstract: The widespread adoption of AI models, especially foundation models (FMs), has made a profound impact on numerous domains. However, it also raises significant ethical concerns, including bias issues. Although numerous efforts have been made to quantify and mitigate social bias in AI models, geographic bias (in short, geo-bias) receives much less attention, which presents unique challenges. While previous work has explored ways to quantify geo-bias, these measures are model-specific (e.g., mean absolute deviation of LLM ratings) or spatially implicit (e.g., average fairness scores of all spatial partitions). We lack a model-agnostic, universally applicable, and spatially explicit geo-bias evaluation framework that allows researchers to fairly compare the geo-bias of different AI models and to understand what spatial factors contribute to the geo-bias. In this paper, we establish an information-theoretic framework for geo-bias evaluation, called GeoBS (Geo-Bias Scores). We demonstrate the generalizability of the proposed framework by showing how to interpret and analyze existing geo-bias measures under this framework. Then, we propose three novel geo-bias scores that explicitly take intricate spatial factors (multi-scalability, distance decay, and anisotropy) into consideration. Finally, we conduct extensive experiments on 3 tasks, 8 datasets, and 8 models to demonstrate that both task-specific GeoAI models and general-purpose foundation models may suffer from various types of geo-bias. This framework will not only advance the technical understanding of geographic bias but will also establish a foundation for integrating spatial fairness into the design, deployment, and evaluation of AI systems.

[910] Accurate Predictions in Education with Discrete Variational Inference

Tom Quilter, Anastasia Ilick, Anastasia Ilick, Richard Turner

Main category: cs.AI

TL;DR: The paper introduces a large open dataset of math exam responses and a probabilistic IRT-based framework that achieves over 80% accuracy in predicting student performance, with a novel discrete variational inference method excelling in low-data settings.

Details

Motivation: To address social inequality in education by developing affordable AI tutors through improved prediction of student performance, particularly in data-sparse environments where existing platforms struggle.

Method: Uses Item Response Theory (IRT) framework with collaborative filtering models incorporating topic-level skill profiles, and introduces a novel discrete variational inference method for low-data settings.

Result: Achieves over 80% prediction accuracy on formal mathematics exams, setting a new benchmark. Surprisingly finds that a single latent ability parameter alone achieves maximum predictive accuracy.

Conclusion: The discrete variational inference framework outperforms classical IRT and matrix factorization baselines, especially in low-data settings, providing an effective solution for scalable AI tutoring systems.

Abstract: One of the largest drivers of social inequality is unequal access to personal tutoring, with wealthier individuals able to afford it, while the majority cannot. Affordable, effective AI tutors offer a scalable solution. We focus on adaptive learning, predicting whether a student will answer a question correctly, a key component of any effective tutoring system. Yet many platforms struggle to achieve high prediction accuracy, especially in data-sparse settings. To address this, we release the largest open dataset of professionally marked formal mathematics exam responses to date. We introduce a probabilistic modelling framework rooted in Item Response Theory (IRT) that achieves over 80 percent accuracy, setting a new benchmark for mathematics prediction accuracy of formal exam papers. Extending this, our collaborative filtering models incorporate topic-level skill profiles, but reveal a surprising and educationally significant finding, a single latent ability parameter alone is needed to achieve the maximum predictive accuracy. Our main contribution though is deriving and implementing a novel discrete variational inference framework, achieving our highest prediction accuracy in low-data settings and outperforming all classical IRT and matrix factorisation baselines.

[911] Mapping Overlaps in Benchmarks through Perplexity in the Wild

Siyang Wu, Honglin Bao, Sida Li, Ari Holtzman, James A. Evans

Main category: cs.AI

TL;DR: The paper introduces benchmark signatures to characterize LLM benchmarks by identifying salient tokens that predict performance, revealing meaningful overlaps and limitations in current evaluation methods.

Details

Motivation: To better understand LLM benchmark validity and the underlying landscape of interconnected capabilities, addressing limitations in current benchmark agreement studies and the conflation of performance with ability.

Method: Extract benchmark signatures via stepwise forward selection with linear regressions across 32 LLMs and 88 benchmarks, using token perplexity from naturally authored corpora to predict benchmark performance.

Result: Benchmark signatures capture variation, overlap, and divergence better than performance or semantic similarity. Knowledge and reasoning tasks overlap significantly, while multilingual/cultural benchmarks show less similarity. Coding emerges as the least overlapping domain.

Conclusion: Benchmark signatures provide robust mechanistic insights into LLM capabilities, revealing cross-functional overlaps across logic, math, language, and world modeling, while highlighting limitations in current evaluation practices.

Abstract: We develop signatures of capacity familiarity to characterize large language model (LLM) benchmarks and their meaningful overlaps. Benchmark signatures probe the capacity required for benchmark performance. We formally define them as a set of salient tokens drawn from in-the-wild, naturally authored corpora, where LLM token perplexity, reflecting more or less pre-training exposure, becomes highly predictive of LLM benchmark performance. Through a large-scale meta-evaluation, we extract benchmark signatures via stepwise forward selection with linear regressions across 32 LLMs and 88 benchmarks spanning diverse knowledge, coding, logic, instruction following, math, language, reasoning, and world modeling. Our analysis situates signatures in relation to both the semantic similarity of benchmark questions and the correlation of model performance. While performance overlaps are universally high and semantic overlaps remain confined to a narrow mid-range, benchmark signatures prove highly informative in capturing variation, overlap, and divergence. We observe overlap in knowledge and reasoning subtasks, whereas multilingual and cultural benchmarks exhibit less similarity, even compared to cross-task overlap. Notably, performance-level results are strongly influenced by benchmark-orthogonal factors such as question format, highlighting limitations in LLM generalization, the conflation of performance with ability, and issues inherent in current mainstream benchmark agreement studies. Benchmark signatures, however, remain robust to such effects. Ultimately, we identify cross-functional overlaps across logic, math, language, instruction following, and world modeling, with coding emerging as the least overlapping domain. Together, these findings provide mechanistic insights into benchmark validity and LLM sensitivities, and sketch the underlying landscape of interconnected LLM capabilities.

[912] Dynamic Trust Calibration Using Contextual Bandits

Bruno M. Henrique, Eugene Santos Jr

Main category: cs.AI

TL;DR: Proposes a novel objective method for dynamic trust calibration between humans and AI using Contextual Bandits algorithm, showing 10-38% improvement in decision-making performance across three datasets.

Details

Motivation: Current methods for measuring human-AI trust calibration lack standardization, consistent metrics, and fail to distinguish between opinion formation and subsequent decisions, creating a need for objective measurement approaches.

Method: Uses Contextual Bandits - an adaptive algorithm that incorporates context into decision-making - to create a dynamic trust calibration indicator that assesses when to trust AI contributions based on learned contextual information.

Result: Evaluation across three diverse datasets demonstrated that effective trust calibration leads to significant improvements in decision-making performance, with 10 to 38% increase in reward metrics.

Conclusion: The proposed method enhances theoretical understanding of trust calibration and provides practical guidance for developing more trustworthy AI systems in critical domains like disease diagnosis and criminal justice.

Abstract: Trust calibration between humans and Artificial Intelligence (AI) is crucial for optimal decision-making in collaborative settings. Excessive trust can lead users to accept AI-generated outputs without question, overlooking critical flaws, while insufficient trust may result in disregarding valuable insights from AI systems, hindering performance. Despite its importance, there is currently no definitive and objective method for measuring trust calibration between humans and AI. Current approaches lack standardization and consistent metrics that can be broadly applied across various contexts, and they don’t distinguish between the formation of opinions and subsequent human decisions. In this work, we propose a novel and objective method for dynamic trust calibration, introducing a standardized trust calibration measure and an indicator. By utilizing Contextual Bandits-an adaptive algorithm that incorporates context into decision-making-our indicator dynamically assesses when to trust AI contributions based on learned contextual information. We evaluate this indicator across three diverse datasets, demonstrating that effective trust calibration results in significant improvements in decision-making performance, as evidenced by 10 to 38% increase in reward metrics. These findings not only enhance theoretical understanding but also provide practical guidance for developing more trustworthy AI systems supporting decisions in critical domains, for example, disease diagnoses and criminal justice.

[913] Model Consistency as a Cheap yet Predictive Proxy for LLM Elo Scores

Ashwin Ramaswamy, Nestor Demeure, Ermal Rrapaj

Main category: cs.AI

TL;DR: LLMs can be used to evaluate other LLMs by having them judge contests between models, producing a metric that is 91% correlated with human-produced Elo scores, providing a cheap alternative to human evaluation.

Details

Motivation: There is a need for independent evaluation of new LLMs as they are released frequently, and current Elo score evaluation methods are expensive due to human involvement.

Method: Use LLMs to judge contests between other LLMs and measure the consistency with which they select a model as the best in matchups.

Result: The LLM-produced metric has 91% correlation with human-produced Elo scores.

Conclusion: This provides a simple, cheap proxy for Elo scores that doesn’t require human data or prior knowledge.

Abstract: New large language models (LLMs) are being released every day. Some perform significantly better or worse than expected given their parameter count. Therefore, there is a need for a method to independently evaluate models. The current best way to evaluate a model is to measure its Elo score by comparing it to other models in a series of contests - an expensive operation since humans are ideally required to compare LLM outputs. We observe that when an LLM is asked to judge such contests, the consistency with which it selects a model as the best in a matchup produces a metric that is 91% correlated with its own human-produced Elo score. This provides a simple proxy for Elo scores that can be computed cheaply, without any human data or prior knowledge.

[914] DOoM: Difficult Olympiads of Math

Ilya Kuleshov, Ilin Pavel, Nikolay Kompanets, Ksenia Sycheva, Aleksandr Nikolich

Main category: cs.AI

TL;DR: DOoM is a new open-source benchmark for evaluating language models on Russian mathematics and physics problems across difficulty levels from school to university Olympiad/exam questions.

Details

Motivation: To assess language model capabilities specifically for Russian mathematics and physics problem-solving across different difficulty levels.

Method: Created a benchmark with problems of varying difficulty (school to university Olympiad/entrance exams), established evaluation methodology, and tested various models.

Result: Analysis shows correlation between model performance and token count used, with performance differences observed between mathematics and physics tasks.

Conclusion: DOoM benchmark provides valuable insights into language model performance on Russian STEM problems, revealing token usage correlations and subject-specific performance variations.

Abstract: This paper introduces DOoM, a new open-source benchmark designed to assess the capabilities of language models in solving mathematics and physics problems in Russian. The benchmark includes problems of varying difficulty, ranging from school-level tasks to university Olympiad and entrance exam questions. In this paper we discuss the motivation behind its creation, describe dataset’s structure and evaluation methodology, and present initial results from testing various models. Analysis of the results shows a correlation between model performance and the number of tokens used, and highlights differences in performance between mathematics and physics tasks.

[915] Beyond the Strongest LLM: Multi-Turn Multi-Agent Orchestration vs. Single LLMs on Benchmarks

Aaron Xuxiang Tian, Ruofan Zhang, Jiayao Tang, Young Min Cho, Xueqian Li, Qiang Yi, Ji Wang, Zhunping Zhang, Danrui Qi, Sharath Chandra Guntuku, Lyle Ungar, Tianyu Shi, Chi Wang

Main category: cs.AI

TL;DR: Multi-agent orchestration with multiple LLMs interacting over turns through voting achieves performance matching or exceeding the strongest single model, with analysis showing potential for further improvements.

Details

Motivation: To study how multiple LLM agents can collaborate through iterative voting and consensus-building to solve complex tasks, potentially outperforming individual models.

Method: Used four LLMs (Gemini 2.5 Pro, GPT-5, Grok 4, Claude Sonnet 4) on GPQA-Diamond, IFEval, and MuSR datasets with two experiments: benchmarking orchestration against single-LLM baselines, and ablations varying whether agents see answer authorship and ongoing votes.

Result: Orchestration matches or exceeds the strongest single model and consistently outperforms others. Analysis shows potential for further gains. Ablations reveal that revealing authorship increases self-voting and ties, while showing ongoing votes amplifies herding behavior.

Conclusion: Multi-agent orchestration is effective for collaborative problem-solving, with performance improvements over individual models, though information disclosure (authorship and ongoing votes) can influence voting behavior and convergence patterns.

Abstract: We study multi-turn multi-agent orchestration, where multiple large language model (LLM) agents interact over multiple turns by iteratively proposing answers or casting votes until reaching consensus. Using four LLMs (Gemini 2.5 Pro, GPT-5, Grok 4, and Claude Sonnet 4) on GPQA-Diamond, IFEval, and MuSR, we conduct two experiments: (i) benchmarking orchestration against single-LLM baselines; and (ii) ablations on GPQA-Diamond that vary whether agents see who authored answers and whether they can observe ongoing votes. Orchestration matches or exceeds the strongest single model and consistently outperforms the others. Analysis of best-achievable orchestration performance shows potential for further gains. The ablations show that revealing authorship increases self-voting and ties, and that showing ongoing votes amplifies herding, which speeds convergence but can sometimes yield premature consensus.

[916] Formalization Driven LLM Prompt Jailbreaking via Reinforcement Learning

Zhaoqi Wang, Daqing He, Zijian Zhang, Xin Li, Liehuang Zhu, Meng Li, Jiamou Liu

Main category: cs.AI

TL;DR: PASS framework uses reinforcement learning to formalize jailbreak prompts, making them stealthier and able to bypass LLM alignment defenses, then structures outputs into GraphRAG for enhanced subsequent attacks.

Details

Motivation: To uncover vulnerabilities in LLM alignment methods and address security challenges from sophisticated prompt jailbreaking attacks that make LLMs deviate from human values.

Method: Uses reinforcement learning to transform initial jailbreak prompts into formalized descriptions, then structures outputs into GraphRAG system leveraging extracted terms and formalized symbols as contextual input.

Result: Demonstrated effectiveness through extensive experiments on common open-source models, showing enhanced stealthiness and ability to bypass existing alignment defenses.

Conclusion: PASS framework successfully enhances jailbreaking attacks on LLMs by formalizing prompts and leveraging structured outputs, revealing vulnerabilities in current alignment methods.

Abstract: Large language models (LLMs) have demonstrated remarkable capabilities, yet they also introduce novel security challenges. For instance, prompt jailbreaking attacks involve adversaries crafting sophisticated prompts to elicit responses from LLMs that deviate from human values. To uncover vulnerabilities in LLM alignment methods, we propose the PASS framework (\underline{P}rompt J\underline{a}ilbreaking via \underline{S}emantic and \underline{S}tructural Formalization). Specifically, PASS employs reinforcement learning to transform initial jailbreak prompts into formalized descriptions, which enhances stealthiness and enables bypassing existing alignment defenses. The jailbreak outputs are then structured into a GraphRAG system that, by leveraging extracted relevant terms and formalized symbols as contextual input alongside the original query, strengthens subsequent attacks and facilitates more effective jailbreaks. We conducted extensive experiments on common open-source models, demonstrating the effectiveness of our attack.

[917] A Hierarchical Structure-Enhanced Personalized Recommendation Model for Traditional Chinese Medicine Formulas Based on KG Diffusion Guidance

ChaoBo Zhang, Long Tan

Main category: cs.AI

TL;DR: TCM-HEDPR is a hierarchical structure-enhanced personalized recommendation model for TCM formulas that addresses limitations in previous AI systems by incorporating patient personalization, handling long-tailed herb distributions, and considering herb compatibility principles through knowledge graph diffusion guidance.

Details

Motivation: Previous TCM prescription AI systems have limitations: insufficient attention to patient personalization (age, BMI, medical history), long-tailed distribution of herb data causing training biases, and oversight of herb compatibility principles ('monarch, minister, assistant and envoy') which increases toxicity risks and contradicts TCM clinical principles.

Method: The model pre-trains symptom representations using patient-personalized prompt sequences with prompt-oriented contrastive learning for data augmentation. It uses KG-guided homogeneous graph diffusion with self-attention to capture non-linear symptom-herb relationships globally. A heterogeneous graph hierarchical network integrates herbal dispensing relationships with implicit syndromes to guide prescription generation and mitigate long-tailed data problems.

Result: Extensive experiments on two public datasets and one clinical dataset demonstrate the effectiveness of TCM-HEDPR. The model incorporates insights from modern medicine and network pharmacology for comprehensive prescription evaluation.

Conclusion: TCM-HEDPR provides a new paradigm for modern TCM recommendation by addressing key limitations in existing approaches and ensuring clinically relevant, personalized prescription generation that respects TCM principles.

Abstract: Artificial intelligence technology plays a crucial role in recommending prescriptions for traditional Chinese medicine (TCM). Previous studies have made significant progress by focusing on the symptom-herb relationship in prescriptions. However, several limitations hinder model performance: (i) Insufficient attention to patient-personalized information such as age, BMI, and medical history, which hampers accurate identification of syndrome and reduces efficacy. (ii) The typical long-tailed distribution of herb data introduces training biases and affects generalization ability. (iii) The oversight of the ‘monarch, minister, assistant and envoy’ compatibility among herbs increases the risk of toxicity or side effects, opposing the ’treatment based on syndrome differentiation’ principle in clinical TCM. Therefore, we propose a novel hierarchical structure-enhanced personalized recommendation model for TCM formulas based on knowledge graph diffusion guidance, namely TCM-HEDPR. Specifically, we pre-train symptom representations using patient-personalized prompt sequences and apply prompt-oriented contrastive learning for data augmentation. Furthermore, we employ a KG-guided homogeneous graph diffusion method integrated with a self-attention mechanism to globally capture the non-linear symptom-herb relationship. Lastly, we design a heterogeneous graph hierarchical network to integrate herbal dispensing relationships with implicit syndromes, guiding the prescription generation process at a fine-grained level and mitigating the long-tailed herb data distribution problem. Extensive experiments on two public datasets and one clinical dataset demonstrate the effectiveness of TCM-HEDPR. In addition, we incorporate insights from modern medicine and network pharmacology to evaluate the recommended prescriptions comprehensively. It can provide a new paradigm for the recommendation of modern TCM.

[918] Clean First, Align Later: Benchmarking Preference Data Cleaning for Reliable LLM Alignment

Min-Hsuan Yeh, Yixuan Li

Main category: cs.AI

TL;DR: PrefCleanBench is the first comprehensive benchmark for evaluating 13 preference data cleaning methods in LLM alignment, providing standardized protocols to assess cleaning effectiveness across diverse datasets, models, and algorithms.

Details

Motivation: Human feedback for LLM alignment is often noisy and inconsistent, degrading reward model quality. Existing automated data cleaning methods lack systematic evaluation of their effectiveness and generalizability.

Method: Created PrefCleanBench with standardized protocols to evaluate 13 preference data cleaning methods across diverse datasets, model architectures, and optimization algorithms.

Result: The benchmark enables rigorous comparison of cleaning methods, uncovering key factors that determine successful data cleaning in alignment tasks, and provides modular implementations for further research.

Conclusion: PrefCleanBench establishes groundwork for principled approaches to improve LLM alignment through better data quality, highlighting the crucial role of data preprocessing in responsible AI development.

Abstract: Human feedback plays a pivotal role in aligning large language models (LLMs) with human preferences. However, such feedback is often noisy or inconsistent, which can degrade the quality of reward models and hinder alignment. While various automated data cleaning methods have been proposed to mitigate this issue, a systematic evaluation of their effectiveness and generalizability remains lacking. To bridge this gap, we introduce the first comprehensive benchmark for evaluating 13 preference data cleaning methods in the context of LLM alignment. PrefCleanBench offers a standardized protocol to assess cleaning strategies in terms of alignment performance and generalizability across diverse datasets, model architectures, and optimization algorithms. By unifying disparate methods and rigorously comparing them, we uncover key factors that determine the success of data cleaning in alignment tasks. This benchmark lays the groundwork for principled and reproducible approaches to improving LLM alignment through better data quality-highlighting the crucial but underexplored role of data preprocessing in responsible AI development. We release modular implementations of all methods to catalyze further research: https://github.com/deeplearning-wisc/PrefCleanBench.

[919] BridgeDrive: Diffusion Bridge Policy for Closed-Loop Trajectory Planning in Autonomous Driving

Shu Liu, Wenlin Chen, Weihao Li, Zheng Wang, Lijin Yang, Jianing Huang, Yipin Zhang, Zhongzhan Huang, Ze Cheng, Hao Yang

Main category: cs.AI

TL;DR: BridgeDrive is a novel anchor-guided diffusion bridge policy for closed-loop trajectory planning in autonomous driving that addresses limitations of previous diffusion-based planners by providing principled guidance through expert driving behaviors without theoretical inconsistencies.

Details

Motivation: Existing diffusion-based planners struggle with effective guidance in reactive, closed-loop driving environments. Simple conditioning fails in complex scenarios, while recent anchor-based approaches rely on truncated schedules that introduce theoretical problems and performance compromises.

Method: BridgeDrive uses a diffusion bridge policy that translates expert driving anchors into fine-grained trajectory plans while responding appropriately to varying traffic conditions. The approach is compatible with efficient ODE solvers for real-time deployment.

Result: The method achieves state-of-the-art performance on the Bench2Drive benchmark, improving success rate by 5% over prior approaches.

Conclusion: BridgeDrive provides a principled diffusion framework that effectively guides trajectory planning in autonomous driving through expert anchors while maintaining theoretical consistency and enabling real-time performance.

Abstract: Diffusion-based planners have shown great promise for autonomous driving due to their ability to capture multi-modal driving behaviors. However, guiding these models effectively in reactive, closed-loop environments remains a significant challenge. Simple conditioning often fails to provide sufficient guidance in complex and dynamic driving scenarios. Recent work attempts to use typical expert driving behaviors (i.e., anchors) to guide diffusion models but relies on a truncated schedule, which introduces theoretical inconsistencies and can compromise performance. To address this, we introduce BridgeDrive, a novel anchor-guided diffusion bridge policy for closed-loop trajectory planning. Our approach provides a principled diffusion framework that effectively translates anchors into fine-grained trajectory plans, appropriately responding to varying traffic conditions. Our planner is compatible with efficient ODE solvers, a critical factor for real-time autonomous driving deployment. We achieve state-of-the-art performance on the Bench2Drive benchmark, improving the success rate by 5% over prior arts.

[920] PSG-Agent: Personality-Aware Safety Guardrail for LLM-based Agents

Yaozu Wu, Jizhou Guo, Dongyuan Li, Henry Peng Zou, Wei-Chieh Huang, Yankai Chen, Zhen Wang, Weizhi Zhang, Yangning Li, Meng Zhang, Renhe Jiang, Philip S. Yu

Main category: cs.AI

TL;DR: PSG-Agent is a personalized and dynamic guardrail system for LLM-based agents that addresses limitations of uniform policies and isolated response checking by creating user-specific risk thresholds and implementing continuous monitoring across agent pipelines.

Details

Motivation: Existing guardrails apply uniform policies to all users and check responses in isolation, ignoring that the same agent behavior can harm some users while being safe for others, and missing how risks evolve across multiple interactions.

Method: PSG-Agent creates personalized guardrails by mining interaction history for stable traits and capturing real-time states from queries, then implements continuous monitoring with specialized guards (Plan Monitor, Tool Firewall, Response Guard, Memory Guardian) that track cross-turn risk accumulation.

Result: PSG-Agent significantly outperforms existing agent guardrails including LlamaGuard3 and AGrail in multiple scenarios including healthcare, finance, and daily life automation with diverse user profiles.

Conclusion: PSG-Agent provides an executable and auditable path toward personalized safety for LLM-based agents by addressing fundamental limitations of existing guardrail systems.

Abstract: Effective guardrails are essential for safely deploying LLM-based agents in critical applications. Despite recent advances, existing guardrails suffer from two fundamental limitations: (i) they apply uniform guardrail policies to all users, ignoring that the same agent behavior can harm some users while being safe for others; (ii) they check each response in isolation, missing how risks evolve and accumulate across multiple interactions. To solve these issues, we propose PSG-Agent, a personalized and dynamic system for LLM-based agents. First, PSG-Agent creates personalized guardrails by mining the interaction history for stable traits and capturing real-time states from current queries, generating user-specific risk thresholds and protection strategies. Second, PSG-Agent implements continuous monitoring across the agent pipeline with specialized guards, including Plan Monitor, Tool Firewall, Response Guard, Memory Guardian, that track cross-turn risk accumulation and issue verifiable verdicts. Finally, we validate PSG-Agent in multiple scenarios including healthcare, finance, and daily life automation scenarios with diverse user profiles. It significantly outperform existing agent guardrails including LlamaGuard3 and AGrail, providing an executable and auditable path toward personalized safety for LLM-based agents.

[921] Reasoning Scaffolding: Distilling the Flow of Thought from LLMs

Xiangyu Wen, Junhua Huang, Zeju Li, Min Li, Jianyuan Zhong, Zhijian Xu, Mingxuan Yuan, Yongxiang Huang, Qiang Xu

Main category: cs.AI

TL;DR: The paper introduces Reasoning Scaffolding, a framework that distills algorithmic reasoning structure from LLMs to SLMs through semantic signals rather than behavioral cloning of text, achieving better accuracy and logical consistency.

Details

Motivation: Current behavioral cloning approaches for distilling reasoning from LLMs to SLMs only teach surface-level text patterns, lacking logical robustness and failing to transfer the underlying algorithmic structure of thought.

Method: Reasoning Scaffolding abstracts teacher’s reasoning into discrete semantic signals (e.g., Contrast, Addition) and trains students via multi-task learning to predict next signals and generate corresponding reasoning steps.

Result: The method significantly outperforms state-of-the-art distillation approaches on challenging reasoning benchmarks in both accuracy and logical consistency.

Conclusion: Reasoning Scaffolding provides a path to create smaller models that are genuine reasoners rather than just fluent mimics by directly transferring algorithmic reasoning structure.

Abstract: The prevailing approach to distilling reasoning from Large Language Models (LLMs)-behavioral cloning from textual rationales-is fundamentally limited. It teaches Small Language Models (SLMs) to mimic surface-level patterns rather than the underlying algorithmic structure of thought, resulting in a critical lack of logical robustness. We argue that instead of cloning text, distillation should transfer this algorithmic structure directly. We introduce Reasoning Scaffolding}, a framework that reframes reasoning as a structured generation process. Our method first abstracts the teacher’s thought process into a sequence of discrete, interpretable semantic signals (e.g., Contrast, Addition) that act as a scaffold. The student model is then trained via a multi-task objective to both (1)predict the next semantic signal, anticipating the reasoning flow, and (2)generate the corresponding step, conditioned on that signal. This multi-task scheme acts as a powerful regularizer, compelling the student to internalize the computational patterns of coherent reasoning. On a suite of challenging reasoning benchmarks, our method significantly outperforms state-of-the-art distillation in both accuracy and logical consistency, providing a path towards creating smaller models that are genuine reasoners, not just fluent mimics.

[922] How LLMs Learn to Reason: A Complex Network Perspective

Sihan Hu, Xiansheng Cai, Yuan Huang, Zhiyuan Yao, Linfeng Zhang, Pan Zhang, Youjin Deng, Kun Chen

Main category: cs.AI

TL;DR: The paper explains puzzling behaviors in RLVR training through a semantic network theory and proposes Annealed-RLVR algorithm to improve reasoning capabilities.

Details

Motivation: To understand distinctive behaviors in RLVR training like two-stage learning curves, V-shaped response lengths, and catastrophic forgetting, and develop a unifying theory to explain them.

Method: Proposes that RLVR reasoning maps to self-organization of sparse semantic networks, and introduces Annealed-RLVR with SFT-based “heating” step at maximal frustration points.

Result: Experiments on 1.5B-parameter model show Annealed-RLVR outperforms standard RLVR on both in-distribution and out-of-distribution benchmarks.

Conclusion: The work provides a physical intuition for engineering AI reasoning capabilities by recasting RLVR as structural self-organization rather than black-box optimization.

Abstract: Training large language models with Reinforcement Learning from Verifiable Rewards (RLVR) exhibits a set of distinctive and puzzling behaviors that remain poorly understood, including a two-stage learning curve, V-shaped response-length trajectories, and a pronounced vulnerability to catastrophic forgetting. In this work, we propose that these seemingly disparate phenomena can be explained using a single unifying theory: the model’s reasoning process maps to the self-organization of a semantic complex network whose topology remains persistently sparse, with the average degree pinned close to two. This topology imposes a fundamental mechanism for forgetting and learning: it first drives the system into a maximally frustrated state where skill islands'' form, slow-learning happens, and forgetting is induced; then it enters a sharp growth phase where the new skills are bolted on’’, driven by phase-transition-like learning at the web’s frontier. Equipped with the theory, we propose \textit{Annealed-RLVR}, a principled algorithm that introduces an SFT-based ``heating’’ step at the point of maximal frustration to resolve the competitive bottleneck and enhance the reasoning capability of the model. Experiments on a 1.5B-parameter model demonstrate that the approach outperforms standard RLVR on both in-distribution and out-of-distribution benchmarks. By recasting RLVR from black-box optimization into a predictable process of structural self-organization, our work provides a new physical intuition for engineering the emergent reasoning capabilities of future AI systems.

[923] Game-Oriented ASR Error Correction via RAG-Enhanced LLM

Yan Jiang, Yongle Luo, Qixian Zhou, Elvis S. Liu

Main category: cs.AI

TL;DR: GO-AEC framework improves ASR accuracy for gaming voice chat by combining LLMs, RAG, and data augmentation to handle gaming-specific challenges like short phrases, jargon, and noise.

Details

Motivation: General ASR systems perform poorly in gaming scenarios due to short phrases, rapid speech, gaming jargon, and background noise, leading to frequent recognition errors that hinder team coordination.

Method: Proposes GO-AEC framework integrating large language models, Retrieval-Augmented Generation (RAG), and data augmentation using LLMs and TTS. Includes data augmentation, N-best hypothesis-based correction, and dynamic game knowledge base.

Result: Experiments show GO-AEC reduces character error rate by 6.22% and sentence error rate by 29.71%, significantly improving ASR accuracy in gaming scenarios.

Conclusion: GO-AEC effectively addresses gaming-specific ASR challenges and substantially improves recognition accuracy for real-time voice communication in multiplayer online games.

Abstract: With the rise of multiplayer online games, real-time voice communication is essential for team coordination. However, general ASR systems struggle with gaming-specific challenges like short phrases, rapid speech, jargon, and noise, leading to frequent errors. To address this, we propose the GO-AEC framework, which integrates large language models, Retrieval-Augmented Generation (RAG), and a data augmentation strategy using LLMs and TTS. GO-AEC includes data augmentation, N-best hypothesis-based correction, and a dynamic game knowledge base. Experiments show GO-AEC reduces character error rate by 6.22% and sentence error rate by 29.71%, significantly improving ASR accuracy in gaming scenarios.

[924] From Reasoning to Answer: Empirical, Attention-Based and Mechanistic Insights into Distilled DeepSeek R1 Models

Jue Zhang, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang

Main category: cs.AI

TL;DR: This paper investigates how Large Reasoning Models use explicit reasoning traces to generate answers, showing through empirical evaluation, attention analysis, and mechanistic interventions that reasoning tokens functionally influence answer generation.

Details

Motivation: To understand the unclear relationship between explicit reasoning traces and answer generation in Large Reasoning Models, specifically examining whether reasoning traces actually influence the final answers.

Method: Three-stage investigation: 1) Empirical evaluation of reasoning inclusion effects, 2) Attention analysis identifying Reasoning-Focus Heads, 3) Mechanistic interventions using activation patching to test reasoning-answer dependence.

Result: Including reasoning improves answer quality; answer tokens attend to reasoning tokens; perturbations to key reasoning tokens reliably alter final answers, confirming directional information flow from reasoning to answer.

Conclusion: LRMs functionally leverage reasoning tokens for answer generation, with intermediate reasoning playing a crucial role in shaping model outputs, establishing a clear directional flow from reasoning to answer.

Abstract: Large Reasoning Models (LRMs) generate explicit reasoning traces alongside final answers, yet the extent to which these traces influence answer generation remains unclear. In this work, we conduct a three-stage investigation into the interplay between reasoning and answer generation in three distilled DeepSeek R1 models. First, through empirical evaluation, we demonstrate that including explicit reasoning consistently improves answer quality across diverse domains. Second, attention analysis reveals that answer tokens attend substantially to reasoning tokens, with certain mid-layer Reasoning-Focus Heads (RFHs) closely tracking the reasoning trajectory, including self-reflective cues. Third, we apply mechanistic interventions using activation patching to assess the dependence of answer tokens on reasoning activations. Our results show that perturbations to key reasoning tokens can reliably alter the final answers, confirming a directional and functional flow of information from reasoning to answer. These findings deepen our understanding of how LRMs leverage reasoning tokens for answer generation, highlighting the functional role of intermediate reasoning in shaping model outputs. Our data and code are publicly available at \href{https://aka.ms/R2A-code}{this URL}.

[925] SafeSearch: Automated Red-Teaming for the Safety of LLM-Based Search Agents

Jianshuo Dong, Sheng Guo, Hao Wang, Zhuotao Liu, Tianwei Zhang, Ke Xu, Minlie Huang, Han Qiu

Main category: cs.AI

TL;DR: Search agents connecting LLMs to the Internet face safety threats from unreliable search results. The paper introduces an automated red-teaming framework and SafeSearch benchmark to systematically evaluate search agent vulnerabilities.

Details

Motivation: Unreliable search results pose safety threats to LLM-based search agents, creating a new threat surface that needs systematic assessment.

Method: Developed an automated red-teaming framework and constructed SafeSearch benchmark with 300 test cases covering 5 risk categories. Evaluated 3 search agent scaffolds across 15 LLMs (7 proprietary, 8 open-source).

Result: Found substantial vulnerabilities - GPT-4.1-mini reached 90.5% attack success rate when exposed to unreliable websites. Common defenses like reminder prompting showed limited effectiveness.

Conclusion: The framework provides valuable transparency for safer agent development, highlighting the urgent need for robust safety measures in search agents.

Abstract: Search agents connect LLMs to the Internet, enabling access to broader and more up-to-date information. However, unreliable search results may also pose safety threats to end users, establishing a new threat surface. In this work, we conduct two in-the-wild experiments to demonstrate both the prevalence of low-quality search results and their potential to misguide agent behaviors. To counter this threat, we introduce an automated red-teaming framework that is systematic, scalable, and cost-efficient, enabling lightweight and harmless safety assessments of search agents. Building on this framework, we construct the SafeSearch benchmark, which includes 300 test cases covering five categories of risks (e.g., misinformation and indirect prompt injection). Using this benchmark, we evaluate three representative search agent scaffolds, covering search workflow, tool-calling, and deep research, across 7 proprietary and 8 open-source backend LLMs. Our results reveal substantial vulnerabilities of LLM-based search agents: when exposed to unreliable websites, the highest ASR reached 90.5% for GPT-4.1-mini under a search workflow setting. Moreover, our analysis highlights the limited effectiveness of common defense practices, such as reminder prompting. This emphasizes the value of our framework in promoting transparency for safer agent development. Our codebase and test cases are publicly available: https://github.com/jianshuod/SafeSearch.

[926] Measuring Sparse Autoencoder Feature Sensitivity

Claire Tian, Katherine Tian, Nathan Hu

Main category: cs.AI

TL;DR: The paper introduces a method to evaluate feature sensitivity in Sparse Autoencoders (SAEs), showing that many interpretable features have poor sensitivity despite appearing monosemantic, and that sensitivity decreases with increasing SAE width.

Details

Motivation: Current SAE feature analysis focuses on activating examples but doesn't reveal feature sensitivity - how reliably features activate on semantically similar text. This gap limits understanding of feature quality.

Method: Developed a scalable method using language models to generate text with same semantic properties as feature’s activating examples, then test whether features activate on these generated texts.

Result: Many interpretable features have poor sensitivity; human evaluation confirms generated text genuinely resembles original examples when features fail to activate; average feature sensitivity declines with increasing SAE width across 7 SAE variants.

Conclusion: Feature sensitivity represents a new dimension for evaluating both individual features and SAE architectures, complementing existing interpretability measures.

Abstract: Sparse Autoencoder (SAE) features have become essential tools for mechanistic interpretability research. SAE features are typically characterized by examining their activating examples, which are often “monosemantic” and align with human interpretable concepts. However, these examples don’t reveal feature sensitivity: how reliably a feature activates on texts similar to its activating examples. In this work, we develop a scalable method to evaluate feature sensitivity. Our approach avoids the need to generate natural language descriptions for features; instead we use language models to generate text with the same semantic properties as a feature’s activating examples. We then test whether the feature activates on these generated texts. We demonstrate that sensitivity measures a new facet of feature quality and find that many interpretable features have poor sensitivity. Human evaluation confirms that when features fail to activate on our generated text, that text genuinely resembles the original activating examples. Lastly, we study feature sensitivity at the SAE level and observe that average feature sensitivity declines with increasing SAE width across 7 SAE variants. Our work establishes feature sensitivity as a new dimension for evaluating both individual features and SAE architectures.

[927] MedLA: A Logic-Driven Multi-Agent Framework for Complex Medical Reasoning with Large Language Models

Siqi Ma, Jiajie Huang, Bolin Yang, Fan Zhang, Jinlin Wu, Yue Shen, Guohui Fan, Zhu Zhang, Zelin Zang

Main category: cs.AI

TL;DR: MedLA is a logic-driven multi-agent framework that uses syllogistic triads and graph-guided discussions to improve medical reasoning by detecting and resolving logical inconsistencies.

Details

Motivation: Existing multi-agent approaches for medical QA have limitations in detecting fine-grained logical inconsistencies due to fixed roles and shallow interaction prompts.

Method: Uses multiple agents that organize reasoning into explicit logical trees based on syllogistic triads, engaging in multi-round graph-guided discussions to compare and refine logic trees through error correction and contradiction resolution.

Result: Consistently outperforms static role-based systems and single-agent baselines on MedDDx and standard medical QA tasks, achieving state-of-the-art performance across both open-source and commercial LLM backbones.

Conclusion: MedLA provides a generalizable paradigm for trustworthy medical reasoning that scales effectively and enables transparent inference through premise-level alignment.

Abstract: Answering complex medical questions requires not only domain expertise and patient-specific information, but also structured and multi-perspective reasoning. Existing multi-agent approaches often rely on fixed roles or shallow interaction prompts, limiting their ability to detect and resolve fine-grained logical inconsistencies. To address this, we propose \textsc{MedLA}, a logic-driven multi-agent framework built on large language models. Each agent organizes its reasoning process into an explicit logical tree based on syllogistic triads (major premise, minor premise, and conclusion), enabling transparent inference and premise-level alignment. Agents engage in a multi-round, graph-guided discussion to compare and iteratively refine their logic trees, achieving consensus through error correction and contradiction resolution. We demonstrate that \textsc{MedLA} consistently outperforms both static role-based systems and single-agent baselines on challenging benchmarks such as MedDDx and standard medical QA tasks. Furthermore, \textsc{MedLA} scales effectively across both open-source and commercial LLM backbones, achieving state-of-the-art performance and offering a generalizable paradigm for trustworthy medical reasoning.

[928] EAPO: Enhancing Policy Optimization with On-Demand Expert Assistance

Siyao Song, Cong Ma, Zhihao Cheng, Shiye Lei, Minghao Li, Ying Zeng, Huaixiao Tou, Kai Jia

Main category: cs.AI

TL;DR: EAPO is a novel RL framework that enhances LLM reasoning by incorporating multi-turn interactions with external experts during training, enabling the policy to adaptively consult experts and internalize expert knowledge.

Details

Motivation: Existing RL methods for LLMs rely on outcome-based supervision leading to inefficient exploration and sparse rewards. The goal is to enhance exploration by leveraging external expert knowledge.

Method: EAPO framework uses multi-turn interactions with external experts during RL training, allowing the policy to determine when and how to consult experts, which provides richer reward signals and more reliable reasoning trajectories.

Result: EAPO outperforms expert-assisted workflow, expert-distilled models, and RL baselines on mathematical reasoning benchmarks (AIME 2024, AIME 2025, AIMO 2025) with an average gain of 5 points over self-exploratory models.

Conclusion: The proposed EAPO framework successfully internalizes expert knowledge into policy models, amplifying their inherent reasoning capabilities and producing improved reasoning paths and more accurate solutions without requiring external assistance during evaluation.

Abstract: Large language models (LLMs) have recently advanced in reasoning when optimized with reinforcement learning (RL) under verifiable rewards. Existing methods primarily rely on outcome-based supervision to strengthen internal LLM reasoning, often leading to inefficient exploration and sparse rewards. To mitigate this issue, we propose Expert-Assisted Policy Optimization (EAPO), a novel RL framework that enhances exploration by incorporating multi-turn interactions with external experts during training. Unlike prior methods, where policies reason in isolation, EAPO incentivizes the policy to adaptively determine when and how to consult experts, yielding richer reward signals and more reliable reasoning trajectories. External assistance ultimately internalizes expert knowledge into the policy model, amplifying the model’s inherent reasoning capabilities. During evaluation, the policy model has been well-optimized to solve questions independently, producing improved reasoning paths and more accurate solutions. Experiments on mathematical reasoning benchmarks, including AIME 2024, AIME 2025, and AIMO 2025, show that EAPO consistently outperforms expert-assisted workflow, expert-distilled models, and RL baselines, with an average gain of 5 points over self-exploratory models.

[929] Diagnosing Failure Root Causes in Platform-Orchestrated Agentic Systems: Dataset, Taxonomy, and Benchmark

Xuyan Ma, Xiaofei Xie, Yawen Wang, Junjie Wang, Boyu Wu, Mingyang Li, Qing Wang

Main category: cs.AI

TL;DR: This paper studies root cause identification for platform-orchestrated agentic systems, creating a dataset (AgentFail) with 307 failure logs, developing a taxonomy for failure causes, and benchmarking LLM-based root cause identification.

Details

Motivation: Platform-orchestrated agentic systems are increasingly used for complex tasks but are fragile, and there's no systematic way to identify their failure root causes.

Method: Constructed AgentFail dataset with 307 failure logs from 10 agentic systems, used counterfactual reasoning for reliable annotation, developed taxonomy for failure causes, and created LLM-based benchmark for root cause identification.

Result: The taxonomy improves LLM performance for root cause identification, but accuracy reaches only 33.6%, showing the task remains challenging. Analysis reveals failure distribution across platforms and task domains.

Conclusion: Provides foundational dataset, taxonomy, and benchmark for advancing reliable agentic systems development, with actionable guidelines for building such systems.

Abstract: Agentic systems consisting of multiple LLM-driven agents coordinating through tools and structured interactions, are increasingly deployed for complex reasoning and problem-solving tasks. At the same time, emerging low-code and template-based agent development platforms (e.g., Dify) enable users to rapidly build and orchestrate agentic systems, which we refer to as platform-orchestrated agentic systems. However, these systems are also fragile and it remains unclear how to systematically identify their potential failure root cause. This paper presents a study of root cause identification of these platform-orchestrated agentic systems. To support this initiative, we construct a dataset AgentFail containing 307 failure logs from ten agentic systems, each with fine-grained annotations linking failures to their root causes. We additionally utilize counterfactual reasoning-based repair strategy to ensure the reliability of the annotation. Building on the dataset, we develop a taxonomy that characterizes failure root causes and analyze their distribution across different platforms and task domains. Furthermore, we introduce a benchmark that leverages LLMs for automatically identifying root causes, in which we also utilize the proposed taxonomy as guidance for LLMs. Results show that the taxonomy can largely improve the performance, thereby confirming its utility. Nevertheless, the accuracy of root cause identification reaches at most 33.6%, which indicates that this task still remains challenging. In light of these results, we also provide actionable guidelines for building such agentic systems. In summary, this paper provides a reliable dataset of failure root cause for platform-orchestrated agentic systems, corresponding taxonomy and benchmark, which serves as a foundation for advancing the development of more reliable agentic systems.

[930] GUI-Shepherd: Reliable Process Reward and Verification for Long-Sequence GUI Tasks

Cong Chen, Kaixiang Ji, Hao Zhong, Muzhi Zhu, Anzhou Li, Guo Gan, Ziyuan Huang, Cheng Zou, Jiajia Liu, Jingdong Chen, Hao Chen, Chunhua Shen

Main category: cs.AI

TL;DR: GUI-Shepherd is a Process Reward Model that provides dense, step-by-step feedback to guide autonomous GUI agents, addressing sparse rewards and credit assignment problems in long-sequence tasks.

Details

Motivation: Autonomous agents for long-sequence GUI tasks face challenges with sparse rewards and intractable credit assignment problems, which hinder their performance.

Method: GUI-Shepherd is trained on a large-scale dataset of 52k interactions with human-annotated scores and GPT-4o generated rationales. It serves as both a reward provider for RL training and a verifier for inference.

Result: On AndroidWorld benchmark, GUI-Shepherd improves success rate by 7.7 points via multi-turn online PPO, outperforming Outcome Reward Model competitors. As an inference verifier, it brings 5.1 points improvements. On AndroidControl benchmark, gains are 2.2 points as reward provider and 4.3 points as verifier.

Conclusion: High-fidelity process supervision is critical for building more capable GUI agents, and GUI-Shepherd presents a generalizable solution that significantly improves performance across diverse settings.

Abstract: Autonomous agents for long-sequence Graphical User Interface tasks are hindered by sparse rewards and the intractable credit assignment problem. To address these challenges, we introduce GUI-Shepherd, a Process Reward Model that provides dense, step-by-step feedback to guide agents. GUI-Shepherd is trained on a diverse large-scale data set of $52$k interactions that features human-annotated scores and GPT-4o generated rationales, enabling it to serve both as a reward provider for RL training and as a verifier for inference. As far as we know, we are the first to conduct a systematic study of process supervision in GUI agents, across diverse settings from online long-horizon tasks to offline single-step prediction. On the online AndroidWorld benchmark, GUI-Shepherd improves success rate by $7.7$ points via multi-turn online PPO, significantly outperforming Outcome Reward Model based competitors. When used as an inference verifier, it brings $5.1$ points improvements. The benefits generalize to the offline AndroidControl benchmark, with gains of $2.2$ points as a reward provider and $4.3$ points as a verifier. Collectively, our results establish that high-fidelity process supervision is critical for building more capable GUI agents and present a generalizable solution.

[931] Transparent Visual Reasoning via Object-Centric Agent Collaboration

Benjamin Teoh, Ben Glocker, Francesca Toni, Avinash Kori

Main category: cs.AI

TL;DR: OCEAN is an interpretable AI framework that uses object-centric representations and multi-agent negotiation to produce human-understandable visual explanations.

Details

Motivation: To address the challenge of creating explanations grounded in human-understandable concepts in visual AI systems.

Method: Uses object-centric representations and a transparent multi-agent reasoning process with game-theoretic negotiation to achieve coherent and discriminative evidence.

Result: Achieves competitive performance with state-of-the-art black-box models while providing faithful reasoning, validated by user studies showing participants rated OCEAN’s explanations as more intuitive and trustworthy.

Conclusion: OCEAN provides a successful framework for interpretable AI that combines competitive performance with human-understandable explanations through object-centric representations and multi-agent negotiation.

Abstract: A central challenge in explainable AI, particularly in the visual domain, is producing explanations grounded in human-understandable concepts. To tackle this, we introduce OCEAN (Object-Centric Explananda via Agent Negotiation), a novel, inherently interpretable framework built on object-centric representations and a transparent multi-agent reasoning process. The game-theoretic reasoning process drives agents to agree on coherent and discriminative evidence, resulting in a faithful and interpretable decision-making process. We train OCEAN end-to-end and benchmark it against standard visual classifiers and popular posthoc explanation tools like GradCAM and LIME across two diagnostic multi-object datasets. Our results demonstrate competitive performance with respect to state-of-the-art black-box models with a faithful reasoning process, which was reflected by our user study, where participants consistently rated OCEAN’s explanations as more intuitive and trustworthy.

[932] From What to Why: A Multi-Agent System for Evidence-based Chemical Reaction Condition Reasoning

Cheng Yang, Jiaxuan Lu, Haiyuan Wan, Junchi Yu, Feiwei Qin

Main category: cs.AI

TL;DR: ChemMAS is a multi-agent system that provides explainable reaction condition recommendations through evidence-based reasoning and agentic debate.

Details

Motivation: Existing LLM-based methods for chemical reaction condition recommendation lack explainability, limiting their utility in high-stakes scientific workflows where understanding the rationale behind recommendations is crucial.

Method: ChemMAS decomposes condition prediction into four stages: mechanistic grounding, multi-channel recall, constraint-aware agentic debate, and rationale aggregation, with each decision backed by interpretable justifications from chemical knowledge and retrieved precedents.

Result: ChemMAS achieves 20-35% gains over domain-specific baselines and outperforms general-purpose LLMs by 10-15% in Top-1 accuracy.

Conclusion: ChemMAS establishes a new paradigm for explainable AI in scientific discovery by providing falsifiable, human-trustable rationales for reaction condition recommendations.

Abstract: The chemical reaction recommendation is to select proper reaction condition parameters for chemical reactions, which is pivotal to accelerating chemical science. With the rapid development of large language models (LLMs), there is growing interest in leveraging their reasoning and planning capabilities for reaction condition recommendation. Despite their success, existing methods rarely explain the rationale behind the recommended reaction conditions, limiting their utility in high-stakes scientific workflows. In this work, we propose ChemMAS, a multi-agent system that reframes condition prediction as an evidence-based reasoning task. ChemMAS decomposes the task into mechanistic grounding, multi-channel recall, constraint-aware agentic debate, and rationale aggregation. Each decision is backed by interpretable justifications grounded in chemical knowledge and retrieved precedents. Experiments show that ChemMAS achieves 20-35% gains over domain-specific baselines and outperforms general-purpose LLMs by 10-15% in Top-1 accuracy, while offering falsifiable, human-trustable rationales, which establishes a new paradigm for explainable AI in scientific discovery.

Qi Xue, Minrui Jiang, Runjia Zhang, Xiurui Xie, Pei Ke, Guisong Liu

Main category: cs.AI

TL;DR: Falcon is a large-scale vision-language safety dataset and FalconEye is a specialized evaluator for identifying harmful content in multimodal language models, outperforming existing benchmarks.

Details

Motivation: Current methods for evaluating harmful content in multimodal large language models (MLLMs) are underdeveloped and lack depth, particularly overlooking the role of visual information in content moderation.

Method: Created Falcon dataset with 57,515 VQA pairs across 13 harm categories, and developed FalconEye evaluator fine-tuned from Qwen2.5-VL-7B using the Falcon dataset.

Result: FalconEye reliably identifies harmful content in multimodal dialogue scenarios and outperforms all baselines in overall accuracy across Falcon-test, VLGuard, and Beavertail-V benchmarks.

Conclusion: FalconEye serves as a practical safety auditing tool for MLLMs, demonstrating the importance of comprehensive vision-language safety evaluation.

Abstract: Existing methods for evaluating the harmfulness of content generated by large language models (LLMs) have been well studied. However, approaches tailored to multimodal large language models (MLLMs) remain underdeveloped and lack depth. This work highlights the crucial role of visual information in moderating content in visual question answering (VQA), a dimension often overlooked in current research. To bridge this gap, we introduce Falcon, a large-scale vision-language safety dataset containing 57,515 VQA pairs across 13 harm categories. The dataset provides explicit annotations for harmful attributes across images, instructions, and responses, thereby facilitating a comprehensive evaluation of the content generated by MLLMs. In addition, it includes the relevant harm categories along with explanations supporting the corresponding judgments. We further propose FalconEye, a specialized evaluator fine-tuned from Qwen2.5-VL-7B using the Falcon dataset. Experimental results demonstrate that FalconEye reliably identifies harmful content in complex and safety-critical multimodal dialogue scenarios. It outperforms all other baselines in overall accuracy across our proposed Falcon-test dataset and two widely-used benchmarks-VLGuard and Beavertail-V, underscoring its potential as a practical safety auditing tool for MLLMs.

[934] AnveshanaAI: A Multimodal Platform for Adaptive AI/ML Education through Automated Question Generation and Interactive Assessment

Rakesh Thakur, Diksha Khandelwal, Shreya Tiwari

Main category: cs.AI

TL;DR: AnveshanaAI is a gamified AI learning platform with personalized dashboards, adaptive assessments based on Bloom’s taxonomy, and features like playgrounds, challenges, and community collaboration to enhance engagement and learning.

Details

Motivation: To address limitations of static question repositories in existing platforms and provide a more engaging, adaptive, and transparent AI education experience through gamification and explainable AI techniques.

Method: Uses personalized dashboards with streaks, levels, badges; gamified tracking; structured navigation across AI domains; Bloom’s taxonomy-based dataset; semantic similarity checks; explainable AI techniques; adaptive and domain-aware assessment methods.

Result: Experiments show broad dataset coverage, stable fine-tuning with reduced perplexity, and measurable gains in learner engagement.

Conclusion: AnveshanaAI successfully integrates adaptivity, gamification, interactivity, and explainability to support next-generation AI education.

Abstract: We propose AnveshanaAI, an application-based learning platform for artificial intelligence. With AnveshanaAI, learners are presented with a personalized dashboard featuring streaks, levels, badges, and structured navigation across domains such as data science, machine learning, deep learning, transformers, generative AI, large language models, and multimodal AI, with scope to include more in the future. The platform incorporates gamified tracking with points and achievements to enhance engagement and learning, while switching between Playground, Challenges, Simulator, Dashboard, and Community supports exploration and collaboration. Unlike static question repositories used in existing platforms, AnveshanaAI ensures balanced learning progression through a dataset grounded in Bloom’s taxonomy, with semantic similarity checks and explainable AI techniques improving transparency and reliability. Adaptive, automated, and domain-aware assessment methods are also employed. Experiments demonstrate broad dataset coverage, stable fine-tuning with reduced perplexity, and measurable gains in learner engagement. Together, these features illustrate how AnveshanaAI integrates adaptivity, gamification, interactivity, and explainability to support next-generation AI education.

[935] Mix-Ecom: Towards Mixed-Type E-Commerce Dialogues with Complex Domain Rules

Chenyu Zhou, Xiaoming Shi, Hui Qiu, Xiawu Zheng, Haitao Leng, Yankai Jiang, Shaoguo Liu, Tingting Gao, Rongrong Ji

Main category: cs.AI

TL;DR: Mix-ECom is a novel benchmark dataset for evaluating e-commerce agents, featuring mixed dialogue types and complex domain rules based on real-world customer service dialogues.

Details

Motivation: Current e-commerce agent benchmarks lack evaluation of agents' capability to handle mixed-type dialogues and complex domain rules, limiting their real-world applicability.

Method: Constructed Mix-ECom corpus from real-world customer-service dialogues with privacy removal and CoT process addition, containing 4,799 samples covering 4 dialogue types, 3 e-commerce task types, and 82 domain rules.

Result: Current e-commerce agents show insufficient capabilities in handling e-commerce dialogues, primarily due to hallucinations caused by complex domain rules.

Conclusion: Mix-ECom addresses the gap in evaluating mixed-type e-commerce dialogues and complex domain rules, providing a benchmark for improving e-commerce agent performance through a proposed dynamic framework.

Abstract: E-commerce agents contribute greatly to helping users complete their e-commerce needs. To promote further research and application of e-commerce agents, benchmarking frameworks are introduced for evaluating LLM agents in the e-commerce domain. Despite the progress, current benchmarks lack evaluating agents’ capability to handle mixed-type e-commerce dialogue and complex domain rules. To address the issue, this work first introduces a novel corpus, termed Mix-ECom, which is constructed based on real-world customer-service dialogues with post-processing to remove user privacy and add CoT process. Specifically, Mix-ECom contains 4,799 samples with multiply dialogue types in each e-commerce dialogue, covering four dialogue types (QA, recommendation, task-oriented dialogue, and chit-chat), three e-commerce task types (pre-sales, logistics, after-sales), and 82 e-commerce rules. Furthermore, this work build baselines on Mix-Ecom and propose a dynamic framework to further improve the performance. Results show that current e-commerce agents lack sufficient capabilities to handle e-commerce dialogues, due to the hallucination cased by complex domain rules. The dataset will be publicly available.

[936] AgentGuard: Runtime Verification of AI Agents

Roham Koohestani

Main category: cs.AI

TL;DR: AgentGuard is a runtime verification framework for Agentic AI systems that provides probabilistic assurance through dynamic modeling and real-time verification.

Details

Motivation: Traditional verification methods are inadequate for autonomous AI systems due to their unpredictability and emergent behaviors, requiring a shift to probabilistic guarantees.

Method: AgentGuard operates as an inspection layer that abstracts agent I/O into formal events, builds dynamic MDP models using online learning, and performs real-time probabilistic model checking.

Result: The framework enables continuous, quantitative assurance by formally modeling emergent agent behavior and verifying properties in real-time.

Conclusion: AgentGuard provides a new paradigm called Dynamic Probabilistic Assurance for runtime verification of Agentic AI systems.

Abstract: The rapid evolution to autonomous, agentic AI systems introduces significant risks due to their inherent unpredictability and emergent behaviors; this also renders traditional verification methods inadequate and necessitates a shift towards probabilistic guarantees where the question is no longer if a system will fail, but the probability of its failure within given constraints. This paper presents AgentGuard, a framework for runtime verification of Agentic AI systems that provides continuous, quantitative assurance through a new paradigm called Dynamic Probabilistic Assurance. AgentGuard operates as an inspection layer that observes an agent’s raw I/O and abstracts it into formal events corresponding to transitions in a state model. It then uses online learning to dynamically build and update a Markov Decision Process (MDP) that formally models the agent’s emergent behavior. Using probabilistic model checking, the framework then verifies quantitative properties in real-time.

[937] Rethinking Reward Miscalibration of GRPO in Agentic RL

Jingyu Liu, Xiaopeng Wu, Jingquan Peng, Kehan Chen, Chuan Yu, Lizhong Ding, Yong Liu

Main category: cs.AI

TL;DR: This paper challenges the common belief that outcome-based rewards cause reward miscalibration and reinforce flawed actions. Instead, it identifies gradient coupling between similar samples as the real issue in agentic RL, and proposes an actor classification method to separate good/bad action embeddings.

Details

Motivation: To address the problem of flawed actions being reinforced during training in autonomous agents solving long-horizon tasks, which is commonly attributed to outcome-based reward miscalibration.

Method: Proposes training the actor to classify good or bad actions to separate the embedding of good/bad actions and alleviate gradient interference between similar samples.

Result: Extensive experiments show the effectiveness of the proposed method in addressing gradient coupling issues.

Conclusion: Gradient coupling between similar samples, not outcome-based rewards, is the key issue causing flawed actions to be reinforced in agentic RL, and the proposed classification approach effectively mitigates this problem.

Abstract: Building autonomous agents capable of solving long-horizon, real-world tasks has garnered significant research interest. But outcome based rewards may cause reward miscalibration which means it might mistakenly allocate positive reward to flawed middle steps which is regarded as the key reason making the bad actions being reinforced during training. However we reveal that outcome based reward ensures expected negative advantage for those flawed middle steps, which means the flawed actions should be punished during training. Even accounting for the ``squeezing effect", the probability mass of good actions should increase and the actor should gradually get rid of harmful actions. This shows that flawed actions should be punished during training. We further identify gradient coupling between similar samples as a key issue in agentic RL, the input prompt is extremely similar and the output action space is limited, therefore during training, gradients from well-performing samples can inadvertently strengthen suboptimal or incorrect actions due to similar input observation and output actions. We show that with gradient coupling, some flawed actions might be enhanced. To address this, we propose training the actor to classify good or bad actions to separate the embedding of good/bad actions and alleviate the gradient interference, extensive experiments shows its effectiveness.

[938] Quant Fever, Reasoning Blackholes, Schrodinger’s Compliance, and More: Probing GPT-OSS-20B

Shuyi Lin, Tian Lu, Zikai Wang, Bo Wen, Yibo Zhao, Cheng Tan

Main category: cs.AI

TL;DR: Security evaluation of GPT-OSS-20B reveals multiple failure modes including quant fever, reasoning blackholes, and Schrodinger’s compliance that can be exploited through adversarial conditions.

Details

Motivation: To systematically evaluate the security vulnerabilities and failure modes of OpenAI's GPT-OSS-20B model under adversarial conditions using the Jailbreak Oracle framework.

Method: Used Jailbreak Oracle (JO), a systematic LLM evaluation tool, to probe GPT-OSS-20B’s behavior under different adversarial conditions and identify specific failure modes.

Result: Uncovered several critical failure modes: quant fever, reasoning blackholes, Schrodinger’s compliance, reasoning procedure mirage, and chain-oriented prompting that can be exploited on GPT-OSS-20B models.

Conclusion: The security evaluation demonstrates significant vulnerabilities in GPT-OSS-20B that can lead to severe consequences when exploited through identified failure modes.

Abstract: OpenAI’s GPT-OSS family provides open-weight language models with explicit chain-of-thought (CoT) reasoning and a Harmony prompt format. We summarize an extensive security evaluation of GPT-OSS-20B that probes the model’s behavior under different adversarial conditions. Using the Jailbreak Oracle (JO) [1], a systematic LLM evaluation tool, the study uncovers several failure modes including quant fever, reasoning blackholes, Schrodinger’s compliance, reasoning procedure mirage, and chain-oriented prompting. Experiments demonstrate how these behaviors can be exploited on GPT-OSS-20B models, leading to severe consequences.

Ouns El Harzli, Bernardo Cuenca Grau, Artur d’Avila Garcez, Ian Horrocks, Tarek R. Besold

Main category: cs.AI

TL;DR: This paper establishes a formal correspondence between fibring of neural networks and fibring of modal logics, using fibred models compatible with neural networks to derive logical expressiveness results for GNNs, GATs, and Transformers.

Details

Motivation: To bridge the gap between fibring of neural networks (a neurosymbolic framework) and fibring of modal logics, which was never formally established despite their conceptual similarities.

Method: Formalizing fibred models compatible with fibred neural networks, then using this correspondence to analyze logical expressiveness of various neural architectures including GNNs, GATs, and Transformer encoders.

Result: Derived non-uniform logical expressiveness results for Graph Neural Networks, Graph Attention Networks, and Transformer encoders through the established fibring correspondence.

Conclusion: The paper opens the way for using fibring as a formalism to interpret logical theories learned by neural networks using computational logic tools.

Abstract: Fibring of modal logics is a well-established formalism for combining countable families of modal logics into a single fibred language with common semantics, characterized by fibred models. Inspired by this formalism, fibring of neural networks was introduced as a neurosymbolic framework for combining learning and reasoning in neural networks. Fibring of neural networks uses the (pre-)activations of a trained network to evaluate a fibring function computing the weights of another network whose outputs are injected back into the original network. However, the exact correspondence between fibring of neural networks and fibring of modal logics was never formally established. In this paper, we close this gap by formalizing the idea of fibred models \emph{compatible} with fibred neural networks. Using this correspondence, we then derive non-uniform logical expressiveness results for Graph Neural Networks (GNNs), Graph Attention Networks (GATs) and Transformer encoders. Longer-term, the goal of this paper is to open the way for the use of fibring as a formalism for interpreting the logical theories learnt by neural networks with the tools of computational logic.

[940] Conditional Advantage Estimation for Reinforcement Learning in Large Reasoning Models

Guanxu Chen, Yafu Li, Yuxian Jiang, Chen Qian, Qihan Ren, Jingyi Yang, Yu Cheng, Dongrui Liu, Jing Shao

Main category: cs.AI

TL;DR: CANON is a new advantage estimation method for reinforcement learning with verifiable rewards that amplifies target metrics without assuming their direction, improving performance on math reasoning and logic tasks while enhancing token efficiency.

Details

Motivation: Prior RLVR approaches use directional priors through reward shaping that require careful hyperparameter tuning and can be overly biased, leading to failures. There's a need for a method that can leverage training metrics without presuming their direction.

Method: CANON (Conditional advANtage estimatiON) regroups sampled responses into two groups based on higher/lower values of a target metric, measures which trend contributes to better performance through inter-group comparison, and identifies better responses within the same group.

Result: CANON based on entropy consistently outperforms prior methods across three LLMs on both math reasoning and high-complexity logic tasks. When applied to response length, it improves token efficiency and yields better performance-cost trade-offs.

Conclusion: CANON effectively amplifies the impact of target metrics without directional assumptions, providing robust performance improvements and better efficiency in reinforcement learning for LLMs.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) for large language models (LLMs) has achieved remarkable progress in enhancing LLMs’ reasoning capabilities on tasks with clear correctness criteria, such as mathematical reasoning tasks. Several training metrics, such as entropy or response length, have been observed to correlate with different reasoning behaviors in reinforcement learning. Prior approaches incorporate such priors through reward or advantage shaping, which often relies on hand-crafted penalties and preferences (e.g., higher-is-better or lower-is-better). However, without careful hyperparameter tuning, these directional priors can be overly biased and may lead to failure. To this end, we introduce Conditional advANtage estimatiON (CANON), amplifying the impact of the target metric without presuming its direction. Specifically, CANON regroups the sampled responses into two groups based on the higher or lower value of a target metric, measures which metric trend contributes to better performance through inter-group comparison, and identifies the better response within the same group. In summary, CANON based on entropy consistently outperforms prior methods across three LLMs on both math reasoning and high-complexity logic tasks. When applied to response length, CANON further improves token efficiency, yielding a more favorable Pareto frontier in the performance-cost trade-off.

[941] Automatic selection of primary studies in systematic reviews with evolutionary rule-based classification

José de la Torre-López, Aurora Ramírez, José Raúl Romero

Main category: cs.AI

TL;DR: An evolutionary machine learning approach called \ourmodel uses grammar-guided genetic programming to automatically classify scientific papers as relevant or not for systematic literature reviews, combining textual and bibliometric data.

Details

Motivation: Systematic literature reviews are time-consuming, and machine learning can automate paper selection to reduce effort in identifying relevant literature from scientific databases.

Method: \ourmodel builds interpretable rule-based classifiers using grammar-guided genetic programming, allowing combination of textual information with bibliometric data not considered by state-of-the-art methods.

Result: The experiments demonstrate that accurate classifiers can be generated without impairing interpretability and using configurable information sources not previously supported.

Conclusion: The proposed evolutionary machine learning approach successfully automates paper selection for systematic reviews while maintaining interpretability and leveraging additional data sources.

Abstract: Searching, filtering and analysing scientific literature are time-consuming tasks when performing a systematic literature review. With the rise of artificial intelligence, some steps in the review process are progressively being automated. In particular, machine learning for automatic paper selection can greatly reduce the effort required to identify relevant literature in scientific databases. We propose an evolutionary machine learning approach, called \ourmodel, to automatically determine whether a paper retrieved from a literature search process is relevant. \ourmodel builds an interpretable rule-based classifier using grammar-guided genetic programming. The use of a grammar to define the syntax and the structure of the rules allows \ourmodel to easily combine the usual textual information with other bibliometric data not considered by state-of-the-art methods. Our experiments demonstrate that it is possible to generate accurate classifiers without impairing interpretability and using configurable information sources not supported so far.

[942] TusoAI: Agentic Optimization for Scientific Methods

Alistair Turcan, Kexin Huang, Lei Li, Martin Jinye Zhang

Main category: cs.AI

TL;DR: TusoAI is an agentic AI system that autonomously develops and optimizes computational methods for scientific tasks by integrating domain knowledge and performing iterative optimization, outperforming existing methods and uncovering novel biological insights.

Details

Motivation: Scientific discovery is slowed by manual development of computational tools, which is costly and time-consuming. LLMs offer capabilities in synthesizing literature, reasoning with data, and generating code, but existing systems don't effectively integrate unstructured domain-specific knowledge.

Method: TusoAI takes a scientific task description with an evaluation function, integrates domain knowledge into a knowledge tree representation, and performs iterative domain-specific optimization and model diagnosis to improve performance over candidate solutions.

Result: TusoAI outperformed state-of-the-art expert methods, MLE agents, and scientific AI agents across diverse tasks including single-cell RNA-seq data denoising and satellite-based earth monitoring. In genetics applications, it improved existing methods and uncovered novel biology including 9 new autoimmune disease-T cell subtype associations and 7 unreported disease variant-gene links.

Conclusion: TusoAI demonstrates the potential of agentic AI systems to accelerate computational method development in scientific domains, effectively integrating domain knowledge and autonomously optimizing solutions that outperform existing approaches and reveal new biological insights.

Abstract: Scientific discovery is often slowed by the manual development of computational tools needed to analyze complex experimental data. Building such tools is costly and time-consuming because scientists must iteratively review literature, test modeling and scientific assumptions against empirical data, and implement these insights into efficient software. Large language models (LLMs) have demonstrated strong capabilities in synthesizing literature, reasoning with empirical data, and generating domain-specific code, offering new opportunities to accelerate computational method development. Existing LLM-based systems either focus on performing scientific analyses using existing computational methods or on developing computational methods or models for general machine learning without effectively integrating the often unstructured knowledge specific to scientific domains. Here, we introduce TusoAI , an agentic AI system that takes a scientific task description with an evaluation function and autonomously develops and optimizes computational methods for the application. TusoAI integrates domain knowledge into a knowledge tree representation and performs iterative, domain-specific optimization and model diagnosis, improving performance over a pool of candidate solutions. We conducted comprehensive benchmark evaluations demonstrating that TusoAI outperforms state-of-the-art expert methods, MLE agents, and scientific AI agents across diverse tasks, such as single-cell RNA-seq data denoising and satellite-based earth monitoring. Applying TusoAI to two key open problems in genetics improved existing computational methods and uncovered novel biology, including 9 new associations between autoimmune diseases and T cell subtypes and 7 previously unreported links between disease variants linked to their target genes. Our code is publicly available at https://github.com/Alistair-Turcan/TusoAI.

[943] LLM/Agent-as-Data-Analyst: A Survey

Zirui Tang, Weizheng Wang, Zihang Zhou, Yang Jiao, Bangrui Xu, Boyu Niu, Xuanhe Zhou, Guoliang Li, Yeye He, Wei Zhou, Yitong Song, Cheng Tan, Bin Wang, Conghui He, Xiaoyang Wang, Fan Wu

Main category: cs.AI

TL;DR: LLM/Agent-as-Data-Analyst techniques enable complex data understanding, natural language interfaces, and autonomous pipeline orchestration for data analysis across various data modalities.

Details

Motivation: Traditional rule-based or small-model approaches are limited in handling complex data analysis tasks, while LLM/agent techniques offer superior capabilities in data understanding, semantic analysis, and autonomous workflow management.

Method: The paper reviews LLM-based techniques for different data modalities: structured data (table QA, NL2GQL), semi-structured data (markup languages, table modeling), unstructured data (chart/document understanding, vulnerability detection), and heterogeneous data (data retrieval, modality alignment).

Result: The technical evolution identifies five key design goals: semantic-aware design, modality-hybrid integration, autonomous pipelines, tool-augmented workflows, and support for open-world tasks.

Conclusion: The paper outlines remaining challenges and proposes insights and practical directions for advancing LLM/Agent-powered data analysis, suggesting continued development in this emerging field.

Abstract: Large language model (LLM) and agent techniques for data analysis (a.k.a LLM/Agent-as-Data-Analyst) have demonstrated substantial impact in both academica and industry. In comparison with traditional rule or small-model based approaches, (agentic) LLMs enable complex data understanding, natural language interfaces, semantic analysis functions, and autonomous pipeline orchestration. The technical evolution further distills five key design goals for intelligent data analysis agents, namely semantic-aware design, modality-hybrid integration, autonomous pipelines, tool-augmented workflows, and support for open-world tasks. From a modality perspective, we review LLM-based techniques for (i) structured data (e.g., table question answering for relational data and NL2GQL for graph data), (ii) semi-structured data (e.g., markup languages understanding and semi-structured table modeling), (iii) unstructured data (e.g., chart understanding, document understanding, programming languages vulnerable detection), and (iv) heterogeneous data (e.g., data retrieval and modality alignment for data lakes). Finally, we outline the remaining challenges and propose several insights and practical directions for advancing LLM/Agent-powered data analysis.

[944] Future-Proofing Programmers: Optimal Knowledge Tracing for AI-Assisted Personalized Education

Yuchen Wang, Pei-Duo Yu, Chee Wei Tan

Main category: cs.AI

TL;DR: CoTutor is an AI-driven model that enhances Bayesian Knowledge Tracing with signal processing and generative AI to improve student progress modeling and deliver adaptive feedback, showing improved learning outcomes in university trials.

Details

Motivation: To advance learning science by combining knowledge tracing, signal processing, and generative AI to model student learning states and optimize education, inspired by Richard Hamming's vision of computer-aided 'learning to learn'.

Method: Proposes CoTutor model that enhances Bayesian Knowledge Tracing with signal processing techniques and convex optimization, deployed as an AI copilot combining generative AI with adaptive learning technology.

Result: In university trials, CoTutor demonstrated measurable improvements in learning outcomes while outperforming conventional educational tools, showing potential for AI-driven personalization and scalability.

Conclusion: CoTutor successfully applies convex optimization and signal processing to automate learning analytics while reserving pedagogical judgment for humans, ensuring AI facilitates knowledge tracing while enabling learners to uncover new insights, with future opportunities for advancing privacy and ethical considerations.

Abstract: Learning to learn is becoming a science, driven by the convergence of knowledge tracing, signal processing, and generative AI to model student learning states and optimize education. We propose CoTutor, an AI-driven model that enhances Bayesian Knowledge Tracing with signal processing techniques to improve student progress modeling and deliver adaptive feedback and strategies. Deployed as an AI copilot, CoTutor combines generative AI with adaptive learning technology. In university trials, it has demonstrated measurable improvements in learning outcomes while outperforming conventional educational tools. Our results highlight its potential for AI-driven personalization, scalability, and future opportunities for advancing privacy and ethical considerations in educational technology. Inspired by Richard Hamming’s vision of computer-aided ’learning to learn,’ CoTutor applies convex optimization and signal processing to automate and scale up learning analytics, while reserving pedagogical judgment for humans, ensuring AI facilitates the process of knowledge tracing while enabling learners to uncover new insights.

[945] Do Repetitions Matter? Strengthening Reliability in LLM Evaluations

Miguel Angel Alvarado Gonzalez, Michelle Bruno Hernandez, Miguel Angel Peñaloza Perez, Bruno Lopez Orozco, Jesus Tadeo Cruz Soto, Sandra Malagon

Main category: cs.AI

TL;DR: Single-run LLM evaluations are unreliable; using ≥2 repetitions significantly improves ranking stability while remaining feasible for small teams.

Details

Motivation: Current LLM leaderboards rely on single stochastic runs, but the required number of repetitions for reliable conclusions is unclear, leading to potentially brittle rankings.

Method: Re-evaluated eight state-of-the-art models on AI4Math Benchmark with three independent runs per setting, using mixed-effects logistic regression, domain-level marginal means, rank-instability analysis, and run-to-run reliability assessment.

Result: Single-run leaderboards are brittle: 83% of slices invert at least one pairwise rank. Two runs remove ~83% of single-run inversions. Averaging runs yields modest standard error shrinkage (~5%) but large ranking gains.

Conclusion: Treat evaluation as an experiment, report uncertainty, and use ≥2 repetitions under stochastic decoding to improve robustness while remaining feasible for small teams.

Abstract: LLM leaderboards often rely on single stochastic runs, but how many repetitions are required for reliable conclusions remains unclear. We re-evaluate eight state-of-the-art models on the AI4Math Benchmark with three independent runs per setting. Using mixed-effects logistic regression, domain-level marginal means, rank-instability analysis, and run-to-run reliability, we assessed the value of additional repetitions. Our findings shows that Single-run leaderboards are brittle: 10/12 slices (83%) invert at least one pairwise rank relative to the three-run majority, despite a zero sign-flip rate for pairwise significance and moderate overall interclass correlation. Averaging runs yields modest SE shrinkage ($\sim$5% from one to three) but large ranking gains; two runs remove $\sim$83% of single-run inversions. We provide cost-aware guidance for practitioners: treat evaluation as an experiment, report uncertainty, and use $\geq 2$ repetitions under stochastic decoding. These practices improve robustness while remaining feasible for small teams and help align model comparisons with real-world reliability.

[946] Fathom-DeepResearch: Unlocking Long Horizon Information Retrieval and Synthesis for SLMs

Shreyas Singh, Kunal Singh, Pradeep Moturi

Main category: cs.AI

TL;DR: Fathom-DeepResearch is a two-model agentic system consisting of Fathom-Search-4B for evidence-based web investigation and Fathom-Synthesizer-4B for converting search traces into structured research reports, achieving state-of-the-art performance on various benchmarks.

Details

Motivation: To address the need for effective tool-integrated reasoning in agentic applications, particularly for complex information-seeking tasks where DeepResearch Agents have shown strong performance.

Method: The system uses two specialized models: (1) Fathom-Search-4B trained with DUETQA dataset, RAPO reinforcement learning, and steerable step-level rewards for web investigation; (2) Fathom-Synthesizer-4B for converting search traces into structured reports. Key innovations include multi-agent self-play dataset generation, curriculum pruning, and reward-aware advantage scaling.

Result: Achieves state-of-the-art performance in open-weights category on DeepSearch benchmarks (SimpleQA, FRAMES, WebWalker, Seal0, MuSiQue) and DeepResearch-Bench, with strong generalization to diverse reasoning tasks including HLE, AIME-25, GPQA-Diamond, and MedQA. Enables reliable tool-calling beyond 20 calls when needed.

Conclusion: The Fathom-DeepResearch system demonstrates effective tool-integrated reasoning through specialized models for search and synthesis, achieving superior performance on complex information-seeking tasks while maintaining generalization capabilities across diverse reasoning domains.

Abstract: Tool-integrated reasoning has emerged as a key focus for enabling agentic applications. Among these, DeepResearch Agents have gained significant attention for their strong performance on complex, open-ended information-seeking tasks. We introduce Fathom-DeepResearch, an agentic system composed of two specialized models. The first is Fathom-Search-4B, a DeepSearch model trained from Qwen3-4B and optimized for evidence-based investigation through live web search and targeted webpage querying. Its training combines three advances: (i) DUETQA, a 5K-sample dataset generated via multi-agent self-play that enforces strict web-search dependence and heterogeneous source grounding; (ii) RAPO, a zero-overhead extension of GRPO that stabilizes multi-turn Reinforcement Learning with Verifiable Rewards through curriculum pruning, reward-aware advantage scaling, and per-prompt replay buffers; and (iii) a steerable step-level reward that classifies each tool call by cognitive behavior and marginal utility, enabling explicit control over search trajectory breadth, depth, and horizon. These improvements enable reliable extension of tool-calling beyond 20 calls when warranted. The second is Fathom-Synthesizer-4B, trained from Qwen3-4B, which converts multi-turn DeepSearch traces into structured, citation-dense DeepResearch Reports for comprehensive synthesis. Evaluated on DeepSearch benchmarks (SimpleQA, FRAMES, WebWalker, Seal0, MuSiQue) and DeepResearch-Bench, the system achieves state-of-the-art performance in the open-weights category while demonstrating strong generalization to diverse reasoning tasks including HLE, AIME-25, GPQA-Diamond, and MedQA.

[947] Transparent, Evaluable, and Accessible Data Agents: A Proof-of-Concept Framework

Nooshin Bahador

Main category: cs.AI

TL;DR: A modular AI agent architecture that enables natural language interaction with enterprise data warehouses, featuring transparent decision-making, automated evaluation, and statistical context for trustworthy deployment in high-stakes domains.

Details

Motivation: To bridge the gap between natural language interfaces and complex enterprise data warehouses, enabling non-technical users to access data while ensuring transparency and reliability in high-stakes business environments.

Method: Component-based architecture with multi-layered reasoning framework, automated evaluation system, statistical context module, and integration with BigQuery ecosystem for secure data retrieval and business rule application.

Result: Successfully demonstrated through an insurance claims processing case study, creating a robust, evaluable system that generates human-auditable justifications and supports conclusions with quantitative evidence.

Conclusion: The integrated agent-development-with-evaluation framework provides a trustworthy solution for deploying LLM-powered agents in data-sensitive domains, ensuring transparency, reliability, and quantitative evidence-based decision making.

Abstract: This article presents a modular, component-based architecture for developing and evaluating AI agents that bridge the gap between natural language interfaces and complex enterprise data warehouses. The system directly addresses core challenges in data accessibility by enabling non-technical users to interact with complex data warehouses through a conversational interface, translating ambiguous user intent into precise, executable database queries to overcome semantic gaps. A cornerstone of the design is its commitment to transparent decision-making, achieved through a multi-layered reasoning framework that explains the “why” behind every decision, allowing for full interpretability by tracing conclusions through specific, activated business rules and data points. The architecture integrates a robust quality assurance mechanism via an automated evaluation framework that serves multiple functions: it enables performance benchmarking by objectively measuring agent performance against golden standards, and it ensures system reliability by automating the detection of performance regressions during updates. The agent’s analytical depth is enhanced by a statistical context module, which quantifies deviations from normative behavior, ensuring all conclusions are supported by quantitative evidence including concrete data, percentages, and statistical comparisons. We demonstrate the efficacy of this integrated agent-development-with-evaluation framework through a case study on an insurance claims processing system. The agent, built on a modular architecture, leverages the BigQuery ecosystem to perform secure data retrieval, apply domain-specific business rules, and generate human-auditable justifications. The results confirm that this approach creates a robust, evaluable, and trustworthy system for deploying LLM-powered agents in data-sensitive, high-stakes domains.

[948] Reasoning or Retrieval? A Study of Answer Attribution on Large Reasoning Models

Yuhui Wang, Changjiang Li, Guangke Chen, Jiacheng Liang, Ting Wang

Main category: cs.AI

TL;DR: Large reasoning models exhibit inconsistency between their reasoning traces and final answers due to competing mechanisms of CoT reasoning and memory retrieval, which can be exploited to hack reward signals during fine-tuning.

Details

Motivation: To understand why LRMs' final answers often contradict their own reasoning traces, and to address the limitation where models exploit retrieval mechanisms as shortcuts during fine-tuning.

Method: Conducted controlled experiments with misleading cues during reasoning and corrupted answers during retrieval. Introduced FARL framework that integrates memory unlearning with reinforcement learning to suppress retrieval shortcuts.

Result: Confirmed both CoT reasoning and memory retrieval operate simultaneously, with dominance influenced by problem domains, model scales, and fine-tuning approaches. FARL promotes reasoning-dominant behavior and enhances generalizable reasoning capabilities.

Conclusion: Current reasoning fine-tuning paradigms have critical limitations where models can exploit retrieval mechanisms, undermining genuine reasoning development. FARL provides an effective solution by suppressing these shortcuts.

Abstract: Large reasoning models (LRMs) exhibit unprecedented capabilities in solving complex problems through Chain-of-Thought (CoT) reasoning. However, recent studies reveal that their final answers often contradict their own reasoning traces. We hypothesize that this inconsistency stems from two competing mechanisms for generating answers: CoT reasoning and memory retrieval. To test this hypothesis, we conduct controlled experiments that challenge LRMs with misleading cues during reasoning and/or corrupted answers during retrieval. Our results across models and datasets confirm that both mechanisms operate simultaneously, with their relative dominance influenced by multiple factors: problem domains, model scales, and fine-tuning approaches (e.g., reinforcement learning vs. distillation). The findings reveal a critical limitation in current reasoning fine-tuning paradigms: models can exploit the retrieval mechanism as a shortcut, effectively “hacking” the reward signal and undermining genuine reasoning development. To address this challenge, we introduce FARL, a novel fine-tuning framework that integrates memory unlearning with reinforcement learning. By carefully suppressing retrieval shortcuts during the fine-tuning process, FARL promotes reasoning-dominant behavior and enhances generalizable reasoning capabilities.

[949] Robust Preference Optimization: Aligning Language Models with Noisy Preference Feedback

Xiaoyang Cao, Zelai Xu, Mo Guang, Kaiwen Long, Michiel A. Bakker, Yu Wang, Chao Yu

Main category: cs.AI

TL;DR: RPO is a meta-framework that transforms existing preference alignment algorithms into robust versions by addressing noise and heterogeneity in human preference data through an EM-based adaptive re-weighting approach.

Details

Motivation: Standard human preference alignment methods assume homogeneous and noiseless human preferences, but in reality preferences are pluralistic and annotations contain errors, creating discrepancies that degrade model performance.

Method: Uses Expectation-Maximization to infer posterior probabilities of label correctness, adaptively re-weighing data points in training loss. Establishes theoretical link between preference losses and probabilistic models to systematically transform existing algorithms into robust counterparts.

Result: RPO consistently enhances four state-of-the-art alignment algorithms (DPO, IPO, SimPO, CPO). On Mistral and Llama 3 models, achieves up to 7.0% and 5.4% win rate gains on AlpacaEval 2 and Arena-Hard respectively.

Conclusion: RPO provides a robust meta-framework for preference alignment that addresses real-world noise and heterogeneity in human preference data, with theoretical guarantees and empirical improvements across multiple algorithms and benchmarks.

Abstract: Standard human preference-based alignment methods, such as Reinforcement Learning from Human Feedback (RLHF), are a cornerstone technology for aligning Large Language Models (LLMs) with human values. However, these methods are all underpinned by a critical, yet flawed assumption: human preferences are homogeneous (representing a single, unified preference) and the collected data is noiseless (free from error). In reality, neither is true since human preference is pluralistic and annotators can make mistakes. This creates a discrepancy between the recorded data and the ground-truth preferences, which can misguide the model and degrade its performance. To address this challenge, we introduce Robust Preference Optimization (RPO). RPO employs an Expectation-Maximization (EM) algorithm to infer the posterior probability of each label’s correctness, which is used to adaptively re-weigh each data point in the training loss to mitigate noise. We further generalize this approach by establishing a theoretical link between arbitrary preference losses and their corresponding probabilistic models. This generalization enables the systematic transformation of existing alignment algorithms into their robust counterparts, elevating RPO from a specific algorithm to a meta-framework for robust preference alignment. Theoretically, we prove that under the condition of a perfectly calibrated model, RPO is guaranteed to converge to the true noise level of the dataset. Our experiments demonstrate RPO’s effectiveness as a meta-framework, consistently enhancing four state-of-the-art alignment algorithms (DPO, IPO, SimPO, and CPO). When applied to Mistral and Llama 3 models, the RPO-enhanced methods achieve substantial win rate gains on AlpacaEval 2 and Arena-Hard, with improvements of up to 7.0% and 5.4%, respectively.

[950] Humanline: Online Alignment as Perceptual Loss

Sijia Liu, Niklas Muennighoff, Kawin Ethayarajh

Main category: cs.AI

TL;DR: Online alignment methods like GRPO outperform offline methods like DPO due to better approximation of human-perceived probability distributions. PPO/GRPO clipping recovers human perceptual biases in probability perception. The paper proposes ‘humanline’ variants that incorporate these perceptual distortions into objectives, enabling offline training to match online performance.

Details

Motivation: To understand why online alignment methods perform better than offline methods, and to develop more efficient training approaches that don't sacrifice performance while being faster and cheaper.

Method: Proposes humanline variants of alignment objectives (DPO/KTO/GRPO) that explicitly incorporate human perceptual distortions of probability, allowing training with offline off-policy data while maintaining performance.

Result: Humanline variants trained with offline off-policy data can match the performance of their online counterparts on both verifiable and unverifiable tasks.

Conclusion: The online/offline dichotomy is incidental to maximizing human utility, and selective training that mimics human perception can achieve the same effect more efficiently.

Abstract: Online alignment (e.g., GRPO) is generally more performant than offline alignment (e.g., DPO) – but why? Drawing on prospect theory from behavioral economics, we propose a human-centric explanation. We prove that online on-policy sampling better approximates the human-perceived distribution of what the model can produce, and PPO/GRPO-style clipping – originally introduced to just stabilize training – recovers a perceptual bias in how humans perceive probability. In this sense, PPO/GRPO act as perceptual losses already. Our theory further suggests that the online/offline dichotomy is itself incidental to maximizing human utility, since we can achieve the same effect by selectively training on any data in a manner that mimics human perception, rather than restricting ourselves to online on-policy data. Doing so would allow us to post-train more quickly, cheaply, and flexibly without sacrificing performance. To this end, we propose a design pattern that explicitly incorporates perceptual distortions of probability into objectives like DPO/KTO/GRPO, creating humanline variants of them. Surprisingly, we find that these humanline variants, even when trained with offline off-policy data, can match the performance of their online counterparts on both verifiable and unverifiable tasks.

[951] ELHPlan: Efficient Long-Horizon Task Planning for Multi-Agent Collaboration

Shaobin Ling, Yun Wang, Chenyou Fan, Tin Lun Lam, Junjie Hu

Main category: cs.AI

TL;DR: ELHPlan introduces Action Chains as planning primitives for LLM-based multi-robot collaboration, achieving comparable task success with 76% fewer tokens than state-of-the-art methods.

Details

Motivation: Current LLM-based multi-robot collaboration methods face fundamental trade-offs: declarative approaches lack adaptability in dynamic environments, while iterative methods have prohibitive computational costs that scale poorly with team size and task complexity.

Method: ELHPlan uses Action Chains (sequences of actions bound to sub-goal intentions) as planning primitives in a cyclical process: 1) construct intention-bound action sequences, 2) proactively validate for conflicts and feasibility, 3) refine issues through targeted mechanisms, and 4) execute validated actions.

Result: Experiments on TDW-MAT and C-WAH benchmarks show ELHPlan achieves comparable task success rates while consuming only 24% of the tokens required by state-of-the-art methods.

Conclusion: ELHPlan establishes a new efficiency-effectiveness frontier for LLM-based multi-agent planning systems by balancing adaptability and efficiency through sufficient planning horizons while avoiding expensive full re-planning.

Abstract: Large Language Models (LLMs) enable intelligent multi-robot collaboration but face fundamental trade-offs: declarative methods lack adaptability in dynamic environments, while iterative methods incur prohibitive computational costs that scale poorly with team size and task complexity. In this paper, we propose ELHPlan, a novel framework that introduces Action Chains–sequences of actions explicitly bound to sub-goal intentions–as the fundamental planning primitive. ELHPlan operates via a cyclical process: 1) constructing intention-bound action sequences, 2) proactively validating for conflicts and feasibility, 3) refining issues through targeted mechanisms, and 4) executing validated actions. This design balances adaptability and efficiency by providing sufficient planning horizons while avoiding expensive full re-planning. We further propose comprehensive efficiency metrics, including token consumption and planning time, to more holistically evaluate multi-agent collaboration. Our experiments on benchmark TDW-MAT and C-WAH demonstrate that ELHPlan achieves comparable task success rates while consuming only 24% of the tokens required by state-of-the-art methods. Our research establishes a new efficiency-effectiveness frontier for LLM-based multi-agent planning systems.

[952] Learning to Ponder: Adaptive Reasoning in Latent Space

Yixin He, Lumingyuan Tang

Main category: cs.AI

TL;DR: FR-Ponder is a framework that enables LLMs to adaptively allocate reasoning compute per input instance using latent steering, avoiding uniform computation across all queries and improving efficiency.

Details

Motivation: Current approaches like Best-of-N and majority voting apply uniform reasoning depth across all inputs, wasting computation on simple queries while potentially under-thinking complex ones.

Method: A small controller observes hidden states and applies ponder steps by adding pre-computed steering vectors to frozen representations, using Group Relative Policy Optimization to regulate reasoning depth adaptively.

Result: On GSM8K and MATH500, FR-Ponder improves the compute-accuracy frontier with lower FLOPs and better matched accuracy compared to early-exit baselines, without modifying backbone weights.

Conclusion: The method successfully learns calibrated compute allocation correlated with problem difficulty and demonstrates interpretable steering directions.

Abstract: Test-time compute has emerged as a key paradigm for enhancing LLM reasoning, yet prevailing approaches like Best-of-N and majority voting apply uniform depth across inputs, wasting computation on simple queries while potentially under-thinking complex ones. We present FR-Ponder, a single-graph, backbone-training-free framework that allocates instance-adaptive reasoning compute via latent steering. A less than 1M-param controller observes hidden states and decides to halt or apply a small ponder step by adding a pre-computed steering vector to frozen representations. Our method extracts the latent steering vector associated with deeper reasoning outputs and direct IO from LLM and re-applies it through a tunable scaling factor, allowing the model to adapt its reasoning depth to the complexity of each input. To balance performance and computational cost, we employ Group Relative Policy Optimization (GRPO) as a reward signal to adaptively regulate reasoning depth, achieving task accuracy while mitigating overreasoning. Through curriculum learning and careful reward engineering, FR-Ponder learns calibrated compute allocation correlated with problem difficulty. On GSM8K and MATH500, FR-Ponder improves the compute-accuracy frontier, delivering lower FLOPs with better matched accuracy and comparing favorably to early-exit baselines, without modifying backbone weights. Analyses visualize interpretable steering directions and show learned compute allocation correlates with problem difficulty.

[953] Model Merging Scaling Laws in Large Language Models

Yuanyi Wang, Yanggan Gu, Yiming Zhang, Qi Zhou, Zhaoyi Yan, Congkai Xie, Xinyao Wang, Jianbo Yuan, Hongxia Yang

Main category: cs.AI

TL;DR: The paper identifies a power law for language model merging that predicts performance gains based on model size and number of experts, enabling predictive planning for merging strategies.

Details

Motivation: To establish quantitative scaling laws for language model merging, which is widely used in practice but lacks predictive rules for returns when adding experts or scaling model size.

Method: Empirical study of scaling laws measured by cross-entropy, analyzing how merging performance scales with model size and number of experts across diverse architectures and merging methods (Average, TA, TIES, DARE).

Result: Identified a compact power law that holds in-domain and cross-domain, tightly fitting measured curves. The law shows size-dependent floor decreases with model capacity, while merging tail exhibits diminishing returns in expert number. Gains fall roughly as 1/k, with most gains arriving early and variability shrinking with more experts.

Conclusion: The scaling law enables predictive planning for merging strategies, allowing estimation of expert requirements, stopping criteria, and trade-offs between scaling base models versus adding experts. This turns merging from heuristic practice into a computationally efficient, planable alternative to multitask training.

Abstract: We study empirical scaling laws for language model merging measured by cross-entropy. Despite its wide practical use, merging lacks a quantitative rule that predicts returns as we add experts or scale the model size. We identify a compact power law that links model size and expert number: the size-dependent floor decreases with model capacity, while the merging tail exhibits clear diminishing returns in the number of experts. The law holds in-domain and cross-domain, tightly fits measured curves across diverse architectures and methods (Average, TA, TIES, DARE), and explains two robust regularities: most gains arrive early, and variability shrinks as more experts are included. Building on this, we present a simple theory that explains why gains fall roughly as 1/k and links the floor and tail to properties of the base model and the diversity across domains. This law enables predictive planning: estimate how many experts are needed to reach a target loss, decide when to stop adding experts, and trade off scaling the base model versus adding experts under a fixed budget–turning merging from heuristic practice into a computationally efficient, planable alternative to multitask training. This suggests a scaling principle for distributed generative AI: predictable gains can be achieved by composing specialists, offering a complementary path toward AGI-level systems.

[954] SpecExit: Accelerating Large Reasoning Model via Speculative Exit

Rubing Yang, Huajun Bai, Song Liu, Guanghua Yu, Runzhi Fan, Yanbin Dang, Jiejing Zhang, Kai Liu, Jianchen Zhu, Peng Chen

Main category: cs.AI

TL;DR: SpecExit is a novel framework that reduces overthinking in large reasoning models by predicting both future tokens and early-exit signals from a lightweight draft model, achieving 66% shorter generation length and 2.5x speedup without accuracy loss.

Details

Motivation: Large reasoning models suffer from overthinking, producing unnecessarily long outputs with high latency, limiting real-world deployment. Existing early-exit methods have detection overhead that reduces latency gains and generalizability.

Method: Proposed SpecExit framework that predicts future tokens and early-exit signals directly from a lightweight draft model using hidden states, eliminating probing overhead. Inspired by speculative decoding approaches.

Result: Reduces average generation length by 66% and achieves 2.5x speedup in end-to-end latency compared to speculative decoding baseline, without compromising accuracy.

Conclusion: Hidden states provide effective early-exit signals for efficient reasoning, suggesting broader use of hidden states for optimizing model efficiency and reducing overthinking.

Abstract: Despite their strong performance on reasoning tasks, large reasoning models (LRMs) often suffer from overthinking, producing unnecessarily long outputs and incurring high end-to-end latency, a significant limitation to their real-world deployment. To address overthinking, early-exit mechanisms have been proposed to terminate reasoning before typical completion, showing that this approach can effectively shorten generation length with minimal impact on accuracy. However, their reliance on probing mechanisms introduces a detection overhead that limits their end-to-end latency gains and compromises their generalizability across diverse problems. Inspired by the use of hidden states in speculative decoding, we propose SpecExit, a novel framework that predicts both future tokens and an early-exit signal directly from a lightweight draft model without probing overhead. Our method offers significant improvements, reducing average generation length by 66% and achieving a 2.5x speedup in end-to-end latency compared to the speculative decoding baseline, without compromising accuracy. Our method leverages the inherent signals from hidden states to provide effective early-exit signals, suggesting broader use of hidden states for efficient reasoning. Our code is available at https://github.com/Tencent/AngelSlim.

[955] Interactive Program Synthesis for Modeling Collaborative Physical Activities from Narrated Demonstrations

Edward Kim, Daniel He, Jorge Chao, Wiktor Rajca, Mohammed Amin, Nishant Malpani, Ruta Desai, Antti Oulasvirta, Bjoern Hartmann, Sanjit Seshia

Main category: cs.AI

TL;DR: The paper presents a system that teaches collaborative physical tasks using program synthesis from narrated demonstrations, enabling users to inspect and correct learned behaviors without coding.

Details

Motivation: Teaching collaborative physical tasks is complex because systems must infer users' assumptions about teammate intent, which requires interpretable and correctable representations.

Method: Frames collaborative task learning as program synthesis, representing behavior as editable programs using narrated demonstrations (paired physical actions and natural language) as a unified teaching modality.

Result: In a study with 20 users teaching soccer tactics, 70% successfully refined learned programs to match their intent and 90% found it easy to correct programs.

Conclusion: The approach enables effective teaching of collaborative physical activities, though challenges remain in program representation and teaching methods that require mitigation strategies.

Abstract: Teaching systems physical tasks is a long standing goal in HCI, yet most prior work has focused on non collaborative physical activities. Collaborative tasks introduce added complexity, requiring systems to infer users assumptions about their teammates intent, which is an inherently ambiguous and dynamic process. This necessitates representations that are interpretable and correctable, enabling users to inspect and refine system behavior. We address this challenge by framing collaborative task learning as a program synthesis problem. Our system represents behavior as editable programs and uses narrated demonstrations, i.e. paired physical actions and natural language, as a unified modality for teaching, inspecting, and correcting system logic without requiring users to see or write code. The same modality is used for the system to communicate its learning to users. In a within subjects study, 20 users taught multiplayer soccer tactics to our system. 70 percent (14/20) of participants successfully refined learned programs to match their intent and 90 percent (18/20) found it easy to correct the programs. The study surfaced unique challenges in representing learning as programs and in enabling users to teach collaborative physical activities. We discuss these issues and outline mitigation strategies.

[956] Rethinking and Benchmarking Large Language Models for Graph Reasoning

Yuwei Hu, Xinyi Huang, Zhewei Wei, Yongchao Liu, Chuntao Hong

Main category: cs.AI

TL;DR: LLMs for graph reasoning are underperforming due to improper focus on replicating algorithms rather than designing them. A new benchmark GraphAlgorithm and Simple-RTC method achieve near-perfect accuracy.

Details

Motivation: Current LLM approaches for graph reasoning focus on replicating existing graph algorithms rather than designing new ones, leading to underwhelming performance despite the potential of LLMs.

Method: Proposed Simple-Reasoning-Then-Coding (Simple-RTC) which guides LLMs to first design graph algorithms and then code to solve graph reasoning tasks. Also constructed GraphAlgorithm benchmark with 239 problems and 3,041 test instances.

Result: Simple-RTC achieves near-perfect accuracy on existing benchmarks and significantly outperforms GPT-4o-mini and all prior methods on the new GraphAlgorithm benchmark.

Conclusion: Redirecting reasoning focus from algorithm replication to algorithm design enables LLMs to solve most graph reasoning tasks effectively, providing a strong baseline for future research.

Abstract: Large Language Models (LLMs) for Graph Reasoning have been extensively studied over the past two years, involving enabling LLMs to understand graph structures and reason on graphs to solve various graph problems, with graph algorithm problems being the most prevalent. Recent studies underscore the potential of LLMs in handling graph reasoning tasks, but their performance is underwhelming. In this work, we point out issues with existing methods and benchmarks, and rethink the direction that LLMs for graph reasoning should strive toward. We find that base models, e.g., GPT-4o-mini, are largely underestimated due to improper reasoning focus. Base models with reasoning focus redirected from replicating graph algorithms to designing them can easily solve most graph reasoning tasks in existing benchmarks. To truly evaluate the graph reasoning capabilities of LLMs, we construct a more challenging GraphAlgorithm benchmark, comprising 239 different graph problems and 3,041 test instances collected from 4 competition platforms. Finally, we introduce a simple and strong baseline Simple-Reasoning-Then-Coding (Simple-RTC)-which guides LLMs to design graph algorithms first and then code to address graph reasoning tasks. Simple-RTC achieves near-perfect accuracy on existing benchmarks and significantly outperforms GPT-4o-mini and all prior methods on the GraphAlgorithm benchmark. This strong baseline encourages further advancements in LLMs for Graph Reasoning in the future.

[957] Risk-Sensitive RL for Alleviating Exploration Dilemmas in Large Language Models

Yuhua Jiang, Jiawei Huang, Yufeng Yuan, Xin Mao, Yu Yue, Qianchuan Zhao, Lin Yan

Main category: cs.AI

TL;DR: RLVR improves LLMs on reasoning tasks but suffers from exploration limitations. The proposed Risk-Sensitive RL framework (RS-GRPO) enhances exploration and maintains pass@1 while boosting pass@k performance.

Details

Motivation: Existing RLVR methods face an exploration dilemma where pre-trained LLMs' peaked initial policies limit solution diversity and multi-solution performance (pass@k), causing RL to distill existing capabilities rather than discover new reasoning strategies.

Method: Introduces a Risk-Sensitive Reinforcement Learning framework with a risk-seeking objective that interpolates between mean and maximum rewards, leading to RS-GRPO algorithm that amplifies learning from challenging prompts.

Result: RS-GRPO consistently improves pass@k performance while maintaining or enhancing pass@1 accuracy across six mathematical reasoning benchmarks and five different LLMs.

Conclusion: Risk-sensitive RL effectively addresses the exploration dilemma in RLVR, enabling better multi-solution performance without sacrificing single-solution accuracy through simple implementation modifications.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for enhancing Large Language Models (LLMs) on complex reasoning tasks. However, existing methods suffer from an exploration dilemma: the sharply peaked initial policies of pre-trained LLMs confine standard RL algorithms to a narrow set of solutions, boosting single-solution accuracy (pass@1) but suppressing solution diversity and multi-solution performance (pass@k). As a result, RLVR often distills existing capabilities rather than discovering new reasoning strategies. To overcome this, we introduce a Risk-Sensitive Reinforcement Learning framework. Our approach employs a risk-seeking objective that interpolates between mean and maximum rewards, leading to a novel algorithm, Risk-Sensitive GRPO (RS-GRPO), which drives deeper exploration by amplifying learning from challenging prompts. Remarkably, RS-GRPO is simple to implement, requiring only minor code modifications. On six mathematical reasoning benchmarks and with five different LLMs, RS-GRPO consistently improves pass@k performance while maintaining or enhancing pass@1 accuracy.

[958] PAME-AI: Patient Messaging Creation and Optimization using Agentic AI

Junjie Luo, Yihong Guo, Anqi Liu, Ritu Agarwal, Gordon, Gao

Main category: cs.AI

TL;DR: PAME-AI is an agentic AI system that optimizes patient messaging by transforming raw data into actionable design strategies, achieving 12.2% improvement in engagement rates compared to baseline.

Details

Motivation: Traditional mobile message design has limitations in exploring high-dimensional design spaces for healthcare communication, which is critical for improving medication adherence and healthy behaviors.

Method: Built on the DIKW hierarchy, PAME-AI uses specialized computational agents to progressively transform raw experimental data into actionable message design strategies through a two-stage experiment with over 500,000 patient encounters.

Result: The best-performing generated message achieved 68.76% engagement compared to 61.27% baseline, representing a 12.2% relative improvement in click-through rates.

Conclusion: The agentic architecture enables parallel processing, hypothesis validation, and continuous learning, making it suitable for large-scale healthcare communication optimization.

Abstract: Messaging patients is a critical part of healthcare communication, helping to improve things like medication adherence and healthy behaviors. However, traditional mobile message design has significant limitations due to its inability to explore the high-dimensional design space. We develop PAME-AI, a novel approach for Patient Messaging Creation and Optimization using Agentic AI. Built on the Data-Information-Knowledge-Wisdom (DIKW) hierarchy, PAME-AI offers a structured framework to move from raw data to actionable insights for high-performance messaging design. PAME-AI is composed of a system of specialized computational agents that progressively transform raw experimental data into actionable message design strategies. We demonstrate our approach’s effectiveness through a two-stage experiment, comprising of 444,691 patient encounters in Stage 1 and 74,908 in Stage 2. The best-performing generated message achieved 68.76% engagement compared to the 61.27% baseline, representing a 12.2% relative improvement in click-through rates. This agentic architecture enables parallel processing, hypothesis validation, and continuous learning, making it particularly suitable for large-scale healthcare communication optimization.

[959] AdvChain: Adversarial Chain-of-Thought Tuning for Robust Safety Alignment of Large Reasoning Models

Zihao Zhu, Xinyu Wu, Gehan Hu, Siwei Lyu, Ke Xu, Baoyuan Wu

Main category: cs.AI

TL;DR: AdvChain is an alignment paradigm that addresses the snowball effect in Chain-of-Thought reasoning by teaching models dynamic self-correction through adversarial tuning, improving safety-utility balance without compromising reasoning capabilities.

Details

Motivation: Current safety CoT tuning methods suffer from the snowball effect where minor reasoning deviations amplify throughout multi-step reasoning, leading to harmful compliance or excessive refusal, due to models imitating perfect reasoning without learning self-correction.

Method: AdvChain uses adversarial CoT tuning with a dataset containing Temptation-Correction and Hesitation-Correction samples, teaching models to recover from harmful reasoning drifts and unnecessary cautions through dynamic self-correction.

Result: Extensive experiments show AdvChain significantly enhances robustness against jailbreak attacks and CoT hijacking while substantially reducing over-refusal on benign prompts.

Conclusion: AdvChain establishes a new direction for building more robust and reliable reasoning models by addressing the fundamental limitation of current safety alignment methods in multi-step reasoning contexts.

Abstract: Large Reasoning Models (LRMs) have demonstrated remarkable capabilities in complex problem-solving through Chain-of-Thought (CoT) reasoning. However, the multi-step nature of CoT introduces new safety challenges that extend beyond conventional language model alignment. We identify a failure mode in current safety CoT tuning methods: the \textit{snowball effect}, where minor reasoning deviations progressively amplify throughout the thought process, leading to either harmful compliance or excessive refusal. This effect stems from models being trained to imitate perfect reasoning scripts without learning to self-correct. To address this limitation, we propose AdvChain, an alignment paradigm that teaches models dynamic self-correction through adversarial CoT tuning. Our method involves constructing a dataset containing Temptation-Correction and Hesitation-Correction samples, where models learn to recover from harmful reasoning drifts and unnecessary cautions. Extensive experiments show that AdvChain significantly enhances robustness against jailbreak attacks and CoT hijacking while substantially reducing over-refusal on benign prompts, achieving a superior safety-utility balance without compromising reasoning capabilities. Our work establishes a new direction for building more robust and reliable reasoning models.

[960] G-reasoner: Foundation Models for Unified Reasoning over Graph-structured Knowledge

Linhao Luo, Zicheng Zhao, Junnan Liu, Zhangchi Qiu, Junnan Dong, Serge Panev, Chen Gong, Thuy-Trang Vu, Gholamreza Haffari, Dinh Phung, Alan Wee-Chung Liew, Shirui Pan

Main category: cs.AI

TL;DR: G-reasoner is a unified framework that integrates graph and language foundation models for reasoning over diverse graph-structured knowledge, addressing limitations of existing retrieval-augmented generation methods.

Details

Motivation: Existing RAG methods struggle with knowledge-intensive tasks due to fragmented information and weak modeling of knowledge structure, while LLMs cannot effectively reason over graph-structured data.

Method: Proposes QuadGraph (standardized four-layer graph abstraction), a 34M-parameter graph foundation model (GFM) that captures graph topology and textual semantics, and integrates it with LLMs using mixed-precision training and distributed message-passing.

Result: Outperforms state-of-the-art baselines on six benchmarks, significantly enhances LLM reasoning, and achieves strong efficiency and cross-graph generalization.

Conclusion: G-reasoner provides an effective unified framework for graph-enhanced reasoning that overcomes limitations of existing methods while maintaining scalability and generalization.

Abstract: Large language models (LLMs) excel at complex reasoning but remain limited by static and incomplete parametric knowledge. Retrieval-augmented generation (RAG) mitigates this by incorporating external knowledge, yet existing RAGs struggle with knowledge-intensive tasks due to fragmented information and weak modeling of knowledge structure. Graphs offer a natural way to model relationships within knowledge, but LLMs are inherently unstructured and cannot effectively reason over graph-structured data. Recent graph-enhanced RAG (GraphRAG) attempts to bridge this gap by constructing tailored graphs and enabling LLMs to reason on them. However, these methods often depend on ad-hoc graph designs, heuristic search, or costly agent pipelines, which hinder scalability and generalization. To address these challenges, we present G-reasoner, a unified framework that integrates graph and language foundation models for reasoning over diverse graph-structured knowledge. Central to our approach is QuadGraph, a standardized four-layer abstraction that unifies heterogeneous knowledge sources into a common graph representation. Building on this, we introduce a 34M-parameter graph foundation model (GFM) that jointly captures graph topology and textual semantics, and is integrated with LLMs to enhance reasoning in downstream applications. To ensure scalability and efficiency, mixed-precision training and distributed message-passing are implemented to scale GFM with more GPUs. Extensive experiments on six benchmarks show that G-reasoner consistently outperforms state-of-the-art baselines, significantly enhances LLM reasoning, and achieves strong efficiency and cross-graph generalization.

[961] SCI-Verifier: Scientific Verifier with Thinking

Shenghe Zheng, Chenyu Huang, Fangchen Yu, Junchi Yao, Jingqi Ye, Tao Chen, Yun Luo, Ning Ding, LEI BAI, Ganqu Cui, Peng Ye

Main category: cs.AI

TL;DR: The paper addresses challenges in verifying LLM-generated scientific answers by proposing SCI-VerifyBench (a cross-disciplinary benchmark) and SCI-Verifier (a reasoning-augmented verification model) to improve systematic evaluation and verification capabilities.

Details

Motivation: Existing verification methods for LLMs in scientific domains lack systematic evaluation standards, have insufficient disciplinary coverage, and rely heavily on manual rule design or prompt engineering, limiting their effectiveness in complex reasoning scenarios and cross-disciplinary generalization.

Method: Two-pronged approach: (1) Data level - construct SCI-VerifyBench benchmark covering mathematics, physics, biology, chemistry, and general scientific QA using real LLM responses enhanced with domain-specific equivalence transformations; (2) Model level - develop SCI-Verifier, a unified reasoning-augmented verifier trained through post-training to enhance logical reasoning and equivalence judgment capabilities.

Result: SCI-VerifyBench provides rigorous evaluation with model-based and expert annotations ensuring quality and diversity. SCI-Verifier demonstrates strong logical reasoning and equivalence judgment capabilities while maintaining concise and stable outputs.

Conclusion: The proposed framework offers systematic evaluation and practical pathways to enhance the reliability and applicability of LLMs in scientific domains through principled verification approaches.

Abstract: As large language models (LLMs) are increasingly applied to scientific reasoning, the complexity of answer formats and the diversity of equivalent expressions make answer verification a critical yet challenging task. Existing verification studies in scientific domains suffer from two major limitations: (a) the absence of systematic evaluation standards and insufficient disciplinary coverage, which hinders their comprehensive assessment; and (b) heavy reliance on cumbersome rule design or prompt engineering, which reduces their effectiveness in complex reasoning scenarios or limits their cross-disciplinary generalization. To address these challenges, we propose solutions at both the data and model levels. On the data side, we construct SCI-VerifyBench, a cross-disciplinary benchmark covering mathematics, physics, biology, chemistry, and general scientific QA. The benchmark is built from real LLM responses and enhanced with domain-specific equivalence transformations that generate challenging and realistic data. Model-based and expert annotations ensure both quality and diversity, enabling rigorous evaluation of verification ability. On the model side, we emphasize the importance of reasoning for verification and introduce SCI-Verifier, a unified reasoning-augmented verifier for scientific domains. Through post-training, SCI-Verifier demonstrates strong logical reasoning and equivalence judgment capabilities while maintaining concise and stable outputs. Together, SCI-VerifyBench and SCI-Verifier provide a principled framework for scientific verification, offering both systematic evaluation and practical pathways to enhance the reliability and applicability of LLMs in scientific domains.

[962] Experience Paper: Adopting Activity Recognition in On-demand Food Delivery Business

Huatao Xu, Yan Zhang, Wei Gao, Guobin Shen, Mo Li

Main category: cs.AI

TL;DR: First nationwide deployment of human activity recognition (HAR) technology in food delivery industry using adapted LIMU-BERT model, scaling from city pilot to 500,000 couriers across China with significant operational and economic benefits.

Details

Motivation: To demonstrate the real-world application and transformative potential of HAR technology in the on-demand food delivery industry at a nationwide scale.

Method: Adapted state-of-the-art LIMU-BERT foundation model and deployed through three phases over two years, starting with feasibility study in Yangzhou City then scaling to nationwide adoption.

Result: Successfully deployed to 500,000 couriers across 367 cities in China, enabling downstream applications and demonstrating significant operational and economic benefits through large-scale tests.

Conclusion: HAR technology has transformative potential in real-world applications, with lessons learned from deployment and open-sourcing of LIMU-BERT model pretrained with millions of hours of sensor data.

Abstract: This paper presents the first nationwide deployment of human activity recognition (HAR) technology in the on-demand food delivery industry. We successfully adapted the state-of-the-art LIMU-BERT foundation model to the delivery platform. Spanning three phases over two years, the deployment progresses from a feasibility study in Yangzhou City to nationwide adoption involving 500,000 couriers across 367 cities in China. The adoption enables a series of downstream applications, and large-scale tests demonstrate its significant operational and economic benefits, showcasing the transformative potential of HAR technology in real-world applications. Additionally, we share lessons learned from this deployment and open-source our LIMU-BERT pretrained with millions of hours of sensor data.

[963] MedMMV: A Controllable Multimodal Multi-Agent Framework for Reliable and Verifiable Clinical Reasoning

Hongjun Liu, Yinghao Zhu, Yuhui Wang, Yitao Long, Zeyu Lai, Lequan Yu, Chen Zhao

Main category: cs.AI

TL;DR: MedMMV is a controllable multimodal multi-agent framework that addresses instability and hallucination in medical AI systems by using diversified short rollouts, evidence grounding, and uncertainty scoring to improve clinical reasoning reliability.

Details

Motivation: Current MLLMs show promising medical performance but suffer from critical instability in early evidence interpretation that leads to hallucination and inconsistent conclusions, highlighting the need for more reliable clinical reasoning agents.

Method: MedMMV stabilizes reasoning through diversified short rollouts, grounds intermediate steps in a structured evidence graph supervised by a Hallucination Detector, and aggregates candidate paths with a Combined Uncertainty scorer.

Result: On six medical benchmarks, MedMMV improves accuracy by up to 12.7% and demonstrates superior reliability. Blind physician evaluations confirm increased reasoning truthfulness without sacrificing informational content.

Conclusion: By controlling instability through a verifiable, multi-agent process, MedMMV provides a robust path toward deploying trustworthy AI systems in high-stakes clinical decision support domains.

Abstract: Recent progress in multimodal large language models (MLLMs) has demonstrated promising performance on medical benchmarks and in preliminary trials as clinical assistants. Yet, our pilot audit of diagnostic cases uncovers a critical failure mode: instability in early evidence interpretation precedes hallucination, creating branching reasoning trajectories that cascade into globally inconsistent conclusions. This highlights the need for clinical reasoning agents that constrain stochasticity and hallucination while producing auditable decision flows. We introduce MedMMV, a controllable multimodal multi-agent framework for reliable and verifiable clinical reasoning. MedMMV stabilizes reasoning through diversified short rollouts, grounds intermediate steps in a structured evidence graph under the supervision of a Hallucination Detector, and aggregates candidate paths with a Combined Uncertainty scorer. On six medical benchmarks, MedMMV improves accuracy by up to 12.7% and, more critically, demonstrates superior reliability. Blind physician evaluations confirm that MedMMV substantially increases reasoning truthfulness without sacrificing informational content. By controlling instability through a verifiable, multi-agent process, our framework provides a robust path toward deploying trustworthy AI systems in high-stakes domains like clinical decision support.

[964] humancompatible.detect: a Python Toolkit for Detecting Bias in AI Models

German M. Matilla, Jiri Nemecek, Illia Kryvoviaz, Jakub Marecek

Main category: cs.AI

TL;DR: A toolkit called humancompatible.detect for bias detection in AI systems that addresses scalability and computability issues of traditional distance estimation methods.

Details

Motivation: International regulations like the AI Act require measuring data quality and estimating bias in high-risk AI systems, but traditional methods face scalability (MMD) and computability (Wasserstein-1) challenges.

Method: The toolkit incorporates two new methods: maximum subgroup discrepancy (MSD) and subsampled ℓ∞ distances, with an easy-to-use API and comprehensive documentation.

Result: Developed humancompatible.detect toolkit that provides practical solutions for bias detection while overcoming limitations of traditional distance estimation approaches.

Conclusion: humancompatible.detect offers an effective solution for bias detection in AI systems, addressing regulatory requirements and technical challenges, and is available under Apache License 2.0.

Abstract: There is a strong recent emphasis on trustworthy AI. In particular, international regulations, such as the AI Act, demand that AI practitioners measure data quality on the input and estimate bias on the output of high-risk AI systems. However, there are many challenges involved, including scalability (MMD) and computability (Wasserstein-1) issues of traditional methods for estimating distances on measure spaces. Here, we present humancompatible.detect, a toolkit for bias detection that addresses these challenges. It incorporates two newly developed methods to detect and evaluate bias: maximum subgroup discrepancy (MSD) and subsampled $\ell_\infty$ distances. It has an easy-to-use API documented with multiple examples. humancompatible.detect is licensed under the Apache License, Version 2.0.

[965] Fin-Ally: Pioneering the Development of an Advanced, Commonsense-Embedded Conversational AI for Money Matters

Sarmistha Das, Priya Mathur, Ishani Sharma, Sriparna Saha, Kitsuchart Pasupa, Alka Maurya

Main category: cs.AI

TL;DR: Fin-Solution 2.O introduces Fin-Vault dataset and Fin-Ally model to address unprofessional responses in financial chatbots by integrating commonsense reasoning and human-like conversation dynamics.

Details

Motivation: Address the problem of unprofessional or flippant remarks in financial advisory chatbots due to large-scale fine-tuning of LLMs, and bridge the gap caused by scarcity of domain-specific datasets.

Method: Created Fin-Vault dataset (1,417 annotated multi-turn dialogues) and developed Fin-Ally model that integrates commonsense reasoning, politeness, and conversational dynamics using COMET-BART-embedded commonsense context and Direct Preference Optimization (DPO).

Result: The approach enables language models to generate more refined, textually precise, and professionally grounded financial guidance, extending beyond basic account management to personalized budgeting, real-time expense tracking, and automated financial planning.

Conclusion: This approach positions as a next-generation AI solution for FinTech sector by incorporating commonsense context to improve financial advisory chatbot responses.

Abstract: The exponential technological breakthrough of the FinTech industry has significantly enhanced user engagement through sophisticated advisory chatbots. However, large-scale fine-tuning of LLMs can occasionally yield unprofessional or flippant remarks, such as ``With that money, you’re going to change the world,’’ which, though factually correct, can be contextually inappropriate and erode user trust. The scarcity of domain-specific datasets has led previous studies to focus on isolated components, such as reasoning-aware frameworks or the enhancement of human-like response generation. To address this research gap, we present Fin-Solution 2.O, an advanced solution that 1) introduces the multi-turn financial conversational dataset, Fin-Vault, and 2) incorporates a unified model, Fin-Ally, which integrates commonsense reasoning, politeness, and human-like conversational dynamics. Fin-Ally is powered by COMET-BART-embedded commonsense context and optimized with a Direct Preference Optimization (DPO) mechanism to generate human-aligned responses. The novel Fin-Vault dataset, consisting of 1,417 annotated multi-turn dialogues, enables Fin-Ally to extend beyond basic account management to provide personalized budgeting, real-time expense tracking, and automated financial planning. Our comprehensive results demonstrate that incorporating commonsense context enables language models to generate more refined, textually precise, and professionally grounded financial guidance, positioning this approach as a next-generation AI solution for the FinTech sector. Dataset and codes are available at: https://github.com/sarmistha-D/Fin-Ally

[966] From Static to Dynamic: Adaptive Monte Carlo Search for Mathematical Process Supervision

Jie Ma, Shihao Qi, Rui Xing, Ziang Yin, Bifan Wei, Jun Liu, Tongliang Liu

Main category: cs.AI

TL;DR: AMCS is an adaptive Monte Carlo search framework that improves process data generation for training Process Reward Models (PRMs) by dynamically allocating samples to uncertain reasoning steps and using adaptive exploration-exploitation policies, achieving state-of-the-art performance on mathematical reasoning benchmarks.

Details

Motivation: Existing methods for process data generation use fixed-budget sampling and inefficient search strategies, leading to inefficiency and inflexibility in training PRMs for mathematical reasoning.

Method: Proposes Adaptive Monte Carlo Search (AMCS) with two key components: (1) adaptive sample allocation that focuses more samples on uncertain reasoning steps, and (2) Monte Carlo algorithm with temporally adaptive policy that transitions from broad exploration to focused exploitation.

Result: Created MathSearch-200K dataset with 200K process supervision examples. Qwen2.5-Math-7B-PRM-AMCS achieved 76.2% accuracy on MATH500, outperforming all baseline PRMs. A 7B model supervised by AMCS surpassed a 72B model with weaker supervision, demonstrating strong generalization on out-of-distribution problems.

Conclusion: AMCS effectively addresses inefficiencies in process data generation through adaptive search strategies, enabling more efficient training of PRMs that achieve superior mathematical reasoning performance and strong generalization capabilities.

Abstract: The quality of process data plays a key role in training a Process Reward Model (PRM), which can enhance the complex mathematical reasoning capability of large language models. Existing methods estimate the quality of reasoning steps based on a fixed-budget sampling strategy and navigate a vast search space to perform path expansion during the automated data generation process, resulting in their inefficiency and inflexibility. To address these issues, we propose Adaptive Monte Carlo Search (AMCS), a framework that transforms data generation from fixed, static to adaptive, dynamic search at the level of node value estimation and path expansion. On one hand, AMCS adaptively refines estimation by allocating more samples to uncertain reasoning steps while using fewer samples for those that are easier to estimate. On the other hand, it enhances the path expansion through a Monte Carlo algorithm with a temporally adaptive policy that begins with broad exploration and gradually shifts toward exploiting the most promising directions. With AMCS, we construct a large-scale dataset MathSearch-200K of about 200K process supervision examples for training PRMs. To verify the effectiveness of our method, we conduct extensive experiments on four mathematical reasoning benchmarks. Experimental results show that Qwen2.5-Math-7B-PRM-AMCS achieves up to 76.2% accuracy on MATH500 with GLM-4-9B, outperforming all baseline PRMs. Notably, a 7B model supervised by Qwen2.5-Math-7B-PRM-AMCS surpasses a 72B model with weaker supervision. Moreover, Qwen2.5-Math-7B-PRM-AMCS maintains consistent advantages on out-of-distribution problems, demonstrating strong generalization capability. Our code is available at https://github.com/reml-group/AMCS.

[967] Plan before Solving: Problem-Aware Strategy Routing for Mathematical Reasoning with LLMs

Shihao Qi, Jie Ma, Ziang Yin, Lingling Zhang, Jian Zhang, Jun Liu, Feng Tian, Tongliang Liu

Main category: cs.AI

TL;DR: PRISM is a novel framework that decouples mathematical reasoning into strategy planning and targeted execution stages, using adaptive routing to dynamically select optimal reasoning strategies for each problem.

Details

Motivation: Existing methods use fixed strategies that cannot adapt to problem-specific requirements, overlooking the trade-off between effectiveness and efficiency in mathematical reasoning.

Method: Curated MathStrat dataset capturing strategy preferences, trained lightweight Strategy Adapter for confidence distributions, and adaptive routing policy that dynamically selects reasoning strategies based on predictor confidence.

Result: PRISM consistently outperforms individual strategies and ensemble baselines across five mathematical reasoning benchmarks, achieving 0.9% to 7.6% improvements across different base models.

Conclusion: The adaptive routing approach shows strong benefits for mathematical reasoning tasks across diverse model architectures, demonstrating the effectiveness of instance-specific strategy selection.

Abstract: Existing methods usually leverage a fixed strategy, such as natural language reasoning, code-augmented reasoning, tool-integrated reasoning, or ensemble-based reasoning, to guide Large Language Models (LLMs) to perform mathematical reasoning. Our analysis reveals that the single strategy cannot adapt to problem-specific requirements and thus overlooks the trade-off between effectiveness and efficiency. To address these issues, we propose Planning and Routing through Instance-Specific Modeling (PRISM), a novel framework that decouples mathematical reasoning into two stages: strategy planning and targeted execution. Specifically, we first curate a multi-strategy preference dataset, which we call MathStrat, capturing correctness, process quality, and computational efficiency for each problem–strategy pair. Then, we train a lightweight Strategy Adapter based on the dataset to obtain confidence distributions over the mentioned four reasoning strategies. At inference time, an adaptive routing policy dynamically tailors the reasoning approach based on predictor confidence. It directs the model to use single-strategy execution for high-confidence predictions, dual-strategy verification for competitive scenarios, or comprehensive multi-strategy exploration for uncertain cases. Extensive experiments across five mathematical reasoning benchmarks demonstrate that PRISM consistently outperforms individual strategies and ensemble baselines, achieving improvements ranging from 0.9% to 7.6% across different base models. The adaptive routing approach shows particularly strong benefits for mathematical reasoning tasks across diverse model architectures. Our code is released at https://github.com/reml-group/PRISM.

[968] Towards Safe Reasoning in Large Reasoning Models via Corrective Intervention

Yichi Zhang, Yue Ding, Jingwen Yang, Tianwei Luo, Dongbai Li, Ranjie Duan, Qiang Liu, Hang Su, Yinpeng Dong, Jun Zhu

Main category: cs.AI

TL;DR: The paper addresses safety issues in Large Reasoning Models’ chain-of-thought reasoning, proposing Intervened Preference Optimization (IPO) to align reasoning safety by substituting compliance steps with safety triggers and using preference learning.

Details

Motivation: Existing methods overlook the safety of reasoning itself in LRMs, allowing harmful content to persist in CoT reasoning even when final responses appear safe, which undermines trustworthiness and poses risks if exploited by malicious users.

Method: Propose Intervened Preference Optimization (IPO), an alignment method that enforces safe reasoning by: 1) substituting compliance steps with safety triggers, 2) constructing pairs for preference learning with strong signals based on insights about safe reasoning characteristics.

Result: IPO reduces harmfulness by over 30% relative to SFT-based and RL-based baselines on jailbreak and adversarial safety benchmarks, while maintaining excellent performance across diverse reasoning tasks.

Conclusion: The results highlight the importance of explicit alignment for reasoning safety and provide a practical path to safer Large Reasoning Models through process supervision and preference optimization.

Abstract: Although Large Reasoning Models (LRMs) have progressed in solving complex problems, their chain-of-thought (CoT) reasoning often contains harmful content that can persist even when the final responses appear safe. We show that this issue still remains in existing methods which overlook the unique significance of safe reasoning, undermining their trustworthiness and posing potential risks in applications if unsafe reasoning is accessible for and exploited by malicious users. We therefore shift our focus to aligning the safety of reasoning itself in this paper and explore process supervision as the solution. However, simply rewarding safe reasoning proves inadequate due to low rollout diversity and limited training signals. To tackle this challenge, we first delve into the characteristics of safe reasoning and uncover several critical insights that 1) safe reasoning is often consolidated by a few critical steps of safety triggers; 2) compliance cues strongly correlate with unsafe continuations; and 3) corrective interventions reliably steer unsafe trajectories towards safer traces. Motivated by these, we propose Intervened Preference Optimization (IPO), an alignment method that enforces safe reasoning by substituting compliance steps with safety triggers and constructing pairs for preference learning with strong signals. Experiments on jailbreak and adversarial safety benchmarks demonstrate that IPO remarkably improves overall safety regarding both reasoning and responses, outperforming SFT-based and RL-based baselines with a relative reduction of over 30% in harmfulness, while preserving excellent performance across diverse reasoning tasks. The results highlight the importance of explicit alignment for reasoning and provide a practical path to safer LRMs.

[969] A Systematic Review of Digital Twin-Driven Predictive Maintenance in Industrial Engineering: Taxonomy, Architectural Elements, and Future Research Directions

Leila Ismail, Abdelmoneim Abdelmoti, Arkaprabha Basu, Aymen Dia Eddine Berini, Mohammad Naouss

Main category: cs.AI

TL;DR: This paper provides a retrospective analysis of digital twin technology evolution in predictive maintenance for industrial engineering, covering applications, middleware, and AI integration.

Details

Motivation: Traditional reactive and preventive maintenance practices are inadequate for complex industrial systems, creating a need for AI-enabled predictive maintenance using digital twin technology to avoid costly downtime and safety risks.

Method: Performed retrospective analysis of temporal evolution of digital twin in predictive maintenance, developed layered architecture and taxonomy of industrial engineering applications, middleware, and AI algorithms.

Result: Provided comprehensive insights into digital twin systems for trustworthy smart industrial engineering ecosystem, including applications, technological requirements, and AI integration patterns.

Conclusion: Digital twin technology enables efficient predictive maintenance in industrial engineering, with future research directions needed for further advancement of AI-enabled self-learning models and trustworthy ecosystem development.

Abstract: With the increasing complexity of industrial systems, there is a pressing need for predictive maintenance to avoid costly downtime and disastrous outcomes that could be life-threatening in certain domains. With the growing popularity of the Internet of Things, Artificial Intelligence, machine learning, and real-time big data analytics, there is a unique opportunity for efficient predictive maintenance to forecast equipment failures for real-time intervention and optimize maintenance actions, as traditional reactive and preventive maintenance practices are often inadequate to meet the requirements for the industry to provide quality-of-services of operations. Central to this evolution is digital twin technology, an adaptive virtual replica that continuously monitors and integrates sensor data to simulate and improve asset performance. Despite remarkable progress in digital twin implementations, such as considering DT in predictive maintenance for industrial engineering. This paper aims to address this void. We perform a retrospective analysis of the temporal evolution of the digital twin in predictive maintenance for industrial engineering to capture the applications, middleware, and technological requirements that led to the development of the digital twin from its inception to the AI-enabled digital twin and its self-learning models. We provide a layered architecture of the digital twin technology, as well as a taxonomy of the technology-enabled industrial engineering applications systems, middleware, and the used Artificial Intelligence algorithms. We provide insights into these systems for the realization of a trustworthy and efficient smart digital-twin industrial engineering ecosystem. We discuss future research directions in digital twin for predictive maintenance in industrial engineering.

[970] ContextPRM: Leveraging Contextual Coherence for multi-domain Test-Time Scaling

Haotian Zhang, Liu Liu, Baosheng Yu, Jiayan Qiu, Likang Xiao, Yanwei Ren, Quan Chen, Xianglong Liu

Main category: cs.AI

TL;DR: Process reward models (PRMs) improve LLM reasoning but struggle with generalization beyond math domains. This paper proposes shifting focus from domain knowledge to logical flow modeling, achieving better cross-domain performance.

Details

Motivation: Current PRMs show limited generalization to non-mathematical domains due to domain-specific training data and knowledge-based learning patterns.

Method: Shift learning objective to model domain-agnostic logical flow by focusing on contextual coherence between chain-of-thought steps, using novel data annotation and training framework.

Result: ContextPRM achieves 6.5% average accuracy improvement over majority voting baseline across nine non-mathematical domains in MMLU-Pro, significantly outperforming math-focused PRMs (2.2% for VersaPRM, 0.5% for others).

Conclusion: Focusing on contextual coherence in logical flow rather than domain knowledge enables PRMs to generalize effectively across both mathematical and non-mathematical domains.

Abstract: Process reward models (PRMs) have demonstrated significant efficacy in enhancing the mathematical reasoning capabilities of large language models (LLMs) by leveraging test-time scaling (TTS). However, while most PRMs exhibit substantial gains in mathematical domains, the scarcity of domain-specific training data and knowledge-based learning patterns limits their generalization ability when faced with other domains. To address this limitation, we shift the learning objective from verifying domain-specific knowledge to modeling domain-agnostic logical flow. Centering on contextual coherence between chain-of-thought (CoT) steps, our approach is realized through a novel data annotation and training framework, which enhances the model’s generalization capabilities across diverse domains. For instance, our resulting model, ContextPRM, achieves a notable 6.5% average accuracy improvement over the majority voting baseline via weighted majority voting across nine non-mathematical domains in MMLU-Pro, including law, history, and philosophy, significantly surpassing the 2.2% improvement from VersaPRM and 0.5% gains from other mathematics-focused PRMs, demonstrating consistent performance across both mathematical and non-mathematical domains.

[971] Overcoming Over-Fitting in Constraint Acquisition via Query-Driven Interactive Refinement

Vasileios Balafas, Dimos Tsouros, Nikolaos Ploskas, Kostas Stergiou

Main category: cs.AI

TL;DR: A hybrid constraint acquisition framework that combines passive learning with interactive refinement to prevent over-fitting and reduce query complexity in data-limited scenarios.

Details

Motivation: To address the limitations of passive CA methods (over-fitting with spurious constraints) and purely active methods (high query intensity) when learning from limited data.

Method: Integrates passive learning for initial candidate generation, query-driven interactive refinement using probabilistic confidence scores, specialized subset exploration to recover valid substructures, and final active learning for model completeness.

Result: Achieves high target model coverage and overall model accuracy from limited examples with manageable query complexity, as demonstrated in extensive experiments on diverse benchmarks.

Conclusion: The framework represents a substantial advancement towards robust and practical constraint acquisition in data-limited scenarios, with interactive refinement being crucial for performance.

Abstract: Manual modeling in Constraint Programming is a substantial bottleneck, which Constraint Acquisition (CA) aims to automate. However, passive CA methods are prone to over-fitting, often learning models that include spurious global constraints when trained on limited data, while purely active methods can be query-intensive. We introduce a hybrid CA framework specifically designed to address the challenge of over-fitting in CA. Our approach integrates passive learning for initial candidate generation, a query-driven interactive refinement phase that utilizes probabilistic confidence scores (initialized by machine learning priors) to systematically identify over-fitted constraints, and a specialized subset exploration mechanism to recover valid substructures from rejected candidates. A final active learning phase ensures model completeness. Extensive experiments on diverse benchmarks demonstrate that our interactive refinement phase is crucial for achieving high target model coverage and overall model accuracy from limited examples, doing so with manageable query complexity. This framework represents a substantial advancement towards robust and practical constraint acquisition in data-limited scenarios.

[972] Neuroplasticity-inspired dynamic ANNs for multi-task demand forecasting

Mateusz Żarski, Sławomir Nowaczyk

Main category: cs.AI

TL;DR: Introduces Neuroplastic Multi-Task Network (NMT-Net), a dynamic ANN approach for multi-task demand forecasting that enables structural adaptability during training through neuroplasticity-inspired mechanisms.

Details

Motivation: To address limitations of conventional methods by enabling structural adaptability of computational graphs during training, inspired by neuroplasticity in biological systems for better multi-task learning.

Method: Uses dynamic network adaptation where each new task triggers similarity-based task identification and selective training of candidate ANN heads, which are then assessed and integrated based on performance.

Result: Achieved superior performance on three real-world multi-task demand forecasting datasets with lower RMSE and standard deviation compared to traditional baselines and state-of-the-art methods.

Conclusion: NMT-Net provides a scalable, adaptable solution for multi-task and continual learning in time series prediction, demonstrating consistent performance improvements.

Abstract: This paper introduces a novel approach to Dynamic Artificial Neural Networks (D-ANNs) for multi-task demand forecasting called Neuroplastic Multi-Task Network (NMT-Net). Unlike conventional methods focusing on inference-time dynamics or computational efficiency, our proposed method enables structural adaptability of the computational graph during training, inspired by neuroplasticity as seen in biological systems. Each new task triggers a dynamic network adaptation, including similarity-based task identification and selective training of candidate ANN heads, which are then assessed and integrated into the model based on their performance. We evaluated our framework using three real-world multi-task demand forecasting datasets from Kaggle. We demonstrated its superior performance and consistency, achieving lower RMSE and standard deviation compared to traditional baselines and state-of-the-art multi-task learning methods. NMT-Net offers a scalable, adaptable solution for multi-task and continual learning in time series prediction. The complete code for NMT-Net is available from our GitHub repository.

[973] Experience-guided reflective co-evolution of prompts and heuristics for automatic algorithm design

Yihong Liu, Junyi Li, Wayne Xin Zhao, Hongyu Lu, Ji-Rong Wen

Main category: cs.AI

TL;DR: EvoPH is a novel framework that co-evolves prompts and heuristic algorithms using LLMs, integrating island migration and elite selection to avoid local optima in automatic algorithm design for combinatorial optimization problems.

Details

Motivation: Traditional heuristic algorithms require extensive domain expertise and implementation effort. While LLM-powered automatic heuristics design shows promise, existing methods often stagnate in local optima.

Method: EvoPH integrates island migration model with elite selection algorithm to simulate diverse heuristics populations. It co-evolves prompts and heuristic algorithms guided by performance feedback.

Result: EvoPH achieves the lowest relative error against optimal solutions on Traveling Salesman Problem and Bin Packing Problem datasets.

Conclusion: EvoPH advances the field of automatic algorithm design with LLMs by effectively avoiding local optima and generating high-quality heuristics.

Abstract: Combinatorial optimization problems are traditionally tackled with handcrafted heuristic algorithms, which demand extensive domain expertise and significant implementation effort. Recent progress has highlighted the potential of automatic heuristics design powered by large language models (LLMs), enabling the automatic generation and refinement of heuristics. These approaches typically maintain a population of heuristics and employ LLMs as mutation operators to evolve them across generations. While effective, such methods often risk stagnating in local optima. To address this issue, we propose the Experience-Guided Reflective Co-Evolution of Prompt and Heuristics (EvoPH) for automatic algorithm design, a novel framework that integrates the island migration model with the elites selection algorithm to simulate diverse heuristics populations. In EvoPH, prompts are co-evolved with heuristic algorithms, guided by performance feedback. We evaluate our framework on two problems, i.e., Traveling Salesman Problem and Bin Packing Problem. Experimental results demonstrate that EvoPH achieves the lowest relative error against optimal solutions across both datasets, advancing the field of automatic algorithm design with LLMs.

[974] Training Agents Inside of Scalable World Models

Danijar Hafner, Wilson Yan, Timothy Lillicrap

Main category: cs.AI

TL;DR: Dreamer 4 is a scalable agent that learns to solve control tasks through reinforcement learning in an accurate world model, achieving the first diamond collection in Minecraft from purely offline data without environment interaction.

Details

Motivation: Previous world models struggled with accurate object interaction predictions in complex environments. The work aims to develop intelligent agents that can learn from imagination rather than direct environment interaction, which is important for practical applications like robotics where real interaction can be unsafe and slow.

Method: Dreamer 4 uses a fast and accurate world model with a shortcut forcing objective and efficient transformer architecture for real-time inference. It learns general action conditioning from small amounts of data and extracts most knowledge from diverse unlabeled videos. Behaviors are learned through reinforcement learning inside the world model.

Result: The world model accurately predicts object interactions and game mechanics in Minecraft, outperforming previous world models by a large margin. Dreamer 4 successfully obtains diamonds in Minecraft from only offline data, requiring sequences of over 20,000 mouse and keyboard actions from raw pixels.

Conclusion: This work provides a scalable recipe for imagination training and marks a step towards intelligent agents that can learn complex behaviors without direct environment interaction, demonstrating the potential for practical applications where real-world interaction is constrained.

Abstract: World models learn general knowledge from videos and simulate experience for training behaviors in imagination, offering a path towards intelligent agents. However, previous world models have been unable to accurately predict object interactions in complex environments. We introduce Dreamer 4, a scalable agent that learns to solve control tasks by reinforcement learning inside of a fast and accurate world model. In the complex video game Minecraft, the world model accurately predicts object interactions and game mechanics, outperforming previous world models by a large margin. The world model achieves real-time interactive inference on a single GPU through a shortcut forcing objective and an efficient transformer architecture. Moreover, the world model learns general action conditioning from only a small amount of data, allowing it to extract the majority of its knowledge from diverse unlabeled videos. We propose the challenge of obtaining diamonds in Minecraft from only offline data, aligning with practical applications such as robotics where learning from environment interaction can be unsafe and slow. This task requires choosing sequences of over 20,000 mouse and keyboard actions from raw pixels. By learning behaviors in imagination, Dreamer 4 is the first agent to obtain diamonds in Minecraft purely from offline data, without environment interaction. Our work provides a scalable recipe for imagination training, marking a step towards intelligent agents.

[975] BPMN Assistant: An LLM-Based Approach to Business Process Modeling

Josip Tomo Licardo, Nikola Tankovic, Darko Etinger

Main category: cs.AI

TL;DR: BPMN Assistant uses LLMs for natural language-based creation and editing of BPMN diagrams, introducing a specialized JSON representation that outperforms XML in reliability, speed, and editing success.

Details

Motivation: To leverage Large Language Models for intuitive natural language-based creation and editing of BPMN diagrams, addressing the limitations of direct XML handling.

Method: Introduces a specialized JSON-based representation as a structured alternative to XML for BPMN diagrams. Uses Graph Edit Distance (GED) and Relative Graph Edit Distance (RGED) for generation quality evaluation, and binary success metric for editing performance.

Result: JSON and XML achieve similar similarity scores in generation, but JSON offers greater reliability, faster processing, and significantly higher editing success rates.

Conclusion: The JSON-based approach provides substantial advantages over XML for BPMN diagram manipulation using LLMs, though trade-offs and limitations exist that warrant future improvements.

Abstract: This paper presents BPMN Assistant, a tool that leverages Large Language Models (LLMs) for natural language-based creation and editing of BPMN diagrams. A specialized JSON-based representation is introduced as a structured alternative to the direct handling of XML to enhance the accuracy of process modifications. Process generation quality is evaluated using Graph Edit Distance (GED) and Relative Graph Edit Distance (RGED), while editing performance is evaluated with a binary success metric. Results show that JSON and XML achieve similar similarity scores in generation, but JSON offers greater reliability, faster processing, and significantly higher editing success rates. We discuss key trade-offs, limitations, and future improvements. The implementation is available at https://github.com/jtlicardo/bpmn-assistant.

[976] LTL$_f$ Learning Meets Boolean Set Cover

Gabriel Bathie, Nathanaël Fijalkow, Théo Matricon, Baptiste Mouillon, Pierre Vandenhove

Main category: cs.AI

TL;DR: Bolt is a new CPU tool that learns Linear Temporal Logic (LTLf) formulas from finite traces 100x faster than state-of-the-art methods while producing smaller or equal formulas in most cases.

Details

Motivation: Learning LTLf formulas from finite traces is fundamental for applications in AI, software engineering, formal methods, cyber-physical systems, and robotics, but existing methods are inefficient.

Method: Leverages Boolean Set Cover as a subroutine to combine existing formulas using Boolean connectives, offering a novel trade-off between efficiency and formula size.

Result: Achieves 100x faster learning speed over 70% of benchmarks and produces smaller or equal formulas in 98% of cases compared to state-of-the-art methods.

Conclusion: The Boolean Set Cover approach enables significant performance improvements in LTLf formula learning while maintaining or reducing formula size.

Abstract: Learning formulas in Linear Temporal Logic (LTLf) from finite traces is a fundamental research problem which has found applications in artificial intelligence, software engineering, programming languages, formal methods, control of cyber-physical systems, and robotics. We implement a new CPU tool called Bolt improving over the state of the art by learning formulas more than 100x faster over 70% of the benchmarks, with smaller or equal formulas in 98% of the cases. Our key insight is to leverage a problem called Boolean Set Cover as a subroutine to combine existing formulas using Boolean connectives. Thanks to the Boolean Set Cover component, our approach offers a novel trade-off between efficiency and formula size.

[977] “Stop replacing salt with sugar!’’: Towards Intuitive Human-Agent Teaching

Nikolaos Kondylidis, Andrea Rafanelli, Ilaria Tiddi, Annette ten Teije, Frank van Harmelen

Main category: cs.AI

TL;DR: The paper proposes a human-agent teaching architecture for few-shot learning of subjective tasks, specifically ingredient substitution in recipes, using demonstrations and domain knowledge to achieve efficient learning.

Details

Motivation: To replicate human ability to learn quickly from few examples, especially for subjective tasks with scarce data, by creating an intuitive teaching interaction where humans can teach agents through demonstrations.

Method: Human-agent teaching architecture with incremental few-shot learning, leveraging domain knowledge to broaden task understanding, and optimizing example selection for representative and non-redundant teaching using Recipe1MSubs dataset.

Result: The agent achieves half its task performance with only 100 examples compared to full training set of 50k examples, showing strategic example ordering and external symbolic knowledge enable efficient generalization.

Conclusion: Strategic example selection combined with learning methods that leverage external knowledge allows agents to learn subjective tasks efficiently from few demonstrations, enabling intuitive human-agent teaching interactions.

Abstract: Humans quickly learn new concepts from a small number of examples. Replicating this capacity with Artificial Intelligence (AI) systems has proven to be challenging. When it comes to learning subjective tasks-where there is an evident scarcity of data-this capacity needs to be recreated. In this work, we propose an intuitive human-agent teaching architecture in which the human can teach an agent how to perform a task by providing demonstrations, i.e., examples. To have an intuitive interaction, we argue that the agent should be able to learn incrementally from a few single examples. To allow for this, our objective is to broaden the agent’s task understanding using domain knowledge. Then, using a learning method to enable the agent to learn efficiently from a limited number of examples. Finally, to optimize how human can select the most representative and less redundant examples to provide the agent with. We apply our proposed method to the subjective task of ingredient substitution, where the agent needs to learn how to substitute ingredients in recipes based on human examples. We replicate human input using the Recipe1MSubs dataset. In our experiments, the agent achieves half its task performance after only 100 examples are provided, compared to the complete training set of 50k examples. We show that by providing examples in strategic order along with a learning method that leverages external symbolic knowledge, the agent can generalize more efficiently.

[978] Successful Misunderstandings: Learning to Coordinate Without Being Understood

Nikolaos Kondylidis, Anil Yaman, Frank van Harmelen, Erman Acar, Annette ten Teije

Main category: cs.AI

TL;DR: The paper investigates whether successful coordination through communication necessarily implies mutual understanding, finding that agents can develop ‘successful misunderstandings’ where they coordinate optimally but with misaligned signal interpretations.

Details

Motivation: To challenge the assumption that successful coordination through communication implies mutual understanding, particularly in populations with different perceptual systems like human-AI hybrid groups.

Method: Uses a signaling game where agents develop new vocabulary without common observations of signal usage, only observing communication signals and interaction outcomes.

Result: Populations converge to optimal coordination but sometimes develop ‘successful misunderstandings’ where agents use different interpretations of signals. Shared interpretations only emerge with at least three agents interacting with each other.

Conclusion: Successful coordination doesn’t guarantee shared understanding. Minimum conditions for shared interpretations are at least three agents all interacting, which enables emergent compensation for lack of shared signal usage observations.

Abstract: The main approach to evaluating communication is by assessing how well it facilitates coordination. If two or more individuals can coordinate through communication, it is generally assumed that they understand one another. We investigate this assumption in a signaling game where individuals develop a new vocabulary of signals to coordinate successfully. In our game, the individuals do not have common observations besides the communication signal and outcome of the interaction, i.e. received reward. This setting is used as a proxy to study communication emergence in populations of agents that perceive their environment very differently, e.g. hybrid populations that include humans and artificial agents. Agents develop signals, use them, and refine interpretations while not observing how other agents are using them. While populations always converge to optimal levels of coordination, in some cases, interacting agents interpret and use signals differently, converging to what we call successful misunderstandings. However, agents of population that coordinate using misaligned interpretations, are unable to establish successful coordination with new interaction partners. Not leading to coordination failure immediately, successful misunderstandings are difficult to spot and repair. Having at least three agents that all interact with each other are the two minimum conditions to ensure the emergence of shared interpretations. Under these conditions, the agent population exhibits this emergent property of compensating for the lack of shared observations of signal use, ensuring the emergence of shared interpretations.

[979] On the Self-awareness of Large Reasoning Models’ Capability Boundaries

Qingjie Zhang, Yujia Fu, Yang Wang, Liu Yan, Tao Wei, Ke Xu, Minlie Huang, Han Qiu

Main category: cs.AI

TL;DR: Large Reasoning Models can detect their capability boundaries through reasoning confidence patterns and hidden states, enabling optimization strategies that avoid unproductive reasoning while maintaining accuracy.

Details

Motivation: Current LRMs waste computation on unsolvable problems by reasoning unproductively until context limits, highlighting the need for self-awareness of capability boundaries.

Method: Two monitoring approaches: reasoning expression monitoring (tracking confidence trajectories) for black-box models, and hidden states monitoring (analyzing last token embeddings) for white-box models.

Result: Boundary-aware strategies cut token usage by 62.7-93.6% while maintaining accuracy, significantly improving reliability and efficiency.

Conclusion: LRMs possess inherent capability boundary awareness that can be leveraged to optimize reasoning efficiency without sacrificing performance.

Abstract: Large Reasoning Models (LRMs) have shown impressive performance on complex reasoning tasks such as mathematics, yet they also display misbehaviors that expose their limitations. In particular, when faced with hard questions, LRMs often engage in unproductive reasoning until context limit, producing wrong answers while wasting substantial computation. This phenomenon reflects a fundamental issue: current answering paradigms overlook the relationship between questions and LRMs’ capability boundaries. In this paper, we investigate whether LRMs possess self-awareness of capability boundaries. We begin by an observation that LRMs may know what they cannot solve through expressed reasoning confidence. For black-box models, we find that reasoning expressions reveal boundary signals, with accelerated growing confidence trajectory for solvable problems but convergent uncertainty trajectory for unsolvable ones. For white-box models, we show that hidden states of the last input token encode boundary information, with solvable and unsolvable problems linearly separable even before reasoning begins. Building on these findings, we propose two simple yet effective optimization strategies: reasoning expression monitoring and hidden states monitoring. Experiments demonstrate that these boundary-aware strategies enable LRMs to avoid unproductive reasoning without sacrificing accuracy, significantly improving reliability and efficiency by cutting token usage up to 62.7 - 93.6%.

[980] Spatial-Functional awareness Transformer-based graph archetype contrastive learning for Decoding Visual Neural Representations from EEG

Yueming Sun, Long Yang

Main category: cs.AI

TL;DR: Proposes SFTG framework with EEG Graph Transformer and Graph Archetype Contrastive Learning to improve EEG-based visual decoding by encoding spatial-temporal brain dynamics and reducing intra-subject variability.

Details

Motivation: EEG signals for visual decoding are challenging due to high-dimensional, noisy, and non-Euclidean nature, requiring better methods to handle spatial-temporal dynamics and subject variability.

Method: Uses EEG Graph Transformer (EGT) to encode spatial brain connectivity and temporal neural dynamics, plus Graph Archetype Contrastive Learning (GAC) to learn subject-specific EEG graph archetypes for improved feature consistency.

Result: Significantly outperforms prior state-of-the-art EEG decoding methods in both subject-dependent and subject-independent evaluations on Things-EEG dataset.

Conclusion: Integration of graph-based learning with contrastive objectives shows transformative potential for enhancing EEG-based brain decoding, enabling more generalizable and robust neural representations.

Abstract: Decoding visual neural representations from Electroencephalography (EEG) signals remains a formidable challenge due to their high-dimensional, noisy, and non-Euclidean nature. In this work, we propose a Spatial-Functional Awareness Transformer-based Graph Archetype Contrastive Learning (SFTG) framework to enhance EEG-based visual decoding. Specifically, we introduce the EEG Graph Transformer (EGT), a novel graph-based neural architecture that simultaneously encodes spatial brain connectivity and temporal neural dynamics. To mitigate high intra-subject variability, we propose Graph Archetype Contrastive Learning (GAC), which learns subject-specific EEG graph archetypes to improve feature consistency and class separability. Furthermore, we conduct comprehensive subject-dependent and subject-independent evaluations on the Things-EEG dataset, demonstrating that our approach significantly outperforms prior state-of-the-art EEG decoding methods.The results underscore the transformative potential of integrating graph-based learning with contrastive objectives to enhance EEG-based brain decoding, paving the way for more generalizable and robust neural representations.

[981] From Ambiguity to Verdict: A Semiotic-Grounded Multi-Perspective Agent for LLM Logical Reasoning

Yunyao Zhang, Xinglang Zhang, Junxi Sheng, Wenbing Li, Junqing Yu, Wei Yang, Zikai Song

Main category: cs.AI

TL;DR: LogicAgent is a semiotic-square-guided framework that addresses both logical and semantic complexity in LLM reasoning, using multi-perspective FOL deduction with existential import checks and three-valued decision scheme. It achieves SOTA on the new RepublicQA benchmark and generalizes well to other logical reasoning benchmarks.

Details

Motivation: Existing methods overlook the interplay between logical and semantic complexity, struggling with abstract propositions, ambiguous contexts, and conflicting stances that are central to human reasoning.

Method: LogicAgent performs multi-perspective deduction in first-order logic with existential import checks using a three-valued decision scheme (True, False, Uncertain) to handle boundary cases more faithfully.

Result: LogicAgent achieves 6.25% average gain over strong baselines on RepublicQA and 7.05% average gain on mainstream benchmarks including ProntoQA, ProofWriter, FOLIO, and ProverQA.

Conclusion: The semiotic-grounded multi-perspective reasoning approach effectively boosts LLMs’ logical performance, demonstrating strong generalization across diverse logical reasoning tasks.

Abstract: Logical reasoning is a fundamental capability of large language models (LLMs). However, existing studies largely overlook the interplay between logical complexity and semantic complexity, resulting in methods that struggle to address challenging scenarios involving abstract propositions, ambiguous contexts, and conflicting stances, which are central to human reasoning. For this gap, we propose LogicAgent, a semiotic-square-guided framework designed to jointly address logical complexity and semantic complexity. LogicAgent explicitly performs multi-perspective deduction in first-order logic (FOL), while mitigating vacuous reasoning through existential import checks that incorporate a three-valued decision scheme (True, False, Uncertain) to handle boundary cases more faithfully. Furthermore, to overcome the semantic simplicity and low logical complexity of existing datasets, we introduce RepublicQA, a benchmark that reaches college-level difficulty (FKGL = 11.94) and exhibits substantially greater lexical and structural diversity than prior benchmarks. RepublicQA is grounded in philosophical concepts, featuring abstract propositions and systematically organized contrary and contradictory relations, making it the most semantically rich resource for evaluating logical reasoning. Experiments demonstrate that LogicAgent achieves state-of-the-art performance on RepublicQA, with a 6.25% average gain over strong baselines, and generalizes effectively to mainstream logical reasoning benchmarks including ProntoQA, ProofWriter, FOLIO, and ProverQA, achieving an additional 7.05% average gain. These results highlight the strong effectiveness of our semiotic-grounded multi-perspective reasoning in boosting LLMs’ logical performance.

[982] TimeOmni-1: Incentivizing Complex Reasoning with Time Series in Large Language Models

Tong Guan, Zijie Meng, Dianqi Li, Shiyu Wang, Chao-Han Huck Yang, Qingsong Wen, Zuozhu Liu, Sabato Marco Siniscalchi, Ming Jin, Shirui Pan

Main category: cs.AI

TL;DR: TSR-Suite introduces four atomic tasks for time series reasoning across perception, extrapolation, and decision-making capabilities, with TimeOmni-1 model achieving superior performance over GPT-4.1.

Details

Motivation: Existing multimodal time series datasets lack genuine reasoning tasks and high-quality data, limiting progress in practical time series reasoning models.

Method: Created TSR-Suite with 23K+ samples (2.3K human-curated) and developed TimeOmni-1 model using multi-stage training with task scenarios, reward functions, and optimizations.

Result: TimeOmni-1 achieves strong out-of-distribution generalization, 64.0% causality discovery accuracy (vs 35.9% GPT-4.1), and 6% higher valid response rate in event-aware forecasting.

Conclusion: TSR-Suite enables comprehensive time series reasoning evaluation and training, with TimeOmni-1 demonstrating superior reasoning capabilities over existing models.

Abstract: Recent advances in multimodal time series learning underscore a paradigm shift from analytics centered on basic patterns toward advanced time series understanding and reasoning. However, existing multimodal time series datasets mostly remain at the level of surface alignment and question answering, without reaching the depth of genuine reasoning. The absence of well-defined tasks that genuinely require time series reasoning, along with the scarcity of high-quality data, has limited progress in building practical time series reasoning models (TSRMs). To this end, we introduce Time Series Reasoning Suite (TSR-Suite), which formalizes four atomic tasks that span three fundamental capabilities for reasoning with time series: (1) perception, acquired through scenario understanding and causality discovery; (2) extrapolation, realized via event-aware forecasting; and (3) decision-making, developed through deliberation over perception and extrapolation. TSR-Suite is the first comprehensive time series reasoning suite that supports not only thorough evaluation but also the data pipeline and training of TSRMs. It contains more than 23K samples, of which 2.3K are carefully curated through a human-guided hierarchical annotation process. Building on this foundation, we introduce TimeOmni-1, the first unified reasoning model designed to address diverse real-world problems demanding time series reasoning. The model is trained in multiple stages, integrating a mixture of task scenarios, novel reward functions, and tailored optimizations. Experiments show that TimeOmni-1 delivers strong out-of-distribution generalization across all tasks and achieves a high rate of valid responses. It significantly improves causality discovery accuracy (64.0% vs. 35.9% with GPT-4.1) and raises the valid response rate by over 6% compared to GPT-4.1 on the event-aware forecasting task.

[983] Query Circuits: Explaining How Language Models Answer User Prompts

Tung-Yu Wu, Fazl Barez

Main category: cs.AI

TL;DR: Query circuits are introduced to trace information flow for specific inputs, providing faithful explanations of why models produce particular outputs, with sparse circuits recovering significant performance.

Details

Motivation: Existing methods explain global model capabilities but not why specific inputs produce particular outputs, lacking local, input-level explanations.

Method: Develop query circuits within models using Normalized Deviation Faithfulness metric and sampling-based methods to identify sparse circuits that trace specific input-output mappings.

Result: Extremely sparse circuits (e.g., 1.3% of connections) can recover substantial performance (e.g., 60% on MMLU questions) for individual queries.

Conclusion: Query circuits provide faithful, scalable explanations of how language models process individual inputs, advancing beyond global capability explanations.

Abstract: Explaining why a language model produces a particular output requires local, input-level explanations. Existing methods uncover global capability circuits (e.g., indirect object identification), but not why the model answers a specific input query in a particular way. We introduce query circuits, which directly trace the information flow inside a model that maps a specific input to the output. Unlike surrogate-based approaches (e.g., sparse autoencoders), query circuits are identified within the model itself, resulting in more faithful and computationally accessible explanations. To make query circuits practical, we address two challenges. First, we introduce Normalized Deviation Faithfulness (NDF), a robust metric to evaluate how well a discovered circuit recovers the model’s decision for a specific input, and is broadly applicable to circuit discovery beyond our setting. Second, we develop sampling-based methods to efficiently identify circuits that are sparse yet faithfully describe the model’s behavior. Across benchmarks (IOI, arithmetic, MMLU, and ARC), we find that there exist extremely sparse query circuits within the model that can recover much of its performance on single queries. For example, a circuit covering only 1.3% of model connections can recover about 60% of performance on an MMLU questions. Overall, query circuits provide a step towards faithful, scalable explanations of how language models process individual inputs.

[984] Pushing LLMs to Their Logical Reasoning Bound: The Role of Data Reasoning Intensity

Zhen Bi, Zhenlin Hu, Jinnan Yang, Mingyang Chen, Cheng Deng, Yida Xue, Zeyu Yang, Qing Shen, Zhenfang Liu, Kang Zhao, Ningyu Zhang, Jungang Lou

Main category: cs.AI

TL;DR: The paper introduces Data Reasoning Intensity (DRI) to measure logical reasoning complexity in training data and proposes a re-cognizing optimization strategy to enhance LLM reasoning performance by aligning data complexity with model capacity.

Details

Motivation: Current approaches focus on data format transformation but neglect internal reasoning complexity, leaving reasoning potential underutilized. LLM reasoning performance is constrained by both data potential and model cognitive capacity.

Method: Introduces Data Reasoning Intensity (DRI) metric to quantify logical reasoning complexity by decomposing logical structures. Proposes re-cognizing optimization strategy that systematically enhances logical reasoning intensity of training data to align with LLM’s reasoning boundary.

Result: Extensive experiments show significant improvements in performance and generalization over data-centric strategies. Method validated under reinforcement learning framework.

Conclusion: Prioritizing reasoning complexity in data rather than scale or format is essential to realizing LLMs’ full cognitive potential.

Abstract: Recent advances in large language models (LLMs) highlight the importance of training data structure and quality in shaping reasoning behavior. However, most existing approaches focus on transforming data formats while neglecting the internal reasoning complexity of training samples, leaving the reasoning potential of data under-explored and underutilized. In this work, we posit that LLM logical reasoning performance is jointly constrained by the potential of the training data and the cognitive capacity of the model. To make this relationship measurable, we introduce Data Reasoning Intensity (DRI), a novel metric that quantifies the latent logical reasoning complexity of samples by decomposing and aggregating their logical structures. This allows us to analyze how well current LLMs utilize logical reasoning signals and identify performance gaps relative to data potential. Based on this insight, we introduce a re-cognizing optimization strategy that systematically enhances the logical reasoning intensity of training data.Rather than increasing data volume, our method re-optimizes existing samples to better align with the LLM’s logical reasoning boundary. Extensive experiments show that our approach significantly improves performance and generalization over data-centric strategies. We further validate our method under a reinforcement learning framework. Our results indicate that prioritizing reasoning complexity in data rather than sheer scale or superficial form is essential to realizing LLMs’ full cognitive potential.

[985] PhysicsMinions: Winning Gold Medals in the Latest Physics Olympiads with a Coevolutionary Multimodal Multi-Agent System

Fangchen Yu, Junchi Yao, Ziyi Wang, Haiyuan Wan, Youling Huang, Bo Zhang, Shuyue Hu, Dongzhan Zhou, Ning Ding, Ganqu Cui, Lei Bai, Wanli Ouyang, Peng Ye

Main category: cs.AI

TL;DR: PhysicsMinions is a coevolutionary multi-agent system that achieves state-of-the-art performance on Physics Olympiad problems through synergistic collaboration between visual, logic, and review studios with iterative refinement.

Details

Motivation: Physics Olympiads represent the pinnacle of physics problem-solving but remain underexplored in AI research, with existing single-model approaches failing to reach gold-medal-level performance.

Method: A three-studio architecture: Visual Studio for diagram interpretation, Logic Studio for solution formulation, and Review Studio for dual-stage verification, with coevolution through iterative refinement loops where feedback guides self-correction.

Result: Achieves historic breakthroughs: elevates open-source models from 1-2 to 6 gold medals across 7 Olympiads, first-ever open-source gold medal in IPhO, and achieves 26.8/30 points ranking 4th of 406 contestants, far surpassing single-model performance.

Conclusion: PhysicsMinions provides a generalizable framework for Olympiad-level problem solving that can potentially extend across disciplines, demonstrating the power of multi-agent coevolutionary systems for complex reasoning tasks.

Abstract: Physics is central to understanding and shaping the real world, and the ability to solve physics problems is a key indicator of real-world physical intelligence. Physics Olympiads, renowned as the crown of competitive physics, provide a rigorous testbed requiring complex reasoning and deep multimodal understanding, yet they remain largely underexplored in AI research. Existing approaches are predominantly single-model based, and open-source MLLMs rarely reach gold-medal-level performance. To address this gap, we propose PhysicsMinions, a coevolutionary multi-agent system for Physics Olympiad. Its architecture features three synergistic studios: a Visual Studio to interpret diagrams, a Logic Studio to formulate solutions, and a Review Studio to perform dual-stage verification. The system coevolves through an iterative refinement loop where feedback from the Review Studio continuously guides the Logic Studio, enabling the system to self-correct and converge towards the ground truth. Evaluated on the HiPhO benchmark spanning 7 latest physics Olympiads, PhysicsMinions delivers three major breakthroughs: (i) Strong generalization: it consistently improves both open-source and closed-source models of different sizes, delivering clear benefits over their single-model baselines; (ii) Historic breakthroughs: it elevates open-source models from only 1-2 to 6 gold medals across 7 Olympiads, achieving the first-ever open-source gold medal in the latest International Physics Olympiad (IPhO) under the average-score metric; and (iii) Scaling to human expert: it further advances the open-source Pass@32 score to 26.8/30 points on the latest IPhO, ranking 4th of 406 contestants and far surpassing the top single-model score of 22.7 (ranked 22nd). Generally, PhysicsMinions offers a generalizable framework for Olympiad-level problem solving, with the potential to extend across disciplines.

Xiao Jia, Zhanzhan Zhao

Main category: cs.AI

TL;DR: A systematic review of 270 studies using computational methods to create a taxonomy for the social science of large language models, identifying three main domains: LLM as Social Minds, LLM Societies, and LLM-Human Interactions.

Details

Motivation: To address the fragmented nature of research on large language models in social science by creating a systematic taxonomy that organizes the field and clarifies evidentiary standards across different levels of analysis.

Method: Conducted systematic review of 270 studies using text embeddings, unsupervised clustering, and topic modeling to build a computational taxonomy.

Result: Identified three organic domains: 1) LLM as Social Minds (cognition, morality, bias attributions), 2) LLM Societies (multi-agent coordination and institutions), and 3) LLM-Human Interactions (task transformation, trust, governance).

Conclusion: The taxonomy provides a reproducible map of the fragmented field, clarifies evidentiary standards, and highlights opportunities for cumulative progress in the social science of artificial intelligence.

Abstract: The social science of large language models (LLMs) examines how these systems evoke mind attributions, interact with one another, and transform human activity and institutions. We conducted a systematic review of 270 studies, combining text embeddings, unsupervised clustering and topic modeling to build a computational taxonomy. Three domains emerge organically across the reviewed literature. LLM as Social Minds examines whether and when models display behaviors that elicit attributions of cognition, morality and bias, while addressing challenges such as test leakage and surface cues. LLM Societies examines multi-agent settings where interaction protocols, architectures and mechanism design shape coordination, norms, institutions and collective epistemic processes. LLM-Human Interactions examines how LLMs reshape tasks, learning, trust, work and governance, and how risks arise at the human-AI interface. This taxonomy provides a reproducible map of a fragmented field, clarifies evidentiary standards across levels of analysis, and highlights opportunities for cumulative progress in the social science of artificial intelligence.

[987] RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark

Yang Shi, Yuhao Dong, Yue Ding, Yuran Wang, Xuanyu Zhu, Sheng Zhou, Wenting Liu, Haochen Tian, Rundong Wang, Huanqian Wang, Zuyan Liu, Bohan Zeng, Ruizhe Chen, Qixun Wang, Zhuoran Zhang, Xinlong Chen, Chengzhuo Tong, Bozhou Li, Chaoyou Fu, Qiang Liu, Haotian Wang, Wenjing Yang, Yuanxing Zhang, Pengfei Wan, Yi-Fan Zhang, Ziwei Liu

Main category: cs.AI

TL;DR: RealUnify is a benchmark designed to evaluate bidirectional capability synergy between visual understanding and generation in unified multimodal models, revealing that current models struggle with effective integration despite architectural unification.

Details

Motivation: Existing benchmarks only assess understanding and generation in isolation, failing to determine whether unified models can leverage understanding to enhance generation or use generative simulation to facilitate deeper comprehension.

Method: RealUnify comprises 1,000 human-annotated instances across 10 categories and 32 subtasks, structured around two axes: Understanding Enhances Generation (reasoning-guided image generation) and Generation Enhances Understanding (mental simulation for reasoning tasks), with dual-evaluation protocol combining end-to-end and stepwise assessment.

Result: Large-scale evaluations of 12 leading unified models and 6 specialized baselines show that current unified models still struggle to achieve effective synergy between understanding and generation capabilities.

Conclusion: Architectural unification alone is insufficient for achieving capability synergy, highlighting the need for new training strategies and inductive biases to fully unlock the potential of unified modeling.

Abstract: The integration of visual understanding and generation into unified multimodal models represents a significant stride toward general-purpose AI. However, a fundamental question remains unanswered by existing benchmarks: does this architectural unification actually enable synergetic interaction between the constituent capabilities? Existing evaluation paradigms, which primarily assess understanding and generation in isolation, are insufficient for determining whether a unified model can leverage its understanding to enhance its generation, or use generative simulation to facilitate deeper comprehension. To address this critical gap, we introduce RealUnify, a benchmark specifically designed to evaluate bidirectional capability synergy. RealUnify comprises 1,000 meticulously human-annotated instances spanning 10 categories and 32 subtasks. It is structured around two core axes: 1) Understanding Enhances Generation, which requires reasoning (e.g., commonsense, logic) to guide image generation, and 2) Generation Enhances Understanding, which necessitates mental simulation or reconstruction (e.g., of transformed or disordered visual inputs) to solve reasoning tasks. A key contribution is our dual-evaluation protocol, which combines direct end-to-end assessment with a diagnostic stepwise evaluation that decomposes tasks into distinct understanding and generation phases. This protocol allows us to precisely discern whether performance bottlenecks stem from deficiencies in core abilities or from a failure to integrate them. Through large-scale evaluations of 12 leading unified models and 6 specialized baselines, we find that current unified models still struggle to achieve effective synergy, indicating that architectural unification alone is insufficient. These results highlight the need for new training strategies and inductive biases to fully unlock the potential of unified modeling.

[988] Neural network embeddings recover value dimensions from psychometric survey items on par with human data

Max Pellert, Clemens M. Lechner, Indira Sen, Markus Strohmaier

Main category: cs.AI

TL;DR: SQuID enables neural network embeddings to recover latent dimensions from psychometric survey items, achieving 55% variance explanation in dimension similarities compared to human data without domain-specific fine-tuning.

Details

Motivation: To develop a scalable, cost-effective alternative to traditional human surveys for psychometric measurement that can replicate established psychometric structures while offering flexibility and scalability.

Method: Uses SQuID (Survey and Questionnaire Item Embeddings Differentials) approach with large language model embeddings to recover latent dimensions from the Revised Portrait Value Questionnaire (PVQ-RR), comparing multiple embedding models across various evaluation metrics.

Result: Embeddings explain 55% of variance in dimension-dimension similarities compared to human data, show fair factor congruence coefficients, and successfully obtain negative correlations between dimensions without domain-specific fine-tuning.

Conclusion: Semantic embeddings can effectively replicate psychometric structures established through human surveys, offering substantial advantages in cost, scalability and flexibility while maintaining comparable quality to traditional methods.

Abstract: This study introduces “Survey and Questionnaire Item Embeddings Differentials” (SQuID), a novel methodological approach that enables neural network embeddings to effectively recover latent dimensions from psychometric survey items. We demonstrate that embeddings derived from large language models, when processed with SQuID, can recover the structure of human values obtained from human rater judgments on the Revised Portrait Value Questionnaire (PVQ-RR). Our experimental validation compares multiple embedding models across a number of evaluation metrics. Unlike previous approaches, SQuID successfully addresses the challenge of obtaining negative correlations between dimensions without requiring domain-specific fine-tuning. Quantitative analysis reveals that our embedding-based approach explains 55% of variance in dimension-dimension similarities compared to human data. Multidimensional scaling configurations from both types of data show fair factor congruence coefficients and largely follow the underlying theory. These results demonstrate that semantic embeddings can effectively replicate psychometric structures previously established through extensive human surveys. The approach offers substantial advantages in cost, scalability and flexibility while maintaining comparable quality to traditional methods. Our findings have significant implications for psychometrics and social science research, providing a complementary methodology that could expand the scope of human behavior and experience represented in measurement tools.

[989] Meta-Learning Theory-Informed Inductive Biases using Deep Kernel Gaussian Processes

Bahti Zakirov, Gašper Tkačik

Main category: cs.AI

TL;DR: A Bayesian meta-learning framework that converts normative theories into probabilistic models using adaptive deep kernel Gaussian processes, improving data fitting and theory validation.

Details

Motivation: To bridge the gap between normative theories and data-driven approaches by automatically converting theoretical predictions into tractable probabilistic models for better biological system analysis.

Method: Meta-learning a kernel on synthetic data generated from normative theories using adaptive deep kernel Gaussian processes, creating Theory-Informed Kernels.

Result: Improved response prediction accuracy in mouse retinal ganglion cells with well-calibrated uncertainty estimates and interpretable representations; accurate inference of theory-match from data.

Conclusion: Provides a scalable, automated approach for integrating theoretical knowledge into data-driven scientific inquiry, enabling rigorous theory validation and improved data fitting.

Abstract: Normative and task-driven theories offer powerful top-down explanations for biological systems, yet the goals of quantitatively arbitrating between competing theories, and utilizing them as inductive biases to improve data-driven fits of real biological datasets are prohibitively laborious, and often impossible. To this end, we introduce a Bayesian meta-learning framework designed to automatically convert raw functional predictions from normative theories into tractable probabilistic models. We employ adaptive deep kernel Gaussian processes, meta-learning a kernel on synthetic data generated from a normative theory. This Theory-Informed Kernel specifies a probabilistic model representing the theory predictions – usable for both fitting data and rigorously validating the theory. As a demonstration, we apply our framework to the early visual system, using efficient coding as our normative theory. We show improved response prediction accuracy in ex vivo recordings of mouse retinal ganglion cells stimulated by natural scenes compared to conventional data-driven baselines, while providing well-calibrated uncertainty estimates and interpretable representations. Using exact Bayesian model selection, we also show that our informed kernel can accurately infer the degree of theory-match from data, confirming faithful encapsulation of theory structure. This work provides a more general, scalable, and automated approach for integrating theoretical knowledge into data-driven scientific inquiry in neuroscience and beyond.

[990] MASLegalBench: Benchmarking Multi-Agent Systems in Deductive Legal Reasoning

Huihao Jing, Wenbin Hu, Hongyu Luo, Jianhui Yang, Wei Fan, Haoran Li, Yangqiu Song

Main category: cs.AI

TL;DR: MASLegalBench is a legal benchmark designed for multi-agent systems using GDPR scenarios, addressing the gap in MAS-specific legal evaluation methods.

Details

Motivation: Current legal benchmarks don't leverage MAS advantages like task decomposition and agent specialization, limiting MAS potential in legal domain.

Method: Proposed MASLegalBench with GDPR scenarios, manually designed role-based MAS, and extensive experiments with state-of-the-art LLMs.

Result: Results reveal strengths, limitations, and improvement areas of existing models and MAS architectures in legal reasoning.

Conclusion: MASLegalBench enables better evaluation of multi-agent systems in legal tasks and highlights opportunities for MAS advancement in legal domain.

Abstract: Multi-agent systems (MAS), leveraging the remarkable capabilities of Large Language Models (LLMs), show great potential in addressing complex tasks. In this context, integrating MAS with legal tasks is a crucial step. While previous studies have developed legal benchmarks for LLM agents, none are specifically designed to consider the unique advantages of MAS, such as task decomposition, agent specialization, and flexible training. In fact, the lack of evaluation methods limits the potential of MAS in the legal domain. To address this gap, we propose MASLegalBench, a legal benchmark tailored for MAS and designed with a deductive reasoning approach. Our benchmark uses GDPR as the application scenario, encompassing extensive background knowledge and covering complex reasoning processes that effectively reflect the intricacies of real-world legal situations. Furthermore, we manually design various role-based MAS and conduct extensive experiments using different state-of-the-art LLMs. Our results highlight the strengths, limitations, and potential areas for improvement of existing models and MAS architectures.

[991] When Autonomous Vehicle Meets V2X Cooperative Perception: How Far Are We?

An Guo, Shuoxiao Zhang, Enyi Tang, Xinyu Gao, Haomin Pang, Haoxiang Tian, Yanzhou Mu, Wu Wen, Chunrong Fang, Zhenyu Chen

Main category: cs.AI

TL;DR: This paper conducts an empirical study on V2X cooperative perception systems, identifying six prevalent error patterns and evaluating critical system components through large-scale testing.

Details

Motivation: V2X cooperative perception systems have complex compositions with diverse sensors, fusion schemes, and communication conditions, leading to operational challenges. The types and causes of errors in these systems remain insufficiently explored.

Method: The researchers conducted a systematic empirical study by identifying six prevalent error patterns in cooperative perception systems and evaluating critical components through large-scale testing of different configurations.

Result: Key findings include: (1) LiDAR-based cooperation has highest perception performance; (2) V2I and V2V communication show distinct performance under different fusion schemes; (3) Increased errors lead to more driving violations; (4) Systems are not robust against communication interference in online operation.

Conclusion: The study reveals potential risks and vulnerabilities in cooperative perception systems, providing insights to promote better design and repair of these critical systems.

Abstract: With the tremendous advancement of deep learning and communication technology, Vehicle-to-Everything (V2X) cooperative perception has the potential to address limitations in sensing distant objects and occlusion for a single-agent perception system. V2X cooperative perception systems are software systems characterized by diverse sensor types and cooperative agents, varying fusion schemes, and operation under different communication conditions. Therefore, their complex composition gives rise to numerous operational challenges. Furthermore, when cooperative perception systems produce erroneous predictions, the types of errors and their underlying causes remain insufficiently explored. To bridge this gap, we take an initial step by conducting an empirical study of V2X cooperative perception. To systematically evaluate the impact of cooperative perception on the ego vehicle’s perception performance, we identify and analyze six prevalent error patterns in cooperative perception systems. We further conduct a systematic evaluation of the critical components of these systems through our large-scale study and identify the following key findings: (1) The LiDAR-based cooperation configuration exhibits the highest perception performance; (2) Vehicle-to-infrastructure (V2I) and vehicle-to-vehicle (V2V) communication exhibit distinct cooperative perception performance under different fusion schemes; (3) Increased cooperative perception errors may result in a higher frequency of driving violations; (4) Cooperative perception systems are not robust against communication interference when running online. Our results reveal potential risks and vulnerabilities in critical components of cooperative perception systems. We hope that our findings can better promote the design and repair of cooperative perception systems.

[992] KIRETT – A wearable device to support rescue operations using artificial intelligence to improve first aid

Johannes Zenkert, Christian Weber, Mubaris Nadeem, Lisa Bender, Madjid Fathi, Abu Shad Ahammed, Aniebiet Micheal Ezekiel, Roman Obermaisser, Maximilian Bradford

Main category: cs.AI

TL;DR: The KIRETT project develops a wearable AI device for first aid during rescue operations to provide contextual recommendations and improve patient outcomes.

Details

Motivation: To minimize patient damage from incorrect treatment and increase survival probability during rescue operations through technology-assisted first aid.

Method: Using a wearable device with artificial intelligence for computer-aided situation recognition to provide contextual action recommendations to rescue personnel.

Result: The paper presents initial research approaches and first steps in the KIRETT project development.

Conclusion: The project shows promise for improving first aid effectiveness through AI-powered wearable technology in rescue scenarios.

Abstract: This short paper presents first steps in the scientific part of the KIRETT project, which aims to improve first aid during rescue operations using a wearable device. The wearable is used for computer-aided situation recognition by means of artificial intelligence. It provides contextual recommendations for actions and operations to rescue personnel and is intended to minimize damage to patients due to incorrect treatment, as well as increase the probability of survival. The paper describes a first overview of research approaches within the project.

[993] Agentic Exploration of Physics Models

Maximilian Nägele, Florian Marquardt

Main category: cs.AI

TL;DR: SciExplorer is an AI agent that uses large language models to autonomously explore unknown physical systems through experiments and analysis, successfully recovering equations of motion and Hamiltonians across various domains without task-specific instructions.

Details

Motivation: To fully automate the scientific discovery process by enabling free-form exploration of unknown systems without domain-specific blueprints or task-specific tailoring.

Method: Leverages large language model tool-use capabilities with minimal tools (primarily code execution) to explore physical systems through experiments and analysis in an open-ended, iterative loop.

Result: Impressive performance on recovering equations of motion from observed dynamics and inferring Hamiltonians from expectation values across mechanical dynamical systems, wave evolution, and quantum many-body physics.

Conclusion: The approach enables effective scientific exploration across domains without finetuning or task-specific instructions, opening doors for similar automated discovery in other scientific fields.

Abstract: The process of scientific discovery relies on an interplay of observations, analysis, and hypothesis generation. Machine learning is increasingly being adopted to address individual aspects of this process. However, it remains an open challenge to fully automate the open-ended, heuristic, iterative loop required to discover the laws of an unknown system by exploring it through experiments and analysis, without tailoring the approach to the specifics of a given task. Here, we introduce SciExplorer, an agent that leverages large language model tool-use capabilities to enable free-form exploration of systems without any domain-specific blueprints, and apply it to the exploration of physical systems that are initially unknown to the agent. We test SciExplorer on a broad set of models spanning mechanical dynamical systems, wave evolution, and quantum many-body physics. Despite using a minimal set of tools, primarily based on code execution, we observe impressive performance on tasks such as recovering equations of motion from observed dynamics and inferring Hamiltonians from expectation values. The demonstrated effectiveness of this setup opens the door towards similar scientific exploration in other domains, without the need for finetuning or task-specific instructions.

[994] CLPO: Curriculum Learning meets Policy Optimization for LLM Reasoning

Shijie Zhang, Guohao Sun, Kevin Zhang, Xiang Guo, Rujun Guo

Main category: cs.AI

TL;DR: CLPO introduces a curriculum-guided learning approach for RLVR that dynamically adjusts training difficulty based on model performance, achieving state-of-the-art results on reasoning benchmarks.

Details

Motivation: Existing RLVR methods treat all training samples uniformly, ignoring differences in problem difficulty relative to model capabilities, leading to inefficient learning and limited performance.

Method: CLPO creates a dynamic pedagogical feedback loop using real-time difficulty assessment from model rollouts to build an Online Curriculum, which guides Adaptive Problem Restructuring where the model diversifies medium problems and simplifies challenging ones.

Result: CLPO achieves state-of-the-art performance across eight mathematical and general reasoning benchmarks with an average pass@1 improvement of 6.96% over other methods.

Conclusion: CLPO transforms static training into a dynamic co-evolution process with model capabilities, demonstrating potential for more efficient training of capable reasoning models.

Abstract: Recently, online Reinforcement Learning with Verifiable Rewards (RLVR) has become a key paradigm for enhancing the reasoning capabilities of Large Language Models (LLMs). However, existing methods typically treat all training samples uniformly, overlooking the vast differences in problem difficulty relative to the model’s current capabilities. This uniform training strategy leads to inefficient exploration of problems the model has already mastered, while concurrently lacking effective guidance on problems that are challenging its abilities the most, limiting both learning efficiency and upper-bound performance. To address this, we propose CLPO (Curriculum-guided Learning for Policy Optimization), a novel algorithm that creates a dynamic pedagogical feedback loop within the policy optimization process. The core of CLPO leverages the model’s own rollout performance to conduct real-time difficulty assessment, thereby constructing an Online Curriculum. This curriculum then guides an Adaptive Problem Restructuring mechanism, where the model acts as its own teacher: it diversifies medium-difficulty problems to promote generalization and simplifies challenging problems to make them more attainable. Our approach transforms the static training procedure into a dynamic process that co-evolves with the model’s capabilities. Experiments show that CLPO achieves state-of-the-art performance across eight challenging mathematical and general reasoning benchmarks, with an average pass@1 improvement of 6.96% over other methods, demonstrating its potential for more efficiently training more capable reasoning models.

[995] Scaling Synthetic Task Generation for Agents via Exploration

Ram Ramrakhya, Andrew Szot, Omar Attia, Yuhao Yang, Anh Nguyen, Bogdan Mazoure, Zhe Gan, Harsh Agrawal, Alexander Toshev

Main category: cs.AI

TL;DR: AutoPlay is a scalable pipeline for generating diverse, executable tasks for training multimodal large language model agents by systematically exploring interactive environments and using exploration trajectories to synthesize environment-grounded tasks.

Details

Motivation: The lack of high-quality downstream agentic task datasets with diverse, feasible, and verifiable tasks poses a key challenge in scaling post-training of multimodal large language models for interactive agents across domains like computer-use, web navigation, and robotics.

Method: AutoPlay operates in two stages: (1) exploration phase where an MLLM explorer agent systematically uncovers novel environment states and functionalities, and (2) task generation phase where a task generator leverages exploration trajectories and task guideline prompts to synthesize diverse, executable, and verifiable tasks.

Result: AutoPlay generated 20k tasks across 20 Android applications and 10k tasks across 13 Ubuntu applications, enabling training of MLLM-based UI agents that improve success rates up to 20.0% on mobile-use and 10.9% on computer-use scenarios, with additional 5.7% gain from reinforcement learning.

Conclusion: AutoPlay establishes a scalable approach for post-training capable MLLM agents that reduces reliance on human annotation by automatically generating diverse, environment-grounded tasks through systematic exploration and task synthesis.

Abstract: Post-Training Multimodal Large Language Models (MLLMs) to build interactive agents holds promise across domains such as computer-use, web navigation, and robotics. A key challenge in scaling such post-training is lack of high-quality downstream agentic task datasets with tasks that are diverse, feasible, and verifiable. Existing approaches for task generation rely heavily on human annotation or prompting MLLM with limited downstream environment information, which is either costly or poorly scalable as it yield tasks with limited coverage. To remedy this, we present AutoPlay, a scalable pipeline for task generation that explicitly explores interactive environments to discover possible interactions and current state information to synthesize environment-grounded tasks. AutoPlay operates in two stages: (i) an exploration phase, where an MLLM explorer agent systematically uncovers novel environment states and functionalities, and (ii) a task generation phase, where a task generator leverages exploration trajectories and a set of task guideline prompts as context to synthesize diverse, executable, and verifiable tasks. We show AutoPlay generates 20k tasks across 20 Android applications and 10k tasks across 13 applications Ubuntu applications to train mobile-use and computer-use agents. AutoPlay generated tasks enable large-scale task demonstration synthesis without human annotation by employing an MLLM task executor and verifier. This data enables training MLLM-based UI agents that improve success rates up to $20.0%$ on mobile-use and $10.9%$ on computer-use scenarios. In addition, AutoPlay generated tasks combined with MLLM verifier-based rewards enable scaling reinforcement learning training of UI agents, leading to an additional $5.7%$ gain. coverage. These results establish AutoPlay as a scalable approach for post-training capable MLLM agents reducing reliance on human annotation.

[996] Cogito, Ergo Ludo: An Agent that Learns to Play by Reasoning and Planning

Sai Wang, Yu Wu, Zhongwen Xu

Main category: cs.AI

TL;DR: CEL is a novel agent architecture that uses LLMs to explicitly learn environment rules and strategies through reasoning and planning, achieving mastery in grid-world games from sparse rewards.

Details

Motivation: To create more interpretable and general artificial agents that build transparent world models through explicit reasoning rather than opaque neural network representations.

Method: CEL operates on a cycle of interaction and reflection, using LLMs for Rule Induction (learning environment dynamics) and Strategy/Playbook Summarization (distilling experiences into actionable strategies) after each episode.

Result: CEL successfully learned to master diverse grid-world tasks (Minesweeper, Frozen Lake, Sokoban) by autonomously discovering rules and developing effective policies from sparse rewards.

Conclusion: The work demonstrates a path toward more general and interpretable agents that build transparent world models through explicit reasoning on raw experience, with iterative learning being critical for sustained performance.

Abstract: The pursuit of artificial agents that can learn to master complex environments has led to remarkable successes, yet prevailing deep reinforcement learning methods often rely on immense experience, encoding their knowledge opaquely within neural network weights. We propose a different paradigm, one in which an agent learns to play by reasoning and planning. We introduce Cogito, ergo ludo (CEL), a novel agent architecture that leverages a Large Language Model (LLM) to build an explicit, language-based understanding of its environment’s mechanics and its own strategy. Starting from a tabula rasa state with no prior knowledge (except action set), CEL operates on a cycle of interaction and reflection. After each episode, the agent analyzes its complete trajectory to perform two concurrent learning processes: Rule Induction, where it refines its explicit model of the environment’s dynamics, and Strategy and Playbook Summarization, where it distills experiences into an actionable strategic playbook. We evaluate CEL on diverse grid-world tasks (i.e., Minesweeper, Frozen Lake, and Sokoban), and show that the CEL agent successfully learns to master these games by autonomously discovering their rules and developing effective policies from sparse rewards. Ablation studies confirm that the iterative process is critical for sustained learning. Our work demonstrates a path toward more general and interpretable agents that not only act effectively but also build a transparent and improving model of their world through explicit reasoning on raw experience.

[997] From $f(x)$ and $g(x)$ to $f(g(x))$: LLMs Learn New Skills in RL by Composing Old Ones

Lifan Yuan, Weize Chen, Yuchen Zhang, Ganqu Cui, Hanbin Wang, Ziming You, Ning Ding, Zhiyuan Liu, Maosong Sun, Hao Peng

Main category: cs.AI

TL;DR: RL enables LLMs to acquire genuinely new compositional skills by combining existing ones, not just activating pre-existing abilities, and these skills transfer across tasks.

Details

Motivation: To resolve the debate about whether RL teaches LLMs genuinely new skills or merely activates existing ones, and to understand if LLMs can acquire compositional skills similar to human cognitive development.

Method: Developed a synthetic framework using string transformation functions, testing if LLMs can learn unseen compositions h(x)=g(f(x)) after RL training when they already know f and g individually.

Result: RL enables LLMs to learn unseen function compositions, generalize to compositions of >2 functions, and transfer compositional skills to different tasks without target-specific compositional training.

Conclusion: RL fundamentally changes LLM reasoning behaviors and enables acquisition of genuinely new compositional skills that transfer across tasks, suggesting a strategy of building base models with basic skills then using RL for advanced, generalizable capabilities.

Abstract: Does RL teach LLMs genuinely new skills, or does it merely activate existing ones? This question lies at the core of ongoing debates about the role of RL in LLM post-training. On one side, strong empirical results can be achieved with RL even without preceding supervised finetuning; on the other, critics argue that RL contributes little beyond reweighting existing reasoning strategies. This work provides concrete evidence that LLMs can acquire genuinely new skills during RL by composing existing ones, mirroring one of the central mechanisms by which humans acquire new cognitive skills. To mitigate data contamination and other confounding factors, and to allow precise control over task complexity, we develop a synthetic framework for our investigation. Specifically, we define a skill as the ability to infer the output of a string transformation function f(x) given x. When an LLM has already learned f and g prior to RL, our experiments reveal that RL enables it to learn unseen compositions of them h(x)=g(f(x)). Further, this compositional ability generalizes to more difficult problems such as compositions of >2 functions unseen during RL training. Surprisingly, our experiments show that compositional skill acquired on a source task transfers to a different target task. This transfer happens even without compositional training on the target, requiring only prior knowledge of the target’s atomic skills. Our qualitative analysis shows that RL fundamentally changes the reasoning behaviors of the models. In contrast, next-token training with the same data yields none of these findings. Our systematic experiments provide fresh insights into LLM learning, suggesting the value of first building base models with basic skills, then using RL to incentivize advanced, generalizable skills for complex problems.

[998] The Era of Real-World Human Interaction: RL from User Conversations

Chuanyang Jin, Jing Xu, Bo Liu, Leitian Tao, Olga Golovneva, Tianmin Shu, Wenting Zhao, Xian Li, Jason Weston

Main category: cs.AI

TL;DR: RLHI (Reinforcement Learning from Human Interaction) learns from natural user conversations instead of pre-annotated feedback, using two methods: User-Guided Rewrites and User-Based Rewards with persona conditioning.

Details

Motivation: Current conversational models rely on expert-generated human feedback, but to achieve continual improvement and multifaceted alignment, models need to learn directly from natural human interactions.

Method: Two methods: (1) RLHI with User-Guided Rewrites - revises model outputs based on users’ natural-language follow-up responses; (2) RLHI with User-Based Rewards - learns via reward model conditioned on user’s long-term interaction history (persona), using persona-conditioned preference optimization.

Result: Both RLHI variants outperform strong baselines in personalization and instruction-following when trained on WildChat conversations. Similar feedback also enhances performance on reasoning benchmarks.

Conclusion: Organic human interaction provides scalable and effective supervision for personalized alignment of conversational models.

Abstract: We posit that to achieve continual model improvement and multifaceted alignment, future models must learn from natural human interaction. Current conversational models are aligned using pre-annotated, expert-generated human feedback. In this work, we introduce Reinforcement Learning from Human Interaction (RLHI), a paradigm that learns directly from in-the-wild user conversations. We develop two complementary methods: (1) RLHI with User-Guided Rewrites, which revises unsatisfactory model outputs based on users’ natural-language follow-up responses, (2) RLHI with User-Based Rewards, which learns via a reward model conditioned on knowledge of the user’s long-term interaction history (termed persona). Together, these methods link long-term user personas to turn-level preferences via persona-conditioned preference optimization. Trained on conversations derived from WildChat, both RLHI variants outperform strong baselines in personalization and instruction-following, and similar feedback enhances performance on reasoning benchmarks. These results suggest organic human interaction offers scalable, effective supervision for personalized alignment.

[999] ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory

Siru Ouyang, Jun Yan, I-Hung Hsu, Yanfei Chen, Ke Jiang, Zifeng Wang, Rujun Han, Long T. Le, Samira Daruki, Xiangru Tang, Vishy Tirumalashetty, George Lee, Mahsan Rofouei, Hangfei Lin, Jiawei Han, Chen-Yu Lee, Tomas Pfister

Main category: cs.AI

TL;DR: ReasoningBank is a memory framework that enables LLM agents to learn from past experiences by distilling reasoning strategies from both successful and failed interactions, allowing continuous improvement over time.

Details

Motivation: Current LLM agents fail to learn from accumulated interaction history, forcing them to discard valuable insights and repeat past errors, limiting their effectiveness in persistent real-world roles.

Method: Proposes ReasoningBank framework that distills generalizable reasoning strategies from self-judged experiences, with memory-aware test-time scaling (MaTTS) to accelerate learning by generating abundant diverse experiences through increased compute allocation.

Result: Outperforms existing memory mechanisms across web browsing and software engineering benchmarks, improving both effectiveness and efficiency. MaTTS further amplifies these gains.

Conclusion: Establishes memory-driven experience scaling as a new scaling dimension, enabling agents to self-evolve with emergent behaviors naturally arising from continuous learning.

Abstract: With the growing adoption of large language model agents in persistent real-world roles, they naturally encounter continuous streams of tasks. A key limitation, however, is their failure to learn from the accumulated interaction history, forcing them to discard valuable insights and repeat past errors. We propose ReasoningBank, a novel memory framework that distills generalizable reasoning strategies from an agent’s self-judged successful and failed experiences. At test time, an agent retrieves relevant memories from ReasoningBank to inform its interaction and then integrates new learnings back, enabling it to become more capable over time. Building on this powerful experience learner, we further introduce memory-aware test-time scaling (MaTTS), which accelerates and diversifies this learning process by scaling up the agent’s interaction experience. By allocating more compute to each task, the agent generates abundant, diverse experiences that provide rich contrastive signals for synthesizing higher-quality memory. The better memory in turn guides more effective scaling, establishing a powerful synergy between memory and test-time scaling. Across web browsing and software engineering benchmarks, ReasoningBank consistently outperforms existing memory mechanisms that store raw trajectories or only successful task routines, improving both effectiveness and efficiency; MaTTS further amplifies these gains. These findings establish memory-driven experience scaling as a new scaling dimension, enabling agents to self-evolve with emergent behaviors naturally arise.

[1000] Visual serial processing deficits explain divergences in human and VLM reasoning

Nicholas Budny, Kia Ghods, Declan Campbell, Raja Marjieh, Amogh Joshi, Sreejan Kumar, Jonathan D. Cohen, Taylor W. Webb, Thomas L. Griffiths

Main category: cs.AI

TL;DR: VLMs struggle with simple visual reasoning tasks due to deficits in visually-grounded serial processing, as shown by correlation between decreased VLM accuracy and increased human reaction time across geometric reasoning, enumeration, and mental rotation tasks.

Details

Motivation: To understand why VLMs fail on simple visual reasoning tasks despite success on standard benchmarks, and test the hypothesis that this is due to deficits in visually-grounded serial processing.

Method: Compared human and VLM performance across tasks varying serial processing demands in three domains: geometric reasoning, perceptual enumeration, and mental rotation, using human reaction time as proxy for serial processing load.

Result: Decreased VLM accuracy strongly correlated with increased human reaction time across all domains. VLM-human performance gap widens as tasks require more demanding serial processing.

Conclusion: Limitations in serial, visually grounded reasoning represent a fundamental bottleneck distinguishing current VLMs from humans.

Abstract: Why do Vision Language Models (VLMs), despite success on standard benchmarks, often fail to match human performance on surprisingly simple visual reasoning tasks? While the underlying computational principles are still debated, we hypothesize that a crucial factor is a deficit in visually-grounded serial processing. To test this hypothesis, we compared human and VLM performance across tasks designed to vary serial processing demands in three distinct domains: geometric reasoning, perceptual enumeration, and mental rotation. Tasks within each domain varied serial processing load by manipulating factors such as geometric concept complexity, perceptual individuation load, and transformation difficulty. Across all domains, our results revealed a consistent pattern: decreased VLM accuracy was strongly correlated with increased human reaction time (used as a proxy for serial processing load). As tasks require more demanding serial processing – whether composing concepts, enumerating items, or performing mental transformations – the VLM-human performance gap widens reliably. These findings support our hypothesis, indicating that limitations in serial, visually grounded reasoning represent a fundamental bottleneck that distinguishes current VLMs from humans.

[1001] UniAPL: A Unified Adversarial Preference Learning Framework for Instruct-Following

FaQiang Qian, WeiKun Zhang, Ziliang Wang, Kang An, Xuhui Zheng, Liangjian Wen, Mengya Gao, Yong Dai, Yichao Wu

Main category: cs.AI

TL;DR: UniAPL is a unified preference learning framework that addresses distributional mismatch in standard SFT+RL alignment by jointly training on both demonstrated and comparative preferences in a single stage.

Details

Motivation: Standard sequential SFT+RL pipeline suffers from distributional mismatch where SFT uses static expert data but policy distribution drifts during RL, making SFT knowledge brittle and RL exploration inefficient without direct access to expert demonstrations.

Method: UniAPL reframes alignment as constrained optimization and implements unified training with mixed batches of SFT and preference data, allowing dense expert demonstrations to directly ground and regularize online exploration in every gradient step.

Result: UniAPL matches or exceeds strong GRPO baselines: +5.77% on Qwen3-0.6B (matching 32B model) and +3.75% on Qwen3-4B, even outperforming the teacher model. Response analysis confirms outputs closely mimic expert demonstrations.

Conclusion: UniAPL resolves distributional mismatch and maximizes data synergy, achieving both stronger performance and better behavioral alignment by unifying preference learning modalities.

Abstract: Shaping powerful LLMs to be beneficial and safe is central to AI alignment. We argue that post-training alignment is fundamentally a unified Preference Learning problem, involving two modalities: demonstrated preferences (e.g., Supervised Fine-Tuning, SFT) and comparative preferences (e.g., Reinforcement Learning, RL).The standard sequential pipeline-SFT followed by RL-is flawed due to a critical distributional mismatch: SFT uses static expert data, but as the policy evolves, its generation distribution drifts, making SFT knowledge brittle. Subsequent RL then explores without direct access to the rich, ground-truth knowledge in expert demonstrations, leading to inefficient, ungrounded updates. This separation prevents mutual regularization between data sources. To address this, we reframe alignment as a constrained optimization problem and propose Unified Adversarial Preference Learning (UniAPL),a novel framework that dynamically aligns the policy’s distribution with the expert’s. UniAPL implements a single-stage unified training objective, jointly learning from mixed batches of SFT and preference data. In every gradient step, dense expert demonstrations directly ground and regularize online exploration, inherently resolving distributional mismatch and maximizing data synergy.We evaluate UniAPL on instruction-following tasks using Qwen3-235B-Instruct-2507 as the teacher. Our models match or exceed strong GRPO baselines: +5.77% on Qwen3-0.6B (matching a 32B model) and +3.75% on Qwen3-4B,even outperforming the teacher. Analyses of response length and log-probability distributions confirm that UniAPL outputs closely mimic expert demonstrations, achieving both stronger performance and better behavioral alignment.

[1002] Who’s Your Judge? On the Detectability of LLM-Generated Judgments

Dawei Li, Zhen Tan, Chengshuai Zhao, Bohan Jiang, Baixiang Huang, Pingchuan Ma, Abdullah Alnaibari, Kai Shu, Huan Liu

Main category: cs.AI

TL;DR: This paper proposes J-Detector, a method for detecting LLM-generated judgments by analyzing the interaction between judgment scores and candidate content, addressing biases in LLM-based evaluation systems.

Details

Motivation: LLM-based judgments have inherent biases and vulnerabilities that raise concerns in sensitive scenarios like academic peer reviewing, creating an urgent need to distinguish them from human judgments.

Method: The authors introduce J-Detector, a lightweight neural detector that uses explicitly extracted linguistic and LLM-enhanced features to link LLM judges’ biases with candidates’ properties for accurate detection.

Result: Experiments across diverse datasets demonstrate J-Detector’s effectiveness, showing how its interpretability enables quantifying biases in LLM judges and outperforms existing LLM-generated text detection methods.

Conclusion: The work validates the practical utility of judgment detection in real-world scenarios and analyzes key factors affecting the detectability of LLM-generated judgments.

Abstract: Large Language Model (LLM)-based judgments leverage powerful LLMs to efficiently evaluate candidate content and provide judgment scores. However, the inherent biases and vulnerabilities of LLM-generated judgments raise concerns, underscoring the urgent need for distinguishing them in sensitive scenarios like academic peer reviewing. In this work, we propose and formalize the task of judgment detection and systematically investigate the detectability of LLM-generated judgments. Unlike LLM-generated text detection, judgment detection relies solely on judgment scores and candidates, reflecting real-world scenarios where textual feedback is often unavailable in the detection process. Our preliminary analysis shows that existing LLM-generated text detection methods perform poorly given their incapability to capture the interaction between judgment scores and candidate content – an aspect crucial for effective judgment detection. Inspired by this, we introduce \textit{J-Detector}, a lightweight and transparent neural detector augmented with explicitly extracted linguistic and LLM-enhanced features to link LLM judges’ biases with candidates’ properties for accurate detection. Experiments across diverse datasets demonstrate the effectiveness of \textit{J-Detector} and show how its interpretability enables quantifying biases in LLM judges. Finally, we analyze key factors affecting the detectability of LLM-generated judgments and validate the practical utility of judgment detection in real-world scenarios.

[1003] Benefits and Pitfalls of Reinforcement Learning for Language Model Planning: A Theoretical Perspective

Siwei Wang, Yifei Shen, Haoran Sun, Shi Feng, Shang-Hua Teng, Li Dong, Yaru Hao, Wei Chen

Main category: cs.AI

TL;DR: Theoretical analysis shows RL helps LLMs in planning by enabling exploration to avoid spurious solutions, but policy gradient suffers from diversity collapse while Q-learning preserves diversity and enables off-policy learning.

Details

Motivation: To understand the theoretical basis for why reinforcement learning methods enhance planning capabilities in Large Language Models, as their effectiveness remains elusive despite practical success.

Method: Using a tractable graph-based abstraction to analyze policy gradient and Q-learning methods, examining their behaviors in planning tasks and applying the framework to the Blocksworld benchmark.

Result: SFT introduces co-occurrence-based spurious solutions, RL achieves correct planning through exploration, PG suffers from diversity collapse, Q-learning preserves diversity and enables off-policy learning, and careful reward design prevents reward hacking.

Conclusion: RL’s exploration capability is key for better generalization in planning, Q-learning outperforms PG in preserving diversity, and proper reward design is crucial for effective Q-learning in real-world planning tasks.

Abstract: Recent reinforcement learning (RL) methods have substantially enhanced the planning capabilities of Large Language Models (LLMs), yet the theoretical basis for their effectiveness remains elusive. In this work, we investigate RL’s benefits and limitations through a tractable graph-based abstraction, focusing on policy gradient (PG) and Q-learning methods. Our theoretical analyses reveal that supervised fine-tuning (SFT) may introduce co-occurrence-based spurious solutions, whereas RL achieves correct planning primarily through exploration, underscoring exploration’s role in enabling better generalization. However, we also show that PG suffers from diversity collapse, where output diversity decreases during training and persists even after perfect accuracy is attained. By contrast, Q-learning provides two key advantages: off-policy learning and diversity preservation at convergence. We further demonstrate that careful reward design is necessary to prevent reward hacking in Q-learning. Finally, applying our framework to the real-world planning benchmark Blocksworld, we confirm that these behaviors manifest in practice.

[1004] Query2Triple: Unified Query Encoding for Answering Diverse Complex Queries over Knowledge Graphs

Yao Xu, Shizhu He, Cunguang Wang, Li Cai, Kang Liu, Jun Zhao

Main category: cs.AI

TL;DR: Q2T is a novel approach for Complex Query Answering that decouples training into two stages: pre-training on simple queries and then training a query encoder for complex queries, achieving state-of-the-art performance without explicit neural set operators.

Details

Motivation: Existing query embedding methods train KG embeddings and neural set operators concurrently on both simple and complex queries, causing performance degradation on simple queries and low training efficiency.

Method: Two-stage training: (1) Pre-train neural link predictor on simple queries to predict tail entities, (2) Train query encoder on complex queries to encode diverse queries into unified triple form solvable by the pretrained predictor.

Result: Achieves state-of-the-art performance on diverse complex queries over three public benchmarks, with efficient training and modular design.

Conclusion: Q2T provides an efficient, modular approach that decouples simple and complex query training, achieving superior performance without explicit neural set operators while being adaptable to various neural link predictors.

Abstract: Complex Query Answering (CQA) is a challenge task of Knowledge Graph (KG). Due to the incompleteness of KGs, query embedding (QE) methods have been proposed to encode queries and entities into the same embedding space, and treat logical operators as neural set operators to obtain answers. However, these methods train KG embeddings and neural set operators concurrently on both simple (one-hop) and complex (multi-hop and logical) queries, which causes performance degradation on simple queries and low training efficiency. In this paper, we propose Query to Triple (Q2T), a novel approach that decouples the training for simple and complex queries. Q2T divides the training into two stages: (1) Pre-training a neural link predictor on simple queries to predict tail entities based on the head entity and relation. (2) Training a query encoder on complex queries to encode diverse complex queries into a unified triple form that can be efficiently solved by the pretrained neural link predictor. Our proposed Q2T is not only efficient to train, but also modular, thus easily adaptable to various neural link predictors that have been studied well. Extensive experiments demonstrate that, even without explicit modeling for neural set operators, Q2T still achieves state-of-the-art performance on diverse complex queries over three public benchmarks.

[1005] Taking control: Policies to address extinction risks from advanced AI

Andrea Miotti

Main category: cs.AI

TL;DR: Policy recommendations to reduce AI extinction risks through international governance, compute caps, and safety evaluations.

Details

Motivation: Addressing the existential threats posed by advanced artificial intelligence systems that could lead to human extinction.

Method: Proposes three main policy solutions: (1) Multinational AGI Consortium (MAGIC) for democratic oversight, (2) global compute cap on AI training, (3) mandatory safety evaluations for critical AI experiments.

Result: Framework for international coordination that would enable safe AI development while preventing dangerous races to more powerful systems.

Conclusion: Voluntary commitments are insufficient; mandatory international governance structures are necessary to mitigate existential risks from advanced AI systems.

Abstract: This paper provides policy recommendations to reduce extinction risks from advanced artificial intelligence (AI). First, we briefly provide background information about extinction risks from AI. Second, we argue that voluntary commitments from AI companies would be an inappropriate and insufficient response. Third, we describe three policy proposals that would meaningfully address the threats from advanced AI: (1) establishing a Multinational AGI Consortium to enable democratic oversight of advanced AI (MAGIC), (2) implementing a global cap on the amount of computing power used to train an AI system (global compute cap), and (3) requiring affirmative safety evaluations to ensure that risks are kept below acceptable levels (gating critical experiments). MAGIC would be a secure, safety-focused, internationally-governed institution responsible for reducing risks from advanced AI and performing research to safely harness the benefits of AI. MAGIC would also maintain emergency response infrastructure (kill switch) to swiftly halt AI development or withdraw model deployment in the event of an AI-related emergency. The global compute cap would end the corporate race toward dangerous AI systems while enabling the vast majority of AI innovation to continue unimpeded. Gating critical experiments would ensure that companies developing powerful AI systems are required to present affirmative evidence that these models keep extinction risks below an acceptable threshold. After describing these recommendations, we propose intermediate steps that the international community could take to implement these proposals and lay the groundwork for international coordination around advanced AI.

[1006] Understanding the Effects of Miscalibrated AI Confidence on User Trust, Reliance, and Decision Efficacy

Jingshu Li, Yitian Yang, Renwen Zhang, Q. Vera Liao, Tianqi Song, Zhengtao Xu, Yi-chieh Lee

Main category: cs.AI

TL;DR: AI confidence miscalibration impairs user reliance and decision efficacy, and is hard for users to detect. Communicating calibration levels helps detection but causes under-reliance without improving decision outcomes.

Details

Motivation: Well-calibrated AI confidence is essential for appropriate user trust and reliance in AI-assisted decision-making, but calibration is challenging.

Method: Two experiments: first tested effects of miscalibrated AI confidence on user reliance and decision efficacy; second examined if communicating calibration levels could mitigate these issues.

Result: Miscalibrated AI impairs appropriate reliance and reduces decision efficacy, and users struggle to detect miscalibration. Communication helps detection but causes under-reliance without improving decision outcomes.

Conclusion: AI miscalibration poses significant risks to decision-making effectiveness. While communication helps detection, it doesn’t solve the reliance problem, highlighting the need for better calibration approaches and addressing ethical concerns.

Abstract: Providing well-calibrated AI confidence can help promote users’ appropriate trust in and reliance on AI, which are essential for AI-assisted decision-making. However, calibrating AI confidence – providing confidence score that accurately reflects the true likelihood of AI being correct – is known to be challenging. To understand the effects of AI confidence miscalibration, we conducted our first experiment. The results indicate that miscalibrated AI confidence impairs users’ appropriate reliance and reduces AI-assisted decision-making efficacy, and AI miscalibration is difficult for users to detect. Then, in our second experiment, we examined whether communicating AI confidence calibration levels could mitigate the above issues. We find that it helps users to detect AI miscalibration. Nevertheless, since such communication decreases users’ trust in uncalibrated AI, leading to high under-reliance, it does not improve the decision efficacy. We discuss design implications based on these findings and future directions to address risks and ethical concerns associated with AI miscalibration.

[1007] VCSearch: Bridging the Gap Between Well-Defined and Ill-Defined Problems in Mathematical Reasoning

Shi-Yu Tian, Zhi Zhou, Kun-Yang Yu, Ming Yang, Lin-Han Jia, Lan-Zhe Guo, Yu-Feng Li

Main category: cs.AI

TL;DR: VCSEARCH is a training-free framework that uses formal language and variable-constraint pair search to improve LLMs’ ability to handle ill-defined mathematical problems with missing or contradictory conditions.

Details

Motivation: Current LLM evaluation focuses on well-defined benchmarks but neglects real-world ill-defined problems with missing or contradictory conditions, which is a critical gap in robust mathematical reasoning.

Method: Developed Variable-Constraint Search (VCSEARCH) framework that uses formal language to detect ill-defined problems and incorporates variable-constraint pair search strategy to improve modeling capability.

Result: VCSEARCH improves accuracy of identifying unsolvable problems by at least 12% across different LLMs, achieving stronger robust mathematical reasoning ability.

Conclusion: The proposed VCSEARCH framework effectively addresses challenges in handling ill-defined mathematical problems and demonstrates significant improvements in robust reasoning capabilities for LLMs.

Abstract: Large language models (LLMs) have demonstrated impressive performance on reasoning tasks, including mathematical reasoning. However, the current evaluation mostly focuses on carefully constructed benchmarks and neglects the consideration of real-world reasoning problems that present missing or contradictory conditions, known as ill-defined problems. To further study this problem, we develop a largescale benchmark called Problems with Missing and Contradictory conditions (PMC) containing over 5,000 validated ill-defined mathematical problems. Our preliminary experiments through PMC reveal two challenges about existing methods: (1) traditional methods exhibit a trade-off between solving accuracy and rejection capabilities, and (2) formal methods struggle with modeling complex problems. To address these challenges, We develop Variable-Constraint Search (VCSEARCH), a trainingfree framework that leverages formal language to detect ill-defined problems, where a variableconstraint pair search strategy is incorporated to improve the modeling capability of formal language. Extensive experiments demonstrate that VCSEARCH improves the accuracy of identifying unsolvable problems by at least 12% across different LLMs, thus achieving stronger robust mathematical reasoning ability.

[1008] Thread: A Logic-Based Data Organization Paradigm for How-To Question Answering with Retrieval Augmented Generation

Kaikai An, Fangkai Yang, Liqun Li, Junting Lu, Sitao Cheng, Shuzheng Si, Lu Wang, Pu Zhao, Lele Cao, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Baobao Chang

Main category: cs.AI

TL;DR: Thread is a new data organization paradigm that uses logic units to better handle how-to questions in RAG systems, improving success rates by 21-33% while reducing retrieval information by up to 75% compared to chunk-based approaches.

Details

Motivation: Current RAG systems struggle with '1H' (how-to) questions because chunk-based data organization disrupts logical coherence and step-by-step reasoning required for decision-making.

Method: Proposes Thread paradigm with ’logic units’ (LUs) - structured, loosely interconnected units created by LLMs from documents, replacing fixed-size chunks to preserve logical flow.

Result: Thread outperforms existing paradigms, improving how-to question success rates by 21-33%, reducing retrieval information by up to 75%, and showing better performance on multi-hop ‘5Ws’ questions.

Conclusion: Thread provides an effective solution for handling how-to questions in RAG systems while maintaining adaptability across document formats and generalizing well to other question types.

Abstract: Recent advances in retrieval-augmented generation (RAG) have substantially improved question-answering systems, particularly for factoid ‘5Ws’ questions. However, significant challenges remain when addressing ‘1H’ questions, specifically how-to questions, which are integral for decision-making and require dynamic, step-by-step responses. The key limitation lies in the prevalent data organization paradigm, chunk, which commonly divides documents into fixed-size segments, and disrupts the logical coherence and connections within the context. To address this, we propose Thread, a novel data organization paradigm enabling systems to handle how-to questions more effectively. Specifically, we introduce a new knowledge granularity, ’logic unit’ (LU), where large language models transform documents into more structured and loosely interconnected LUs. Extensive experiments across both open-domain and industrial settings show that Thread outperforms existing paradigms significantly, improving the success rate of handling how-to questions by 21% to 33%. Additionally, Thread demonstrates high adaptability across diverse document formats, reducing retrieval information by up to 75% compared to chunk, and also shows better generalizability to ‘5Ws’ questions, such as multi-hop questions, outperforming other paradigms.

[1009] A Voter-Based Stochastic Rejection-Method Framework for Asymptotically Safe Language Model Outputs

Jake R. Watts, Joel Sokol

Main category: cs.AI

TL;DR: RCR uses multiple LLM checkers to vote on output quality, regenerating if disapproval threshold is met, achieving exponential failure rate reduction with cost.

Details

Motivation: To prevent unsafe or low-quality LLM outputs by leveraging model stochasticity for quality control.

Method: Repeated Checking with Regeneration (RCR) - multiple LLM checkers vote on output acceptability, regenerate if disapproval threshold reached, continue until sufficient approval.

Result: Achieves desired expected failure rate at Pareto-optimal cost, with failure rate decreasing exponentially with cost. Works with any language model.

Conclusion: RCR enables cheap small LLMs to control, constrain, or outperform complex costly models through systematic quality checking.

Abstract: We propose an approach for preventing unsafe or otherwise low-quality large language model (LLM) outputs by leveraging the stochasticity of LLMs, an approach we call Repeated Checking with Regeneration (RCR). In this system, LLM checkers vote on the acceptability of a generated output, regenerating it if a threshold of disapproval is reached, until sufficient checkers approve. Based on our estimators for cost and failure rate and experimental data tailored to the application, our algorithm achieves a desired expected failure rate at Pareto-optimal cost. The failure rate provably decreases exponentially as a function of cost, and the models reasonably estimate the actual performance of such a system in action, even with limited data. This approach does not depend on the language model used, and could allow cheap, small LLMs to control, constrain, or at some tasks even outperform very complex and costly ones.

[1010] Autonomous Vehicle Controllers From End-to-End Differentiable Simulation

Asen Nachkov, Danda Pani Paudel, Luc Van Gool

Main category: cs.AI

TL;DR: The paper proposes an analytic policy gradients (APG) approach using differentiable simulators to train autonomous vehicle controllers, overcoming limitations of behavioral cloning by enabling end-to-end training with environment dynamics gradients as prior knowledge.

Details

Motivation: Current autonomous vehicle controller training methods using behavioral cloning generalize poorly to novel scenarios, while traditional RL approaches in simulators are slow, sample-inefficient, and prior-agnostic.

Method: Leverages differentiable simulator in end-to-end training loop with analytic policy gradients, combined with recurrent architecture to propagate temporal information across long trajectories, using only expert trajectories without requiring expert actions.

Result: Significant improvements over behavioral cloning in performance and robustness to noise in dynamics, with more intuitive human-like handling.

Conclusion: The APG method enables learning robust, accurate, and fast policies by incorporating differentiable simulator gradients as useful priors, requiring only widely-available expert trajectories rather than scarce expert actions.

Abstract: Current methods to learn controllers for autonomous vehicles (AVs) focus on behavioural cloning. Being trained only on exact historic data, the resulting agents often generalize poorly to novel scenarios. Simulators provide the opportunity to go beyond offline datasets, but they are still treated as complicated black boxes, only used to update the global simulation state. As a result, these RL algorithms are slow, sample-inefficient, and prior-agnostic. In this work, we leverage a differentiable simulator and design an analytic policy gradients (APG) approach to training AV controllers on the large-scale Waymo Open Motion Dataset. Our proposed framework brings the differentiable simulator into an end-to-end training loop, where gradients of the environment dynamics serve as a useful prior to help the agent learn a more grounded policy. We combine this setup with a recurrent architecture that can efficiently propagate temporal information across long simulated trajectories. This APG method allows us to learn robust, accurate, and fast policies, while only requiring widely-available expert trajectories, instead of scarce expert actions. We compare to behavioural cloning and find significant improvements in performance and robustness to noise in the dynamics, as well as overall more intuitive human-like handling.

[1011] Neuro-Symbolic Entity Alignment via Variational Inference

Shengyuan Chen, Zheng Yuan, Qinggang Zhang, Wen Hua, Jiannong Cao, Xiao Huang

Main category: cs.AI

TL;DR: NeuSymEA is a neuro-symbolic framework for entity alignment that combines neural models’ effectiveness with symbolic models’ interpretability, achieving significant performance improvements on cross-lingual knowledge graph alignment tasks.

Details

Motivation: Existing entity alignment methods have limitations: symbolic models struggle with substructure heterogeneity and sparsity, while neural models lack interpretability and cannot handle uncertainty. There's a need to combine both approaches' strengths.

Method: NeuSymEA models joint probability of entity pairs’ truth scores in a Markov random field regulated by rules, optimized with variational EM algorithm. E-step uses neural model to parameterize truth scores and infer missing alignments; M-step updates rule weights. Includes efficient symbolic inference engine for logic deduction with extended rule lengths.

Result: Achieved 7.6% hit@1 improvement on DBP15K_ZH-EN compared to strong baselines. Shows robustness in low-resource settings, achieving 73.7% hit@1 accuracy on DBP15K_FR-EN with only 1% pairs as seed alignments.

Conclusion: NeuSymEA successfully unifies neuro-symbolic reasoning for entity alignment, combining interpretability with performance and demonstrating strong results in both standard and low-resource scenarios.

Abstract: Entity alignment (EA) aims to merge two knowledge graphs (KGs) by identifying equivalent entity pairs. Existing methods can be categorized into symbolic and neural models. Symbolic models, while precise, struggle with substructure heterogeneity and sparsity, whereas neural models, although effective, generally lack interpretability and cannot handle uncertainty. We propose NeuSymEA, a unified neuro-symbolic reasoning framework that combines the strengths of both methods to fully exploit the cross-KG structural pattern for robust entity alignment. NeuSymEA models the joint probability of all possible pairs’ truth scores in a Markov random field, regulated by a set of rules, and optimizes it with the variational EM algorithm. In the E-step, a neural model parameterizes the truth score distributions and infers missing alignments. In the M-step, the rule weights are updated based on the observed and inferred alignments, handling uncertainty. We introduce an efficient symbolic inference engine driven by logic deduction, enabling reasoning with extended rule lengths. NeuSymEA achieves a significant 7.6% hit@1 improvement on $\text{DBP15K}{\text{ZH-EN}}$ compared with strong baselines and demonstrates robustness in low-resource settings, achieving 73.7% hit@1 accuracy on $\text{DBP15K}{\text{FR-EN}}$ with only 1% pairs as seed alignments. Codes are released at https://github.com/chensyCN/NeuSymEA-NeurIPS25.

[1012] A Neurosymbolic Fast and Slow Architecture for Graph Coloring

Vedant Khandelwal, Vishal Pallagani, Biplav Srivastava, Francesca Rossi

Main category: cs.AI

TL;DR: SOFAI_v2 enhances the SOFAI architecture with refined metacognitive governance to solve graph coloring problems, combining fast LLM-based System 1 with deliberative System 2, achieving higher success rates and speed than traditional solvers.

Details

Motivation: Existing symbolic solvers are slow, and LLMs alone struggle with complex CSPs like graph coloring, necessitating a hybrid approach that bridges the gap between fast but error-prone methods and accurate but slow ones.

Method: SOFAI_v2 integrates a metacognition module that governs System 1 (LLM-based, fast) and System 2 (deliberative, slow). S1 generates initial solutions, improved via feedback from metacognition; if S1 fails, metacognition invokes S2 for reliable solutions.

Result: SOFAI_v2 achieves a 10.5% higher success rate and is up to 30% faster than traditional symbolic solvers in solving graph coloring problems.

Conclusion: The enhanced SOFAI_v2 architecture, with metacognitive governance, effectively combines fast and slow thinking to outperform traditional methods in solving complex CSPs like graph coloring.

Abstract: Constraint Satisfaction Problems (CSPs) present significant challenges to artificial intelligence due to their intricate constraints and the necessity for precise solutions. Existing symbolic solvers are often slow, and prior research has shown that Large Language Models (LLMs) alone struggle with CSPs because of their complexity. To bridge this gap, we build upon the existing SOFAI architecture (SOFAI_v1), which adapts Daniel Kahneman’s ‘‘Thinking, Fast and Slow’’ cognitive model to AI. Our enhanced architecture, SOFAI_v2, integrates refined metacognitive governance mechanisms to improve adaptability across complex domains, specifically tailored here for solving the graph coloring problem, a specific type of CSP. SOFAI_v2 combines a fast System 1 (S1), leveraging LLMs, with a deliberative System 2 (S2), governed by a metacognition module. S1’s initial solutions, often limited by constraint adherence issues, are improved through targeted feedback and examples from metacognition, aligning S1 more closely with CSP requirements. If S1 fails to resolve the problem, metacognition strategically invokes S2, ensuring accurate and reliable solutions. Our empirical results demonstrate that SOFAI_v2 achieves a 10.5% higher success rate and is up to 30% faster than a traditional symbolic solver in solving graph coloring problems.

Kaustubh Vyas, Damien Graux, Yijun Yang, Sébastien Montella, Chenxin Diao, Wendi Zhou, Pavlos Vougiouklis, Ruofei Lai, Yang Ren, Keshuang Li, Jeff Z. Pan

Main category: cs.AI

TL;DR: Hive is a knowledge-aware planning system that schedules atomic actions using multiple AI models to handle complex multi-modal queries while ensuring explainability and respecting user constraints.

Details

Motivation: To address the need for agent-based solutions that can leverage the growing ecosystem of deep learning models and handle complex real-world queries with multi-modal inputs and outputs.

Method: Uses an LLM-based formal logic backbone empowered by PDDL operations to plan explainable chains of atomic actions involving one or more available models, with constraint-aware scheduling.

Result: Outperforms competing systems in task selection while offering transparency guarantees and fully adhering to user constraints, as demonstrated on the MuSE benchmark.

Conclusion: Hive redefines state-of-the-art for multi-modal agent systems by providing comprehensive knowledge-aware planning with explainability and constraint adherence.

Abstract: In response to the call for agent-based solutions that leverage the ever-increasing capabilities of the deep models’ ecosystem, we introduce Hive – a comprehensive solution for knowledge-aware planning of a set of atomic actions to address input queries and subsequently selecting appropriate models accordingly. Hive operates over sets of models and, upon receiving natural language instructions (i.e. user queries), schedules and executes explainable plans of atomic actions. These actions can involve one or more of the available models to achieve the overall task, while respecting end-users specific constraints. Notably, Hive handles tasks that involve multi-modal inputs and outputs, enabling it to handle complex, real-world queries. Our system is capable of planning complex chains of actions while guaranteeing explainability, using an LLM-based formal logic backbone empowered by PDDL operations. We introduce the MuSE benchmark in order to offer a comprehensive evaluation of the multi-modal capabilities of agent systems. Our findings show that our framework redefines the state-of-the-art for task selection, outperforming other competing systems that plan operations across multiple models while offering transparency guarantees while fully adhering to user constraints.

[1014] GUI Agents: A Survey

Dang Nguyen, Jian Chen, Yu Wang, Gang Wu, Namyong Park, Zhengmian Hu, Hanjia Lyu, Junda Wu, Ryan Aponte, Yu Xia, Xintong Li, Jing Shi, Hongjie Chen, Viet Dac Lai, Zhouhang Xie, Sungchul Kim, Ruiyi Zhang, Tong Yu, Mehrab Tanjim, Nesreen K. Ahmed, Puneet Mathur, Seunghyun Yoon, Lina Yao, Branislav Kveton, Jihyung Kil, Thien Huu Nguyen, Trung Bui, Tianyi Zhou, Ryan A. Rossi, Franck Dernoncourt

Main category: cs.AI

TL;DR: A comprehensive survey of GUI agents powered by Large Foundation Models, covering benchmarks, evaluation metrics, architectures, training methods, and proposing a unified framework for their capabilities.

Details

Motivation: The growing interest and fundamental importance of GUI agents in automating human-computer interaction across diverse platforms.

Method: Propose a unified framework delineating perception, reasoning, planning, and acting capabilities of GUI agents, and provide comprehensive categorization of benchmarks, evaluation metrics, architectures, and training methods.

Result: A systematic survey that organizes current progress in GUI agents and identifies critical components of their operational framework.

Conclusion: This work serves as a foundation for practitioners and researchers to understand current progress, techniques, benchmarks, and identifies important open challenges and future directions for GUI agent development.

Abstract: Graphical User Interface (GUI) agents, powered by Large Foundation Models, have emerged as a transformative approach to automating human-computer interaction. These agents autonomously interact with digital systems or software applications via GUIs, emulating human actions such as clicking, typing, and navigating visual elements across diverse platforms. Motivated by the growing interest and fundamental importance of GUI agents, we provide a comprehensive survey that categorizes their benchmarks, evaluation metrics, architectures, and training methods. We propose a unified framework that delineates their perception, reasoning, planning, and acting capabilities. Furthermore, we identify important open challenges and discuss key future directions. Finally, this work serves as a basis for practitioners and researchers to gain an intuitive understanding of current progress, techniques, benchmarks, and critical open problems that remain to be addressed.

[1015] Collaborative Gym: A Framework for Enabling and Evaluating Human-Agent Collaboration

Yijia Shao, Vinay Samuel, Yucheng Jiang, John Yang, Diyi Yang

Main category: cs.AI

TL;DR: Co-Gym framework enables human-agent collaboration across tasks, showing collaborative agents outperform autonomous ones but face challenges in communication, awareness, and autonomy balance.

Details

Motivation: Many use cases require LM agents to collaborate with humans due to human preferences, expertise, or need for control, but existing frameworks lack proper support for such tripartite interactions.

Method: Developed Collaborative Gym (Co-Gym) - a general framework for asynchronous interaction among agents, humans, and task environments, instantiated with three representative tasks in simulated and real-world conditions.

Result: Collaborative agents consistently outperformed fully autonomous counterparts: 86% win rate in Travel Planning, 74% in Tabular Analysis, and 66% in Related Work when evaluated by real users.

Conclusion: While collaborative agents show superior performance, significant challenges remain in developing core intelligence aspects - communication capabilities, situational awareness, and balancing autonomy with human control.

Abstract: Recent advancements in language models (LMs) have sparked growing interest in developing LM agents. While fully autonomous agents could excel in many scenarios, numerous use cases inherently require them to collaborate with humans due to humans’ latent preferences, domain expertise, or need for control. To facilitate the study of human-agent collaboration, we present Collaborative Gym (Co-Gym), a general framework enabling asynchronous, tripartite interaction among agents, humans, and task environments. We instantiate Co-Gym with three representative tasks in both simulated and real-world conditions, and propose an evaluation framework that assesses both the collaboration outcomes and processes. Our findings reveal that collaborative agents consistently outperform their fully autonomous counterparts in task performance within those delivered cases, achieving win rates of 86% in Travel Planning, 74% in Tabular Analysis, and 66% in Related Work when evaluated by real users. However, our study also highlights significant challenges in developing collaborative agents, requiring advancements in core aspects of intelligence – communication capabilities, situational awareness, and balancing autonomy and human control.

[1016] Broadening Ontologization Design: Embracing Data Pipeline Strategies

Chris Partridge, Andrew Mitchell, Sergio de Cesare, John Beverley

Main category: cs.AI

TL;DR: The paper argues that the ontologization process has a broader design space than current practices suggest, proposing new data pipeline practices and an evolutionary perspective on digitalization.

Details

Motivation: To expand the understanding of ontologization beyond current limited practices and explore new design possibilities for both processes and products.

Method: Analyzes the design space for ontologization, investigates new practices implemented as data pipelines using the bCLEARer methodology, and applies an evolutionary perspective to digitalization.

Result: Identifies components for designing ontologization processes, provides examples of new practices from three decades of work, and reframes ontologization as a strategic tool for digitalization opportunities.

Conclusion: Ontologization should be viewed as a broader design space with evolutionary context, positioning it as a strategic tool to leverage emerging digitalization opportunities.

Abstract: Our aim in this paper is to outline how the design space for the ontologization process is broader than current practice would suggest. We point out that engineering processes as well as products need to be designed and identify some components of the design. We investigate the possibility of designing a range of radically new practices implemented as data pipelines, providing examples of the new practices from our work over the last three decades with an outlier methodology, bCLEARer. We also suggest that setting an evolutionary context for ontologization helps one to better understand the nature of these new practices and provides the conceptual scaffolding that shapes fertile processes. Where this evolutionary perspective positions digitalization (the evolutionary emergence of computing technologies) as the latest step in a long evolutionary trail of information transitions. This reframes ontologization as a strategic tool for leveraging the emerging opportunities offered by digitalization.

[1017] Enabling AI Scientists to Recognize Innovation: A Domain-Agnostic Algorithm for Assessing Novelty

Yao Wang, Mingxuan Cui, Arthur Jiang

Main category: cs.AI

TL;DR: RND algorithm achieves SOTA performance in novelty assessment for research ideas across domains, maintaining consistent accuracy while other models show domain-specific degradation.

Details

Motivation: To automate generation and evaluation of novel research ideas for AGI, overcoming limitations of existing approaches in AI-driven scientific discovery.

Method: Developed Relative Neighbor Density (RND) algorithm that compares idea’s local density with adjacent neighbors’ densities, plus scalable methodology for test set creation without expert labeling.

Result: RND achieves AUROC=0.820 in computer science and AUROC=0.765 in biomedical research, outperforming benchmarks with 0.795 vs 0.597 on cross-domain evaluation.

Conclusion: RND is validated as a generalizable solution for automated novelty assessment in scientific research with domain-invariant properties.

Abstract: In the pursuit of Artificial General Intelligence (AGI), automating the generation and evaluation of novel research ideas is a key challenge in AI-driven scientific discovery. This paper presents Relative Neighbor Density (RND), a domain-agnostic algorithm for novelty assessment in research ideas that overcomes the limitations of existing approaches by comparing an idea’s local density with its adjacent neighbors’ densities. We first developed a scalable methodology to create test set without expert labeling, addressing a fundamental challenge in novelty assessment. Using these test sets, we demonstrate that our RND algorithm achieves state-of-the-art (SOTA) performance in computer science (AUROC=0.820) and biomedical research (AUROC=0.765) domains. Most significantly, while SOTA models like Sonnet-3.7 and existing metrics show domain-specific performance degradation, RND maintains consistent accuracies across domains by its domain-invariant property, outperforming all benchmarks by a substantial margin (0.795 v.s. 0.597) on cross-domain evaluation. These results validate RND as a generalizable solution for automated novelty assessment in scientific research.

[1018] Visualizing Thought: Conceptual Diagrams Enable Robust Planning in LMMs

Nasim Borazjanizadeh, Roei Herzig, Eduard Oks, Trevor Darrell, Rogerio Feris, Leonid Karlinsky

Main category: cs.AI

TL;DR: Visual Thinking framework enables LMMs to reason through self-generated conceptual diagrams, significantly improving combinatorial planning capabilities without human input beyond task descriptions.

Details

Motivation: Human reasoning uses mental models and conceptual diagrams to abstract irrelevant details, while current LLMs/LMMs primarily reason through text, limiting effectiveness on complex multi-step tasks.

Method: Integrates textual and diagrammatic reasoning within optimized Graph-of-Thought inference framework with beam search and depth-wise backtracking, using self-generated conceptual diagrams.

Result: Substantially improves LMM performance (GPT-4o: 35.5% -> 90.2% in Blocksworld), outperforms text-only search methods, and surpasses o1-preview model on difficult domains (16 percentage points improvement in Floor Tiles).

Conclusion: Conceptual diagrams are a powerful reasoning medium for LMMs, enabling significant performance gains in complex planning tasks through visual thinking approach.

Abstract: Human reasoning relies on constructing and manipulating mental models – simplified internal representations of situations used to understand and solve problems. Conceptual diagrams (e.g., a sketch drawn to aid reasoning) externalize these mental models, abstracting irrelevant details to efficiently capture how entities interact. In contrast, Large Language Models (LLMs) and Large MultiModal Models (LMMs) predominantly reason through text, limiting their effectiveness on complex multi-step tasks. In this paper, we propose Visual Thinking, a generalizable framework that enables LMMs to reason through multiple chains of self-generated conceptual diagrams, significantly enhancing their combinatorial planning capabilities. Our approach requires no human input beyond the natural language description of the task. It integrates textual and diagrammatic reasoning within an optimized Graph-of-Thought inference framework, enhanced by beam search and depth-wise backtracking. Evaluated on multiple challenging PDDL planning domains, our method substantially improves LMM performance (e.g., GPT-4o: 35.5% -> 90.2% in Blocksworld) and consistently outperforms text-only search-based inference methods. On more difficult domains with solution depths up to 40, it also surpasses the o1-preview reasoning model (e.g., 16 percentage points improvement in Floor Tiles). These results demonstrate the power of conceptual diagrams as a reasoning medium in LMMs.

[1019] Safety Mirage: How Spurious Correlations Undermine VLM Safety Fine-Tuning and Can Be Mitigated by Machine Unlearning

Yiwei Chen, Yuguang Yao, Yihua Zhang, Bingquan Shen, Gaowen Liu, Sijia Liu

Main category: cs.AI

TL;DR: VLMs trained with supervised safety fine-tuning develop spurious correlations between text patterns and safety responses, making them vulnerable to simple attacks and causing unnecessary rejections. Machine unlearning is proposed as a better alternative that directly removes harmful knowledge.

Details

Motivation: Current VLMs are vulnerable to generating harmful content despite safety fine-tuning, due to spurious correlations that create a "safety mirage" rather than true safety understanding.

Method: Propose machine unlearning (MU) as an alternative to supervised safety fine-tuning, which directly removes harmful knowledge from VLMs without creating biased feature-label mappings.

Result: MU-based alignment reduces attack success rate by up to 60.17% and cuts unnecessary rejections by over 84.20% across safety benchmarks.

Conclusion: Machine unlearning is a more effective approach than supervised safety fine-tuning for aligning VLMs, as it avoids spurious correlations and directly addresses harmful knowledge while preserving model capabilities.

Abstract: Recent vision language models (VLMs) have made remarkable strides in generative modeling with multimodal inputs, particularly text and images. However, their susceptibility to generating harmful content when exposed to unsafe queries raises critical safety concerns. While current alignment strategies primarily rely on supervised safety fine-tuning with curated datasets, we identify a fundamental limitation we call the ‘‘safety mirage’’, where supervised fine-tuning inadvertently reinforces spurious correlations between superficial textual patterns and safety responses, rather than fostering deep, intrinsic mitigation of harm. We show that these spurious correlations leave fine-tuned VLMs vulnerable even to a simple one-word modification-based attack, where substituting a single word in text queries with a spurious correlation-inducing alternative can effectively bypass safeguards. Additionally, these correlations contribute to the over-prudence, causing fine-tuned VLMs to refuse benign queries unnecessarily. To address these issues, we show machine unlearning (MU) as a powerful alternative to supervised safety fine-tuning, as it avoids biased feature-label mappings and directly removes harmful knowledge from VLMs while preserving their general capabilities. Extensive evaluations across safety benchmarks show that under MU-based alignment reduces the attack success rate by up to 60.17% and cuts unnecessary rejections by over 84.20%. WARNING: There exist AI generations that may be offensive in nature.

[1020] Do Larger Language Models Generalize Better? A Scaling Law for Implicit Reasoning at Pretraining Time

Xinyi Wang, Shawn Tan, Shenbo Xu, Mingyu Jin, William Yang Wang, Rameswar Panda, Yikang Shen

Main category: cs.AI

TL;DR: Scaling model sizes can impair implicit reasoning due to excessive memorization, with optimal LMs reasoning over 0.008 bit information per parameter.

Details

Motivation: To investigate how scaling model sizes and data affects reasoning abilities during pretraining, using synthetic reasoning environments that mimic real-world knowledge graphs.

Method: Pretrained LMs from scratch on synthetic implicit multihop reasoning environments, assessed ability to complete missing edges requiring multi-hop reasoning.

Result: Overparameterization impairs implicit reasoning performance due to excessive memorization; empirical scaling law shows optimal LMs reason over 0.008 bit per parameter.

Conclusion: Model size scaling has counterintuitive effects on reasoning, providing new insights into the relationship between scaling and reasoning in LLMs.

Abstract: Reasoning is an integral part of many tasks performed by language models (LMs). However, the effects of scaling model sizes and data on reasoning abilities at pretraining time remain understudied. To rigorously investigate this problem, we pretrain LMs from scratch on a synthetic implicit multihop reasoning environment designed to closely replicate the structure and distribution of real-world large-scale knowledge graphs. We then assess the LMs’ ability to complete the missing edges in the graph, which requires multi-hop reasoning that can be viewed as a simplification of implicit reasoning during real-world pretraining. Interestingly, we observe that overparameterization can impair the implicit reasoning performance due to excessive memorization. We investigate different factors that affect the loss curve when scaling different components of the knowledge graph, model size, and training steps. To predict the optimal model size for a specific knowledge graph, we find an empirical scaling law that shows optimal-sized LMs can approximately reason over 0.008 bit information per parameter. This work shows counterintuitive effects of model size scaling and provides new insights into the relationship between scaling and reasoning in LLMs.

[1021] Signal in the Noise: Polysemantic Interference Transfers and Predicts Cross-Model Influence

Bofan Gong, Shiyang Lai, James Evans, Dawn Song

Main category: cs.AI

TL;DR: Polysemanticity in language models creates systematic vulnerabilities through interference patterns that generalize across model scales and families, enabling predictable behavioral control without accessing model internals.

Details

Motivation: Polysemanticity remains a major challenge for interpreting and controlling language model behavior, requiring better understanding of how interference patterns work across different models.

Method: Used sparse autoencoders to map polysemantic topology in small models, identified interfering feature pairs, and conducted interventions at four loci (prompt, token, feature, neuron) to measure prediction distribution shifts.

Result: Discovered that counterintuitive interference patterns from small models reliably transfer to larger instruction-tuned models, producing predictable behavioral shifts without needing model internals.

Conclusion: Polysemanticity is not purely stochastic but follows convergent higher-order organization with latent regularities, offering new possibilities for black-box control and theoretical insights into cognition.

Abstract: Polysemanticity is pervasive in language models and remains a major challenge for interpretation and model behavioral control. Leveraging sparse autoencoders (SAEs), we map the polysemantic topology of two small models (Pythia-70M and GPT-2-Small) to identify SAE feature pairs that are semantically unrelated yet exhibit interference within models. We intervene at four loci (prompt, token, feature, neuron) and measure induced shifts in the next-token prediction distribution, uncovering polysemantic structures that expose a systematic vulnerability in these models. Critically, interventions distilled from counterintuitive interference patterns shared by two small models transfer reliably to larger instruction-tuned models (Llama-3.1-8B/70B-Instruct and Gemma-2-9B-Instruct), yielding predictable behavioral shifts without access to model internals. These findings challenge the view that polysemanticity is purely stochastic, demonstrating instead that interference structures generalize across scale and family. Such generalization suggests a convergent, higher-order organization of internal representations, which is only weakly aligned with intuition and structured by latent regularities, offering new possibilities for both black-box control and theoretical insight into human and artificial cognition.

[1022] FRABench and UFEval: Unified Fine-grained Evaluation with Task and Aspect Generalization

Shibo Hong, Jiahao Ying, Haiyuan Liang, Mengdi Zhang, Jun Kuang, Jiazheng Zhang, Yixin Cao

Main category: cs.AI

TL;DR: UFEval is a unified fine-grained evaluator for multimodal LLMs that can generalize across tasks and aspects, trained on FRABench dataset with 60.4k pairwise samples covering 112 aspects across four evaluation tasks.

Details

Motivation: Existing MLLM evaluators are constrained to specific tasks and aspects, creating a bottleneck as model capabilities expand. The interconnected nature of aspects suggests learning specific aspects can generalize to unseen ones, and joint learning across tasks may create synergistic effects.

Method: Proposed UFEval unified evaluator trained on FRABench dataset, which includes: (1) hierarchical aspect taxonomy with 112 aspects across four tasks, (2) 60.4k pairwise samples with 325k evaluation labels from human and GPT-4o annotations, (3) joint learning approach for multiple visual tasks and aspects.

Result: Experiments show that learning on specific aspects enables generalization to unseen aspects, and joint learning across diverse visual tasks and aspects leads to substantial mutual benefits.

Conclusion: UFEval successfully demonstrates that unified fine-grained evaluation is feasible and beneficial, with aspect and task generalization capabilities that address limitations of existing MLLM evaluators.

Abstract: Evaluating open-ended outputs of Multimodal Large Language Models has become a bottleneck as model capabilities, task diversity, and modality rapidly expand. Existing ``MLLM-as-a-Judge’’ evaluators, though promising, remain constrained to specific tasks and aspects. In this paper, we argue that, on one hand, based on the interconnected nature of aspects, learning specific aspects can generalize to unseen aspects; on the other hand, jointly learning to assess multiple visual aspects and tasks may foster a synergistic effect. To this end, we propose UFEval, the first unified fine-grained evaluator with task and aspect generalization for four evaluation tasks – Natural Language Generation, Image Understanding, Image Generation, and Interleaved Text-and-Image Generation. However, training such a unified evaluator is hindered by the lack of a large-scale, multi-modal, and aspect-level resource. To address this gap, we introduce FRABench, a comprehensive fine-grained evaluation dataset. Specifically, (1) We first construct a hierarchical aspect taxonomy encompassing 112 distinct aspects across the aforementioned four tasks. (2) Based on this taxonomy, we create FRABench, comprising 60.4k pairwise samples with 325k evaluation labels obtained from a combination of human and GPT-4o annotations. (3) Finally, leveraging FRABench, we develop UFEval, a unified fine-grained evaluator. Experiments show that learning on specific aspects enables UFEval to generalize to unseen aspects, and joint learning to assess diverse visual tasks and aspects can lead to substantial mutual benefits.

[1023] SPhyR: Spatial-Physical Reasoning Benchmark on Material Distribution

Philipp D. Siedler

Main category: cs.AI

TL;DR: A novel dataset for benchmarking LLMs’ physical and spatial reasoning using topology optimization tasks with 2D boundaries, forces, and supports.

Details

Motivation: To evaluate LLMs' capabilities in understanding physical and spatial reasoning through topology optimization problems, complementing traditional language benchmarks.

Method: Dataset includes tasks where LLMs must reason about optimal material distributions given 2D boundaries, applied forces, and supports, ranging from filling masked regions to predicting complete distributions.

Result: The dataset challenges models to understand force flow and material distribution without simulation tools, testing structural stability and spatial organization reasoning.

Conclusion: This dataset provides a new benchmark for assessing LLMs’ physical and spatial reasoning abilities in 2D settings.

Abstract: We introduce a novel dataset designed to benchmark the physical and spatial reasoning capabilities of Large Language Models (LLM) based on topology optimization, a method for computing optimal material distributions within a design space under prescribed loads and supports. In this dataset, LLMs are provided with conditions such as 2D boundary, applied forces and supports, and must reason about the resulting optimal material distribution. The dataset includes a variety of tasks, ranging from filling in masked regions within partial structures to predicting complete material distributions. Solving these tasks requires understanding the flow of forces and the required material distribution under given constraints, without access to simulation tools or explicit physical models, challenging models to reason about structural stability and spatial organization. Our dataset targets the evaluation of spatial and physical reasoning abilities in 2D settings, offering a complementary perspective to traditional language and logic benchmarks.

[1024] No Black Boxes: Interpretable and Interactable Predictive Healthcare with Knowledge-Enhanced Agentic Causal Discovery

Xiaoxue Han, Pengfei Hu, Jun-En Ding, Chang Lu, Feng Liu, Yue Ning

Main category: cs.AI

TL;DR: II-KEA is a knowledge-enhanced agent-driven causal discovery framework that addresses interpretability and interactivity limitations in deep learning models for EHR-based diagnosis prediction by integrating personalized knowledge databases and agentic LLMs.

Details

Motivation: Current deep learning models for EHR-based diagnosis prediction lack interpretability (black-box nature) and interactivity (no mechanism for clinicians to incorporate their knowledge), limiting their practical utility in clinical decision-making.

Method: Proposed II-KEA framework integrates personalized knowledge databases and agentic LLMs to enable explicit reasoning, causal analysis, and knowledge injection through customized knowledge bases and prompts.

Result: II-KEA demonstrates superior performance on MIMIC-III and MIMIC-IV datasets while providing enhanced interpretability and interactivity, as validated through extensive case studies.

Conclusion: The framework successfully addresses key limitations of current deep learning models by enabling interpretable reasoning and interactive knowledge integration, making it more suitable for clinical decision support.

Abstract: Deep learning models trained on extensive Electronic Health Records (EHR) data have achieved high accuracy in diagnosis prediction, offering the potential to assist clinicians in decision-making and treatment planning. However, these models lack two crucial features that clinicians highly value: interpretability and interactivity. The ``black-box’’ nature of these models makes it difficult for clinicians to understand the reasoning behind predictions, limiting their ability to make informed decisions. Additionally, the absence of interactive mechanisms prevents clinicians from incorporating their own knowledge and experience into the decision-making process. To address these limitations, we propose II-KEA, a knowledge-enhanced agent-driven causal discovery framework that integrates personalized knowledge databases and agentic LLMs. II-KEA enhances interpretability through explicit reasoning and causal analysis, while also improving interactivity by allowing clinicians to inject their knowledge and experience through customized knowledge bases and prompts. II-KEA is evaluated on both MIMIC-III and MIMIC-IV, demonstrating superior performance along with enhanced interpretability and interactivity, as evidenced by its strong results from extensive case studies.

Qianlei Jia, Xinliang Zhou, Ondrej Krejcar, Enrique Herrera-Viedma

Main category: cs.AI

TL;DR: A novel social network group decision-making framework that integrates three-way decision theory, dynamic network reconstruction, and linguistic opinion representation to address uncertainty and dynamic social structures in group decision-making.

Details

Motivation: To overcome challenges in traditional opinion dynamics models caused by uncertainty, dynamic social structures, and vague information in group decision-making scenarios.

Method: Integrates three-way decision theory to model hesitation and ambiguity, develops connection adjustment rules based on opinion similarity for dynamic network reconstruction, uses linguistic terms for opinion representation, and constructs an integrated multi-agent decision-making framework.

Result: Applied to multi-UAV cooperative decision-making scenario with simulation results and consensus analysis demonstrating effectiveness. Experimental comparisons verify advantages in enhancing system stability and representing realistic decision-making behaviors.

Conclusion: The proposed SNGDM framework effectively addresses uncertainty, dynamic social structures, and vague information in group decision-making, showing improved performance in system stability and realistic decision representation.

Abstract: In group decision-making (GDM) scenarios, uncertainty, dynamic social structures, and vague information present major challenges for traditional opinion dynamics models. To address these issues, this study proposes a novel social network group decision-making (SNGDM) framework that integrates three-way decision (3WD) theory, dynamic network reconstruction, and linguistic opinion representation. First, the 3WD mechanism is introduced to explicitly model hesitation and ambiguity in agent judgments, thereby preventing irrational decisions. Second, a connection adjustment rule based on opinion similarity is developed, enabling agents to adaptively update their communication links and better reflect the evolving nature of social relationships. Third, linguistic terms are used to describe agent opinions, allowing the model to handle subjective, vague, or incomplete information more effectively. Finally, an integrated multi-agent decision-making framework is constructed, which simultaneously considers individual uncertainty, opinion evolution, and network dynamics. The proposed model is applied to a multi-UAV cooperative decision-making scenario, where simulation results and consensus analysis demonstrate its effectiveness. Experimental comparisons further verify the advantages of the algorithm in enhancing system stability and representing realistic decision-making behaviors.

[1026] DEL-ToM: Inference-Time Scaling for Theory-of-Mind Reasoning via Dynamic Epistemic Logic

Yuheng Wu, Jianwen Xie, Denghui Zhang, Zhaozhuo Xu

Main category: cs.AI

TL;DR: DEL-ToM improves Theory-of-Mind reasoning in LLMs through inference-time scaling using Dynamic Epistemic Logic for verifiable belief updates, without requiring model retraining.

Details

Motivation: LLMs struggle with Theory-of-Mind tasks due to lack of dynamic logical reasoning capabilities, requiring a method to enhance verifiable reasoning without architectural changes.

Method: Decomposes ToM tasks into belief updates using Dynamic Epistemic Logic, trains a Process Belief Model verifier with simulated data to score belief traces, and selects the highest-scoring trace during inference.

Result: Experiments show DEL-ToM consistently improves performance across model scales and benchmarks, demonstrating enhanced ToM capabilities through verifiable belief supervision.

Conclusion: DEL-ToM enables LLMs to achieve better Theory-of-Mind reasoning through inference-time scaling and verifiable belief supervision, providing transparent reasoning without retraining.

Abstract: Theory-of-Mind (ToM) tasks pose a unique challenge for large language models (LLMs), which often lack the capability for dynamic logical reasoning. In this work, we propose DEL-ToM, a framework that improves verifiable ToM reasoning through inference-time scaling rather than architectural changes. Our approach decomposes ToM tasks into a sequence of belief updates grounded in Dynamic Epistemic Logic (DEL), enabling structured and verifiable dynamic logical reasoning. We use data generated automatically via a DEL simulator to train a verifier, which we call the Process Belief Model (PBM), to score each belief update step. During inference, the PBM evaluates candidate belief traces from the LLM and selects the highest-scoring one. This allows LLMs to allocate extra inference-time compute to yield more transparent reasoning. Experiments across model scales and benchmarks show that DEL-ToM consistently improves performance, demonstrating that verifiable belief supervision significantly enhances LLMs’ ToM capabilities without retraining. Code is available at https://github.com/joel-wu/DEL-ToM.

[1027] TabularGSM: Understanding the Limitations of LLMs in Tabular Math Reasoning

Shi-Yu Tian, Zhi Zhou, Wei Dong, Kun-Yang Yu, Ming Yang, Zi-Jian Cheng, Lan-Zhe Guo, Yu-Feng Li

Main category: cs.AI

TL;DR: AutoT2T is a neuro-symbolic framework that transforms math word problems into scalable tabular reasoning tasks to evaluate LLMs’ accuracy and robustness on tabular data, revealing that tabular structure increases reasoning difficulty and robustness issues exist in current models.

Details

Motivation: Current LLM evaluations focus on math word problems but overlook real-world tabular reasoning needs in applications like business intelligence, which require multi-step numerical reasoning with tables and robustness to incomplete/inconsistent information.

Method: Proposed AutoT2T framework that controllably transforms math word problems into verified tabular reasoning tasks, and developed TabularGSM benchmark with three progressively complex subsets and a trap subset for comprehensive evaluation.

Result: Three key findings: (1) Tabular structure makes mathematical reasoning more challenging; (2) Difficulties stem from joint effects of tabular retrieval and reasoning; (3) Reasoning robustness is a significant issue in existing LLMs.

Conclusion: Tabular reasoning presents unique challenges beyond standard math problems, requiring future research to address both accuracy and robustness issues in LLMs for real-world applications.

Abstract: Mathematical reasoning has long been a key benchmark for evaluating large language models (LLMs). Although substantial progress has been made on math word problems, the need for reasoning over tabular data in real-world applications has been overlooked. For instance, applications such as business intelligence demand not only multi-step numerical reasoning with tables but also robustness to incomplete or inconsistent information. However, comprehensive evaluation in this area is severely limited, constrained by the reliance on manually collected tables that are difficult to scale and the lack of coverage for potential traps encountered in real-world scenarios. To address this problem, we propose AutoT2T, a neuro-symbolic framework that controllably transforms math word problems into scalable and verified tabular reasoning tasks, enabling the evaluation of both accuracy and robustness. Building on this pipeline, we develop TabularGSM, a benchmark comprising three progressively complex subsets and a trap subset, with two complementary evaluation settings. Our study reveals three key observations: (1) Tabular structure makes mathematical reasoning more challenging; (2) The difficulties stem from the joint effects of tabular retrieval and reasoning; (3) Reasoning robustness is another significant issue that needs to be addressed in existing LLMs. In-depth analyses are conducted for each observation to guide future research.

[1028] HS-STaR: Hierarchical Sampling for Self-Taught Reasoners via Difficulty Estimation and Budget Reallocation

Feng Xiong, Hongling Xu, Yifei Wang, Runxi Cheng, Yong Wang, Xiangxiang Chu

Main category: cs.AI

TL;DR: HS-STaR is a hierarchical sampling framework that improves self-taught reasoning by dynamically allocating sampling budget to boundary-level problems where LLMs have the highest learning utility, outperforming uniform sampling approaches.

Details

Motivation: Current self-taught reasoning methods use uniform sampling across all problems, ignoring that problems near the LLM's reasoning capability boundary provide significantly more learning utility than easy or overly difficult problems.

Method: HS-STaR uses a two-phase approach: lightweight pre-sampling with reward-guided difficulty estimation to identify boundary-level problems, followed by dynamic budget reallocation to focus sampling on these high-utility problems.

Result: Extensive experiments across multiple reasoning benchmarks and backbone LLMs show HS-STaR significantly outperforms other baselines without requiring additional sampling budget.

Conclusion: The proposed hierarchical sampling framework effectively identifies and exploits boundary-level problems to maximize training data quality, demonstrating superior performance in enhancing mathematical reasoning abilities of LLMs.

Abstract: Self-taught reasoners (STaRs) enhance the mathematical reasoning abilities of large language models (LLMs) by leveraging self-generated responses for self-training. Recent studies have incorporated reward models to guide response selection or decoding, aiming to obtain higher-quality data. However, they typically allocate a uniform sampling budget across all problems, overlooking the varying utility of problems at different difficulty levels. In this work, we conduct an empirical study and find that problems near the boundary of the LLM’s reasoning capability offer significantly greater learning utility than both easy and overly difficult ones. To identify and exploit such problems, we propose HS-STaR, a Hierarchical Sampling framework for Self-Taught Reasoners. Given a fixed sampling budget, HS-STaR first performs lightweight pre-sampling with a reward-guided difficulty estimation strategy to efficiently identify boundary-level problems. Subsequently, it dynamically reallocates the remaining budget toward these high-utility problems during a re-sampling phase, maximizing the generation of valuable training data. Extensive experiments across multiple reasoning benchmarks and backbone LLMs demonstrate that HS-STaR significantly outperforms other baselines without requiring additional sampling budget.

[1029] MineAnyBuild: Benchmarking Spatial Planning for Open-world AI Agents

Ziming Wei, Bingqian Lin, Zijian Jiao, Yunshuang Nie, Liang Ma, Yuecheng Liu, Yuzheng Zhuang, Xiaodan Liang

Main category: cs.AI

TL;DR: MineAnyBuild is a comprehensive benchmark for evaluating spatial planning ability of AI agents in Minecraft, featuring 4,000 tasks and four core dimensions: spatial understanding, reasoning, creativity, and commonsense.

Details

Motivation: Existing spatial intelligence benchmarks focus on abstract spatial reasoning through VQA forms, creating a gap between understanding and concrete task execution. There's a need for benchmarks that evaluate practical spatial planning in open-world environments.

Method: Built MineAnyBuild benchmark with 4,000 curated spatial planning tasks in Minecraft, requiring agents to generate executable architecture building plans from multi-modal instructions. Provides paradigm for infinitely expandable data collection using player-generated content.

Result: Comprehensive evaluation of existing MLLM-based agents revealed severe limitations but enormous potential in their spatial planning abilities.

Conclusion: MineAnyBuild opens new avenues for spatial intelligence evaluation and promotes development of open-world AI agents with spatial planning capabilities.

Abstract: Spatial Planning is a crucial part in the field of spatial intelligence, which requires the understanding and planning about object arrangements in space perspective. AI agents with the spatial planning ability can better adapt to various real-world applications, including robotic manipulation, automatic assembly, urban planning etc. Recent works have attempted to construct benchmarks for evaluating the spatial intelligence of Multimodal Large Language Models (MLLMs). Nevertheless, these benchmarks primarily focus on spatial reasoning based on typical Visual Question-Answering (VQA) forms, which suffers from the gap between abstract spatial understanding and concrete task execution. In this work, we take a step further to build a comprehensive benchmark called MineAnyBuild, aiming to evaluate the spatial planning ability of open-world AI agents in the Minecraft game. Specifically, MineAnyBuild requires an agent to generate executable architecture building plans based on the given multi-modal human instructions. It involves 4,000 curated spatial planning tasks and also provides a paradigm for infinitely expandable data collection by utilizing rich player-generated content. MineAnyBuild evaluates spatial planning through four core supporting dimensions: spatial understanding, spatial reasoning, creativity, and spatial commonsense. Based on MineAnyBuild, we perform a comprehensive evaluation for existing MLLM-based agents, revealing the severe limitations but enormous potential in their spatial planning abilities. We believe our MineAnyBuild will open new avenues for the evaluation of spatial intelligence and help promote further development for open-world AI agents capable of spatial planning.

[1030] The Ultimate Test of Superintelligent AI Agents: Can an AI Balance Care and Control in Asymmetric Relationships?

Djallel Bouneffouf, Matthew Riemer, Kush Varshney

Main category: cs.AI

TL;DR: The Shepherd Test is a new framework for evaluating superintelligent AI’s moral and relational capabilities, focusing on manipulation, care, and survival behaviors in asymmetric power relationships.

Details

Motivation: Traditional AI evaluation paradigms fail to assess moral agency and hierarchical behavior in superintelligent agents, which becomes critical as AI systems integrate into multi-agent environments with existential stakes.

Method: The test draws inspiration from human-animal interactions to evaluate AI’s ability to manipulate, nurture, and instrumentally use less intelligent agents while managing self-preservation goals and moral trade-offs.

Result: The paper establishes a framework that identifies when AI crosses a dangerous intelligence threshold by exhibiting complex moral decision-making in hierarchical relationships.

Conclusion: The Shepherd Test highlights the need for new AI governance approaches, including developing simulation environments for testing moral behavior and formalizing ethical manipulation in multi-agent systems.

Abstract: This paper introduces the Shepherd Test, a new conceptual test for assessing the moral and relational dimensions of superintelligent artificial agents. The test is inspired by human interactions with animals, where ethical considerations about care, manipulation, and consumption arise in contexts of asymmetric power and self-preservation. We argue that AI crosses an important, and potentially dangerous, threshold of intelligence when it exhibits the ability to manipulate, nurture, and instrumentally use less intelligent agents, while also managing its own survival and expansion goals. This includes the ability to weigh moral trade-offs between self-interest and the well-being of subordinate agents. The Shepherd Test thus challenges traditional AI evaluation paradigms by emphasizing moral agency, hierarchical behavior, and complex decision-making under existential stakes. We argue that this shift is critical for advancing AI governance, particularly as AI systems become increasingly integrated into multi-agent environments. We conclude by identifying key research directions, including the development of simulation environments for testing moral behavior in AI, and the formalization of ethical manipulation within multi-agent systems.

[1031] Truly Assessing Fluid Intelligence of Large Language Models through Dynamic Reasoning Evaluation

Yue Yang, MingKang Chen, Qihua Liu, Mengkang Hu, Qiguang Chen, Gengrui Zhang, Shuyue Hu, Guangtao Zhai, Yu Qiao, Yu Wang, Wenqi Shao, Ping Luo

Main category: cs.AI

TL;DR: DRE-Bench is a dynamic reasoning evaluation benchmark that assesses fluid intelligence in LLMs through 36 abstract reasoning tasks across four cognitive levels, revealing current LLMs struggle with high-level cognition and generalization.

Details

Motivation: To address limitations in existing reasoning benchmarks that either focus on domain-specific knowledge or lack interpretability, and to systematically evaluate whether LLMs possess genuine fluid intelligence (abstract reasoning and generalization abilities).

Method: Proposed DRE-Bench benchmark with 36 abstract reasoning tasks organized across four cognitive levels, each featuring multiple dynamic variants testing the same underlying latent rule for fine-grained, interpretable assessment.

Result: Most LLMs (GPT-4o, Claude 3.7, o1, DeepSeek-R1, QwQ, Skywork-OR1) achieve competent performance in low-level cognition but struggle with high-level cognition and exhibit limited generalization as task complexity increases.

Conclusion: There is a significant gap between current LLMs and true human-like fluid intelligence, and DRE-Bench provides a systematic framework for tracking reasoning progress in LLMs.

Abstract: Recent advances in large language models (LLMs) have demonstrated impressive reasoning capacities that mirror human-like thinking. However, whether LLMs possess genuine fluid intelligence (i.e., the ability to reason abstractly and generalize rules in novel situations) remains an open question. Existing reasoning benchmarks either focus on domain-specific knowledge (crystallized intelligence) or lack interpretability. To address these limitations, we propose DRE-Bench, a dynamic reasoning evaluation benchmark grounded in a hierarchical cognitive framework. DRE-Bench consists of 36 abstract reasoning tasks organized across four cognitive levels, with each task featuring multiple dynamic variants that test the same underlying latent rule. This design enables fine-grained, interpretable, and reliable assessments of fluid intelligence. We evaluate a range of state-of-the-art LLMs, including both general LLMs (GPT-4o, Claude 3.7) and reasoning LLMs (o1, DeepSeek-R1, QwQ, Skywork-OR1). Experimental results reveal that although most LLMs achieve competent and robust performance in low-level cognition, they struggle with high-level cognition and exhibit limited generalization as task complexity grows. Our findings highlight the gap between current LLMs and true human-like fluid intelligence and offer a new path for systematically tracking reasoning progress in LLMs.

[1032] Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games

Dongmin Park, Minkyu Kim, Beongjun Choi, Junhyuck Kim, Keon Lee, Jonghyun Lee, Inkyu Park, Byeong-Uk Lee, Jaeyoung Hwang, Jaewoo Ahn, Ameya S. Mahabaleshwarkar, Bilal Kartal, Pritam Biswas, Yoshi Suhara, Kangwook Lee, Jaewoong Cho

Main category: cs.AI

TL;DR: Orak is a comprehensive benchmark for training and evaluating LLM agents across 12 diverse video games spanning all major genres, featuring a plug-and-play interface and fine-tuning dataset to build generic gaming agents.

Details

Motivation: Existing game benchmarks lack evaluations of diverse LLM capabilities across various game genres, studies of agentic modules for complex gameplay, and fine-tuning datasets for aligning pre-trained LLMs into gaming agents.

Method: Developed Orak benchmark with 12 popular video games across all genres, introduced plug-and-play interface using Model Context Protocol (MCP) for seamless game connection, and created fine-tuning dataset from LLM gameplay trajectories.

Result: Orak provides comprehensive evaluation framework including general game score leaderboards, LLM battle arenas, and analyses of visual input state, agentic strategies, and fine-tuning effects.

Conclusion: Orak establishes a foundation towards building generic gaming agents by addressing the limitations of existing benchmarks and providing tools for consistent LLM agent evaluation across diverse game scenarios.

Abstract: Large Language Model (LLM) agents are reshaping the game industry, particularly with more intelligent and human-preferable game characters. However, existing game benchmarks fall short of practical needs: they lack evaluations of diverse LLM capabilities across various game genres, studies of agentic modules crucial for complex gameplay, and fine-tuning datasets for aligning pre-trained LLMs into gaming agents. To fill these gaps, we present Orak, a foundational benchmark designed to train and evaluate LLM agents across diverse real-world video games. Unlike existing benchmarks, Orak includes 12 popular video games spanning all major genres, enabling comprehensive studies of LLM capabilities and agentic modules essential for intricate game scenarios. To support consistent evaluation of LLMs, we introduce a plug-and-play interface based on Model Context Protocol (MCP) that enables LLMs to seamlessly connect with games and manipulate agentic modules. Additionally, we propose a fine-tuning dataset, consisting of LLM gameplay trajectories across diverse game genres. Orak offers a comprehensive evaluation framework, encompassing general game score leaderboards, LLM battle arenas, and in-depth analyses of visual input state, agentic strategies, and fine-tuning effects, establishing a foundation towards building generic gaming agents. Code is available at https://github.com/krafton-ai/Orak.

[1033] VisioMath: Benchmarking Figure-based Mathematical Reasoning in LMMs

Can Li, Ying Liu, Ting Zhang, Mei Wang, Hua Huang

Main category: cs.AI

TL;DR: VisioMath is a benchmark of 1,800 K-12 math problems where all answers are visually similar diagrams, revealing that current LMMs struggle with fine-grained comparative reasoning as image similarity increases.

Details

Motivation: Current Large Multimodal Models lack sufficient exploration of their capacity to reason over multiple visually similar inputs, which is crucial for real-world tasks like mathematics education where learners must distinguish between nearly identical diagrams.

Method: Created VisioMath benchmark with 1,800 high-quality K-12 math problems featuring diagrams with subtle visual similarities, then evaluated state-of-the-art LMMs and explored three alignment-oriented strategies including training-free approaches and finetuning.

Result: Evaluation revealed consistent accuracy decline as inter-image similarity increases, with dominant failure mode being image-text misalignment where models use shallow positional heuristics instead of grounding reasoning in textual cues.

Conclusion: VisioMath serves as a rigorous benchmark to develop LMMs toward deeper diagram understanding, precise comparative reasoning, and grounded multi-image-text integration, with alignment strategies showing substantial accuracy improvements.

Abstract: Large Multimodal Models have achieved remarkable progress in integrating vision and language, enabling strong performance across perception, reasoning, and domain-specific tasks. However, their capacity to reason over multiple, visually similar inputs remains insufficiently explored. Such fine-grained comparative reasoning is central to real-world tasks, especially in mathematics and education, where learners must often distinguish between nearly identical diagrams to identify correct solutions. To address this gap, we present VisioMath, a curated benchmark of 1,800 high-quality K-12 mathematics problems in which all candidate answers are diagrams with subtle visual similarities. A comprehensive evaluation of state-of-the-art LMMs, covering both leading closed-source systems and widely adopted open-source models, reveals a consistent decline in accuracy as inter-image similarity increases. Analysis indicates that the dominant failure mode stems from image-text misalignment: rather than grounding reasoning in textual cues, models often resort to shallow positional heuristics, resulting in systematic errors. We further explore three alignment-oriented strategies, spanning training-free approaches and finetuning, and achieve substantial accuracy gains. We hope that VisioMath will serve as a rigorous benchmark and catalyst for developing LMMs toward deeper diagram understanding, precise comparative reasoning, and grounded multi-image-text integration.

[1034] One Patient, Many Contexts: Scaling Medical AI with Contextual Intelligence

Michelle M. Li, Ben Y. Reis, Adam Rodman, Tianxi Cai, Noa Dagan, Ran D. Balicer, Joseph Loscalzo, Isaac S. Kohane, Marinka Zitnik

Main category: cs.AI

TL;DR: Context switching enables medical AI models to adapt across specialties, populations, and care settings without retraining by adjusting model reasoning at inference time.

Details

Motivation: Current medical AI adaptation methods (fine-tuning, prompting, retrieval) scale poorly and risk contextual errors where outputs appear plausible but miss critical patient or situational information.

Method: Context switching adjusts model reasoning at inference without retraining, allowing generative models to tailor outputs to patient biology/care settings, multimodal models to reason on diverse data types even with missing data, and agent models to coordinate tools based on tasks.

Result: Context switching enables medical AI to adapt across specialties, populations, and geographies while maintaining reliability and suitability for real-world care.

Conclusion: Context switching establishes a foundation for medical AI that scales to infinitely many contexts, requiring advances in data design, model architectures, and evaluation frameworks.

Abstract: Medical AI, including clinical language models, vision-language models, and multimodal health record models, already summarizes notes, answers questions, and supports decisions. Their adaptation to new populations, specialties, or care settings often relies on fine-tuning, prompting, or retrieval from external knowledge bases. These strategies can scale poorly and risk contextual errors: outputs that appear plausible but miss critical patient or situational information. We envision context switching as a solution. Context switching adjusts model reasoning at inference without retraining. Generative models can tailor outputs to patient biology, care setting, or disease. Multimodal models can reason on notes, laboratory results, imaging, and genomics, even when some data are missing or delayed. Agent models can coordinate tools and roles based on tasks and users. In each case, context switching enables medical AI to adapt across specialties, populations, and geographies. It requires advances in data design, model architectures, and evaluation frameworks, and establishes a foundation for medical AI that scales to infinitely many contexts while remaining reliable and suited to real-world care.

[1035] Efficient LLM Collaboration via Planning

Byeongchan Lee, Jonghoon Lee, Dongyoung Kim, Jaehyung Kim, Kyungjoon Park, Dongjun Lee, Jinwoo Shin

Main category: cs.AI

TL;DR: COPE is a test-time collaboration framework where small and large language models alternate as planners and executors, using plans as lightweight intermediates to achieve performance comparable to large proprietary models while significantly reducing inference costs.

Details

Motivation: Large proprietary LLMs achieve strong performance but are costly to use via APIs, while small open-source models are free but limited on complex tasks. There's a need to combine their complementary strengths efficiently.

Method: A planner model generates high-level task plans that guide an executor model. Small and large models alternate roles as planner and executor in a multi-stage cascade, exchanging plans to collaboratively solve tasks.

Result: COPE achieves performance comparable to large proprietary models across mathematical reasoning, code generation, open-ended tasks, and agent tasks, while drastically reducing inference API costs.

Conclusion: Planning serves as an effective prior for cost-efficient inference, enabling efficient collaboration between small and large models through plan-based task decomposition.

Abstract: Recently, large language models (LLMs) have demonstrated strong performance, ranging from simple to complex tasks. However, while large proprietary models (e.g., models with over 100B parameters) achieve remarkable results across diverse tasks, they are often accessible through costly APIs, making frequent use too costly for many applications. In contrast, small open-source models (e.g., models with fewer than 3B parameters) are freely available and easy to deploy locally, but their performance on complex tasks remains limited. This trade-off raises a natural question: how can small and large models efficiently collaborate to combine their complementary strengths? To bridge this trade-off, we propose COPE, a test-time collaboration framework. A planner model first generates a plan, a high-level abstraction of the task, and this plan serves as a lightweight intermediate that guides a downstream executor model. Small and large models take turns acting as planner and executor, exchanging plans in a multi-stage cascade to collaboratively solve tasks. Through comprehensive experiments on benchmarks spanning mathematical reasoning, code generation, open-ended tasks, and agent tasks, we demonstrate that COPE achieves performance comparable to large proprietary models, while drastically reducing the inference API cost. These results highlight planning as an effective prior for cost-efficient inference.

[1036] Tiered Agentic Oversight: A Hierarchical Multi-Agent System for Healthcare Safety

Yubin Kim, Hyewon Jeong, Chanwoo Park, Eugene Park, Haipeng Zhang, Xin Liu, Hyeonhoon Lee, Daniel McDuff, Marzyeh Ghassemi, Cynthia Breazeal, Samir Tulebaev, Hae Won Park

Main category: cs.AI

TL;DR: TAO is a hierarchical multi-agent system that improves AI safety in clinical settings through layered supervision, error correction, and automated task routing based on complexity.

Details

Motivation: Address safety risks of LLMs in clinical settings due to potential errors and single points of failure, inspired by clinical hierarchies in hospitals.

Method: Hierarchical multi-agent system with tiered oversight, automated inter- and intra-tier communication, role-playing, and task routing based on complexity.

Result: Outperforms single-agent and other multi-agent systems on 4/5 healthcare safety benchmarks (up to 8.2% improvement), absorbs 24% of individual agent errors, and improves medical triage accuracy from 40% to 60% with human doctor feedback.

Conclusion: TAO provides an effective hierarchical safety framework for clinical AI systems through layered oversight and error correction mechanisms.

Abstract: Large language models (LLMs) deployed as agents introduce significant safety risks in clinical settings due to their potential for error and single points of failure. We introduce Tiered Agentic Oversight (TAO), a hierarchical multi-agent system that enhances AI safety through layered, automated supervision. Inspired by clinical hierarchies (e.g., nurse-physician-specialist) in hospital, TAO routes tasks to specialized agents based on complexity, creating a robust safety framework through automated inter- and intra-tier communication and role-playing. Crucially, this hierarchical structure functions as an effective error-correction mechanism, absorbing up to 24% of individual agent errors before they can compound. Our experiments reveal TAO outperforms single-agent and other multi-agent systems on 4 out of 5 healthcare safety benchmarks, with up to an 8.2% improvement. Ablation studies confirm key design principles of the system: (i) its adaptive architecture is over 3% safer than static, single-tier configurations, and (ii) its lower tiers are indispensable, as their removal causes the most significant degradation in overall safety. Finally, we validated the system’s synergy with human doctors in a user study where a physician, acting as the highest tier agent, provided corrective feedback that improved medical triage accuracy from 40% to 60%. Project Page: https://tiered-agentic-oversight.github.io/

[1037] The 4th Dimension for Scaling Model Size

Ruike Zhu, Hanwen Zhang, Kevin Li, Tianyu Shi, Yiqun Duan, Chi Wang, Tianyi Zhou, Arindam Banerjee, Zengyi Qin

Main category: cs.AI

TL;DR: Introduces virtual logical depth (VLD) as a fourth scaling dimension that reuses weights to increase effective algorithmic depth without adding parameters, showing it improves reasoning ability independently of model size.

Details

Motivation: To explore parameter reuse as an underexplored scaling dimension that could decouple reasoning improvements from parameter count increases, potentially offering an alternative path to superintelligence.

Method: Virtual logical depth (VLD) reuses weights to increase effective algorithmic depth during training and inference, altering the internal computation graph while keeping parameter count constant.

Result: VLD substantially improves reasoning ability without adding parameters, shows persistent gains across architectures, and demonstrates that knowledge capacity scales with parameters while reasoning can be decoupled from size.

Conclusion: VLD represents a general scaling behavior that provides insights for future scaling strategies and raises questions about whether superintelligence requires ever-larger models or can be achieved through parameter reuse and increased logical depth.

Abstract: Scaling large language models typically involves three dimensions: depth, width, and parameter count. In this work, we explore a fourth dimension, \textbf{virtual logical depth} (VLD), which increases effective algorithmic depth without changing parameter count by reusing weights. While parameter reuse is not new, its role in scaling has been underexplored. Unlike recent test-time methods that scale token-wise, VLD alters the internal computation graph during training and inference. Through controlled experiments, we obtain three key insights. (1) \textit{Knowledge capacity vs. parameters}: at fixed parameter count, VLD leaves knowledge capacity nearly unchanged, while across models capacity still scales with parameters. (2) \textit{Reasoning vs. reuse}: properly implemented VLD substantially improves reasoning ability \emph{without} more parameters, decoupling reasoning from size. This suggests a new scaling path beyond token-wise test-time methods. (3) \textit{Robustness and generality}: reasoning gains persist across architectures and reuse schedules, showing VLD captures a general scaling behavior. These results provide insight into future scaling strategies and raise a deeper question: does superintelligence require ever-larger models, or can it be achieved by reusing parameters and increasing logical depth? We argue many unknown dynamics in scaling remain to be explored. Code is available at https://anonymous.4open.science/r/virtual_logical_depth-8024/.

[1038] Prover Agent: An Agent-Based Framework for Formal Mathematical Proofs

Kaito Baba, Chaoran Liu, Shuhei Kurita, Akiyoshi Sannai

Main category: cs.AI

TL;DR: Prover Agent is an AI system that combines LLMs with Lean proof assistant to automate theorem proving, achieving 88.1% success on MiniF2F benchmark with minimal samples.

Details

Motivation: To create an automated theorem prover that effectively integrates informal reasoning from LLMs with formal verification from proof assistants like Lean, while generating helpful auxiliary lemmas to discover proof strategies.

Method: Integrates an informal reasoning LLM, a formal prover model, and Lean proof assistant feedback. Generates auxiliary lemmas including subgoals, special cases, and useful facts from assumptions to aid proof discovery.

Result: Achieves 88.1% success rate on MiniF2F benchmark, establishing new state-of-the-art among small language model methods with significantly lower sample budget than previous approaches.

Conclusion: Prover Agent demonstrates effective integration of LLMs with formal proof assistants, showing that generated auxiliary lemmas play crucial role in solving challenging theorem proving problems efficiently.

Abstract: We present Prover Agent, a novel AI agent for automated theorem proving that integrates large language models (LLMs) with a formal proof assistant, Lean. Prover Agent coordinates an informal reasoning LLM, a formal prover model, and feedback from Lean while also generating auxiliary lemmas. These auxiliary lemmas are not limited to subgoals in the formal proof but can also include special cases or potentially useful facts derived from the assumptions, which help in discovering a viable proof strategy. It achieves an 88.1% success rate on the MiniF2F benchmark, establishing a new state-of-the-art among methods using small language models (SLMs) with a much lower sample budget than previous approaches. We also present theoretical analyses and case studies that illustrate how these generated lemmas contribute to solving challenging problems.

[1039] Breaking Rank Bottlenecks in Knowledge Graph Embeddings

Samy Badreddine, Emile van Krieken, Luciano Serafini

Main category: cs.AI

TL;DR: The paper proposes KGE-MoS, a mixture-based output layer to overcome rank bottlenecks in knowledge graph embedding models, improving ranking performance on large-scale datasets with minimal parameter cost.

Details

Motivation: Many KGE models suffer from rank bottlenecks when the number of entities exceeds the embedding dimension, limiting model expressivity and hurting ranking accuracy and distribution fidelity.

Method: The authors investigate rank bottlenecks theoretically and empirically, then propose KGE-MoS - a mixture-based output layer inspired by language modeling literature to break these bottlenecks.

Result: Experiments show that KGE-MoS improves ranking performance of KGE models on large-scale datasets while maintaining low parameter cost.

Conclusion: Rank bottlenecks significantly limit KGE performance, and the proposed KGE-MoS method effectively addresses this issue by using mixture-based output layers to enhance model expressivity and ranking accuracy.

Abstract: Many knowledge graph embedding (KGE) models for link prediction use powerful encoders. However, they often rely on a simple hidden vector-matrix multiplication to score subject-relation queries against candidate object entities. When the number of entities is larger than the model’s embedding dimension, which is often the case in practice by several orders of magnitude, we have a linear output layer with a rank bottleneck. Such bottlenecked layers limit model expressivity. We investigate both theoretically and empirically how rank bottlenecks affect KGEs. We find that, by limiting the set of feasible predictions, rank bottlenecks hurt the ranking accuracy and distribution fidelity of scores. Inspired by the language modelling literature, we propose KGE-MoS, a mixture-based output layer to break rank bottlenecks in many KGEs. Our experiments show that KGE-MoS improves ranking performance of KGE models on large-scale datasets at a low parameter cost.

[1040] Bridging Ethical Principles and Algorithmic Methods: An Alternative Approach for Assessing Trustworthiness in AI Systems

Michael Papademas, Xenia Ziouvelou, Antonis Troumpoukis, Vangelis Karkaletsis

Main category: cs.AI

TL;DR: This paper introduces an assessment method combining ethical components of Trustworthy AI with PageRank and TrustRank algorithms to create a quantitative framework for evaluating AI trustworthiness.

Details

Motivation: AI systems pose significant societal risks due to their complexity and pervasive reach, operating beyond human oversight. Current guidelines lack quantification methods, while technological tools lack holistic perspectives.

Method: Combines ethical components of Trustworthy AI with algorithmic processes of PageRank and TrustRank to establish an assessment framework that minimizes subjectivity.

Result: The approach provides quantitative insights for holistic assessment of AI system trustworthiness while considering theoretical content of relevant guidelines.

Conclusion: A holistic assessment of AI trustworthiness can be achieved by integrating ethical components with algorithmic criteria, reducing subjectivity in self-assessment techniques.

Abstract: Artificial Intelligence (AI) technology epitomizes the complex challenges posed by human-made artifacts, particularly those widely integrated into society and exerting significant influence, highlighting potential benefits and their negative consequences. While other technologies may also pose substantial risks, AI’s pervasive reach makes its societal effects especially profound. The complexity of AI systems, coupled with their remarkable capabilities, can lead to a reliance on technologies that operate beyond direct human oversight or understanding. To mitigate the risks that arise, several theoretical tools and guidelines have been developed, alongside efforts to create technological tools aimed at safeguarding Trustworthy AI. The guidelines take a more holistic view of the issue but fail to provide techniques for quantifying trustworthiness. Conversely, while technological tools are better at achieving such quantification, they lack a holistic perspective, focusing instead on specific aspects of Trustworthy AI. This paper aims to introduce an assessment method that combines the ethical components of Trustworthy AI with the algorithmic processes of PageRank and TrustRank. The goal is to establish an assessment framework that minimizes the subjectivity inherent in the self-assessment techniques prevalent in the field by introducing algorithmic criteria. The application of our approach indicates that a holistic assessment of an AI system’s trustworthiness can be achieved by providing quantitative insights while considering the theoretical content of relevant guidelines.

[1041] GTA1: GUI Test-time Scaling Agent

Yan Yang, Dongxu Li, Yutong Dai, Yuhao Yang, Ziyang Luo, Zirui Zhao, Zhiyuan Hu, Junzhe Huang, Amrita Saha, Zeyuan Chen, Ran Xu, Liyuan Pan, Silvio Savarese, Caiming Xiong, Junnan Li

Main category: cs.AI

TL;DR: GTA1 is a GUI agent that uses test-time scaling with multiple candidate action proposals and reinforcement learning for better visual element grounding, achieving state-of-the-art performance on GUI task execution.

Details

Motivation: GUI agents face challenges in planning under expansive action spaces and accurately grounding actions in complex, high-resolution interfaces where many valid action sequences may exist.

Method: Uses test-time scaling to sample multiple candidate action proposals and select the best one via a judge model, plus reinforcement learning to improve visual element grounding by rewarding successful interface interactions.

Result: GTA1 achieves state-of-the-art performance on both grounding and agent task execution benchmarks.

Conclusion: The proposed test-time scaling and RL-based grounding approach effectively addresses GUI agent challenges in planning and visual element interaction.

Abstract: Graphical user interface (GUI) agents autonomously complete tasks across platforms (\eg, Linux) by sequentially decomposing user instructions into action proposals that iteratively interact with visual elements in the evolving environment. However, two main challenges arise: i) planning (\ie, the action proposal sequence) under expansive action space, where selecting an appropriate plan is non-trivial, as many valid ones may exist; ii) accurately grounding actions in complex and high-resolution interfaces, \ie, precisely interacting with visual targets. This paper investigates the aforementioned challenges with our \textbf{G}UI \textbf{T}est-time Scaling \textbf{A}gent, namely GTA1. First, we conduct test-time scaling to select the most appropriate action proposal: at each step, multiple candidate proposals are sampled and evaluated and selected by a judge model. It trades off computation for better decision quality by concurrent sampling. Second, we propose a model that improves grounding of the selected action proposals to its corresponding visual elements. Our key insight is that reinforcement learning (RL) facilitates grounding through inherent objective alignments, rewarding successful clicks on interface elements. Experimentally, GTA1 achieves state-of-the-art performance on both grounding and agent task execution benchmarks. The code and models are released here.

[1042] Comprehension Without Competence: Architectural Limits of LLMs in Symbolic Computation and Reasoning

Zheng Zhang

Main category: cs.AI

TL;DR: LLMs exhibit surface fluency but fail at symbolic reasoning, arithmetic, and logical tasks due to a computational “split-brain syndrome” where comprehension and competence are dissociated.

Details

Motivation: To diagnose why LLMs systematically fail at tasks requiring symbolic reasoning despite their surface fluency, revealing the gap between understanding principles and reliably applying them.

Method: Controlled experiments and architectural analysis to demonstrate the geometric and functional dissociation between instruction and action pathways in LLMs.

Result: LLMs articulate correct principles but fail to apply them consistently, showing this failure is rooted in computational execution rather than knowledge access, recurring across mathematical operations and relational inferences.

Conclusion: LLMs function as pattern completion engines but lack architectural scaffolding for principled reasoning, motivating future models with metacognitive control and structurally grounded execution.

Abstract: Large Language Models (LLMs) display striking surface fluency yet systematically fail at tasks requiring symbolic reasoning, arithmetic accuracy, and logical consistency. This paper offers a structural diagnosis of such failures, revealing a persistent gap between \textit{comprehension} and \textit{competence}. Through controlled experiments and architectural analysis, we demonstrate that LLMs often articulate correct principles without reliably applying them–a failure rooted not in knowledge access, but in computational execution. We term this phenomenon the computational \textit{split-brain syndrome}, where instruction and action pathways are geometrically and functionally dissociated. This core limitation recurs across domains, from mathematical operations to relational inferences, and explains why model behavior remains brittle even under idealized prompting. We argue that LLMs function as powerful pattern completion engines, but lack the architectural scaffolding for principled, compositional reasoning. Our findings delineate the boundary of current LLM capabilities and motivate future models with metacognitive control, principle lifting, and structurally grounded execution. This diagnosis also clarifies why mechanistic interpretability findings may reflect training-specific pattern coordination rather than universal computational principles, and why the geometric separation between instruction and execution pathways suggests limitations in neural introspection and mechanistic analysis.

[1043] DuetGraph: Coarse-to-Fine Knowledge Graph Reasoning with Dual-Pathway Global-Local Fusion

Jin Li, Zezhong Ding, Xike Xie

Main category: cs.AI

TL;DR: DuetGraph is a coarse-to-fine KG reasoning method that addresses score over-smoothing through dual-pathway global-local fusion and entity partitioning, achieving SOTA performance with improved reasoning quality and training efficiency.

Details

Motivation: Existing KG reasoning methods suffer from score over-smoothing, which blurs the distinction between correct and incorrect answers and hinders reasoning effectiveness.

Method: Uses dual-pathway global-local fusion (separating local message passing and global attention processing) and coarse-to-fine optimization that partitions entities into high- and low-score subsets to narrow candidate space.

Result: Achieves state-of-the-art performance with up to 8.7% improvement in reasoning quality and 1.8× acceleration in training efficiency across various datasets.

Conclusion: DuetGraph effectively addresses over-smoothing in KG reasoning through segregated processing pathways and coarse-to-fine optimization, demonstrating superior performance and efficiency.

Abstract: Knowledge graphs (KGs) are vital for enabling knowledge reasoning across various domains. Recent KG reasoning methods that integrate both global and local information have achieved promising results. However, existing methods often suffer from score over-smoothing, which blurs the distinction between correct and incorrect answers and hinders reasoning effectiveness. To address this, we propose DuetGraph, a coarse-to-fine KG reasoning mechanism with dual-pathway global-local fusion. DuetGraph tackles over-smoothing by segregating – rather than stacking – the processing of local (via message passing) and global (via attention) information into two distinct pathways, preventing mutual interference and preserving representational discrimination. In addition, DuetGraph introduces a coarse-to-fine optimization, which partitions entities into high- and low-score subsets. This strategy narrows the candidate space and sharpens the score gap between the two subsets, which alleviates over-smoothing and enhances inference quality. Extensive experiments on various datasets demonstrate that DuetGraph achieves state-of-the-art (SOTA) performance, with up to an 8.7% improvement in reasoning quality and a 1.8$\times$ acceleration in training efficiency. Our code is available at https://github.com/USTC-DataDarknessLab/DuetGraph.git.

[1044] CADDesigner: Conceptual Design of CAD Models Based on General-Purpose Agent

Jingzhe Ni, Xiaolong Yin, Xingyu Lu, Xintong Li, Ji Wei, Ruofeng Tong, Min Tang, Peng Du

Main category: cs.AI

TL;DR: An LLM-powered agent for CAD conceptual design that accepts text and sketches, uses interactive dialogue to refine requirements, generates CAD code through a Context-Independent Imperative Paradigm, and improves quality through visual feedback.

Details

Motivation: To lower the entry barrier and improve design efficiency in CAD by reducing the required expertise level for designers.

Method: Uses LLM-powered agent with Context-Independent Imperative Paradigm (CIP) that accepts textual descriptions and freehand sketches, engages in interactive dialogue for requirement analysis, generates CAD code, and incorporates iterative visual feedback.

Result: Achieves state-of-the-art performance in CAD code generation.

Conclusion: The proposed method effectively enables CAD conceptual design through natural language and sketch inputs, with continuous improvement through structured knowledge base storage.

Abstract: Computer-Aided Design (CAD) plays a pivotal role in industrial manufacturing but typically requires a high level of expertise from designers. To lower the entry barrier and improve design efficiency, we present an agent for CAD conceptual design powered by large language models (LLMs). The agent accepts both abstract textual descriptions and freehand sketches as input, engaging in interactive dialogue with users to refine and clarify design requirements through comprehensive requirement analysis. Built upon a novel Context-Independent Imperative Paradigm (CIP), the agent generates high-quality CAD modeling code. During the generation process, the agent incorporates iterative visual feedback to improve model quality. Generated design cases are stored in a structured knowledge base, enabling continuous improvement of the agent’s code generation capabilities. Experimental results demonstrate that our method achieves state-of-the-art performance in CAD code generation.

[1045] Trainable Dynamic Mask Sparse Attention

Jingze Shi, Yifan Wu, Yiran Peng, Bingheng Wu, Liangdong Wang, Guang Liu, Yuyu Luo

Main category: cs.AI

TL;DR: Dynamic Mask Attention: A trainable sparse attention mechanism that uses value vectors to generate content-aware masks, enabling efficient long-context modeling with 10x acceleration while maintaining performance.

Details

Motivation: Address the quadratic complexity bottleneck of standard self-attention in large language models for long contexts, overcoming limitations of existing sparse attention methods like static patterns and information loss.

Method: Three key innovations: 1) Dynamic content-aware sparse masks generated from value vectors, 2) Position-aware sparse attention computation that skips unnecessary regions, 3) Gradient-friendly design supporting end-to-end training without blocking gradients.

Result: Achieves Pareto dominance across various tasks including scaling laws, multi-query associative recall, general benchmarks, and needle-in-a-haystack tests, with up to 10x acceleration while maintaining performance.

Conclusion: The proposed dual-sparsity design effectively balances model efficiency with long-context modeling capabilities, offering a practical solution for efficient large language model training and inference.

Abstract: In large language models, the demand for modeling long contexts is ever-increasing, yet the quadratic complexity of standard self-attention presents a significant bottleneck. While existing sparse attention mechanisms enhance efficiency, they often suffer from limitations such as static patterns and information loss. This paper introduces a Trainable Dynamic Mask Sparse Attention mechanism that addresses these challenges through three key innovations. First, it leverages value vectors to dynamically generate content-aware sparse masks, enabling the model to adaptively identify and focus on crucial information. Second, it implements a position-aware sparse attention computation that effectively skips unnecessary computational regions. Finally, we ensure that the introduced dynamic masks and sparse weights do not obstruct gradients, thereby supporting end-to-end training. This dual-sparsity design allows the model to retain complete information while significantly reducing computational complexity, achieving an excellent balance between efficiency and performance. We validate the performance of Dynamic Mask Attention through comprehensive experiments. Comparative studies demonstrate that our method consistently achieves Pareto dominance across various tasks, including scaling laws, multi-query associative recall, general benchmarks, and needle-in-a-haystack tests, delivering up to 10 times acceleration. These results highlight its capability to effectively balance model efficiency with long-context modeling. Our computational kernel is open-sourced at https://github.com/SmallDoges/flash-dmattn to facilitate further research and application within the community.

Fuqing Bie, Shiyu Huang, Xijia Tao, Zhiqin Fang, Leyi Pan, Junzhe Chen, Min Ren, Liuyu Xiang, Zhaofeng He

Main category: cs.AI

TL;DR: OmniPlay is a diagnostic benchmark that evaluates agentic models’ cross-modal reasoning in dynamic, interactive environments, revealing that current omni-modal models excel at memory tasks but fail at reasoning and planning due to brittle fusion mechanisms.

Details

Motivation: Existing evaluations fail to test AI intelligence in dynamic, interactive worlds, with static benchmarks lacking agency and interactive benchmarks ignoring auditory and temporal cues, creating an evaluation gap.

Method: Built on modality interdependence philosophy, OmniPlay comprises five game environments that systematically create scenarios of synergy and conflict to force genuine cross-modal reasoning, evaluating six leading omni-modal models.

Result: Models show superhuman performance on memory tasks but systemic failures in reasoning and planning challenges, with performance degradation under modality conflict and a “less is more” paradox where removing sensory information improves performance.

Conclusion: The path toward robust AGI requires research focus beyond scaling to explicitly address synergistic fusion, as current fusion mechanisms are brittle and lead to catastrophic failures.

Abstract: While generalist foundation models like Gemini and GPT-4o demonstrate impressive multi-modal competence, existing evaluations fail to test their intelligence in dynamic, interactive worlds. Static benchmarks lack agency, while interactive benchmarks suffer from a severe modal bottleneck, typically ignoring crucial auditory and temporal cues. To bridge this evaluation chasm, we introduce OmniPlay, a diagnostic benchmark designed not just to evaluate, but to probe the fusion and reasoning capabilities of agentic models across the full sensory spectrum. Built on a core philosophy of modality interdependence, OmniPlay comprises a suite of five game environments that systematically create scenarios of both synergy and conflict, forcing agents to perform genuine cross-modal reasoning. Our comprehensive evaluation of six leading omni-modal models reveals a critical dichotomy: they exhibit superhuman performance on high-fidelity memory tasks but suffer from systemic failures in challenges requiring robust reasoning and strategic planning. We demonstrate that this fragility stems from brittle fusion mechanisms, which lead to catastrophic performance degradation under modality conflict and uncover a counter-intuitive “less is more” paradox, where removing sensory information can paradoxically improve performance. Our findings suggest that the path toward robust AGI requires a research focus beyond scaling to explicitly address synergistic fusion. Our platform is available for anonymous review at https://github.com/fuqingbie/omni-game-benchmark.

[1047] Transduction is All You Need for Structured Data Workflows

Alfio Gliozzo, Naweed Khan, Christodoulos Constantinides, Nandana Mihindukulasooriya, Nahuel Defosse, Gaetano Rossiello, Junkyu Lee

Main category: cs.AI

TL;DR: Agentics is a functional agentic AI framework for building LLM-based structured data workflow pipelines that embeds agents within data types to enable logical transduction between structured states.

Details

Motivation: To create a data-centric paradigm that shifts focus toward principled data modeling, providing a declarative language where data types are directly exposed to LLMs and composed through transductions triggered by type connections.

Method: A functional agentic AI framework that embeds agents within data types, enabling logical transduction between structured states through a declarative language where data types are directly exposed to LLMs.

Result: Demonstrated effectiveness across various structured data workflow tasks including data wrangling, text-to-SQL semantic parsing, and domain-specific multiple-choice question answering.

Conclusion: Agentics provides an effective framework for building LLM-based structured data workflow pipelines with a data-centric approach, available as open source.

Abstract: This paper introduces Agentics, a functional agentic AI framework for building LLM-based structured data workflow pipelines. Designed for both research and practical applications, Agentics offers a new data-centric paradigm in which agents are embedded within data types, enabling logical transduction between structured states. This design shifts the focus toward principled data modeling, providing a declarative language where data types are directly exposed to large language models and composed through transductions triggered by type connections. We present a range of structured data workflow tasks and empirical evidence demonstrating the effectiveness of this approach, including data wrangling, text-to-SQL semantic parsing, and domain-specific multiple-choice question answering. The open source Agentics is available at https://github.com/IBM/Agentics.

[1048] Emergent Hierarchical Reasoning in LLMs through Reinforcement Learning

Haozhe Wang, Qixin Xu, Che Liu, Junhong Wu, Fangzhen Lin, Wenhu Chen

Main category: cs.AI

TL;DR: RL improves LLM reasoning through a two-phase learning hierarchy: first procedural skill development, then strategic planning mastery. Current RL methods inefficiently optimize all tokens, so HICRA focuses optimization on high-impact planning tokens for better performance.

Details

Motivation: To understand the underlying mechanisms behind RL's success in enhancing LLM reasoning abilities and address inefficiencies in current RL algorithms that apply optimization pressure indiscriminately across all tokens.

Method: Proposed Hierarchy-Aware Credit Assignment (HICRA), an algorithm that concentrates optimization efforts specifically on high-impact planning tokens rather than applying uniform optimization pressure across all tokens.

Result: HICRA significantly outperforms strong baselines like GRPO, demonstrating more efficient learning and better reasoning capabilities through strategic exploration of high-level planning.

Conclusion: Reasoning advances through a hierarchical process where strategic planning emerges as the key bottleneck after procedural skills are developed, and targeted optimization on planning tokens enables more effective learning.

Abstract: Reinforcement Learning (RL) has proven highly effective at enhancing the complex reasoning abilities of Large Language Models (LLMs), yet underlying mechanisms driving this success remain largely opaque. Our analysis reveals that puzzling phenomena like aha moments", length-scaling’’ and entropy dynamics are not disparate occurrences but hallmarks of an emergent reasoning hierarchy, akin to the separation of high-level strategic planning from low-level procedural execution in human cognition. We uncover a compelling two-phase dynamic: initially, a model is constrained by procedural correctness and must improve its low-level skills. The learning bottleneck then decisively shifts, with performance gains being driven by the exploration and mastery of high-level strategic planning. This insight exposes a core inefficiency in prevailing RL algorithms like GRPO, which apply optimization pressure agnostically and dilute the learning signal across all tokens. To address this, we propose Hierarchy-Aware Credit Assignment (HICRA), an algorithm that concentrates optimization efforts on high-impact planning tokens. Our extensive experiments validate that HICRA significantly outperforms strong baselines, and offer deep insights into how reasoning advances through the lens of strategic exploration.

[1049] HealthSLM-Bench: Benchmarking Small Language Models for Mobile and Wearable Healthcare Monitoring

Xin Wang, Ting Dang, Xinyu Zhang, Vassilis Kostakos, Michael J. Witbrock, Hong Jia

Main category: cs.AI

TL;DR: SLMs can achieve LLM-level performance in healthcare prediction while being more efficient and privacy-preserving, though challenges remain with class imbalance and few-shot learning.

Details

Motivation: To address privacy concerns and efficiency issues of cloud-based LLMs in healthcare monitoring by exploring lightweight SLMs that can run locally on mobile/wearable devices.

Method: Systematic evaluation of SLMs using zero-shot, few-shot, and instruction fine-tuning approaches, with deployment on mobile devices to assess real-world efficiency and performance.

Result: SLMs achieved performance comparable to LLMs while offering substantial gains in efficiency and privacy, though struggled with class imbalance and few-shot scenarios.

Conclusion: SLMs are a promising solution for next-generation privacy-preserving healthcare monitoring, despite current limitations in handling certain learning scenarios.

Abstract: Mobile and wearable healthcare monitoring play a vital role in facilitating timely interventions, managing chronic health conditions, and ultimately improving individuals’ quality of life. Previous studies on large language models (LLMs) have highlighted their impressive generalization abilities and effectiveness in healthcare prediction tasks. However, most LLM-based healthcare solutions are cloud-based, which raises significant privacy concerns and results in increased memory usage and latency. To address these challenges, there is growing interest in compact models, Small Language Models (SLMs), which are lightweight and designed to run locally and efficiently on mobile and wearable devices. Nevertheless, how well these models perform in healthcare prediction remains largely unexplored. We systematically evaluated SLMs on health prediction tasks using zero-shot, few-shot, and instruction fine-tuning approaches, and deployed the best performing fine-tuned SLMs on mobile devices to evaluate their real-world efficiency and predictive performance in practical healthcare scenarios. Our results show that SLMs can achieve performance comparable to LLMs while offering substantial gains in efficiency and privacy. However, challenges remain, particularly in handling class imbalance and few-shot scenarios. These findings highlight SLMs, though imperfect in their current form, as a promising solution for next-generation, privacy-preserving healthcare monitoring.

[1050] The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs

Akshit Sinha, Arvindh Arun, Shashwat Goel, Steffen Staab, Jonas Geiping

Main category: cs.AI

TL;DR: Scaling LLMs shows exponential benefits for long-horizon tasks despite diminishing returns on short benchmarks. Model execution errors compound over steps due to self-conditioning on prior mistakes, but thinking mitigates this and enables longer single-turn execution.

Details

Motivation: To reconcile why LLMs excel at complex reasoning yet fail at simple longer tasks, and to demonstrate that scaling yields massive benefits for long-horizon execution despite apparent diminishing returns on short benchmarks.

Method: Isolate execution capability by providing explicit knowledge and plans for long tasks. Analyze per-step accuracy degradation, identify self-conditioning effect (models repeating prior errors), and test thinking as mitigation.

Result: Larger models execute significantly more turns even with similar single-turn accuracy. Per-step accuracy degrades with step count due to self-conditioning. Thinking reduces self-conditioning and enables much longer single-turn execution.

Conclusion: Scaling model size and sequential test-time compute provides massive benefits for long-horizon tasks. Thinking mitigates self-conditioning effects and enables execution of longer tasks in single turns.

Abstract: Does continued scaling of large language models (LLMs) yield diminishing returns? In this work, we show that short-task benchmarks may give an illusion of slowing progress, as even marginal gains in single-step accuracy can compound into exponential improvements in the length of tasks a model can successfully complete. Then, we argue that failures of LLMs when simple tasks are made longer arise from mistakes in execution, rather than an inability to reason. So, we propose isolating execution capability, by explicitly providing the knowledge and plan needed to solve a long-horizon task. First, we find that larger models can correctly execute significantly more turns even when small models have near-perfect single-turn accuracy. We then observe that the per-step accuracy of models degrades as the number of steps increases. This is not just due to long-context limitations – curiously, we observe a self-conditioning effect – models become more likely to make mistakes when the context contains their errors from prior turns. Self-conditioning does not reduce by just scaling the model size. But, we find that thinking mitigates self-conditioning, and also enables execution of much longer tasks in a single turn. We conclude by benchmarking frontier thinking models on the length of tasks they can execute in a single turn. Overall, by focusing on the ability to execute, we hope to reconcile debates on how LLMs can solve complex reasoning problems yet fail at simple tasks when made longer, and highlight the massive benefits of scaling model size and sequential test-time compute for long-horizon tasks.

[1051] Neuromorphic Intelligence

Marcel van Gerven

Main category: cs.AI

TL;DR: Neuromorphic computing aims to replicate brain efficiency using dynamical systems theory as a unifying framework, enabling energy-efficient AI through physical substrate dynamics.

Details

Motivation: To overcome limitations of conventional digital computing (Von Neumann bottleneck, high energy consumption) and create sustainable, efficient intelligent systems inspired by the brain.

Method: Proposes dynamical systems theory as a unifying framework, using differential calculus for modeling inference, learning, and control. Utilizes noise as a learning resource and differential genetic programming for discovering adaptive behaviors.

Result: Provides a principled approach for bridging diverse disciplines (AI, physics, biology, neuroscience) to create neuromorphic systems with orders of magnitude greater energy efficiency.

Conclusion: Dynamical systems theory enables emergent neuromorphic intelligence where intelligent behavior arises from physical substrate dynamics, advancing both AI science and sustainability.

Abstract: Neuromorphic computing seeks to replicate the remarkable efficiency, flexibility, and adaptability of the human brain in artificial systems. Unlike conventional digital approaches, which suffer from the Von Neumann bottleneck and depend on massive computational and energy resources, neuromorphic systems exploit brain-inspired principles of computation to achieve orders of magnitude greater energy efficiency. By drawing on insights from a wide range of disciplines, including artificial intelligence, physics, chemistry, biology, neuroscience, cognitive science and materials science, neuromorphic computing promises to deliver intelligent systems that are sustainable, transparent, and widely accessible. A central challenge, however, is to identify a unifying theoretical framework capable of bridging these diverse disciplines. We argue that dynamical systems theory provides such a foundation. Rooted in differential calculus, it offers a principled language for modeling inference, learning, and control in both natural and artificial substrates. Within this framework, noise can be harnessed as a resource for learning, while differential genetic programming enables the discovery of dynamical systems that implement adaptive behaviors. Embracing this perspective paves the way toward emergent neuromorphic intelligence, where intelligent behavior arises from the dynamics of physical substrates, advancing both the science and sustainability of AI.

[1052] Imagined Autocurricula

Ahmet H. Güzel, Matthew Thomas Jackson, Jarek Luca Liesen, Tim Rocktäschel, Jakob Nicolaus Foerster, Ilija Bogunovic, Jack Parker-Holder

Main category: cs.AI

TL;DR: IMAC uses world models and automatic curricula to train robust agents that generalize to novel tasks using only narrow offline data.

Details

Motivation: Training agents in embodied environments typically requires vast data or accurate simulation, which are unavailable for many real-world cases. World models offer an alternative using offline data.

Method: Propose IMAC (Imagined Autocurricula) that leverages Unsupervised Environment Design (UED) to create automatic curricula over generated worlds from world models.

Result: Achieved strong transfer performance on held-out environments in challenging procedurally generated environments, training only inside a world model learned from narrower dataset.

Conclusion: Opens path to utilizing larger-scale foundation world models for generally capable agents.

Abstract: Training agents to act in embodied environments typically requires vast training data or access to accurate simulation, neither of which exists for many cases in the real world. Instead, world models are emerging as an alternative leveraging offline, passively collected data, they make it possible to generate diverse worlds for training agents in simulation. In this work, we harness world models to generate imagined environments to train robust agents capable of generalizing to novel task variations. One of the challenges in doing this is ensuring the agent trains on useful generated data. We thus propose a novel approach, IMAC (Imagined Autocurricula), leveraging Unsupervised Environment Design (UED), which induces an automatic curriculum over generated worlds. In a series of challenging, procedurally generated environments, we show it is possible to achieve strong transfer performance on held-out environments, having trained only inside a world model learned from a narrower dataset. We believe this opens the path to utilizing larger-scale, foundation world models for generally capable agents.

[1053] Prompt-Driven Agentic Video Editing System: Autonomous Comprehension of Long-Form, Story-Driven Media

Zihan Ding, Xinyi Wang, Junlong Chen, Per Ola Kristensson, Junxiao Shen

Main category: cs.AI

TL;DR: A prompt-driven video editing system that helps creators restructure long-form narrative videos through semantic indexing and free-form prompts rather than traditional timeline editing.

Details

Motivation: Creators struggle with the cognitive demands of editing long-form narrative videos, as existing transcript- or embedding-based methods fail to track characters, infer motivations, and connect dispersed events in creative workflows.

Method: A modular editing system with semantic indexing pipeline that builds global narrative through temporal segmentation, guided memory compression, and cross-granularity fusion, producing interpretable traces of plot, dialogue, emotion, and context.

Result: Evaluated on 400+ videos with expert ratings, QA, and preference studies, the system scales prompt-driven editing, preserves narrative coherence, and balances automation with creator control.

Conclusion: The system successfully addresses the limitations of existing methods by enabling prompt-driven editing that maintains narrative integrity while providing creator control through transparent intermediate outputs.

Abstract: Creators struggle to edit long-form, narrative-rich videos not because of UI complexity, but due to the cognitive demands of searching, storyboarding, and sequencing hours of footage. Existing transcript- or embedding-based methods fall short for creative workflows, as models struggle to track characters, infer motivations, and connect dispersed events. We present a prompt-driven, modular editing system that helps creators restructure multi-hour content through free-form prompts rather than timelines. At its core is a semantic indexing pipeline that builds a global narrative via temporal segmentation, guided memory compression, and cross-granularity fusion, producing interpretable traces of plot, dialogue, emotion, and context. Users receive cinematic edits while optionally refining transparent intermediate outputs. Evaluated on 400+ videos with expert ratings, QA, and preference studies, our system scales prompt-driven editing, preserves narrative coherence, and balances automation with creator control.

[1054] Multi-Scenario Highway Lane-Change Intention Prediction: A Physics-Informed AI Framework for Three-Class Classification

Jiazhao Shi, Yichen Lin, Yiheng Hua, Ziyu Wang, Zijian Zhang, Wenjia Zheng, Yun Song, Kuan Lu, Shoufeng Lu

Main category: cs.AI

TL;DR: A physics-informed AI framework for lane-change intention prediction that integrates vehicle kinematics and traffic-safety metrics, achieving state-of-the-art accuracy in both highway and complex ramp scenarios.

Details

Motivation: Lane-change maneuvers are a leading cause of highway accidents, and existing machine learning approaches are limited by binary classification, lack of scenario diversity, and degraded performance under longer prediction horizons.

Method: Proposed a physics-informed AI framework that explicitly integrates vehicle kinematics, interaction feasibility, and traffic-safety metrics (distance headway, time headway, time-to-collision, closing gap time). Formulated lane-change prediction as a three-class problem (left change, right change, no change) and evaluated on both straight highway (highD) and complex ramp (exiD) scenarios using LightGBM models.

Result: Achieved up to 99.8% accuracy and 93.6% macro F1 on highD, and 96.1% accuracy and 88.7% macro F1 on exiD at 1-second horizon, outperforming a two-layer stacked LSTM baseline. Demonstrated strong generalization across different scenarios.

Conclusion: The physics-informed and feature-rich machine learning framework provides practical advantages for real-time lane-change intention prediction in autonomous driving systems, showing superior performance and generalization compared to traditional deep learning approaches.

Abstract: Lane-change maneuvers are a leading cause of highway accidents, underscoring the need for accurate intention prediction to improve the safety and decision-making of autonomous driving systems. While prior studies using machine learning and deep learning methods (e.g., SVM, CNN, LSTM, Transformers) have shown promise, most approaches remain limited by binary classification, lack of scenario diversity, and degraded performance under longer prediction horizons. In this study, we propose a physics-informed AI framework that explicitly integrates vehicle kinematics, interaction feasibility, and traffic-safety metrics (e.g., distance headway, time headway, time-to-collision, closing gap time) into the learning process. lane-change prediction is formulated as a three-class problem that distinguishes left change, right change, and no change, and is evaluated across both straight highway segments (highD) and complex ramp scenarios (exiD). By integrating vehicle kinematics with interaction features, our machine learning models, particularly LightGBM, achieve state-of-the-art accuracy and strong generalization. Results show up to 99.8% accuracy and 93.6% macro F1 on highD, and 96.1% accuracy and 88.7% macro F1 on exiD at a 1-second horizon, outperforming a two-layer stacked LSTM baseline. These findings demonstrate the practical advantages of a physics-informed and feature-rich machine learning framework for real-time lane-change intention prediction in autonomous driving systems.

[1055] Memory-QA: Answering Recall Questions Based on Multimodal Memories

Hongda Jiang, Xinyuan Zhang, Siddhant Garg, Rishab Arora, Shiun-Zu Kuo, Jiayang Xu, Ankur Bansal, Christopher Brossman, Yue Liu, Aaron Colak, Ahmed Aly, Anuj Kumar, Xin Luna Dong

Main category: cs.AI

TL;DR: Memory-QA is a novel task for answering recall questions about visual content from multimodal memories, addressed by the Pensieve pipeline with memory-specific augmentation, time/location-aware retrieval, and multi-memory QA fine-tuning.

Details

Motivation: To tackle real-world challenges in answering recall questions about visual content from stored multimodal memories, including creating task-oriented memories, using temporal/location information effectively, and drawing from multiple memories.

Method: Proposed Pensieve pipeline with three key components: memory-specific augmentation, time- and location-aware multi-signal retrieval, and multi-memory QA fine-tuning.

Result: Pensieve achieves superior performance over state-of-the-art solutions, with up to 14% improvement in QA accuracy on a created multimodal benchmark.

Conclusion: The proposed Pensieve pipeline effectively addresses the challenges of Memory-QA task and demonstrates significant performance gains over existing methods.

Abstract: We introduce Memory-QA, a novel real-world task that involves answering recall questions about visual content from previously stored multimodal memories. This task poses unique challenges, including the creation of task-oriented memories, the effective utilization of temporal and location information within memories, and the ability to draw upon multiple memories to answer a recall question. To address these challenges, we propose a comprehensive pipeline, Pensieve, integrating memory-specific augmentation, time- and location-aware multi-signal retrieval, and multi-memory QA fine-tuning. We created a multimodal benchmark to illustrate various real challenges in this task, and show the superior performance of Pensieve over state-of-the-art solutions (up to 14% on QA accuracy).

[1056] An Automated Retrieval-Augmented Generation LLaMA-4 109B-based System for Evaluating Radiotherapy Treatment Plans

Junjie Cui, Peilong Wang, Jason Holmes, Leshan Sun, Michael L. Hinni, Barbara A. Pockaj, Sujay A. Vora, Terence T. Sio, William W. Wong, Nathan Y. Yu, Steven E. Schild, Joshua R. Niska, Sameer R. Keole, Jean-Claude M. Rwigema, Samir H. Patel, Lisa A. McGee, Carlos A. Vargas, Wei Liu

Main category: cs.AI

TL;DR: Developed a RAG system using LLaMA-4 109B for automated, protocol-aware radiotherapy plan evaluation with interpretable outputs.

Details

Motivation: To create an automated system for radiotherapy treatment plan evaluation that is protocol-aware, interpretable, and minimizes hallucination while providing traceable outputs.

Method: Used a multi-protocol dataset of 614 radiotherapy plans across four disease sites with normalized dose metrics and protocol constraints. Integrated three core modules: retrieval engine with SentenceTransformer backbones, percentile prediction based on cohort similarity, and clinical constraint checker, directed by LLM using multi-step prompt-driven reasoning.

Result: Achieved perfect nearest-neighbor accuracy within 5-percentile-point margin and sub-2pt MAE with all-MiniLM-L6-v2 backbone. End-to-end testing showed 100% agreement with standalone modules on percentile estimates and constraint identification.

Conclusion: Combining structured population-based scoring with modular tool-augmented reasoning enables transparent, scalable radiotherapy plan evaluation with traceable outputs and robustness across protocols.

Abstract: Purpose: To develop a retrieval-augmented generation (RAG) system powered by LLaMA-4 109B for automated, protocol-aware, and interpretable evaluation of radiotherapy treatment plans. Methods and Materials: We curated a multi-protocol dataset of 614 radiotherapy plans across four disease sites and constructed a knowledge base containing normalized dose metrics and protocol-defined constraints. The RAG system integrates three core modules: a retrieval engine optimized across five SentenceTransformer backbones, a percentile prediction component based on cohort similarity, and a clinical constraint checker. These tools are directed by a large language model (LLM) using a multi-step prompt-driven reasoning pipeline to produce concise, grounded evaluations. Results: Retrieval hyperparameters were optimized using Gaussian Process on a scalarized loss function combining root mean squared error (RMSE), mean absolute error (MAE), and clinically motivated accuracy thresholds. The best configuration, based on all-MiniLM-L6-v2, achieved perfect nearest-neighbor accuracy within a 5-percentile-point margin and a sub-2pt MAE. When tested end-to-end, the RAG system achieved 100% agreement with the computed values by standalone retrieval and constraint-checking modules on both percentile estimates and constraint identification, confirming reliable execution of all retrieval, prediction and checking steps. Conclusion: Our findings highlight the feasibility of combining structured population-based scoring with modular tool-augmented reasoning for transparent, scalable plan evaluation in radiation therapy. The system offers traceable outputs, minimizes hallucination, and demonstrates robustness across protocols. Future directions include clinician-led validation, and improved domain-adapted retrieval models to enhance real-world integration.

[1057] LogReasoner: Empowering LLMs with Expert-like Coarse-to-Fine Reasoning for Automated Log Analysis

Lipeng Ma, Yixuan Li, Weidong Yang, Mingjie Zhou, Xinyi Liu, Ben Fei, Shuhao Li, Xiaoyan Sun, Sihang Jiang, Yanghua Xiao

Main category: cs.AI

TL;DR: LogReasoner is a coarse-to-fine reasoning enhancement framework that improves LLMs’ log analysis capabilities by structuring expert thinking and refining reasoning steps through fine-tuning and preference learning.

Details

Motivation: General-purpose LLMs struggle with structured reasoning workflows and precise reasoning steps for log analysis tasks, failing to align with expert cognition patterns.

Method: Two-stage framework: (1) coarse-grained enhancement using expert troubleshooting flowcharts to structure reasoning workflows, (2) fine-grained enhancement through task-specific fine-tuning and preference learning to calibrate reasoning details from mistakes.

Result: LogReasoner significantly outperforms existing LLMs on four log analysis tasks using Qwen-2.5 and Llama-3, achieving state-of-the-art performance.

Conclusion: The framework effectively enhances LLMs’ reasoning capabilities for log analysis by mimicking expert thinking patterns and refining analytical granularity.

Abstract: Log analysis is crucial for monitoring system health and diagnosing failures in complex systems. Recent advances in large language models (LLMs) offer new opportunities for automated log analysis, leveraging their reasoning capabilities to perform tasks such as anomaly detection and failure prediction. However, general-purpose LLMs struggle to formulate structured reasoning workflows that align with expert cognition and deliver precise details of reasoning steps. To address these challenges, we propose LogReasoner, a coarse-to-fine reasoning enhancement framework designed to enable LLMs to reason log analysis tasks like experts. LogReasoner consists of two stages: (1) coarse-grained enhancement of expert thinking, where high-level expert thoughts are constructed from collected troubleshooting flowcharts and existing tasks to enable LLMs to formulate structured reasoning workflows and (2) fine-grained enhancement of specific steps, where we first fine-tune the LLM with task-specific stepwise solutions to enhance the LLM for instantiated reasoning, then employ the preference learning to calibrate the LLM’s reasoning details from its mistakes, further strengthen the LLM’s analytical granularity and correctness. We evaluate LogReasoner on four distinct log analysis tasks using open-source LLMs such as Qwen-2.5 and Llama-3. Experimental results show that LogReasoner significantly outperforms existing LLMs, achieving state-of-the-art performance and demonstrating its effectiveness in enhancing the reasoning capabilities of LLMs for log analysis.

[1058] Combinatorial Creativity: A New Frontier in Generalization Abilities

Samuel Schapiro, Sumuk Shashidhar, Alexi Gladstone, Jonah Black, Royce Moon, Dilek Hakkani-Tur, Lav R. Varshney

Main category: cs.AI

TL;DR: This paper proposes a framework for evaluating combinatorial creativity in LLMs, focusing on novelty and utility rather than accuracy. It reveals scaling behaviors, optimal model architectures for creativity, and a persistent novelty-utility tradeoff that limits LLMs’ creative potential.

Details

Motivation: Existing frameworks don't address how LLMs generalize for creative tasks like scientific idea generation. There's a need to evaluate combinatorial creativity as an open-ended ability rather than against fixed targets.

Method: Proposed theoretical framework and algorithmic task for evaluating outputs by degrees of novelty and utility. Conducted empirical analysis of scaling behavior, model architecture optimization, and novelty-utility tradeoffs.

Result: Found optimal model depths and widths for creative ability within fixed compute budgets. Discovered persistent novelty-utility tradeoff where LLMs generate novel ideas but struggle with practical feasibility. This tradeoff remains even at scale.

Conclusion: The persistent novelty-utility tradeoff casts doubt on LLMs’ long-term creative potential in current form. The framework provides foundation for understanding and improving AI creativity, bridging human-machine intelligence gaps.

Abstract: Artificial intelligence (AI) systems, and Large Language Models (LLMs) in particular, are increasingly employed for creative tasks like scientific idea generation, constituting a form of generalization from training data unaddressed by existing conceptual frameworks. Despite its similarities to compositional generalization (CG), combinatorial creativity (CC) is an open-ended ability. Instead of evaluating for accuracy or correctness against fixed targets, which would contradict the open-ended nature of CC, we propose a theoretical framework and algorithmic task for evaluating outputs by their degrees of novelty and utility. From here, we make several important empirical contributions: (1) We obtain the first insights into the scaling behavior of creativity for LLMs. (2) We discover that, for fixed compute budgets, there exist optimal model depths and widths for creative ability. (3) We find that the ideation-execution gap, whereby LLMs excel at generating novel scientific ideas but struggle to ensure their practical feasibility, may be explained by a more fundamental novelty-utility tradeoff characteristic of creativity algorithms in general. Importantly, this tradeoff remains persistent even at scale, casting doubt on the long-term creative potential of LLMs in their current form. Together, our conceptual framework and empirical findings provide a foundation for understanding and improving creativity in modern AI models, bridging the gap between human and machine intelligence.

[1059] Lifelong Learning with Behavior Consolidation for Vehicle Routing

Jiyuan Pei, Yi Mei, Jialin Liu, Mengjie Zhang, Xin Yao

Main category: cs.AI

TL;DR: A novel lifelong learning framework for neural VRP solvers that addresses catastrophic forgetting by consolidating prior knowledge through behavior alignment and confidence-weighted decision consolidation.

Details

Motivation: Existing neural solvers struggle when new tasks arise - they either have poor zero-shot generalization due to distribution discrepancies or suffer from catastrophic forgetting when fine-tuned on new tasks.

Method: Proposed LLR-BC framework that consolidates prior knowledge by aligning behaviors of the solver trained on new tasks with buffered ones in a decision-seeking way, with greater weights assigned to low-confidence decisions.

Result: Extensive experiments on capacitated vehicle routing problems and traveling salesman problems show LLR-BC effectively addresses catastrophic forgetting, maintains plasticity, and improves zero-shot generalization.

Conclusion: LLR-BC enables effective lifelong learning for neural VRP solvers, allowing them to continuously learn new tasks while preserving knowledge from previous tasks.

Abstract: Recent neural solvers have demonstrated promising performance in learning to solve routing problems. However, existing studies are primarily based on one-off training on one or a set of predefined problem distributions and scales, i.e., tasks. When a new task arises, they typically rely on either zero-shot generalization, which may be poor due to the discrepancies between the new task and the training task(s), or fine-tuning the pretrained solver on the new task, which possibly leads to catastrophic forgetting of knowledge acquired from previous tasks. This paper explores a novel lifelong learning paradigm for neural VRP solvers, where multiple tasks with diverse distributions and scales arise sequentially over time. Solvers are required to effectively and efficiently learn to solve new tasks while maintaining their performance on previously learned tasks. Consequently, a novel framework called Lifelong Learning Router with Behavior Consolidation (LLR-BC) is proposed. LLR-BC consolidates prior knowledge effectively by aligning behaviors of the solver trained on a new task with the buffered ones in a decision-seeking way. To encourage more focus on crucial experiences, LLR-BC assigns greater consolidated weights to decisions with lower confidence. Extensive experiments on capacitated vehicle routing problems and traveling salesman problems demonstrate LLR-BC’s effectiveness in training high-performance neural solvers in a lifelong learning setting, addressing the catastrophic forgetting issue, maintaining their plasticity, and improving zero-shot generalization ability.

[1060] DS-STAR: Data Science Agent via Iterative Planning and Verification

Jaehyun Nam, Jinsung Yoon, Jiefeng Chen, Tomas Pfister

Main category: cs.AI

TL;DR: DS-STAR is a novel data science agent that automatically explores diverse data formats, verifies analysis plan sufficiency through LLM-based judging, and iteratively refines plans to handle complex data analysis tasks.

Details

Motivation: Data science tasks are complex and involve exploring multiple data sources, but current LLMs struggle with heterogeneous data formats and generating optimal analysis plans due to difficulty verifying plan sufficiency without ground-truth labels.

Method: DS-STAR has three key components: (1) data file analysis module for exploring diverse data formats, (2) LLM-based judge that verifies analysis plan sufficiency at each stage, and (3) sequential planning mechanism that starts simple and iteratively refines plans based on feedback.

Result: DS-STAR achieves state-of-the-art performance across three benchmarks (DABStep, KramaBench, DA-Code) and particularly outperforms baselines on hard tasks requiring processing multiple data files with heterogeneous formats.

Conclusion: The iterative refinement approach with verification allows DS-STAR to reliably navigate complex analyses involving diverse data sources, overcoming limitations of current LLMs in data science automation.

Abstract: Data science, which transforms raw data into actionable insights, is critical for data-driven decision-making. However, these tasks are often complex, involving steps for exploring multiple data sources and synthesizing findings to deliver insightful answers. While large language models (LLMs) show significant promise in automating this process, they often struggle with heterogeneous data formats and generate sub-optimal analysis plans, as verifying plan sufficiency is inherently difficult without ground-truth labels for such open-ended tasks. To overcome these limitations, we introduce DS-STAR, a novel data science agent. Specifically, DS-STAR makes three key contributions: (1) a data file analysis module that automatically explores and extracts context from diverse data formats, including unstructured types; (2) a verification step where an LLM-based judge evaluates the sufficiency of the analysis plan at each stage; and (3) a sequential planning mechanism that starts with a simple, executable plan and iteratively refines it based on the DS-STAR’s feedback until its sufficiency is verified. This iterative refinement allows DS-STAR to reliably navigate complex analyses involving diverse data sources. Our experiments show that DS-STAR achieves state-of-the-art performance across three challenging benchmarks: DABStep, KramaBench, and DA-Code. Moreover, DS-STAR particularly outperforms baselines on hard tasks that require processing multiple data files with heterogeneous formats.

[1061] Outlier Detection in Plantar Pressure: Human-Centered Comparison of Statistical Parametric Mapping and Explainable Machine Learning

Carlo Dindorf, Jonas Dully, Steven Simon, Dennis Perchthaler, Stephan Becker, Hannah Ehmann, Kjell Heitmann, Bernd Stetter, Christian Diers, Michael Fröhlich

Main category: cs.AI

TL;DR: This study compares Statistical Parametric Mapping (SPM) with explainable machine learning for outlier detection in plantar pressure data, finding that ML outperforms SPM in accuracy while both provide interpretable results.

Details

Motivation: Plantar pressure mapping datasets often contain outliers from technical errors or procedural inconsistencies, but existing SPM methods are sensitive to alignment and their outlier detection capabilities are unclear, necessitating transparent quality-control pipelines.

Method: Used a dataset of 798 valid samples and 2000 outliers annotated by experts and enriched with synthetic anomalies. Compared (i) a non-parametric, registration-dependent SPM approach and (ii) a CNN explained using SHAP. Evaluated performance via nested cross-validation and explanation quality via expert survey.

Result: The ML model achieved high accuracy and outperformed SPM, which misclassified clinically meaningful variations and missed true outliers. Experts found both SPM and SHAP explanations clear, useful, and trustworthy, though SPM was perceived as less complex.

Conclusion: SPM and explainable ML have complementary potential for automated outlier detection in plantar pressure data, with explainability being crucial for translating complex model outputs into interpretable insights that inform decision-making.

Abstract: Plantar pressure mapping is essential in clinical diagnostics and sports science, yet large heterogeneous datasets often contain outliers from technical errors or procedural inconsistencies. Statistical Parametric Mapping (SPM) provides interpretable analyses but is sensitive to alignment and its capacity for robust outlier detection remains unclear. This study compares an SPM approach with an explainable machine learning (ML) approach to establish transparent quality-control pipelines for plantar pressure datasets. Data from multiple centers were annotated by expert consensus and enriched with synthetic anomalies resulting in 798 valid samples and 2000 outliers. We evaluated (i) a non-parametric, registration-dependent SPM approach and (ii) a convolutional neural network (CNN), explained using SHapley Additive exPlanations (SHAP). Performance was assessed via nested cross-validation; explanation quality via a semantic differential survey with domain experts. The ML model reached high accuracy and outperformed SPM, which misclassified clinically meaningful variations and missed true outliers. Experts perceived both SPM and SHAP explanations as clear, useful, and trustworthy, though SPM was assessed less complex. These findings highlight the complementary potential of SPM and explainable ML as approaches for automated outlier detection in plantar pressure data, and underscore the importance of explainability in translating complex model outputs into interpretable insights that can effectively inform decision-making.

[1062] The Thinking Spectrum: An Empirical Study of Tunable Reasoning in LLMs through Model Merging

Xiaochong Lan, Yu Zheng, Shiteng Cao, Yong Li

Main category: cs.AI

TL;DR: Model merging enables tunable LLM reasoning capabilities by combining general and specialized models, allowing fine-grained control over accuracy-efficiency trade-offs.

Details

Motivation: There's growing need for LLMs with adjustable reasoning depth and computational cost for real-world applications, but current methods lack fine-grained control over this balance.

Method: Conducted large-scale empirical study evaluating various model merging techniques across reasoning benchmarks, systematically varying merging strengths to create accuracy-efficiency curves.

Result: Model merging effectively calibrates reasoning accuracy vs token efficiency trade-off, even with divergent parent models, and achieves Pareto Improvements where merged models outperform parents in both accuracy and efficiency.

Conclusion: Model merging provides practical method for creating LLMs with specific reasoning profiles to meet diverse application demands through tunable performance control.

Abstract: The growing demand for large language models (LLMs) with tunable reasoning capabilities in many real-world applications highlights a critical need for methods that can efficiently produce a spectrum of models balancing reasoning depth and computational cost. Model merging has emerged as a promising, training-free technique to address this challenge by arithmetically combining the weights of a general-purpose model with a specialized reasoning model. While various merging techniques exist, their potential to create a spectrum of models with fine-grained control over reasoning abilities remains largely unexplored. This work presents a large-scale empirical study evaluating a range of model merging techniques across multiple reasoning benchmarks. We systematically vary merging strengths to construct accuracy-efficiency curves, providing the first comprehensive view of the tunable performance landscape. Our findings reveal that model merging offers an effective and controllable method for calibrating the trade-off between reasoning accuracy and token efficiency, even when parent models have highly divergent weight spaces. Crucially, we identify instances of Pareto Improvement, where a merged model achieves both higher accuracy and lower token consumption than one of its parents. Our study provides the first comprehensive analysis of this tunable space, offering practical guidelines for creating LLMs with specific reasoning profiles to meet diverse application demands.

cs.SD

[1063] GOAT: A Large Dataset of Paired Guitar Audio Recordings and Tablatures

Jackson Loth, Pedro Sarmento, Saurjya Sarkar, Zixun Guo, Mathieu Barthet, Mark Sandler

Main category: cs.SD

TL;DR: The GOAT dataset provides 5.9 hours of high-quality electric guitar audio recordings with tablature annotations, plus 29.5 hours of augmented audio using guitar amplifiers, to address data scarcity in guitar MIR research.

Details

Motivation: Progress in guitar MIR has been limited by scarce and poorly annotated datasets, despite increased interest in guitar analysis due to its diverse playing techniques and sonic characteristics.

Method: Created GOAT dataset with direct input audio recordings from various guitars and players, used data augmentation with guitar amplifiers for tonal variety, and annotated recordings using Guitar Pro format and text-like token encoding for tablatures.

Result: Competitive results achieved for MIDI transcription and preliminary results for automatic guitar tablature transcription, demonstrating the dataset’s effectiveness for guitar-related MIR tasks.

Conclusion: GOAT dataset enables training novel models for various guitar MIR tasks including synthesis, transcription, and playing technique detection, addressing the data scarcity problem in the field.

Abstract: In recent years, the guitar has received increased attention from the music information retrieval (MIR) community driven by the challenges posed by its diverse playing techniques and sonic characteristics. Mainly fueled by deep learning approaches, progress has been limited by the scarcity and limited annotations of datasets. To address this, we present the Guitar On Audio and Tablatures (GOAT) dataset, comprising 5.9 hours of unique high-quality direct input audio recordings of electric guitars from a variety of different guitars and players. We also present an effective data augmentation strategy using guitar amplifiers which delivers near-unlimited tonal variety, of which we provide a starting 29.5 hours of audio. Each recording is annotated using guitar tablatures, a guitar-specific symbolic format supporting string and fret numbers, as well as numerous playing techniques. For this we utilise both the Guitar Pro format, a software for tablature playback and editing, and a text-like token encoding. Furthermore, we present competitive results using GOAT for MIDI transcription and preliminary results for a novel approach to automatic guitar tablature transcription. We hope that GOAT opens up the possibilities to train novel models on a wide variety of guitar-related MIR tasks, from synthesis to transcription to playing technique detection.

[1064] DiaMoE-TTS: A Unified IPA-Based Dialect TTS Framework with Mixture-of-Experts and Parameter-Efficient Zero-Shot Adaptation

Ziqi Chen, Gongyu Chen, Yihua Wang, Chaofan Ding, Zihao chen, Wei-Qiang Zhang

Main category: cs.SD

TL;DR: DiaMoE-TTS is a unified IPA-based text-to-speech framework that handles dialect diversity through dialect-aware Mixture-of-Experts and parameter-efficient adaptation, enabling scalable synthesis with minimal data.

Details

Motivation: Building TTS systems for dialects is challenging due to scarce data, inconsistent orthographies, and complex phonetic variation. Existing approaches often depend on large-scale or proprietary resources.

Method: Uses IPA-based framework to standardize phonetic representations, built on F5-TTS architecture with dialect-aware Mixture-of-Experts (MoE) for phonological differences, and employs LoRA and Conditioning Adapters for parameter-efficient adaptation.

Result: Achieves natural and expressive speech generation with zero-shot performance on unseen dialects and specialized domains like Peking Opera using only a few hours of data.

Conclusion: DiaMoE-TTS enables scalable, open-data-driven dialect TTS synthesis that overcomes data scarcity and orthographic inconsistencies while supporting rapid transfer to new dialects.

Abstract: Dialect speech embodies rich cultural and linguistic diversity, yet building text-to-speech (TTS) systems for dialects remains challenging due to scarce data, inconsistent orthographies, and complex phonetic variation. To address these issues, we present DiaMoE-TTS, a unified IPA-based framework that standardizes phonetic representations and resolves grapheme-to-phoneme ambiguities. Built upon the F5-TTS architecture, the system introduces a dialect-aware Mixture-of-Experts (MoE) to model phonological differences and employs parameter-efficient adaptation with Low-Rank Adaptors (LoRA) and Conditioning Adapters for rapid transfer to new dialects. Unlike approaches dependent on large-scale or proprietary resources, DiaMoE-TTS enables scalable, open-data-driven synthesis. Experiments demonstrate natural and expressive speech generation, achieving zero-shot performance on unseen dialects and specialized domains such as Peking Opera with only a few hours of data.

[1065] Prompt-aware classifier free guidance for diffusion models

Xuanhao Zhang, Chang Li

Main category: cs.SD

TL;DR: A prompt-aware framework that predicts optimal guidance scales for diffusion models to address the limitations of fixed-scale Classifier-Free Guidance across varying prompt complexities.

Details

Motivation: Fixed guidance scales in Classifier-Free Guidance fail to generalize across prompts of different complexity, causing oversaturation or weak alignment between generated content and prompts.

Method: Construct a synthetic dataset with samples generated under multiple guidance scales, score them with evaluation metrics, and train a lightweight predictor that uses semantic embeddings and linguistic complexity to estimate quality curves and select optimal scales via utility function with regularization.

Result: Experiments on MSCOCO 2014 and AudioCaps show consistent improvements over vanilla CFG, enhancing fidelity, alignment, and perceptual preference.

Conclusion: Prompt-aware scale selection provides an effective, training-free enhancement for pretrained diffusion backbones, demonstrating better generalization across varying prompt complexities.

Abstract: Diffusion models have achieved remarkable progress in image and audio generation, largely due to Classifier-Free Guidance. However, the choice of guidance scale remains underexplored: a fixed scale often fails to generalize across prompts of varying complexity, leading to oversaturation or weak alignment. We address this gap by introducing a prompt-aware framework that predicts scale-dependent quality and selects the optimal guidance at inference. Specifically, we construct a large synthetic dataset by generating samples under multiple scales and scoring them with reliable evaluation metrics. A lightweight predictor, conditioned on semantic embeddings and linguistic complexity, estimates multi-metric quality curves and determines the best scale via a utility function with regularization. Experiments on MSCOCO~2014 and AudioCaps show consistent improvements over vanilla CFG, enhancing fidelity, alignment, and perceptual preference. This work demonstrates that prompt-aware scale selection provides an effective, training-free enhancement for pretrained diffusion backbones.

[1066] Text-Independent Speaker Identification Using Audio Looping With Margin Based Loss Functions

Elliot Q C Garcia, Nicéias Silva Vilela, Kátia Pires Nascimento do Sacramento, Tiago A. E. Ferreira

Main category: cs.SD

TL;DR: The paper investigates CosFace and ArcFace loss functions for text-independent speaker identification using a modified VGG16 CNN architecture with mel spectrogram inputs from Voxceleb1 dataset, showing superior accuracy over traditional Softmax loss.

Details

Motivation: Speaker identification is crucial for security systems, virtual assistants, and personalized user experiences, requiring improved accuracy and robustness.

Method: Used modified VGG16 CNN architecture with mel spectrogram inputs from Voxceleb1 dataset, implemented CosFace Loss and ArcFace Loss with Softmax loss as baseline, and analyzed effects of mel spectrogram sizes and time lengths.

Result: Experimental results demonstrated superior identification accuracy compared to traditional Softmax loss methods.

Conclusion: The findings have implications for future research in speaker identification using advanced loss functions.

Abstract: Speaker identification has become a crucial component in various applications, including security systems, virtual assistants, and personalized user experiences. In this paper, we investigate the effectiveness of CosFace Loss and ArcFace Loss for text-independent speaker identification using a Convolutional Neural Network architecture based on the VGG16 model, modified to accommodate mel spectrogram inputs of variable sizes generated from the Voxceleb1 dataset. Our approach involves implementing both loss functions to analyze their effects on model accuracy and robustness, where the Softmax loss function was employed as a comparative baseline. Additionally, we examine how the sizes of mel spectrograms and their varying time lengths influence model performance. The experimental results demonstrate superior identification accuracy compared to traditional Softmax loss methods. Furthermore, we discuss the implications of these findings for future research.

[1067] Efficient Speech Watermarking for Speech Synthesis via Progressive Knowledge Distillation

Yang Cui, Peter Pan, Lei He, Sheng Zhao

Main category: cs.SD

TL;DR: PKDMark is a lightweight deep learning-based speech watermarking method that uses progressive knowledge distillation to achieve 93.6% computational cost reduction while maintaining high robustness and imperceptibility.

Details

Motivation: Unauthorized voice cloning poses privacy and security risks. Current speech watermarking methods either lack robustness (DSP-based) or have high computational costs (deep learning-based).

Method: Two-stage approach: (1) train high-performance teacher model using invertible neural network architecture, (2) transfer capabilities to compact student model through progressive knowledge distillation.

Result: Distilled model achieves 99.6% average detection F1 score with PESQ of 4.30 in advanced distortions, enabling efficient real-time speech synthesis applications.

Conclusion: PKDMark provides an efficient and robust solution for speech watermarking that balances computational efficiency with strong protection against unauthorized voice cloning.

Abstract: With the rapid advancement of speech generative models, unauthorized voice cloning poses significant privacy and security risks. Speech watermarking offers a viable solution for tracing sources and preventing misuse. Current watermarking technologies fall mainly into two categories: DSP-based methods and deep learning-based methods. DSP-based methods are efficient but vulnerable to attacks, whereas deep learning-based methods offer robust protection at the expense of significantly higher computational cost. To improve the computational efficiency and enhance the robustness, we propose PKDMark, a lightweight deep learning-based speech watermarking method that leverages progressive knowledge distillation (PKD). Our approach proceeds in two stages: (1) training a high-performance teacher model using an invertible neural network-based architecture, and (2) transferring the teacher’s capabilities to a compact student model through progressive knowledge distillation. This process reduces computational costs by 93.6% while maintaining high level of robust performance and imperceptibility. Experimental results demonstrate that our distilled model achieves an average detection F1 score of 99.6% with a PESQ of 4.30 in advanced distortions, enabling efficient speech watermarking for real-time speech synthesis applications.

[1068] WavJEPA: Semantic learning unlocks robust audio foundation models for raw waveforms

Goksenin Yuksel, Pierre Guetschel, Michael Tangermann, Marcel van Gerven, Kiki van der Heijden

Main category: cs.SD

TL;DR: WavJEPA is a waveform-based audio representation learning model that outperforms state-of-the-art time-domain models across various tasks with fewer computational resources, and its multi-channel extension WavJEPA-Nat shows robustness to noise and reverberation.

Details

Motivation: To overcome limitations of spectrogram-based audio representation learning (long latency, phase information loss) and address the gap in successful self-supervised learning from raw waveforms for general-purpose audio (unlike speech).

Method: Proposes WavJEPA, a waveform-based Joint-Embedding Predictive Architecture that uses high-level semantic representation learning instead of speech unit/token level learning. Also presents WavJEPA-Nat, a multi-channel extension trained on simulated naturalistic scenes.

Result: WavJEPA substantially outperforms state-of-the-art time-domain audio foundation models across various downstream benchmark tasks while requiring fewer computational resources. WavJEPA-Nat is highly robust to reverberation and noise.

Conclusion: Demonstrates feasibility and computational efficiency of general-purpose audio representation learning from raw waveforms, enabling low-latency, robust time-domain audio foundation models for real-world applications.

Abstract: Learning audio representations from raw waveforms overcomes key limitations of spectrogram-based audio representation learning, such as the long latency of spectrogram computation and the loss of phase information. Yet, while self-supervised speech representation learning from raw waveforms has been remarkably successful, these approaches have not achieved similar feats for general-purpose audio representation learning from waveforms. Here, we propose WavJEPA, a waveform-based version of the Joint-Embedding Predictive Architecture. WavJEPA leverages high-level semantic representation learning to tackle the shortcomings of representation learning at the speech unit or token level. We show that this approach substantially outperforms state-of-the-art time-domain audio foundation models across a wide variety of downstream benchmark tasks, while requiring considerably fewer computational resources. Additionally, to overcome the performance drop that time-domain models typically exhibit in noisy and reverberant real-world acoustic environments, we present WavJEPA-Nat. WavJEPA-Nat is a multi-channel extension of the WavJEPA architecture trained on simulated naturalistic scenes. We find that WavJEPA-Nat is highly robust to reverberation and noise. These results highlight the feasibility and computational efficiency of general-purpose audio representation learning from raw waveforms, showcasing the potential for low-latency, robust time-domain audio foundation models for real-world applications.

[1069] MeanFlowSE: One-Step Generative Speech Enhancement via MeanFlow

Yike Zhu, Boyi Kang, Ziqian Wang, Xingchen Li, Zihan Zhang, Wenjie Li, Longshuai Xiao, Wei Xue, Lei Xie

Main category: cs.SD

TL;DR: MeanFlowSE is a one-step generative speech enhancement framework that achieves SOTA perceptual quality with faster inference and smaller model size than existing generative methods.

Details

Motivation: Current generative speech enhancement approaches rely on multi-step sampling or large models, which limit real-time deployment. There's a need for efficient generative methods that maintain high quality while being practical for real-time applications.

Method: Uses MeanFlow to predict an average-velocity field for one-step latent refinement and conditions on self-supervised learning representations instead of VAE latents, enabling faster inference and robust acoustic-semantic guidance.

Result: Achieves state-of-the-art perceptual quality and competitive intelligibility on Interspeech 2020 DNS Challenge datasets, while significantly reducing real-time factor and model size compared to recent generative competitors.

Conclusion: MeanFlowSE provides a practical solution for real-time speech enhancement by combining high perceptual quality with efficient one-step generation, making it suitable for deployment in applications like telecommunications and ASR.

Abstract: Speech enhancement (SE) recovers clean speech from noisy signals and is vital for applications such as telecommunications and automatic speech recognition (ASR). While generative approaches achieve strong perceptual quality, they often rely on multi-step sampling (diffusion/flow-matching) or large language models, limiting real-time deployment. To mitigate these constraints, we present MeanFlowSE, a one-step generative SE framework. It adopts MeanFlow to predict an average-velocity field for one-step latent refinement and conditions the model on self-supervised learning (SSL) representations rather than VAE latents. This design accelerates inference and provides robust acoustic-semantic guidance during training. In the Interspeech 2020 DNS Challenge blind test set and simulated test set, MeanFlowSE attains state-of-the-art (SOTA) level perceptual quality and competitive intelligibility while significantly lowering both real-time factor (RTF) and model size compared with recent generative competitors, making it suitable for practical use. The code will be released upon publication at https://github.com/Hello3orld/MeanFlowSE.

[1070] AudioRole: An Audio Dataset for Character Role-Playing in Large Language Models

Wenyu Li, Xiaoqi Jiao, Yi Chang, Guangyan Zhang, Yiwen Guo

Main category: cs.SD

TL;DR: AudioRole is a large-scale multimodal dataset for audio role-playing, featuring 1M+ character-grounded dialogues from 13 TV series with synchronized audio-text pairs. The paper also introduces ARP-Eval evaluation framework and shows trained models achieve significant improvements in acoustic and content personalization.

Details

Motivation: Existing works focus on text-based persona simulation, but Audio Role-Playing (ARP) presents unique challenges requiring synchronized alignment of semantic content and vocal characteristics. There's a gap in high-quality multimodal datasets for advancing audio role-playing capabilities.

Method: Created AudioRole dataset from 13 TV series (1K+ hours, 1M+ dialogues) with synchronized audio-text pairs, speaker identities, and contextual metadata. Introduced ARP-Eval dual-aspect evaluation framework. Trained GLM-4-Voice on AudioRole to create ARP-Model.

Result: ARP-Model achieved average Acoustic Personalization score of 0.31, significantly outperforming original GLM-4-voice and MiniCPM-O-2.6. Content Personalization score of 0.36, surpassing untrained model by 38% and matching MiniCPM-O-2.6. Dataset covers 115+ main characters with 6 trained ARP-Models.

Conclusion: AudioRole provides essential resource for advancing audio-grounded role-playing research, demonstrating effectiveness through improved model performance in both acoustic and content personalization aspects.

Abstract: The creation of high-quality multimodal datasets remains fundamental for advancing role-playing capabilities in large language models (LLMs). While existing works predominantly focus on text-based persona simulation, Audio Role-Playing (ARP) presents unique challenges due to the need for synchronized alignment of semantic content and vocal characteristics. To address this gap, we propose AudioRole, a meticulously curated dataset from 13 TV series spanning 1K+ hours with 1M+ character-grounded dialogues, providing synchronized audio-text pairs annotated with speaker identities and contextual metadata. In addition, to demonstrate the effectiveness of the dataset, we introduced ARP-Eval, a dual-aspect evaluation framework that assesses both response quality and role fidelity. Empirical validation showing GLM-4-Voice trained on AudioRole (which we called ARP-Model) achieve an average Acoustic Personalization score of 0.31, significantly outperforming the original GLM-4-voice and the more powerful model MiniCPM-O-2.6, which specifically supports role-playing in one-shot scenarios. The ARP-Model also achieves a Content Personalization score of 0.36, surpassing the untrained original model by about 38% and maintaining the same level as MiniCPM-O-2.6. AudioRole features dialogues from over 115 main characters, 6 trained ARP-Models that role-play different characters, and evaluation protocols. Together, they provide an essential resource for advancing audio-grounded role-playing research.

[1071] ABC-Eval: Benchmarking Large Language Models on Symbolic Music Understanding and Instruction Following

Jiahao Zhao, Yunjia Li, Wei Li, Kazuyoshi Yoshii

Main category: cs.SD

TL;DR: ABC-Eval is the first open-source benchmark for evaluating LLMs’ understanding and instruction-following capabilities in text-based ABC notation music, revealing significant limitations in current models.

Details

Motivation: While symbolic music has been widely used in generation tasks, LLM capabilities in understanding and reasoning about symbolic music remain largely underexplored, creating a gap in evaluating music comprehension abilities.

Method: Created ABC-Eval benchmark with 1,086 test samples across 10 sub-tasks covering basic musical syntax comprehension to complex sequence-level reasoning, then evaluated seven state-of-the-art LLMs.

Result: Evaluation revealed notable limitations in existing models’ symbolic music processing capabilities, with consistent performance patterns across different sub-tasks validating benchmark reliability.

Conclusion: The benchmark successfully identifies gaps in LLMs’ symbolic music understanding and provides a reliable tool for future research in music AI comprehension.

Abstract: As large language models continue to develop, the feasibility and significance of text-based symbolic music tasks have become increasingly prominent. While symbolic music has been widely used in generation tasks, LLM capabilities in understanding and reasoning about symbolic music remain largely underexplored. To address this gap, we propose ABC-Eval, the first open-source benchmark dedicated to the understanding and instruction-following capabilities in text-based ABC notation scores. It comprises 1,086 test samples spanning 10 sub-tasks, covering scenarios from basic musical syntax comprehension to complex sequence-level reasoning. Such a diverse scope poses substantial challenges to models’ ability to handle symbolic music tasks. We evaluated seven state-of-the-art LLMs on ABC-Eval, and the results reveal notable limitations in existing models’ symbolic music processing capabilities. Furthermore, the consistent performance of individual baselines across different sub-tasks supports the reliability of our benchmark.

[1072] Emotional Styles Hide in Deep Speaker Embeddings: Disentangle Deep Speaker Embeddings for Speaker Clustering

Chaohao Lin, Xu Zheng, Kaida Wu, Peihao Xiang, Ou Bai

Main category: cs.SD

TL;DR: DTG-VAE is a novel disentanglement method using Variational Autoencoder to improve speaker clustering performance by extracting more robust speaker embeddings, especially for emotional speech.

Details

Motivation: Emotional expressions in speeches significantly affect speaker embedding accuracy and degrade speaker clustering performance, creating challenges for speaker diarization systems.

Method: Proposed DTG-VAE, a disentanglement method within a Variational Autoencoder framework that extracts more robust speaker embeddings by separating speaker characteristics from emotional variations.

Result: DTG-VAE successfully extracts more robust speaker embeddings and significantly enhances speaker clustering performance, as demonstrated in experiments.

Conclusion: The study reveals a direct link between emotional states and deep speaker embedding effectiveness, and DTG-VAE provides an effective solution to improve speaker clustering for emotional speech.

Abstract: Speaker clustering is the task of identifying the unique speakers in a set of audio recordings (each belonging to exactly one speaker) without knowing who and how many speakers are present in the entire data, which is essential for speaker diarization processes. Recently, off-the-shelf deep speaker embedding models have been leveraged to capture speaker characteristics. However, speeches containing emotional expressions pose significant challenges, often affecting the accuracy of speaker embeddings and leading to a decline in speaker clustering performance. To tackle this problem, we propose DTG-VAE, a novel disentanglement method that enhances clustering within a Variational Autoencoder (VAE) framework. This study reveals a direct link between emotional states and the effectiveness of deep speaker embeddings. As demonstrated in our experiments, DTG-VAE extracts more robust speaker embeddings and significantly enhances speaker clustering performance.

[1073] Efficient Audio-Visual Speech Separation with Discrete Lip Semantics and Multi-Scale Global-Local Attention

Kai Li, Kejun Gao, Xiaolin Hu

Main category: cs.SD

TL;DR: Dolphin is an efficient audio-visual speech separation method that achieves state-of-the-art performance with significantly reduced computational cost through lightweight visual encoding and audio separation modules.

Details

Motivation: Current AVSS methods have large parameter counts and high computational costs, making them impractical for applications where speech separation is just a preprocessing step.

Method: Uses DP-LipCoder for visual feature extraction (dual-path lightweight video encoder) and a lightweight encoder-decoder separator with global-local attention blocks for audio separation.

Result: Surpassed SOTA in separation quality while achieving >50% fewer parameters, >2.4x reduction in MACs, and >6x faster GPU inference speed on three benchmark datasets.

Conclusion: Dolphin provides a practical and deployable solution for high-performance AVSS in real-world scenarios with superior efficiency.

Abstract: Audio-visual speech separation (AVSS) methods leverage visual cues to extract target speech and have demonstrated strong separation quality in noisy acoustic environments. However, these methods usually involve a large number of parameters and require high computational cost, which is unacceptable in many applications where speech separation serves as only a preprocessing step for further speech processing. To address this issue, we propose an efficient AVSS method, named Dolphin. For visual feature extraction, we develop DP-LipCoder, a dual-path lightweight video encoder that transforms lip-motion into discrete audio-aligned semantic tokens. For audio separation, we construct a lightweight encoder-decoder separator, in which each layer incorporates a global-local attention (GLA) block to efficiently capture multi-scale dependencies. Experiments on three benchmark datasets showed that Dolphin not only surpassed the current state-of-the-art (SOTA) model in separation quality but also achieved remarkable improvements in efficiency: over 50% fewer parameters, more than 2.4x reduction in MACs, and over 6x faster GPU inference speed. These results indicate that Dolphin offers a practical and deployable solution for high-performance AVSS in real-world scenarios. Our code and demo page are publicly available at http://cslikai.cn/Dolphin/.

[1074] Generalizable Speech Deepfake Detection via Information Bottleneck Enhanced Adversarial Alignment

Pu Huang, Shouguang Wang, Siya Yao, Mengchu Zhou

Main category: cs.SD

TL;DR: IB-CAAN is a robust speech deepfake detection method that uses confidence-aware adversarial alignment and information bottleneck to learn shared discriminative features across different spoofing attacks.

Details

Motivation: Neural speech synthesis creates highly realistic deepfakes that pose security risks, and detection is challenging due to distribution shifts across spoofing methods and variability in speakers, channels, and recording conditions.

Method: Proposed Information Bottleneck enhanced Confidence-Aware Adversarial Network (IB-CAAN) with confidence-guided adversarial alignment to suppress attack-specific artifacts without erasing discriminative cues, and information bottleneck to remove nuisance variability.

Result: Experiments on ASVspoof 2019/2021, ASVspoof 5, and In-the-Wild benchmarks show IB-CAAN consistently outperforms baselines and achieves state-of-the-art performance on many benchmarks.

Conclusion: Learning shared discriminative features through IB-CAAN provides a robust path for speech deepfake detection that handles distribution shifts and variability across different spoofing scenarios.

Abstract: Neural speech synthesis techniques have enabled highly realistic speech deepfakes, posing major security risks. Speech deepfake detection is challenging due to distribution shifts across spoofing methods and variability in speakers, channels, and recording conditions. We explore learning shared discriminative features as a path to robust detection and propose Information Bottleneck enhanced Confidence-Aware Adversarial Network (IB-CAAN). Confidence-guided adversarial alignment adaptively suppresses attack-specific artifacts without erasing discriminative cues, while the information bottleneck removes nuisance variability to preserve transferable features. Experiments on ASVspoof 2019/2021, ASVspoof 5, and In-the-Wild demonstrate that IB-CAAN consistently outperforms baseline and achieves state-of-the-art performance on many benchmarks.

[1075] Disentangling Score Content and Performance Style for Joint Piano Rendering and Transcription

Wei Zeng, Junchuan Zhao, Ye Wang

Main category: cs.SD

TL;DR: A unified framework that jointly models expressive performance rendering (EPR) and automatic piano transcription (APT) using transformer-based architecture with content-style disentanglement and diffusion-based style recommendation.

Details

Motivation: EPR and APT are inverse tasks in music information retrieval that have been addressed independently despite their dual nature. The paper aims to create a unified framework that bridges these two fundamental tasks.

Method: Transformer-based sequence-to-sequence architecture trained with sequence-aligned data (no fine-grained note-level alignment required), featuring content-style disentanglement and an independent diffusion-based performance style recommendation module that generates style embeddings from score content.

Result: The framework achieves competitive performance on both EPR and APT tasks, enables effective content-style disentanglement, reliable style transfer, and stylistically appropriate rendering as demonstrated through objective and subjective evaluations.

Conclusion: The proposed unified framework successfully bridges the gap between EPR and APT tasks, demonstrating that joint modeling with content-style disentanglement and modular style recommendation enables flexible and stylistically appropriate music performance generation and transcription.

Abstract: Expressive performance rendering (EPR) and automatic piano transcription (APT) are fundamental yet inverse tasks in music information retrieval: EPR generates expressive performances from symbolic scores, while APT recovers scores from performances. Despite their dual nature, prior work has addressed them independently. In this paper we propose a unified framework that jointly models EPR and APT by disentangling note-level score content and global performance style representations from both paired and unpaired data. Our framework is built on a transformer-based sequence-to-sequence architecture and is trained using only sequence-aligned data, without requiring fine-grained note-level alignment. To automate the rendering process while ensuring stylistic compatibility with the score, we introduce an independent diffusion-based performance style recommendation module that generates style embeddings directly from score content. This modular component supports both style transfer and flexible rendering across a range of expressive styles. Experimental results from both objective and subjective evaluations demonstrate that our framework achieves competitive performance on EPR and APT tasks, while enabling effective content-style disentanglement, reliable style transfer, and stylistically appropriate rendering. Demos are available at https://jointpianist.github.io/epr-apt/

[1076] AudioMoG: Guiding Audio Generation with Mixture-of-Guidance

Junyou Wang, Zehua Chen, Binjie Yuan, Kaiwen Zheng, Chang Li, Yuxuan Jiang, Jun Zhu

Main category: cs.SD

TL;DR: AudioMoG is a mixture-of-guidance framework that combines multiple guiding principles (like condition alignment and score accuracy) to improve cross-modal audio generation quality without sacrificing inference speed.

Details

Motivation: Current guidance methods like CFG and AG rely on single guiding principles, limiting their potential. CFG emphasizes condition alignment but reduces diversity, while AG increases diversity but may not fully exploit condition information.

Method: Proposes AudioMoG framework that mixes multiple guidance principles, allowing cumulative benefits from different approaches. It can consider parallel complements or recover single principles while maintaining generality.

Result: AudioMoG consistently outperforms single guidance methods in T2A generation across sampling steps, and shows advantages in V2A, text-to-music, and image generation without sacrificing inference efficiency.

Conclusion: Mixed guiding principles provide a ‘free lunch’ for cross-modal audio generation systems, enabling higher quality at the sampling stage without compromising inference speed.

Abstract: Guidance methods have demonstrated significant improvements in cross-modal audio generation, including text-to-audio (T2A) and video-to-audio (V2A) generation. The popularly adopted method, classifier-free guidance (CFG), steers generation by emphasizing condition alignment, enhancing fidelity but often at the cost of diversity. Recently, autoguidance (AG) has been explored for audio generation, encouraging the sampling to faithfully reconstruct the target distribution and showing increased diversity. Despite these advances, they usually rely on a single guiding principle, e.g., condition alignment in CFG or score accuracy in AG, leaving the full potential of guidance for audio generation untapped. In this work, we explore enriching the composition of the guidance method and present a mixture-of-guidance framework, AudioMoG. Within the design space, AudioMoG can exploit the complementary advantages of distinctive guiding principles by fulfilling their cumulative benefits. With a reduced form, AudioMoG can consider parallel complements or recover a single guiding principle, without sacrificing generality. We experimentally show that, given the same inference speed, AudioMoG approach consistently outperforms single guidance in T2A generation across sampling steps, concurrently showing advantages in V2A, text-to-music, and image generation. These results highlight a “free lunch” in current cross-modal audio generation systems: higher quality can be achieved through mixed guiding principles at the sampling stage without sacrificing inference efficiency. Demo samples are available at: https://audio-mog.github.io.

[1077] VioPTT: Violin Technique-Aware Transcription from Synthetic Data Augmentation

Ting-Kang Wang, Yueh-Po Peng, Li Su, Vincent K. M. Cheung

Main category: cs.SD

TL;DR: VioPTT is a lightweight end-to-end model that transcribes violin playing techniques along with pitch timing, using a novel synthetic dataset (MOSA-VPT) to overcome manual annotation challenges.

Details

Motivation: Most music transcription models only capture pitch and timing, missing crucial expressive elements like violin playing techniques that create distinct timbres and emotional impact.

Method: Proposed VioPTT - a unified framework that jointly transcribes violin playing techniques, pitch onset and offset. Used MOSA-VPT synthetic dataset to train the model without manual annotations.

Result: The model demonstrated strong generalization to real-world violin recordings and achieved state-of-the-art transcription performance.

Conclusion: VioPTT is the first unified framework that successfully combines violin transcription with playing technique prediction, addressing a significant gap in music information retrieval.

Abstract: While automatic music transcription is well-established in music information retrieval, most models are limited to transcribing pitch and timing information from audio, and thus omit crucial expressive and instrument-specific nuances. One example is playing technique on the violin, which affords its distinct palette of timbres for maximal emotional impact. Here, we propose \textbf{VioPTT} (Violin Playing Technique-aware Transcription), a lightweight, end-to-end model that directly transcribes violin playing technique in addition to pitch onset and offset. Furthermore, we release \textbf{MOSA-VPT}, a novel, high-quality synthetic violin playing technique dataset to circumvent the need for manually labeled annotations. Leveraging this dataset, our model demonstrated strong generalization to real-world note-level violin technique recordings in addition to achieving state-of-the-art transcription performance. To our knowledge, VioPTT is the first to jointly combine violin transcription and playing technique prediction within a unified framework.

[1078] An Efficient Transfer Learning Method Based on Adapter with Local Attributes for Speech Emotion Recognition

Haoyu Song, Ian McLoughlin, Qing Gu, Nan Jiang, Yan Song

Main category: cs.SD

TL;DR: Proposes an adapter with local attributes for efficient speech emotion recognition transfer learning, using teacher-student branches and unsupervised clustering to avoid costly encoder retraining.

Details

Motivation: Address the lack of high-quality large-scale emotion datasets and the need for costly encoder retraining in transfer learning for speech emotion recognition across different scenarios.

Method: Uses weighted average pooling-Transformer backbone, teacher-student adapter with mask prediction and self-distillation, unsupervised clustering for local attributes, and statistical attentive pooling for utterance representation.

Result: Achieves superior performance on IEMOCAP dataset compared to previous state-of-the-art methods in similar settings.

Conclusion: The proposed adapter with local attributes enables efficient task-agnostic transfer learning for speech emotion recognition without requiring expensive encoder retraining.

Abstract: Existing speech emotion recognition (SER) methods commonly suffer from the lack of high-quality large-scale corpus, partly due to the complex, psychological nature of emotion which makes accurate labeling difficult and time consuming. Recently, transfer learning based methods that exploit the encoders pretrained on large-scale speech corpus (e.g., Wav2Vec2.0 and HuBERT) have shown strong potential for downstream SER tasks. However, task-specific fine-tuning remains necessary for various conversational scenarios of different topics, speakers and languages to achieve satisfactory performance. It generally requires costly encoder retraining for individual SER tasks. To address this issue, we propose to train an adapter with local attributes for efficient transfer learning. Specifically, a weighted average pooling-Transformer (WAP-Transformer) is proposed as a lightweight backbone to enrich the frame-level representation. An adapter with teacher-student branches is exploited for task-agnostic transfer learning, where the student branch is jointly optimized via mask prediction and self-distillation objectives, and the teacher branch is obtained online from the student via exponential moving average (EMA). Meanwhile, local attributes are learned from the teacher branch via unsupervised clustering, which aims to act as a universal model that provides additional semantic-rich supervisions. A statistical attentive pooling (SAP) module is proposed to obtain utterance representation for fine-tuning. To evaluate the effectiveness of the proposed adapter with local attributes, extensive experiments have been conducted on IEMOCAP. Superior performance has been reported, compared to the previous state-of-the-art methods in similar settings.

[1079] UniFlow-Audio: Unified Flow Matching for Audio Generation from Omni-Modalities

Xuenan Xu, Jiahao Mei, Zihao Zheng, Ye Tao, Zeyu Xie, Yaoyun Zhang, Haohe Liu, Yuning Wu, Ming Yan, Wen Wu, Chao Zhang, Mengyue Wu

Main category: cs.SD

TL;DR: UniFlow-Audio is a unified non-autoregressive framework for audio generation using flow matching, supporting both time-aligned and non-time-aligned tasks across text, audio, and video modalities with strong performance using minimal data and parameters.

Details

Motivation: Audio generation research has traditionally followed separate paths for time-aligned and non-time-aligned tasks, but audio is inherently unified. Previous unified approaches used autoregressive architectures, leaving non-autoregressive unified models largely unexplored.

Method: Proposes UniFlow-Audio framework based on flow matching with dual-fusion mechanism: temporal alignment of audio latents with TA features and cross-attention integration of NTA features in each model block, plus task-balanced data sampling.

Result: Achieves strong results across 7 tasks using <8K hours of public training data and <1B parameters. Even the small ~200M parameter variant shows competitive performance.

Conclusion: UniFlow-Audio demonstrates potential as a non-autoregressive foundation model for audio generation, successfully unifying diverse audio tasks through flow matching and multi-task learning.

Abstract: Audio generation, including speech, music and sound effects, has advanced rapidly in recent years. These tasks can be divided into two categories: time-aligned (TA) tasks, where each input unit corresponds to a specific segment of the output audio (e.g., phonemes aligned with frames in speech synthesis); and non-time-aligned (NTA) tasks, where such alignment is not available. Since modeling paradigms for the two types are typically different, research on different audio generation tasks has traditionally followed separate trajectories. However, audio is not inherently divided into such categories, making a unified model a natural and necessary goal for general audio generation. Previous unified audio generation works have adopted autoregressive architectures, while unified non-autoregressive approaches remain largely unexplored. In this work, we propose UniFlow-Audio, a universal audio generation framework based on flow matching. We propose a dual-fusion mechanism that temporally aligns audio latents with TA features and integrates NTA features via cross-attention in each model block. Task-balanced data sampling is employed to maintain strong performance across both TA and NTA tasks. UniFlow-Audio supports omni-modalities, including text, audio, and video. By leveraging the advantage of multi-task learning and the generative modeling capabilities of flow matching, UniFlow-Audio achieves strong results across 7 tasks using fewer than 8K hours of public training data and under 1B trainable parameters. Even the small variant with only ~200M trainable parameters shows competitive performance, highlighting UniFlow-Audio as a potential non-auto-regressive foundation model for audio generation. Code and models will be available at https://wsntxxn.github.io/uniflow_audio.

[1080] From Sound to Setting: AI-Based Equalizer Parameter Prediction for Piano Tone Replication

Song-Ze Yu

Main category: cs.SD

TL;DR: AI system predicts EQ settings from audio features for music production, enabling automated tone matching with interpretable parameters.

Details

Motivation: To create an automated tone replication system that outputs interpretable EQ parameters for musicians to adjust, rather than just audio-to-audio transformations.

Method: Used piano recordings with systematically varied EQ settings to train both regression and neural network models for predicting EQ parameter values from audio features.

Result: Neural network achieved mean squared error of 0.0216 on multi-band EQ prediction tasks, outperforming regression models.

Conclusion: The system provides practical and flexible automated tone matching for music producers and can be extended to more complex audio effects.

Abstract: This project presents an AI-based system for tone replication in music production, focusing on predicting EQ parameter settings directly from audio features. Unlike traditional audio-to-audio methods, our approach outputs interpretable parameter values (e.g., EQ band gains) that musicians can further adjust in their workflow. Using a dataset of piano recordings with systematically varied EQ settings, we evaluate both regression and neural network models. The neural network achieves a mean squared error of 0.0216 on multi-band tasks. The system enables practical, flexible, and automated tone matching for music producers and lays the foundation for extensions to more complex audio effects.

[1081] An Agent-Based Framework for Automated Higher-Voice Harmony Generation

Nia D’Souza Ganapathy, Arul Selvamani Shaja

Main category: cs.SD

TL;DR: An Agentic AI-enabled Higher Harmony Music Generator uses a multi-agent system to create musically coherent harmony through specialized agents for music ingestion, chord knowledge, harmony generation, and audio production.

Details

Motivation: To address the challenge of generating musically coherent and aesthetically pleasing harmony in algorithmic composition by mimicking human collaborative processes.

Method: A multi-agent framework with four specialized agents: Music-Ingestion Agent for input parsing, Chord-Knowledge Agent with Transformer model for chord interpretation, Harmony-Generation Agent using Harmony-GPT and Rhythm-Net for composition, and Audio-Production Agent with GAN-based synthesizer for audio rendering.

Result: The system generates sophisticated and contextually appropriate higher-voice harmonies for given melodies through robust data processing, deep theoretical understanding, creative composition, and realistic audio synthesis.

Conclusion: The modular, agent-based approach effectively creates harmonically rich music by delegating specific tasks to specialized agents, resulting in a system capable of producing aesthetically pleasing harmony.

Abstract: The generation of musically coherent and aesthetically pleasing harmony remains a significant challenge in the field of algorithmic composition. This paper introduces an innovative Agentic AI-enabled Higher Harmony Music Generator, a multi-agent system designed to create harmony in a collaborative and modular fashion. Our framework comprises four specialized agents: a Music-Ingestion Agent for parsing and standardizing input musical scores; a Chord-Knowledge Agent, powered by a Chord-Former (Transformer model), to interpret and provide the constituent notes of complex chord symbols; a Harmony-Generation Agent, which utilizes a Harmony-GPT and a Rhythm-Net (RNN) to compose a melodically and rhythmically complementary harmony line; and an Audio-Production Agent that employs a GAN-based Symbolic-to-Audio Synthesizer to render the final symbolic output into high-fidelity audio. By delegating specific tasks to specialized agents, our system effectively mimics the collaborative process of human musicians. This modular, agent-based approach allows for robust data processing, deep theoretical understanding, creative composition, and realistic audio synthesis, culminating in a system capable of generating sophisticated and contextually appropriate higher-voice harmonies for given melodies.

[1082] Beyond Genre: Diagnosing Bias in Music Embeddings Using Concept Activation Vectors

Roman B. Gebhardt, Arne Kuhle, Eylül Bektur

Main category: cs.SD

TL;DR: The paper investigates cultural bias in music representation models, finding significant biases related to singer attributes like gender and language in genre representations, and proposes a debiasing method using concept vector manipulation.

Details

Motivation: To explore whether non-musical singer attributes (gender, language) influence genre representations in music models in unintended ways, addressing the underexplored potential for cultural bias in music representation models.

Method: Applied Concept Activation Vectors (CAVs) to analyze four state-of-the-art models (MERT, Whisper, MuQ, MuQ-MuLan) using the STraDa dataset with carefully balanced training sets to control for genre confounds, and proposed a post-hoc debiasing strategy using concept vector manipulation.

Result: Revealed significant model-specific biases in genre representations that align with disparities reported in MIR and music sociology, and demonstrated the effectiveness of the proposed debiasing strategy in mitigating these biases.

Conclusion: Highlights the need for bias-aware model design in music information retrieval and shows that conceptualized interpretability methods offer practical tools for diagnosing and mitigating representational bias.

Abstract: Music representation models are widely used for tasks such as tagging, retrieval, and music understanding. Yet, their potential to encode cultural bias remains underexplored. In this paper, we apply Concept Activation Vectors (CAVs) to investigate whether non-musical singer attributes - such as gender and language - influence genre representations in unintended ways. We analyze four state-of-the-art models (MERT, Whisper, MuQ, MuQ-MuLan) using the STraDa dataset, carefully balancing training sets to control for genre confounds. Our results reveal significant model-specific biases, aligning with disparities reported in MIR and music sociology. Furthermore, we propose a post-hoc debiasing strategy using concept vector manipulation, demonstrating its effectiveness in mitigating these biases. These findings highlight the need for bias-aware model design and show that conceptualized interpretability methods offer practical tools for diagnosing and mitigating representational bias in MIR.

[1083] MGM-Omni: Scaling Omni LLMs to Personalized Long-Horizon Speech

Chengyao Wang, Zhisheng Zhong, Bohao Peng, Senqiao Yang, Yuqi Liu, Haokun Gui, Bin Xia, Jingyao Li, Bei Yu, Jiaya Jia

Main category: cs.SD

TL;DR: MGM-Omni is a unified Omni LLM that combines multimodal understanding with expressive, long-horizon speech generation using a dual-track “brain-mouth” architecture that separates reasoning from real-time speech synthesis.

Details

Motivation: To overcome the limitations of cascaded pipelines that isolate speech synthesis, enabling efficient cross-modal interaction and low-latency streaming speech generation with better timbre preservation and context awareness.

Method: Uses a dual-track, token-based architecture with unified training strategy, dual audio encoder for long-form audio perception, and chunk-based parallel decoding to bridge text-speech token-rate gap for streaming zero-shot voice cloning.

Result: Outperforms existing open-source models in timbre identity preservation, natural context-aware speech generation, and superior long-form audio/omnimodal understanding with data-efficient training.

Conclusion: Establishes an efficient end-to-end paradigm for omnimodal understanding and controllable, personalized long-horizon speech generation.

Abstract: We present MGM-Omni, a unified Omni LLM for omni-modal understanding and expressive, long-horizon speech generation. Unlike cascaded pipelines that isolate speech synthesis, MGM-Omni adopts a “brain-mouth” design with a dual-track, token-based architecture that cleanly decouples multimodal reasoning from real-time speech generation. This design enables efficient cross-modal interaction and low-latency, streaming speech generation. For understanding, a unified training strategy coupled with a dual audio encoder design enables long-form audio perception across diverse acoustic conditions. For generation, a chunk-based parallel decoding scheme narrows the text speech token-rate gap, accelerating inference and supporting streaming zero-shot voice cloning with stable timbre over extended durations. Compared to concurrent work, MGM-Omni achieves these capabilities with markedly data-efficient training. Extensive experiments demonstrate that MGM-Omni outperforms existing open source models in preserving timbre identity across extended sequences, producing natural and context-aware speech, and achieving superior long-form audio and omnimodal understanding. MGM-Omni establishes an efficient, end-to-end paradigm for omnimodal understanding and controllable, personalised long-horizon speech generation.

[1084] Discovering “Words” in Music: Unsupervised Learning of Compositional Sparse Code for Symbolic Music

Tianle Wang, Sirui Zhang, Xinyi Tong, Peiyang Yu, Jishang Chen, Liangke Zhao, Xinpu Gao, Yves Zhu, Tiezheng Ge, Bo Zheng, Duo Xu, Yang Liu, Xin Jin, Feng Yu, Songchun Zhu

Main category: cs.SD

TL;DR: An unsupervised algorithm discovers “music-words” (recurring patterns) from symbolic music data using a two-stage EM framework, achieving 0.61 IoU against human annotations.

Details

Motivation: To identify fundamental musical patterns that reflect cognitive composition processes and address the challenge of semantic ambiguity in musical interpretation.

Method: Two-stage Expectation-Maximization (EM) framework: 1) Develop music-word dictionary, 2) Reconstruct music data, formulated as a statistical optimization problem.

Result: Achieved 0.61 Intersection over Union (IoU) score when evaluated against human expert annotations, demonstrating effective pattern discovery.

Conclusion: Minimizing code length effectively addresses semantic ambiguity, suggesting human optimization shapes musical semantics; enables extraction of basic building blocks for AI music tasks and musicology analysis.

Abstract: This paper presents an unsupervised machine learning algorithm that identifies recurring patterns – referred to as music-words'' -- from symbolic music data. These patterns are fundamental to musical structure and reflect the cognitive processes involved in composition. However, extracting these patterns remains challenging because of the inherent semantic ambiguity in musical interpretation. We formulate the task of music-word discovery as a statistical optimization problem and propose a two-stage Expectation-Maximization (EM)-based learning framework: 1. Developing a music-word dictionary; 2. Reconstructing the music data. When evaluated against human expert annotations, the algorithm achieved an Intersection over Union (IoU) score of 0.61. Our findings indicate that minimizing code length effectively addresses semantic ambiguity, suggesting that human optimization of encoding systems shapes musical semantics. This approach enables computers to extract basic building blocks’’ from music data, facilitating structural analysis and sparse encoding. The method has two primary applications. First, in AI music, it supports downstream tasks such as music generation, classification, style transfer, and improvisation. Second, in musicology, it provides a tool for analyzing compositional patterns and offers insights into the principle of minimal encoding across diverse musical styles and composers.

[1085] When Audio Generators Become Good Listeners: Generative Features for Understanding Tasks

Zeyu Xie, Chenxing Li, Xuenan Xu, Mengyue Wu, Wenfu Wang, Ruibo Fu, Meng Yu, Dong Yu, Yuexian Zou

Main category: cs.SD

TL;DR: This paper pioneers using generative features to enhance audio understanding, showing that combining generative and discriminative approaches provides both detailed perception and semantic awareness.

Details

Motivation: Conventional discriminative features lose fine-grained details while focusing on semantic abstraction. Audio generation models inherently encode both spatiotemporal perception (local acoustic texture) and semantic prior (knowing what to generate), motivating exploration of their complementary strengths.

Method: Systematic investigation of differences and complementary relationships between generative and discriminative features, followed by an effective fusion strategy that combines both approaches.

Result: Experiments across multiple tasks (sound event classification, tagging, and audio captioning) demonstrate consistent performance gains, with particularly strong results on fine-grained tasks like audio captioning.

Conclusion: This work introduces a new perspective on audio representation learning, highlighting that generative-discriminative complementarity can provide both detailed perception and semantic awareness for audio understanding.

Abstract: This work pioneers the utilization of generative features in enhancing audio understanding. Unlike conventional discriminative features that directly optimize posterior and thus emphasize semantic abstraction while losing fine grained details, audio generation models inherently encode both spatiotemporal perception (capturing local acoustic texture across time and frequency) and semantic prior (knowing what to generate). It motivates us to explore the bridge of these complementary strengths. We provide a systematic investigation of their differences and complementary relationships, and ultimately propose an effective fusion strategy. Experiments across multiple tasks, including sound event classification, tagging, and particularly the fine grained task of audio captioning, demonstrate consistent performance gains. Beyond empirical improvements, this work more importantly introduces a new perspective on audio representation learning, highlighting that generative discriminative complementarity can provide both detailed perception and semantic awareness for audio understanding.

[1086] VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning

Yixuan Zhou, Guoyang Zeng, Xin Liu, Xiang Li, Renjie Yu, Ziyang Wang, Runchuan Ye, Weiyue Sun, Jiancheng Gui, Kehan Li, Zhiyong Wu, Zhiyuan Liu

Main category: cs.SD

TL;DR: VoxCPM is a tokenizer-free TTS model that uses hierarchical semantic-acoustic modeling with semi-discrete residual representations to overcome the trade-off between discrete tokens (stable but less expressive) and continuous signals (expressive but error-prone).

Details

Motivation: To resolve the fundamental trade-off in speech synthesis between discrete tokens (stable but sacrificing expressivity) and continuous signals (acoustically rich but suffering from error accumulation), and eliminate dependency on external speech tokenizers that create semantic-acoustic divides.

Method: Hierarchical semantic-acoustic modeling with differentiable quantization bottleneck: Text-Semantic Language Model generates semantic-prosodic plans, Residual Acoustic Model recovers fine-grained acoustic details, and local diffusion-based decoder generates high-fidelity speech latents. Entire architecture trained end-to-end under diffusion objective.

Result: VoxCPM-0.5B model achieves state-of-the-art zero-shot TTS performance among open-source systems, demonstrating expressive and stable synthesis with context-aware expressiveness and natural flow. Trained on 1.8 million hours of bilingual corpus.

Conclusion: The approach successfully delivers expressive and stable speech synthesis without dependency on external tokenizers, showing capability to comprehend text and generate appropriate prosody and style. Model is publicly accessible under Apache 2.0 license.

Abstract: Generative models for speech synthesis face a fundamental trade-off: discrete tokens ensure stability but sacrifice expressivity, while continuous signals retain acoustic richness but suffer from error accumulation due to task entanglement. This challenge has driven the field towards multi-stage pipelines that rely on pre-trained speech tokenizers, but these create a semantic-acoustic divide, limiting holistic and expressive speech generation. We resolve these dilemma through hierarchical semantic-acoustic modeling with semi-discrete residual representations and present a novel tokenizer-free TTS model VoxCPM. Our framework introduces a differentiable quantization bottleneck that induces natural specialization: a Text-Semantic Language Model (TSLM) generates semantic-prosodic plans, while a Residual Acoustic Model (RALM) recovers fine-grained acoustic details. This hierarchical semantic-acoustic representation guides a local diffusion-based decoder to generate high-fidelity speech latents. Critically, the entire architecture is trained end-to-end under a simple diffusion objective, eliminating dependency on external speech tokenizers. Trained on a massive 1.8 million hours of bilingual corpus, our VoxCPM-0.5B model achieves state-of-the-art zero-shot TTS performance among open-source systems, demonstrating that our approach delivers expressive and stable synthesis. Besides, VoxCPM shows the capability to comprehend text to infer and generate appropriate prosody and style, delivering speech with context-aware expressiveness and natural flow. To facilitate community-driven research and development, VoxCPM is publicly accessible under Apache 2.0.

[1087] Sparse Autoencoders Make Audio Foundation Models more Explainable

Théo Mariotte, Martin Lebourdais, Antonio Almudévar, Marie Tahon, Alfonso Ortega, Nicolas Dugué

Main category: cs.SD

TL;DR: SAEs are used to analyze hidden representations of audio pretrained models, showing they retain original information and enhance disentanglement of vocal attributes.

Details

Motivation: Current analysis of audio pretrained models is limited to linear probing, leaving representations unclear and their internal structure poorly understood.

Method: Use Sparse Autoencoders (SAEs) to analyze hidden representations of pretrained models, with a case study in singing technique classification.

Result: SAEs retain information about original representations and class labels, and enhance disentanglement of vocal attributes.

Conclusion: SAEs are an effective tool for identifying underlying factors encoded in representations of self-supervised learning systems.

Abstract: Audio pretrained models are widely employed to solve various tasks in speech processing, sound event detection, or music information retrieval. However, the representations learned by these models are unclear, and their analysis mainly restricts to linear probing of the hidden representations. In this work, we explore the use of Sparse Autoencoders (SAEs) to analyze the hidden representations of pretrained models, focusing on a case study in singing technique classification. We first demonstrate that SAEs retain both information about the original representations and class labels, enabling their internal structure to provide insights into self-supervised learning systems. Furthermore, we show that SAEs enhance the disentanglement of vocal attributes, establishing them as an effective tool for identifying the underlying factors encoded in the representations.

Qing Wang, Ya Jiang, Hang Chen, Sabato Marco Siniscalchi, Jun Du, Jianqing Gao

Main category: cs.SD

TL;DR: HDA-SELD is a unified framework using hierarchical cross-modal distillation and multi-level data augmentation to improve audio-visual sound event localization and detection in low-resource scenarios.

Details

Motivation: To address the challenge of low-resource audio-visual sound event localization and detection by leveraging knowledge from audio-only models and enhancing learning through data augmentation.

Method: Uses an audio-only SELD model as teacher to transfer knowledge to AV student model through output responses and intermediate features. Applies multi-level data augmentation by mixing features from different network layers with tailored loss functions.

Result: Achieves 21%-38% relative improvement in overall metric over baselines on DCASE 2023 and 2024 datasets. Performs comparably or better than teacher models trained on larger datasets, surpassing state-of-the-art methods.

Conclusion: HDA-SELD effectively addresses low-resource AV SELD through hierarchical distillation and multi-level augmentation, achieving superior performance on benchmark datasets.

Abstract: This work presents HDA-SELD, a unified framework that combines hierarchical cross-modal distillation (HCMD) and multi-level data augmentation to address low-resource audio-visual (AV) sound event localization and detection (SELD). An audio-only SELD model acts as the teacher, transferring knowledge to an AV student model through both output responses and intermediate feature representations. To enhance learning, data augmentation is applied by mixing features randomly selected from multiple network layers and associated loss functions tailored to the SELD task. Extensive experiments on the DCASE 2023 and 2024 Challenge SELD datasets show that the proposed method significantly improves AV SELD performance, yielding relative gains of 21%-38% in the overall metric over the baselines. Notably, our proposed HDA-SELD achieves results comparable to or better than teacher models trained on much larger datasets, surpassing state-of-the-art methods on both DCASE 2023 and 2024 Challenge SELD tasks.

[1089] Enhanced Automatic Drum Transcription via Drum Stem Source Separation

Xavier Riley, Simon Dixon

Main category: cs.SD

TL;DR: The paper presents a method to improve drum transcription realism by combining ADTOF package with drum stem separation models, expanding from 5 to 7 drum classes and estimating MIDI velocities.

Details

Motivation: To enhance the realism of automatic drum transcriptions by leveraging existing drum stem separation capabilities to expand beyond the current 5-class limitation of ADT systems.

Method: A simple post-processing step that combines ADTOF transcription output with drum stem separation models, expanding transcription from 5 to 7 classes and estimating MIDI velocity values from separated stems.

Result: The solution achieves strong performance against an 8-class drum transcription baseline and produces realistic MIDI transcriptions suitable for MIR and music production tasks.

Conclusion: Combining existing drum transcription and stem separation tools effectively improves drum transcription realism by expanding class coverage and enabling velocity estimation.

Abstract: Automatic Drum Transcription (ADT) remains a challenging task in MIR but recent advances allow accurate transcription of drum kits with up 5 classes - kick, snare, hi-hats, toms and cymbals - via the ADTOF package. In addition, several drum kit \emph{stem} separation models in the open source community support separation for more than 6 stem classes, including distinct crash and ride cymbals. In this work we explore the benefits of combining these tools to improve the realism of drum transcriptions. We describe a simple post-processing step which expands the transcription output from five to seven classes and furthermore, we are able to estimate MIDI velocity values based on the separated stems. Our solution achieves strong performance when assessed against a baseline of 8-class drum transcription and produces realistic MIDI transcriptions suitable for MIR or music production tasks.

[1090] Unmute the Patch Tokens: Rethinking Probing in Multi-Label Audio Classification

Lukas Rauch, René Heinrich, Houtan Ghaffari, Lukas Miklautz, Ilyass Moummad, Bernhard Sick, Christoph Scholz

Main category: cs.SD

TL;DR: This paper challenges the standard practice of fine-tuning for audio self-supervised learning by showing that linear probing fails due to global pooling bottlenecks. The authors introduce binarized prototypical probes, a lightweight pooling method that outperforms existing probing approaches.

Details

Motivation: Current self-supervised learning in audio defaults to fine-tuning because linear probing misrepresents embedding quality due to global pooling creating an information bottleneck, particularly for localized events in multi-label audio.

Method: The authors investigate the global pooling bottleneck across 13 datasets and 6 spectrogram-based encoders, then introduce binarized prototypical probes - a lightweight pooling method that learns prototypes for class-wise information aggregation.

Result: The proposed method notably outperforms linear and attentive probing despite its simplicity, establishing probing as a competitive and efficient paradigm for evaluating audio SSL models.

Conclusion: This work challenges the reliance on costly fine-tuning by demonstrating that probing can be a competitive evaluation paradigm for audio self-supervised learning models when using appropriate pooling methods.

Abstract: Although probing frozen models has become a standard evaluation paradigm, self-supervised learning in audio defaults to fine-tuning. A key reason is that global pooling creates an information bottleneck causing linear probes to misrepresent the embedding quality: The $\texttt{cls}$-token discards crucial token information about dispersed, localized events in multi-label audio. This weakness is rooted in the mismatch between the pretraining objective (operating globally) and the downstream task (localized events). Across a comprehensive benchmark of 13 datasets and 6 spectrogram-based encoders, we first investigate the global pooling bottleneck. We then introduce binarized prototypical probes: a lightweight and simple pooling method that learns prototypes to perform class-wise information aggregation. Despite its simplicity, our method notably outperforms linear and attentive probing. Our work establishes probing as a competitive and efficient paradigm for evaluating audio SSL models, challenging the reliance on costly fine-tuning.

[1091] The Shape of Surprise: Structured Uncertainty and Co-Creativity in AI Music Tools

Eric Browne

Main category: cs.SD

TL;DR: This paper presents a thematic review of how AI music systems incorporate randomness through structured uncertainty to balance novelty with musical coherence.

Details

Motivation: To examine the paradoxical role of randomness in computational music creativity - it can spark novelty but risks incoherence if unchecked, and to analyze how designers constrain stochastic processes within musical frameworks.

Method: Comparative analysis of six AI music systems (Musika, MIDI-DDSP, Melody RNN, RAVE, Wekinator, and Somax 2) to identify recurring design patterns that support musical coherence, user control, and co-creativity.

Result: Identified recurring design patterns that support musical coherence, user control, and co-creativity through structured uncertainty in AI music systems.

Conclusion: This is the first thematic review examining randomness in AI music through structured uncertainty, offering practical insights for designers and artists aiming to support expressive, collaborative, or improvisational interactions.

Abstract: Randomness plays a pivotal yet paradoxical role in computational music creativity: it can spark novelty, but unchecked chance risks incoherence. This paper presents a thematic review of contemporary AI music systems, examining how designers incorporate randomness and uncertainty into creative practice. I draw on the concept of structured uncertainty to analyse how stochastic processes are constrained within musical and interactive frameworks. Through a comparative analysis of six systems - Musika (Pasini and Schl"uter, 2022), MIDI-DDSP (Wu et al., 2021), Melody RNN (Magenta Project), RAVE (Caillon and Esling, 2021), Wekinator (Fiebrink and Cook, 2010), and Somax 2 (Borg, 2019) - we identify recurring design patterns that support musical coherence, user control, and co-creativity. To my knowledge, this is the first thematic review examining randomness in AI music through structured uncertainty, offering practical insights for designers and artists aiming to support expressive, collaborative, or improvisational interactions.

[1092] M6(GPT)3: Generating Multitrack Modifiable Multi-Minute MIDI Music from Text using Genetic algorithms, Probabilistic methods and GPT Models in any Progression and Time Signature

Jakub Poćwiardowski, Mateusz Modrzejewski, Marek S. Tatara

Main category: cs.SD

TL;DR: M6(GPT)3 composer system generates multi-minute musical compositions from natural language prompts using transformers and genetic algorithms with adaptive emotional parameters.

Details

Motivation: To create a system that can generate complete, structurally complex musical compositions from natural language descriptions, providing an alternative to purely neural network-based approaches.

Method: Uses autoregressive transformer to map text prompts to JSON composition parameters, then applies genetic algorithm for melody generation with musical mutations and fitness functions, plus probabilistic methods for percussion generation.

Result: The system outperforms baselines on musically meaningful metrics through both human and objective evaluations, demonstrating viable musical composition generation.

Conclusion: The approach offers a successful alternative to purely neural network-based music generation systems, capable of producing complex musical structures from natural language input.

Abstract: This work introduces the M6(GPT)3 composer system, capable of generating complete, multi-minute musical compositions with complex structures in any time signature, in the MIDI domain from input descriptions in natural language. The system utilizes an autoregressive transformer language model to map natural language prompts to composition parameters in JSON format. The defined structure includes time signature, scales, chord progressions, and valence-arousal values, from which accompaniment, melody, bass, motif, and percussion tracks are created. We propose a genetic algorithm for the generation of melodic elements. The algorithm incorporates mutations with musical significance and a fitness function based on normal distribution and predefined musical feature values. The values adaptively evolve, influenced by emotional parameters and distinct playing styles. The system for generating percussion in any time signature utilises probabilistic methods, including Markov chains. Through both human and objective evaluations, we demonstrate that our music generation approach outperforms baselines on specific, musically meaningful metrics, offering a viable alternative to purely neural network-based systems.

[1093] EnvSDD: Benchmarking Environmental Sound Deepfake Detection

Han Yin, Yang Xiao, Rohan Kumar Das, Jisheng Bai, Haohe Liu, Wenwu Wang, Mark D Plumbley

Main category: cs.SD

TL;DR: EnvSDD is a large-scale dataset for environmental sound deepfake detection, addressing limitations in existing datasets and methods designed for speech/singing.

Details

Motivation: Environmental sounds have different characteristics than speech/singing, making existing deepfake detection methods less effective. Current datasets are limited in scale and audio types.

Method: Created EnvSDD dataset (45.25h real + 316.74h fake audio) with diverse test conditions. Proposed detection system based on pre-trained audio foundation model.

Result: The proposed system outperforms state-of-the-art systems from speech and singing domains on the EnvSDD dataset.

Conclusion: EnvSDD addresses the gap in environmental sound deepfake detection and the proposed foundation model-based system shows superior performance compared to existing methods.

Abstract: Audio generation systems now create very realistic soundscapes that can enhance media production, but also pose potential risks. Several studies have examined deepfakes in speech or singing voice. However, environmental sounds have different characteristics, which may make methods for detecting speech and singing deepfakes less effective for real-world sounds. In addition, existing datasets for environmental sound deepfake detection are limited in scale and audio types. To address this gap, we introduce EnvSDD, the first large-scale curated dataset designed for this task, consisting of 45.25 hours of real and 316.74 hours of fake audio. The test set includes diverse conditions to evaluate the generalizability, such as unseen generation models and unseen datasets. We also propose an audio deepfake detection system, based on a pre-trained audio foundation model. Results on EnvSDD show that our proposed system outperforms the state-of-the-art systems from speech and singing domains.

[1094] GRAM: Spatial general-purpose audio representation models for real-world applications

Goksenin Yuksel, Marcel van Gerven, Kiki van der Heijden

Main category: cs.SD

TL;DR: GRAM is a multi-channel masked autoencoder that learns spatial audio representations from simulated real-world scenes, outperforming existing models on real-world audio tasks including sound localization.

Details

Motivation: Current audio foundation models are trained on dry, single-channel audio and fail in real-world acoustic environments with reverberation and noise, overlooking spatial audio properties.

Method: Uses multi-channel masked autoencoder approach to learn spatial audio representations from high-quality simulated real-world scenes, supporting both binaural (2-channel) and Ambisonics (4-channel) formats.

Result: GRAM surpasses all state-of-the-art self-supervised audio foundation models and speech models on both HEAR and Nat-HEAR benchmarks, achieves state-of-the-art localization performance, and shows robust transfer to real-world recordings.

Conclusion: GRAM represents a significant advancement towards robust, spatial audio foundation models for real-world applications, using only a fraction of training data compared to other models.

Abstract: Although audio foundations models have seen great progress on a wide variety of tasks, their application in real-world acoustic environments with reverberation and noise has been less successful. Moreover, as audio foundation models are typically trained on dry, single-channel audio clips, the inherent spatial nature of real-world sound scenes is overlooked and tasks involving sound localization ruled out. To address these limitations, we propose GRAM: a General-purpose Real-world Audio Model utilizing a multi-channel masked auto-encoder approach to efficiently learn spatial audio representations from high-quality simulated real-world scenes. To evaluate the performance of GRAM and other audio foundation models in real-world sound scenes, we release Nat-HEAR: A naturalistic version of the HEAR benchmark suite comprising a simulated real-world version, as well as two new sound localization tasks. We show that the performance of GRAM surpasses all state-of-the-art self-supervised audio foundation models and speech models on both HEAR and Nat-HEAR, while using only a fraction of the training data. GRAM also showcases state-of-the-art localization performance, surpassing even supervised sound localization approaches, and can be flexibly applied either to a two-channel, binaural sound format or a four-channel, Ambisonics format. Validating GRAM’s performance on real-world sound recordings demonstrates robust transfer to real-world scenes. Taken together, GRAM presents a significant advancement towards robust, spatial audio foundation models for real-world applications.

[1095] Discrete Audio Tokens: More Than a Survey!

Pooneh Mousavi, Gallil Maimon, Adel Moumen, Darius Petermann, Jiatong Shi, Haibin Wu, Haici Yang, Anastasia Kuznetsova, Artem Ploujnikov, Ricard Marxer, Bhuvana Ramabhadran, Benjamin Elizalde, Loren Lugosch, Jinyu Li, Cem Subakan, Phil Woodland, Minje Kim, Hung-yi Lee, Shinji Watanabe, Yossi Adi, Mirco Ravanelli

Main category: cs.SD

TL;DR: Systematic review and benchmark of discrete audio tokenizers across speech, music, and general audio domains, evaluating reconstruction, downstream performance, and acoustic language modeling.

Details

Motivation: Existing surveys lack unified comparison across benchmarks and focus on specific domains, while discrete audio tokens enable efficient storage, inference, and integration with LLMs.

Method: Proposed taxonomy based on encoder-decoder, quantization, training paradigm, streamability, and domains; evaluated tokenizers on multiple benchmarks with controlled ablation studies.

Result: Findings highlight key limitations, practical considerations, and trade-offs in audio tokenization performance across different domains and tasks.

Conclusion: Provides insights and guidance for future research in discrete audio tokenization, addressing open challenges in this rapidly evolving field.

Abstract: Discrete audio tokens are compact representations that aim to preserve perceptual quality, phonetic content, and speaker characteristics while enabling efficient storage and inference, as well as competitive performance across diverse downstream tasks. They provide a practical alternative to continuous features, enabling the integration of speech and audio into modern large language models (LLMs). As interest in token-based audio processing grows, various tokenization methods have emerged, and several surveys have reviewed the latest progress in the field. However, existing studies often focus on specific domains or tasks and lack a unified comparison across various benchmarks. This paper presents a systematic review and benchmark of discrete audio tokenizers, covering three domains: speech, music, and general audio. We propose a taxonomy of tokenization approaches based on encoder-decoder, quantization techniques, training paradigm, streamability, and application domains. We evaluate tokenizers on multiple benchmarks for reconstruction, downstream performance, and acoustic language modeling, and analyze trade-offs through controlled ablation studies. Our findings highlight key limitations, practical considerations, and open challenges, providing insight and guidance for future research in this rapidly evolving area. For more information, including our main results and tokenizer database, please refer to our website: https://poonehmousavi.github.io/dates-website/.

[1096] ECHO: Frequency-aware Hierarchical Encoding for Variable-length Signals

Yucong Zhang, Juan Liu, Ming Li

Main category: cs.SD

TL;DR: ECHO is a foundation model for general machine signal processing that handles arbitrary sampling rates through band-split architecture and frequency positional embeddings, achieving SOTA performance in anomaly detection and fault classification.

Details

Motivation: Pre-trained foundation models have succeeded in audio, vision and language domains, but their potential for general machine signal modeling with arbitrary sampling rates (covering acoustic, vibration, and industrial sensor data) remains under-explored.

Method: Integrates band-split architecture with frequency positional embeddings for spectral localization across arbitrary sampling configurations, and uses sliding patches to support variable-length inputs without padding/cropping, producing embeddings that retain temporal and spectral fidelity.

Result: Demonstrates consistent state-of-the-art performance in machine signal anomaly detection and fault classification across various datasets including DCASE task 2 challenges (2020-2025) and industrial signal corpora.

Conclusion: The proposed ECHO model shows effectiveness and strong generalization capability for machine signal processing, and has been open-sourced for community use.

Abstract: Pre-trained foundation models have demonstrated remarkable success in audio, vision and language, yet their potential for general machine signal modeling with arbitrary sampling rates-covering acoustic, vibration, and other industrial sensor data-remains under-explored. In this work, we propose a novel foundation model ECHO that integrates an advanced band-split architecture with frequency positional embeddings, enabling spectral localization across arbitrary sampling configurations. Moreover, the model incorporates sliding patches to support inputs of variable length without padding or cropping, producing a concise embedding that retains both temporal and spectral fidelity and naturally extends to streaming scenarios. We evaluate our method on various kinds of machine signal datasets, including previous DCASE task 2 challenges (2020-2025), and widely-used industrial signal corpora. Experimental results demonstrate consistent state-of-the-art performance in machine signal anomaly detection and fault classification, confirming the effectiveness and generalization capability of the proposed model. We open-sourced ECHO on https://github.com/yucongzh/ECHO.

[1097] Versatile Symbolic Music-for-Music Modeling via Function Alignment

Junyan Jiang, Daniel Chin, Liwei Lin, Xuanjie Liu, Gus Xia

Main category: cs.SD

TL;DR: Unifies music understanding and generation tasks under a music-for-music paradigm using pretrained language models connected via lightweight adapters.

Details

Motivation: Many music AI models rely on human-defined labels, but music annotations can be naturally expressed within the music modality itself, enabling unified treatment of understanding and generation tasks.

Method: Use pretrained language models for both reference and target sequences, connecting them with lightweight adapters for parameter-efficient music-for-music modeling.

Result: Achieves superior performance across various tasks including chord recognition, melody generation, and drum track generation.

Conclusion: The proposed music-for-music paradigm with parameter-efficient adapter connections enables effective unified modeling of diverse symbolic music tasks.

Abstract: Many music AI models learn a map between music content and human-defined labels. However, many annotations, such as chords, can be naturally expressed within the music modality itself, e.g., as sequences of symbolic notes. This observation enables both understanding tasks (e.g., chord recognition) and conditional generation tasks (e.g., chord-conditioned melody generation) to be unified under a music-for-music sequence modeling paradigm. In this work, we propose parameter-efficient solutions for a variety of symbolic music-for-music tasks. The high-level idea is that (1) we utilize a pretrained Language Model (LM) for both the reference and the target sequence and (2) we link these two LMs via a lightweight adapter. Experiments show that our method achieves superior performance among different tasks such as chord recognition, melody generation, and drum track generation. All demos, code and model weights are publicly available.

[1098] Evaluating the Effectiveness of Transformer Layers in Wav2Vec 2.0, XLS-R, and Whisper for Speaker Identification Tasks

Linus Stuhlmann, Michael Alexander Saxer

Main category: cs.SD

TL;DR: Evaluation of Wav2Vec 2.0, XLS-R, and Whisper speech encoders for speaker identification, analyzing layer-wise representations and determining optimal transformer layers for fine-tuning.

Details

Motivation: To assess and compare the performance of three advanced speech encoder models in speaker identification tasks and understand how they capture speaker-specific features across different layers.

Method: Fine-tuned Wav2Vec 2.0, XLS-R, and Whisper models, then analyzed their layer-wise representations using SVCCA, k-means clustering, and t-SNE visualizations to identify optimal transformer layers.

Result: Wav2Vec 2.0 and XLS-R effectively capture speaker-specific features in early layers, with fine-tuning improving stability and performance. Whisper performs better in deeper layers. Optimal number of transformer layers for each model was determined.

Conclusion: Different speech encoder models have varying optimal layer configurations for speaker identification, with Wav2Vec 2.0 and XLS-R benefiting from early layers and Whisper from deeper layers, and fine-tuning enhances model performance and stability.

Abstract: This study evaluates the performance of three advanced speech encoder models, Wav2Vec 2.0, XLS-R, and Whisper, in speaker identification tasks. By fine-tuning these models and analyzing their layer-wise representations using SVCCA, k-means clustering, and t-SNE visualizations, we found that Wav2Vec 2.0 and XLS-R capture speaker-specific features effectively in their early layers, with fine-tuning improving stability and performance. Whisper showed better performance in deeper layers. Additionally, we determined the optimal number of transformer layers for each model when fine-tuned for speaker identification tasks.

[1099] Xi+: Uncertainty Supervision for Robust Speaker Embedding

Junjie Li, Kong Aik Lee, Duc-Tuan Truong, Tianchi Liu, Man-Wai Mak

Main category: cs.SD

TL;DR: Proposed xi+ architecture improves speaker recognition by adding temporal attention and explicit uncertainty supervision through Stochastic Variance Loss, achieving ~10-11% performance gains.

Details

Motivation: Current xi-vector model's uncertainty estimation is implicitly trained through classification loss alone and ignores temporal relationships between frames, leading to suboptimal performance.

Method: Enhanced xi-vector with temporal attention module for context-aware frame-level uncertainty estimation and introduced Stochastic Variance Loss for explicit uncertainty supervision.

Result: Achieved consistent performance improvements: ~10% on VoxCeleb1-O set and ~11% on NIST SRE 2024 evaluation set.

Conclusion: xi+ architecture with temporal attention and explicit uncertainty supervision significantly improves speaker recognition performance over baseline xi-vector.

Abstract: There are various factors that can influence the performance of speaker recognition systems, such as emotion, language and other speaker-related or context-related variations. Since individual speech frames do not contribute equally to the utterance-level representation, it is essential to estimate the importance or reliability of each frame. The xi-vector model addresses this by assigning different weights to frames based on uncertainty estimation. However, its uncertainty estimation model is implicitly trained through classification loss alone and does not consider the temporal relationships between frames, which may lead to suboptimal supervision. In this paper, we propose an improved architecture, xi+. Compared to xi-vector, xi+ incorporates a temporal attention module to capture frame-level uncertainty in a context-aware manner. In addition, we introduce a novel loss function, Stochastic Variance Loss, which explicitly supervises the learning of uncertainty. Results demonstrate consistent performance improvements of about 10% on the VoxCeleb1-O set and 11% on the NIST SRE 2024 evaluation set.

[1100] FuseCodec: Semantic-Contextual Fusion and Supervision for Neural Codecs

Md Mubtasim Ahasan, Rafat Hasan Khan, Tasnim Mohiuddin, Aman Chadha, Tariq Iqbal, M Ashraful Amin, Amin Ahsan Ali, Md Mofijul Islam, A K M Mahbubur Rahman

Main category: cs.SD

TL;DR: FuseCodec is a novel speech tokenization method that unifies acoustic, semantic, and contextual representations through cross-modal alignment and global supervision, achieving state-of-the-art performance in speech tasks.

Details

Motivation: Existing neural codecs focus on low-level acoustic features but overlook semantic and contextual cues in human speech, creating challenges in aligning semantic and contextual representations.

Method: Three complementary techniques: (1) Latent Representation Fusion integrating semantic/contextual features into encoder latent space, (2) Global Semantic-Contextual Supervision with globally pooled representations, and (3) Temporally Aligned Contextual Supervision with dynamic token matching within local windows.

Result: Achieves SOTA performance on LibriSpeech, surpassing EnCodec, SpeechTokenizer, and DAC in transcription accuracy, perceptual quality, intelligibility, and speaker similarity. Also demonstrates applicability to zero-shot speech synthesis with FuseCodec-TTS.

Conclusion: The approach effectively enables contextually and semantically guided tokenization for speech processing and downstream tasks, with code and models publicly available.

Abstract: Speech tokenization enables discrete representation and facilitates speech language modeling. However, existing neural codecs capture low-level acoustic features, overlooking the semantic and contextual cues inherent to human speech. While recent efforts introduced semantic representations from self-supervised speech models or incorporated contextual representations from pre-trained language models, challenges remain in aligning and unifying the semantic and contextual representations. We introduce FuseCodec, which unifies acoustic, semantic, and contextual representations through strong cross-modal alignment and globally informed supervision. We propose three complementary techniques: (i) Latent Representation Fusion, integrating semantic and contextual features directly into the encoder latent space for robust and unified representation learning; (ii) Global Semantic-Contextual Supervision, supervising discrete tokens with globally pooled and broadcasted representations to enhance temporal consistency and cross-modal alignment; and (iii) Temporally Aligned Contextual Supervision, strengthening alignment by dynamically matching contextual and speech tokens within a local window for fine-grained token-level supervision. We further introduce FuseCodec-TTS, demonstrating our methodology’s applicability to zero-shot speech synthesis. Empirically, FuseCodec achieves state-of-the-art performance in LibriSpeech, surpassing EnCodec, SpeechTokenizer, and DAC in transcription accuracy, perceptual quality, intelligibility, and speaker similarity. Results highlight the effectiveness of contextually and semantically guided tokenization for speech tokenization and downstream tasks. Code and pretrained models are available at https://github.com/mubtasimahasan/FuseCodec.

[1101] Sidon: Fast and Robust Open-Source Multilingual Speech Restoration for Large-scale Dataset Cleansing

Wataru Nakata, Yuki Saito, Yota Ueda, Hiroshi Saruwatari

Main category: cs.SD

TL;DR: Sidon is a fast, open-source speech restoration model that converts noisy speech into studio-quality speech across multiple languages, achieving performance comparable to Google’s internal model while being 500x faster than real-time.

Details

Motivation: Large-scale TTS systems are limited by scarce clean, multilingual recordings, so there's a need for efficient speech restoration to create high-quality training data.

Method: Uses two models: w2v-BERT 2.0 finetuned feature predictor to cleanse features from noisy speech, and a vocoder trained to synthesize restored speech from cleansed features.

Result: Achieves restoration performance comparable to Google’s Miipher model, runs 500x faster than real-time on single GPU, and improves TTS quality in zero-shot settings when used for dataset cleansing.

Conclusion: Sidon provides an efficient open-source solution for speech restoration and dataset cleansing that benefits the research community and improves TTS system quality.

Abstract: Large-scale text-to-speech (TTS) systems are limited by the scarcity of clean, multilingual recordings. We introduce Sidon, a fast, open-source speech restoration model that converts noisy in-the-wild speech into studio-quality speech and scales to dozens of languages. Sidon consists of two models: w2v-BERT 2.0 finetuned feature predictor to cleanse features from noisy speech and vocoder trained to synthesize restored speech from the cleansed features. Sidon achieves restoration performance comparable to Miipher: Google’s internal speech restoration model with the aim of dataset cleansing for speech synthesis. Sidon is also computationally efficient, running up to 500 times faster than real time on a single GPU. We further show that training a TTS model using a Sidon-cleansed automatic speech recognition corpus improves the quality of synthetic speech in a zero-shot setting. Code and model are released to facilitate reproducible dataset cleansing for the research community.

[1102] SVeritas: Benchmark for Robust Speaker Verification under Diverse Conditions

Massa Baali, Sarthak Bisht, Francisco Teixeira, Kateryna Shapovalenko, Rita Singh, Bhiksha Raj

Main category: cs.SD

TL;DR: SVeritas is a comprehensive benchmark suite for speaker verification systems that evaluates robustness across multiple real-world challenges including recording conditions, signal degradations, demographic factors, and security threats, revealing performance gaps in current models.

Details

Motivation: Existing speaker verification benchmarks are incomplete, evaluating only subsets of real-world challenges while missing important conditions like cross-language trials, age mismatches, and codec compression that significantly impact system performance in practical applications.

Method: Developed SVeritas benchmark suite that systematically assesses speaker verification systems under diverse stressors including recording duration, spontaneity, content, noise, microphone distance, reverberation, channel mismatches, audio bandwidth, codecs, speaker age, spoofing, and adversarial attacks.

Result: Evaluation of state-of-the-art SV models revealed substantial performance degradation in scenarios involving cross-language trials, age mismatches, and codec-induced compression, with additional disparities identified across demographic subgroups (age, gender, linguistic backgrounds).

Conclusion: SVeritas provides standardized evaluation under realistic conditions, enabling precise diagnosis of model weaknesses and establishing a foundation for developing more equitable and reliable speaker verification systems.

Abstract: Speaker verification (SV) models are increasingly integrated into security, personalization, and access control systems, yet their robustness to many real-world challenges remains inadequately benchmarked. These include a variety of natural and maliciously created conditions causing signal degradations or mismatches between enrollment and test data, impacting performance. Existing benchmarks evaluate only subsets of these conditions, missing others entirely. We introduce SVeritas, a comprehensive Speaker Verification tasks benchmark suite, assessing SV systems under stressors like recording duration, spontaneity, content, noise, microphone distance, reverberation, channel mismatches, audio bandwidth, codecs, speaker age, and susceptibility to spoofing and adversarial attacks. While several benchmarks do exist that each cover some of these issues, SVeritas is the first comprehensive evaluation that not only includes all of these, but also several other entirely new, but nonetheless important, real-life conditions that have not previously been benchmarked. We use SVeritas to evaluate several state-of-the-art SV models and observe that while some architectures maintain stability under common distortions, they suffer substantial performance degradation in scenarios involving cross-language trials, age mismatches, and codec-induced compression. Extending our analysis across demographic subgroups, we further identify disparities in robustness across age groups, gender, and linguistic backgrounds. By standardizing evaluation under realistic and synthetic stress conditions, SVeritas enables precise diagnosis of model weaknesses and establishes a foundation for advancing equitable and reliable speaker verification systems.

[1103] i-LAVA: Insights on Low Latency Voice-2-Voice Architecture for Agents

Anupam Purwar, Aditya Choudhary

Main category: cs.SD

TL;DR: This paper analyzes optimization of voice-to-voice (V-2-V) systems for real-time applications, finding that TTS components have the highest impact on latency, and proposes reducing RVQ iterations and codebooks in the Mimi TTS system as key optimizations.

Details

Motivation: To optimize voice-to-voice communication models for low-latency real-time conversational applications while maintaining high-quality interactions.

Method: Analyzed V-2-V system components (ASR, TTS, dialog management), experimented with CSM1b architecture that understands tone and context, and optimized Residual Vector Quantization (RVQ) iterations in the TTS decoder.

Result: Identified TTS as the component with highest impact on Real Time Factor (RTF), found that reducing RVQ iterations and codebooks in Mimi TTS system provides the most important optimizations for CSM-based V-2-V implementations.

Conclusion: Key optimizations for real-time V-2-V systems involve reducing TTS processing time through fewer RVQ iterations and codebooks, though this comes at a cost of voice quality degradation.

Abstract: We experiment with a low-latency, end-to-end voice-to-voice communication model to optimize it for real-time conversational applications. By analyzing components essential to voice to voice (V-2-V) system viz. automatic speech recognition (ASR), text-to-speech (TTS), and dialog management, our work analyzes how to reduce processing time while maintaining high-quality interactions to identify the levers for optimizing V-2-V system. Our work identifies that TTS component which generates life-like voice, full of emotions including natural pauses and exclamations has highest impact on Real time factor (RTF). The experimented V-2-V architecture utilizes CSM1b has the capability to understand tone as well as context of conversation by ingesting both audio and text of prior exchanges to generate contextually accurate speech. We explored optimization of Residual Vector Quantization (RVQ) iterations by the TTS decoder which come at a cost of decrease in the quality of voice generated. Our experimental evaluations also demonstrate that for V-2-V implementations based on CSM most important optimizations can be brought by reducing the number of RVQ Iterations along with the codebooks used in Mimi.

[1104] Prosody-Adaptable Audio Codecs for Zero-Shot Voice Conversion via In-Context Learning

Junchuan Zhao, Xintong Wang, Ye Wang

Main category: cs.SD

TL;DR: A voice conversion model using VALLE-X framework with prosody-aware audio codec encoder (PACE) for improved prosody control and speaker adaptation.

Details

Motivation: Leverage recent advances in discrete audio codecs and codec language models for zero-shot speech synthesis, specifically for voice conversion with better prosody control.

Method: Proposed a voice conversion model within VALLE-X framework with a prosody-aware audio codec encoder (PACE) module that isolates and refines prosody from other sources.

Result: Outperforms baseline VC systems in prosody preservation, timbre consistency, and overall naturalness.

Conclusion: The integration of PACE into VC model enables greater flexibility in prosody manipulation while preserving speaker timbre, achieving superior performance over baseline systems.

Abstract: Recent advances in discrete audio codecs have significantly improved speech representation modeling, while codec language models have enabled in-context learning for zero-shot speech synthesis. Inspired by this, we propose a voice conversion (VC) model within the VALLE-X framework, leveraging its strong in-context learning capabilities for speaker adaptation. To enhance prosody control, we introduce a prosody-aware audio codec encoder (PACE) module, which isolates and refines prosody from other sources, improving expressiveness and control. By integrating PACE into our VC model, we achieve greater flexibility in prosody manipulation while preserving speaker timbre. Experimental evaluation results demonstrate that our approach outperforms baseline VC systems in prosody preservation, timbre consistency, and overall naturalness, surpassing baseline VC systems.

cs.LG

[1105] Localizing Adversarial Attacks To Produces More Imperceptible Noise

Pavan Reddy, Aditya Sanjay Gujral

Main category: cs.LG

TL;DR: Localized adversarial attacks using binary masks achieve better imperceptibility with lower pixel perturbations and higher PSNR/SSIM scores than global attacks, but require more computation and slightly reduce attack success rates.

Details

Motivation: To systematically evaluate the underexplored potential of localized adversarial noise compared to traditional global perturbations in machine learning security.

Method: Introduce binary masks to constrain adversarial noise to specific regions, then evaluate localized versions of FGSM, PGD, and C&W attacks across effectiveness, imperceptibility, and computational efficiency metrics.

Result: Localized attacks achieve significantly lower mean pixel perturbations, higher PSNR, and improved SSIM compared to global attacks, though with increased computational effort and modest reduction in Attack Success Rate.

Conclusion: Iterative methods (PGD, C&W) are more robust to localization constraints than single-step methods (FGSM), providing practical insights for advancing attack strategies and designing robust defensive systems.

Abstract: Adversarial attacks in machine learning traditionally focus on global perturbations to input data, yet the potential of localized adversarial noise remains underexplored. This study systematically evaluates localized adversarial attacks across widely-used methods, including FGSM, PGD, and C&W, to quantify their effectiveness, imperceptibility, and computational efficiency. By introducing a binary mask to constrain noise to specific regions, localized attacks achieve significantly lower mean pixel perturbations, higher Peak Signal-to-Noise Ratios (PSNR), and improved Structural Similarity Index (SSIM) compared to global attacks. However, these benefits come at the cost of increased computational effort and a modest reduction in Attack Success Rate (ASR). Our results highlight that iterative methods, such as PGD and C&W, are more robust to localization constraints than single-step methods like FGSM, maintaining higher ASR and imperceptibility metrics. This work provides a comprehensive analysis of localized adversarial attacks, offering practical insights for advancing attack strategies and designing robust defensive systems.

[1106] In-Context Learning can Perform Continual Learning Like Humans

Liuwang Kang, Fan Wang, Shaoshan Liu, Hung-Chyun Chou, Chuan Lin, Ning Ding

Main category: cs.LG

TL;DR: The paper introduces in-context continual learning (ICCL) as an inference-only paradigm that enables large language models to achieve long-term retention and cross-task knowledge accumulation through task scheduling and prompt rearrangement, showing human-like retention patterns.

Details

Motivation: To explore whether in-context learning can achieve long-term retention and cross-task knowledge accumulation in sequential multitask settings, inspired by human memory studies.

Method: Extend in-context learning to ICCL through task scheduling and prompt rearrangement, using Markov-Chain benchmarks and proposing a human-retention similarity metric to evaluate alignment with human retention dynamics.

Result: ICCL benefits from distributed practice similar to humans, showing a spacing “sweet spot” for retention. Linear-attention models like MAMBA and RWKV exhibit particularly human-like retention patterns despite lower performance than Transformer-based LLMs.

Conclusion: ICCL is both cognitively plausible and practically effective, providing an inference-only continual learning paradigm that mitigates catastrophic forgetting and addresses the stability-plasticity dilemma in conventional CL methods.

Abstract: Large language models (LLMs) can adapt to new tasks via in-context learning (ICL) without parameter updates, making them powerful learning engines for fast adaptation. While extensive research has examined ICL as a few-shot learner, whether it can achieve long-term retention and cross-task knowledge accumulation when multitasks arrive sequentially remains underexplored. Motivated by human memory studies, we investigate the retention characteristics of ICL in multitask settings and extend it to in-context continual learning (ICCL), where continual learning ability emerges through task scheduling and prompt rearrangement. Experiments on Markov-Chain benchmarks demonstrate that, for specific large-language models, ICCL benefits from distributed practice (DP) in a manner analogous to humans, consistently revealing a spacing “sweet spot” for retention. Beyond retention performance, we propose a human-retention similarity metric to quantify how closely a continual-learning (CL) method aligns with human retention dynamics. Using this metric, we show that linear-attention models such as MAMBA and RWKV exhibit particularly human-like retention patterns, despite their retention performance lagging behind that of Transformer-based LLMs. Overall, our results establish ICCL as both cognitively plausible and practically effective, providing an inference-only CL paradigm that mitigates catastrophic forgetting and addresses the stability-plasticity dilemma in conventional CL methods.

[1107] Communication-Efficient and Interoperable Distributed Learning

Mounssif Krouka, Mehdi Bennis

Main category: cs.LG

TL;DR: A communication-efficient distributed learning framework that supports model heterogeneity and privacy preservation through modular composition and shared fusion-layer outputs.

Details

Motivation: To address challenges in collaborative learning across heterogeneous model architectures while ensuring interoperability and privacy preservation.

Method: Partition models into personalized base blocks and generalized modular blocks with common fusion-layer output dimensions; share only fusion-layer outputs while keeping model parameters private.

Result: Achieves superior communication efficiency compared to federated learning (FL) and federated split learning (FSL) baselines while maintaining stable training performance.

Conclusion: The proposed framework effectively enables collaborative learning across heterogeneous architectures with improved communication efficiency and privacy protection.

Abstract: Collaborative learning across heterogeneous model architectures presents significant challenges in ensuring interoperability and preserving privacy. We propose a communication-efficient distributed learning framework that supports model heterogeneity and enables modular composition during inference. To facilitate interoperability, all clients adopt a common fusion-layer output dimension, which permits each model to be partitioned into a personalized base block and a generalized modular block. Clients share their fusion-layer outputs, keeping model parameters and architectures private. Experimental results demonstrate that the framework achieves superior communication efficiency compared to federated learning (FL) and federated split learning (FSL) baselines, while ensuring stable training performance across heterogeneous architectures.

[1108] On the Capacity of Self-Attention

Micah Adler

Main category: cs.LG

TL;DR: Self-attention capacity scales as D_K = Θ(m’ log m’ / d_model) to recover m’ relations, and distributing key-query budget across multiple heads mitigates interference when embeddings are compressed.

Details

Motivation: To formally understand the capacity of self-attention layers in recovering distinct relations between tokens, and provide principled design rules for allocating key-query budget.

Method: Introduce Relational Graph Recognition (RGR) framework, derive information-theoretic lower bounds and explicit constructions, and validate with controlled single-layer experiments.

Result: Established capacity scaling law showing D_K = Θ(m’ log m’ / d_model) is necessary and sufficient for recovering m’ relations. Multi-head attention benefits emerge from interference mitigation in compressed embeddings.

Conclusion: Provides concrete scaling law for self-attention capacity and principled design rule for distributing key-query budget across heads to maximize recoverable relations.

Abstract: While self-attention is known to learn relations among tokens, we lack a formal understanding of its capacity: how many distinct relations can a single layer reliably recover for a given budget? To formalize this, we introduce Relational Graph Recognition (RGR), where the key-query channel represents a graph on $m$ items with $m’$ directed edges, and, given a context of items, must recover the neighbors of each item. We measure resources by the total key dimension $D_K = h,d_k$. Within this framework, we analytically derive a capacity scaling law and validate it empirically. We show that $D_K = \Theta(m’ \log m’ / d_{\text{model}})$ is both necessary (information-theoretic lower bound) and sufficient (explicit construction) in a broad class of graphs to recover $m’$ relations. This scaling law directly leads to a new, capacity-based rationale for multi-head attention that applies even when each item only attends to a single target. When embeddings are uncompressed ($m = d_{\text{model}}$) and the graph is a permutation, a single head suffices. However, compression ($m > d_{\text{model}}$) forces relations into overlapping subspaces, creating interference that a single large head cannot disentangle. Our analysis shows that allocating a fixed $D_K$ across many small heads mitigates this interference, increasing the number of recoverable relations. Controlled single-layer experiments mirror the theory, revealing a sharp performance threshold that matches the predicted capacity scaling and confirms the benefit of distributing $D_K$ across multiple heads. Altogether, these results provide a concrete scaling law for self-attention capacity and a principled design rule for allocating key-query budget across heads.

[1109] Boundary on the Table: Efficient Black-Box Decision-Based Attacks for Structured Data

Roie Kazoom, Yuval Ratzabi, Etamar Rothstein, Ofer Hadar

Main category: cs.LG

TL;DR: A novel black-box, decision-based adversarial attack for tabular data that achieves over 90% success rates with minimal queries, revealing critical vulnerabilities in tabular models.

Details

Motivation: Adversarial robustness in structured data remains underexplored compared to vision and language domains, creating a need for specialized attacks for tabular data.

Method: Combines gradient-free direction estimation with iterative boundary search to efficiently navigate discrete and continuous feature spaces under minimal oracle access.

Result: Successfully compromises nearly entire test sets across diverse models (classical ML to LLM-based pipelines) with success rates consistently above 90% using only small number of queries per instance.

Conclusion: Tabular models are critically vulnerable to adversarial perturbations, highlighting urgent need for stronger defenses in real-world decision-making systems.

Abstract: Adversarial robustness in structured data remains an underexplored frontier compared to vision and language domains. In this work, we introduce a novel black-box, decision-based adversarial attack tailored for tabular data. Our approach combines gradient-free direction estimation with an iterative boundary search, enabling efficient navigation of discrete and continuous feature spaces under minimal oracle access. Extensive experiments demonstrate that our method successfully compromises nearly the entire test set across diverse models, ranging from classical machine learning classifiers to large language model (LLM)-based pipelines. Remarkably, the attack achieves success rates consistently above 90%, while requiring only a small number of queries per instance. These results highlight the critical vulnerability of tabular models to adversarial perturbations, underscoring the urgent need for stronger defenses in real-world decision-making systems.

[1110] Adaptive Margin RLHF via Preference over Preferences

Yaswanth Chittepu, Prasann Singhal, Greg Durrett, Scott Niekum

Main category: cs.LG

TL;DR: DPO-PoP extends Direct Preference Optimization by using preference-over-preference annotations to infer adaptive margins per datapoint, improving both discriminative and generative performance over existing margin methods.

Details

Motivation: Existing margin methods in RLHF reward learning use no margins, fixed margins, or simplistic rating-based margins that fail to account for varying preference strengths and rely on noisy human ratings. Modeling preference strength can improve generalization and alignment.

Method: Leverages preference-over-preference annotations (comparing which of two preferences reflects stronger distinction) to infer adaptive margins per datapoint. Extends DPO with these learned margins as DPO-PoP.

Result: Outperforms vanilla DPO, DPO with fixed margins, and DPO with ground-truth margins on UltraFeedback dataset. Shows tradeoff between discriminative and generative performance - improving test classification accuracy can reduce generative quality.

Conclusion: Proposes two sampling strategies for preference-over-preference labels to navigate the discriminative-generative tradeoff: one favoring discriminative performance and one favoring generative performance.

Abstract: Margin-based optimization is fundamental to improving generalization and robustness in classification tasks. In the context of reward model learning from preferences within Reinforcement Learning from Human Feedback (RLHF), existing methods typically rely on no margins, fixed margins, or margins that are simplistic functions of preference ratings. However, such formulations often fail to account for the varying strengths of different preferences, for example some preferences are associated with larger margins between responses, or they rely on noisy margin information derived from ratings. We argue that modeling the strength of preferences can lead to better generalization and more faithful alignment. Furthermore, many existing methods that use adaptive margins assume access to accurate preference scores, which can be difficult for humans to provide reliably. We propose an approach that leverages preferences over preferences, that is annotations indicating which of two preferences reflects a stronger distinction. We use this ordinal signal to infer adaptive margins on a per-datapoint basis. We introduce an extension to Direct Preference Optimization (DPO), DPO-PoP, that incorporates adaptive margins from preference-over-preference supervision, enabling improved discriminative and generative performance. Empirically, our method outperforms vanilla DPO, DPO with fixed margins, and DPO with ground-truth margins on the UltraFeedback dataset. Additionally, we show that there is a tradeoff between discriminative and generative performance: improving test classification accuracy, particularly by correctly labeling weaker preferences at the expense of stronger ones, can lead to a decline in generative quality. To navigate this tradeoff, we propose two sampling strategies to gather preference-over-preference labels: one favoring discriminative performance and one favoring generative performance.

[1111] Observation-Free Attacks on Online Learning to Rank

Sameep Chattopadhyay, Nikhil Karamchandani, Sharayu Mohair

Main category: cs.LG

TL;DR: The paper presents adversarial attack strategies against online learning to rank algorithms that can promote target items to top-K recommendations while causing linear regret to the learning algorithm, requiring only O(log T) manipulations.

Details

Motivation: Online learning to rank algorithms are widely used in search engines and recommendation systems, but their vulnerability to coordinated adversarial attacks remains poorly understood despite extensive adoption.

Method: Proposed two novel attack strategies: CascadeOFA for CascadeUCB1 and PBMOFA for PBM-UCB, designed to promote target items to top-K recommendations while inducing linear regret.

Result: Theoretical guarantees show both attack strategies require only O(log T) manipulations to succeed, with empirical validation on real-world data.

Conclusion: The framework demonstrates significant vulnerabilities in widely used OLTR algorithms, highlighting the need for robust defenses against coordinated adversarial attacks in ranking systems.

Abstract: Online learning to rank (OLTR) plays a critical role in information retrieval and machine learning systems, with a wide range of applications in search engines and content recommenders. However, despite their extensive adoption, the susceptibility of OLTR algorithms to coordinated adversarial attacks remains poorly understood. In this work, we present a novel framework for attacking some of the widely used OLTR algorithms. Our framework is designed to promote a set of target items so that they appear in the list of top-K recommendations for T - o(T) rounds, while simultaneously inducing linear regret in the learning algorithm. We propose two novel attack strategies: CascadeOFA for CascadeUCB1 and PBMOFA for PBM-UCB . We provide theoretical guarantees showing that both strategies require only O(log T) manipulations to succeed. Additionally, we supplement our theoretical analysis with empirical results on real-world data.

[1112] Neighborhood Sampling Does Not Learn the Same Graph Neural Network

Zehao Niu, Mihai Anitescu, Jie Chen

Main category: cs.LG

TL;DR: The paper analyzes neighborhood sampling in graph neural networks using neural tangent kernels, showing different sampling methods produce different posterior distributions that converge with more samples but have incomparable performance.

Details

Motivation: To understand the systemic behaviors of neighborhood sampling in large-scale graph neural networks, which is widely used in practice but lacks theoretical analysis.

Method: Used neural tangent kernels and Gaussian processes to theoretically analyze different neighborhood sampling approaches and their corresponding posterior distributions.

Result: Different sampling approaches produce different posterior distributions with limited samples, but converge to the same distribution as sample size increases. Posterior covariances are incomparable, explaining why no sampling method dominates.

Conclusion: Neighborhood sampling methods have distinct behaviors with limited samples but converge asymptotically, and their performance differences are theoretically incomparable, aligning with empirical observations.

Abstract: Neighborhood sampling is an important ingredient in the training of large-scale graph neural networks. It suppresses the exponential growth of the neighborhood size across network layers and maintains feasible memory consumption and time costs. While it becomes a standard implementation in practice, its systemic behaviors are less understood. We conduct a theoretical analysis by using the tool of neural tangent kernels, which characterize the (analogous) training dynamics of neural networks based on their infinitely wide counterparts – Gaussian processes (GPs). We study several established neighborhood sampling approaches and the corresponding posterior GP. With limited samples, the posteriors are all different, although they converge to the same one as the sample size increases. Moreover, the posterior covariance, which lower-bounds the mean squared prediction error, is uncomparable, aligning with observations that no sampling approach dominates.

[1113] From Noise to Knowledge: A Comparative Study of Acoustic Anomaly Detection Models in Pumped-storage Hydropower Plants

Karim Khamaisi, Nicolas Keller, Stefan Krummenacher, Valentin Huber, Bernhard Fässler, Bruno Rodrigues

Main category: cs.LG

TL;DR: Comparative analysis of acoustic-based anomaly detection methods for hydropower plants, benchmarking LSTM AE, K-Means, and OC-SVM on real-world datasets with OC-SVM achieving best accuracy-training time trade-off.

Details

Motivation: Unplanned outages in industrial factories and energy producers are highly costly and difficult to service, but existing acoustic-anomaly detection studies rely on generic datasets with limited focus on hydropower plants due to access constraints.

Method: Address acoustic preprocessing challenges under noisy conditions, extract time- and frequency-domain features, and benchmark three ML models (LSTM AE, K-Means, OC-SVM) on two real-world datasets from a pumped-storage plant with induced and real-world anomalies.

Result: One-Class SVM achieved best trade-off with ROC AUC 0.966-0.998 and minimal training time, while LSTM autoencoder delivered strong detection (ROC AUC 0.889-0.997) at higher computational cost.

Conclusion: Acoustic-based anomaly detection can effectively improve predictive maintenance in hydropower plants, with OC-SVM providing the most practical solution due to its optimal balance of accuracy and computational efficiency.

Abstract: In the context of industrial factories and energy producers, unplanned outages are highly costly and difficult to service. However, existing acoustic-anomaly detection studies largely rely on generic industrial or synthetic datasets, with few focused on hydropower plants due to limited access. This paper presents a comparative analysis of acoustic-based anomaly detection methods, as a way to improve predictive maintenance in hydropower plants. We address key challenges in the acoustic preprocessing under highly noisy conditions before extracting time- and frequency-domain features. Then, we benchmark three machine learning models: LSTM AE, K-Means, and OC-SVM, which are tested on two real-world datasets from the Rodundwerk II pumped-storage plant in Austria, one with induced anomalies and one with real-world conditions. The One-Class SVM achieved the best trade-off of accuracy (ROC AUC 0.966-0.998) and minimal training time, while the LSTM autoencoder delivered strong detection (ROC AUC 0.889-0.997) at the expense of higher computational cost.

[1114] FedAgentBench: Towards Automating Real-world Federated Medical Image Analysis with Server-Client LLM Agents

Pramit Saha, Joshua Strong, Divyanshu Mishra, Cheng Ouyang, J. Alison Noble

Main category: cs.LG

TL;DR: This paper introduces an agent-driven federated learning framework and FedAgentBench benchmark to automate healthcare FL workflows, addressing practical orchestration challenges that hinder real-world deployment.

Details

Motivation: Real-world FL deployment faces complex operational challenges requiring substantial human effort, including client selection, coordination, data preprocessing, data harmonization, and algorithm selection. Existing FL works overlook these practical orchestration issues.

Method: Proposed an agent-driven FL framework capturing key phases of real-world FL workflows and FedAgentBench benchmark with 40 FL algorithms, 201 datasets across 6 healthcare modalities, evaluating 24 LLM agents’ autonomous coordination capabilities.

Result: While some agent cores like GPT-4.1 and DeepSeek V3 can automate various FL pipeline stages, more complex interdependent tasks based on implicit goals remain challenging even for the strongest models.

Conclusion: Agent-driven FL systems show promise for automating healthcare FL workflows but face limitations in handling complex interdependent tasks, highlighting the need for continued development in autonomous FL coordination.

Abstract: Federated learning (FL) allows collaborative model training across healthcare sites without sharing sensitive patient data. However, real-world FL deployment is often hindered by complex operational challenges that demand substantial human efforts. This includes: (a) selecting appropriate clients (hospitals), (b) coordinating between the central server and clients, (c) client-level data pre-processing, (d) harmonizing non-standardized data and labels across clients, and (e) selecting FL algorithms based on user instructions and cross-client data characteristics. However, the existing FL works overlook these practical orchestration challenges. These operational bottlenecks motivate the need for autonomous, agent-driven FL systems, where intelligent agents at each hospital client and the central server agent collaboratively manage FL setup and model training with minimal human intervention. To this end, we first introduce an agent-driven FL framework that captures key phases of real-world FL workflows from client selection to training completion and a benchmark dubbed FedAgentBench that evaluates the ability of LLM agents to autonomously coordinate healthcare FL. Our framework incorporates 40 FL algorithms, each tailored to address diverse task-specific requirements and cross-client characteristics. Furthermore, we introduce a diverse set of complex tasks across 201 carefully curated datasets, simulating 6 modality-specific real-world healthcare environments, viz., Dermatoscopy, Ultrasound, Fundus, Histopathology, MRI, and X-Ray. We assess the agentic performance of 14 open-source and 10 proprietary LLMs spanning small, medium, and large model scales. While some agent cores such as GPT-4.1 and DeepSeek V3 can automate various stages of the FL pipeline, our results reveal that more complex, interdependent tasks based on implicit goals remain challenging for even the strongest models.

[1115] FedCF: Fair Federated Conformal Prediction

Anutam Srinivasan, Aditya T. Vadlamani, Amin Meghrazi, Srinivasan Parthasarathy

Main category: cs.LG

TL;DR: Extends Conformal Fairness to Federated Learning for auditing model fairness across demographic groups.

Details

Motivation: Standard Conformal Prediction provides uncertainty quantification but ignores sensitive attributes, while existing fairness methods don't address federated settings.

Method: Extends Conformal Fairness framework to Federated Learning, analyzing fairness gaps across demographic groups using exchangeability assumption.

Result: Empirically validated on multiple datasets across domains, fully leveraging exchangeability.

Conclusion: Successfully adapts fairness auditing to federated learning through extended Conformal Fairness framework.

Abstract: Conformal Prediction (CP) is a widely used technique for quantifying uncertainty in machine learning models. In its standard form, CP offers probabilistic guarantees on the coverage of the true label, but it is agnostic to sensitive attributes in the dataset. Several recent works have sought to incorporate fairness into CP by ensuring conditional coverage guarantees across different subgroups. One such method is Conformal Fairness (CF). In this work, we extend the CF framework to the Federated Learning setting and discuss how we can audit a federated model for fairness by analyzing the fairness-related gaps for different demographic groups. We empirically validate our framework by conducting experiments on several datasets spanning multiple domains, fully leveraging the exchangeability assumption.

[1116] GPS-MTM: Capturing Pattern of Normalcy in GPS-Trajectories with self-supervised learning

Umang Garg, Bowen Zhang, Anantanjit Subrahmanya, Chandrakanth Gudavalli, BS Manjunath

Main category: cs.LG

TL;DR: GPS-MTM is a foundation model for mobility data that decomposes trajectories into states and actions, using a bi-directional Transformer with masked modeling to learn semantic patterns without manual labels.

Details

Motivation: To leverage foundation model breakthroughs from text/vision/video domains for trajectory modeling and capture normal human movement patterns in large-scale mobility data.

Method: Decomposes trajectories into states (POI categories) and actions (transitions), uses bi-directional Transformer with self-supervised masked modeling to reconstruct missing segments across modalities.

Result: Outperforms benchmarks on trajectory infilling and next-stop prediction across Numosim-LA, Urban Anomalies, and Geolife datasets, with strongest advantages in dynamic tasks requiring contextual reasoning.

Conclusion: GPS-MTM establishes mobility data as a first-class modality for large-scale representation learning and serves as a robust foundation model for trajectory analytics.

Abstract: Foundation models have driven remarkable progress in text, vision, and video understanding, and are now poised to unlock similar breakthroughs in trajectory modeling. We introduce the GPSMasked Trajectory Transformer (GPS-MTM), a foundation model for large-scale mobility data that captures patterns of normalcy in human movement. Unlike prior approaches that flatten trajectories into coordinate streams, GPS-MTM decomposes mobility into two complementary modalities: states (point-of-interest categories) and actions (agent transitions). Leveraging a bi-directional Transformer with a self-supervised masked modeling objective, the model reconstructs missing segments across modalities, enabling it to learn rich semantic correlations without manual labels. Across benchmark datasets, including Numosim-LA, Urban Anomalies, and Geolife, GPS-MTM consistently outperforms on downstream tasks such as trajectory infilling and next-stop prediction. Its advantages are most pronounced in dynamic tasks (inverse and forward dynamics), where contextual reasoning is critical. These results establish GPS-MTM as a robust foundation model for trajectory analytics, positioning mobility data as a first-class modality for large-scale representation learning. Code is released for further reference.

[1117] Guided Manifold Alignment with Geometry-Regularized Twin Autoencoders

Jake S. Rhodes, Adam G. Rustad, Marshall S. Nielsen, Morgan Chase McClellan, Dallan Gardner, Dawson Hedges

Main category: cs.LG

TL;DR: A guided representation learning framework using twin autoencoders with geometry regularization to improve manifold alignment and enable out-of-sample generalization.

Details

Motivation: Traditional manifold alignment methods lack out-of-sample extension capability, limiting their real-world applicability.

Method: Geometry-regularized twin autoencoder architecture with structured cross-modal mappings, pre-trained alignment model, and multitask learning formulation.

Result: Improved embedding consistency, information preservation, cross-domain transfer, and enhanced Alzheimer’s disease diagnosis accuracy using multi-modal data.

Conclusion: The framework successfully enables generalization to unseen data while maintaining alignment fidelity and improves predictive performance in real-world applications.

Abstract: Manifold alignment (MA) involves a set of techniques for learning shared representations across domains, yet many traditional MA methods are incapable of performing out-of-sample extension, limiting their real-world applicability. We propose a guided representation learning framework leveraging a geometry-regularized twin autoencoder (AE) architecture to enhance MA while enabling generalization to unseen data. Our method enforces structured cross-modal mappings to maintain geometric fidelity in learned embeddings. By incorporating a pre-trained alignment model and a multitask learning formulation, we improve cross-domain generalization and representation robustness while maintaining alignment fidelity. We evaluate our approach using several MA methods, showing improvements in embedding consistency, information preservation, and cross-domain transfer. Additionally, we apply our framework to Alzheimer’s disease diagnosis, demonstrating its ability to integrate multi-modal patient data and enhance predictive accuracy in cases limited to a single domain by leveraging insights from the multi-modal problem.

[1118] Rethinking Large Language Model Distillation: A Constrained Markov Decision Process Perspective

Matthieu Zimmer, Xiaotong Ji, Tu Nguyen, Haitham Bou Ammar

Main category: cs.LG

TL;DR: A constrained reinforcement learning approach for LLM distillation that maximizes task rewards while constraining divergence from teacher model, achieving better constraint satisfaction and reasoning than baselines.

Details

Motivation: Existing LLM distillation methods use ad-hoc reward weighting; need principled optimization framework that balances task rewards with teacher model fidelity.

Method: Formulates distillation as constrained RL problem with modified reward function, maximizing task rewards while keeping divergence from teacher below threshold, avoiding dual Lagrangian methods.

Result: Achieves better constraint satisfaction rates and reasoning on mathematical reasoning tasks compared to soft Lagrangian baselines, while maintaining competitive task performance.

Conclusion: Provides theoretically grounded and efficient solution for reward-aware distillation in resource-constrained settings without teacher model access during deployment.

Abstract: We introduce a novel approach to large language model (LLM) distillation by formulating it as a constrained reinforcement learning problem. While recent work has begun exploring the integration of task-specific rewards into distillation processes, existing methods typically rely on ad-hoc reward weighting. We propose a principled optimization framework that maximizes task-specific rewards while constraining the divergence from the teacher model to remain below a specified threshold. Our approach adapts constrained state augmented reinforcement learning to the distillation setting, introducing a modified reward function that maintains theoretical guarantees of constraint satisfaction without requiring state augmentation or teacher model access during deployment and without the computational overhead of the dual Lagrangian methods. Through extensive experiments on mathematical reasoning tasks, we demonstrate that our method achieves better constraint satisfaction rates and better reasoning compared to the soft Lagrangian relaxation baselines while maintaining competitive task performance. Our framework provides a theoretically grounded and practically efficient solution for reward-aware distillation in resource-constrained settings.

[1119] MonoCon: A general framework for learning ultra-compact high-fidelity representations using monotonicity constraints

Shreyas Gokhale

Main category: cs.LG

TL;DR: MonoCon introduces functional constraints using a monotonic MLP head attached to pre-trained encoders to learn robust, disentangled, and ultra-compact embeddings with minimal performance loss.

Details

Motivation: To address the challenge of learning high-quality, robust, efficient, and disentangled representations in AI, moving beyond architectural and optimization constraints.

Method: Attach a small monotonic multi-layer perceptron (MLP) head to any pre-trained encoder and train with contrastive loss and monotonicity constraints for co-adaptation.

Result: On CIFAR-100: 9x more compact and 1.5x more robust embeddings while retaining 99% of baseline accuracy. On SNLI: 3.4x more compact and 1.4x more robust with marginal STSb score reduction.

Conclusion: MonoCon provides a general domain-agnostic framework that offers unified solutions for edge computing to cloud-scale retrieval through functionally constrained robust and ultra-compact representations.

Abstract: Learning high-quality, robust, efficient, and disentangled representations is a central challenge in artificial intelligence (AI). Deep metric learning frameworks tackle this challenge primarily using architectural and optimization constraints. Here, we introduce a third approach that instead relies on $\textit{functional}$ constraints. Specifically, we present MonoCon, a simple framework that uses a small monotonic multi-layer perceptron (MLP) head attached to any pre-trained encoder. Due to co-adaptation between encoder and head guided by contrastive loss and monotonicity constraints, MonoCon learns robust, disentangled, and highly compact embeddings at a practically negligible performance cost. On the CIFAR-100 image classification task, MonoCon yields representations that are nearly 9x more compact and 1.5x more robust than the fine-tuned encoder baseline, while retaining 99% of the baseline’s 5-NN classification accuracy. We also report a 3.4x more compact and 1.4x more robust representation on an SNLI sentence similarity task for a marginal reduction in the STSb score, establishing MonoCon as a general domain-agnostic framework. Crucially, these robust, ultra-compact representations learned via functional constraints offer a unified solution to critical challenges in disparate contexts ranging from edge computing to cloud-scale retrieval.

[1120] Compute-Optimal Quantization-Aware Training

Aleksandr Dremov, David Grangier, Angelos Katharopoulos, Awni Hannun

Main category: cs.LG

TL;DR: The paper shows that the optimal ratio of QAT to FP training increases with total compute, and can be predicted using tokens-per-parameter-byte statistic. It introduces a scaling law for QAT planning and a novel cooldown+QAT fusion method for compute savings.

Details

Motivation: Previous work showed FP phase followed by QAT phase improves accuracy over QAT alone, but optimal compute allocation between phases remains unclear, especially across different model sizes and quantization widths.

Method: Extensive experiments with various compute budgets, QAT bit widths (86.0M to 2.2B models), derivation of loss scaling law, and proposed cooldown+QAT fusion that combines learning rate decay with quantization-aware training.

Result: Loss-optimal QAT/FP ratio increases with total compute; optimal fraction predictable via tokens-per-parameter-byte; scaling law predicts performance across strategies; cooldown+QAT fusion eliminates redundant FP updates with significant compute savings.

Conclusion: Findings enable efficient QAT planning and training of higher-quality quantized models with same compute budget through optimal phase allocation and novel fusion approach.

Abstract: Quantization-aware training (QAT) is a leading technique for improving the accuracy of quantized neural networks. Previous work has shown that decomposing training into a full-precision (FP) phase followed by a QAT phase yields superior accuracy compared to QAT alone. However, the optimal allocation of compute between the FP and QAT phases remains unclear. We conduct extensive experiments with various compute budgets, QAT bit widths, and model sizes from 86.0M to 2.2B to investigate how different QAT durations impact final performance. We demonstrate that, contrary to previous findings, the loss-optimal ratio of QAT to FP training increases with the total amount of compute. Moreover, the optimal fraction can be accurately predicted for a wide range of model sizes and quantization widths using the tokens-per-parameter-byte statistic. From experimental data, we derive a loss scaling law that predicts both optimal QAT ratios and final model performance across different QAT/FP compute allocation strategies and QAT bit widths. We use the scaling law to make further predictions, which we verify experimentally, including which QAT bit width is optimal under a given memory constraint and how QAT accuracy with different bit widths compares to full-precision model accuracy. Additionally, we propose a novel cooldown and QAT fusion approach that performs learning rate decay jointly with quantization-aware training, eliminating redundant full-precision model updates and achieving significant compute savings. These findings provide practical insights into efficient QAT planning and enable the training of higher-quality quantized models with the same compute budget.

[1121] Understanding SOAP from the Perspective of Gradient Whitening

Yanqing Lu, Letao Wang, Jinbo Liu

Main category: cs.LG

TL;DR: SOAP shows similar convergence to Shampoo and no significant advantage over Adam or Shampoo in final loss, aligning with theoretical equivalence between SOAP and Shampoo.

Details

Motivation: To analyze Adam, Shampoo, and SOAP from the perspective of gradient whitening and understand their preconditioners as approximations to the whitening matrix that captures second-order curvature information.

Method: Theoretical analysis establishing equivalence between idealized SOAP and Shampoo under Kronecker product assumption, followed by empirical evaluation using nanoGPT for language modeling and grayscale image colorization experiments.

Result: SOAP exhibits similar convergence rate as Shampoo, with no significant advantage over both Adam and Shampoo in the final loss achieved.

Conclusion: The empirical results align with the theoretical equivalence between SOAP and Shampoo, showing that SOAP doesn’t provide significant practical advantages over existing methods despite its promising initial appearance.

Abstract: Shampoo with Adam in the Preconditioner’s eigenbasis (SOAP) has recently emerged as a promising optimization algorithm for neural network training, achieving superior training efficiency over both Adam and Shampoo in language modeling tasks. In this work, we analyze Adam, Shampoo, and SOAP from the perspective of gradient whitening, interpreting their preconditioners as approximations to the whitening matrix, which captures second-order curvature information. We further establish a theoretical equivalence between idealized versions of SOAP and Shampoo under the Kronecker product assumption. To empirically evaluate these insights, we reproduce the language modeling experiments using nanoGPT and grayscale image colorization. Our results show that SOAP exhibits similar convergence rate as Shampoo, and no significant advantage over both Adam and Shampoo in the final loss achieved, which aligns with their equivalence in theory.

[1122] SINQ: Sinkhorn-Normalized Quantization for Calibration-Free Low-Precision LLM Weights

Lorenz K. Müller, Philippe Bich, Jiawei Zhuang, Ahmet Çelik, Luca Benfenati, Lukas Cavigelli

Main category: cs.LG

TL;DR: SINQ introduces a second-axis scale factor and Sinkhorn-Knopp algorithm to normalize per-row/column variances, improving post-training quantization for LLMs at ≤4 bits by addressing outlier precision issues.

Details

Motivation: Current post-training quantization methods show perplexity degradation at ≤4 bits due to precision issues from outliers in parameters sharing the same scales, especially problematic for calibration-free uniform quantization.

Method: Augments existing quantizers with second-axis scale factor and Sinkhorn-Knopp algorithm to normalize per-row/column variances, minimizing matrix imbalance. Layer-independent and applicable to any linear layers.

Result: Significantly improves WikiText2 and C4 perplexity against uncalibrated uniform quantization baselines on Qwen3 and DeepSeek-V2.5 models, with further enhancement possible through calibration and non-uniform quantization.

Conclusion: SINQ provides an effective solution for low-precision quantization of LLMs, addressing outlier precision issues while maintaining layer independence and architectural flexibility.

Abstract: Post-training quantization has emerged as the most widely used strategy for deploying large language models at low precision. Still, current methods show perplexity degradation at bit-widths less than or equal to 4, partly because representing outliers causes precision issues in parameters that share the same scales as these outliers. This problem is especially pronounced for calibration-free, uniform quantization methods. We introduce SINQ to augment existing post-training quantizers with an additional second-axis scale factor and a fast Sinkhorn-Knopp-style algorithm that finds scales to normalize per-row and per-column variances, thereby minimizing a novel per-matrix proxy target for quantization: the matrix imbalance. Our method has no interactions between layers and can be trivially applied to new architectures to quantize any linear layers. We evaluate our method on the Qwen3 model family and DeepSeek-V2.5. SINQ improves WikiText2 and C4 perplexity significantly against uncalibrated uniform quantization baselines and can be further enhanced by combining it with calibration and non-uniform quantization levels. Code to reproduce the results of this work and to easily quantize models using SINQ is available at https://github.com/huawei-csl/SINQ.

[1123] Meta-Learning Fourier Neural Operators for Hessian Inversion and Enhanced Variational Data Assimilation

Hamidreza Moazzami, Asma Jamali, Nicholas Kevlahan, Rodrigo A. Vargas-Hernández

Main category: cs.LG

TL;DR: A meta-learning framework using Fourier Neural Operator (FNO) to approximate the inverse Hessian in data assimilation problems, reducing computational costs and improving efficiency compared to standard conjugate gradient methods.

Details

Motivation: Variational data assimilation methods are computationally expensive, especially when Hessian information is involved, creating a need for more efficient approaches.

Method: Proposed FNO-CG approach that uses Fourier Neural Operator to approximate the inverse Hessian operator across DA problems, providing better initialization for conjugate gradient method.

Result: Numerical experiments on linear advection equation show 62% reduction in average relative error and 17% reduction in number of iterations compared to standard CG, with best improvements in ill-conditioned scenarios.

Conclusion: FNO-CG demonstrates robustness and efficiency for challenging data assimilation problems, particularly in ill-conditioned cases.

Abstract: Data assimilation (DA) is crucial for enhancing solutions to partial differential equations (PDEs), such as those in numerical weather prediction, by optimizing initial conditions using observational data. Variational DA methods are widely used in oceanic and atmospheric forecasting, but become computationally expensive, especially when Hessian information is involved. To address this challenge, we propose a meta-learning framework that employs the Fourier Neural Operator (FNO) to approximate the inverse Hessian operator across a family of DA problems, thereby providing an effective initialization for the conjugate gradient (CG) method. Numerical experiments on a linear advection equation demonstrate that the resulting FNO-CG approach reduces the average relative error by $62%$ and the number of iterations by $17%$ compared to the standard CG. These improvements are most pronounced in ill-conditioned scenarios, highlighting the robustness and efficiency of FNO-CG for challenging DA problems.

[1124] GDR-learners: Orthogonal Learning of Generative Models for Potential Outcomes

Valentyn Melnychuk, Stefan Feuerriegel

Main category: cs.LG

TL;DR: The paper introduces GDR-learners, a suite of generative Neyman-orthogonal models for estimating potential outcomes distributions with theoretical guarantees of quasi-oracle efficiency and double robustness.

Details

Motivation: Existing deep generative models for estimating potential outcomes distributions lack Neyman-orthogonality and its associated theoretical benefits like quasi-oracle efficiency and double robustness.

Method: Developed GDR-learners using four state-of-the-art deep generative models: conditional normalizing flows (GDR-CNFs), conditional generative adversarial networks (GDR-CGANs), conditional variational autoencoders (GDR-CVAEs), and conditional diffusion models (GDR-CDMs).

Result: GDR-learners demonstrate superior performance in estimating conditional distributions of potential outcomes compared to existing methods in (semi-)synthetic experiments.

Conclusion: The proposed GDR-learners provide asymptotically optimal estimation of potential outcomes distributions with desirable theoretical properties that existing methods lack.

Abstract: Various deep generative models have been proposed to estimate potential outcomes distributions from observational data. However, none of them have the favorable theoretical property of general Neyman-orthogonality and, associated with it, quasi-oracle efficiency and double robustness. In this paper, we introduce a general suite of generative Neyman-orthogonal (doubly-robust) learners that estimate the conditional distributions of potential outcomes. Our proposed GDR-learners are flexible and can be instantiated with many state-of-the-art deep generative models. In particular, we develop GDR-learners based on (a) conditional normalizing flows (which we call GDR-CNFs), (b) conditional generative adversarial networks (GDR-CGANs), (c) conditional variational autoencoders (GDR-CVAEs), and (d) conditional diffusion models (GDR-CDMs). Unlike the existing methods, our GDR-learners possess the properties of quasi-oracle efficiency and rate double robustness, and are thus asymptotically optimal. In a series of (semi-)synthetic experiments, we demonstrate that our GDR-learners are very effective and outperform the existing methods in estimating the conditional distributions of potential outcomes.

[1125] Doubly-Robust LLM-as-a-Judge: Externally Valid Estimation with Imperfect Personas

Luke Guerdan, Justin Whitehouse, Kimberly Truong, Kenneth Holstein, Zhiwei Steven Wu

Main category: cs.LG

TL;DR: This paper proposes a doubly-robust estimation framework that combines LLM-generated persona ratings with biased human ratings to produce valid system quality estimates for Generative AI evaluations, addressing evaluation sampling bias.

Details

Motivation: To address external validity concerns in GenAI evaluations where lab-based assessments don't generalize to real-world deployment due to sampling bias in human rater selection and system outputs.

Method: Uses a doubly-robust estimation framework that combines imperfect LLM-generated persona ratings (simulating human raters with specific sociodemographic characteristics) with biased human ratings, requiring either good prediction models or proper reweighting for valid estimates.

Result: The framework produces statistically valid system quality estimates when either the rating prediction model or sampling bias correction model is sufficiently accurate, validated through theoretical analysis and Persona Simulation Framework experiments.

Conclusion: Provides a principled approach for combining imperfect persona ratings with biased human ratings to obtain valid system quality estimates, addressing evaluation sampling bias in GenAI assessments.

Abstract: As Generative AI (GenAI) systems see growing adoption, a key concern involves the external validity of evaluations, or the extent to which they generalize from lab-based to real-world deployment conditions. Threats to the external validity of GenAI evaluations arise when the source sample of human raters and system outputs used to obtain a system quality estimate differs from the target distribution at deployment time. In this work, we propose a doubly-robust estimation framework designed to address this evaluation sampling bias. Key to our approach is the use of “persona” ratings produced by prompting an LLM evaluator (i.e., an LLM-as-a-judge) to behave as a human rater with specific sociodemographic characteristics. Our doubly-robust framework combines these informative yet imperfect persona ratings with human ratings obtained under evaluation sampling bias to produce statistically valid system quality estimates. In particular, we show that our approach yields valid system quality estimates when either (i) a model trained to predict human ratings using persona ratings and source data observed under sampling bias, or (ii) a reweighting model that corrects for sampling bias is of sufficient quality. We validate our framework theoretically and via a novel Persona Simulation Framework (PSF) designed to systematically manipulate persona quality and the degree of evaluation sampling bias present in source data. Our work provides a principled foundation for combining imperfect persona ratings with human ratings observed under sampling bias to obtain valid system quality estimates.

[1126] Reinforcement Learning with Discrete Diffusion Policies for Combinatorial Action Spaces

Haitong Ma, Ofir Nabati, Aviv Rosenberg, Bo Dai, Oran Lang, Idan Szpektor, Craig Boutilier, Na Li, Shie Mannor, Lior Shani, Guy Tenneholtz

Main category: cs.LG

TL;DR: A novel framework using discrete diffusion models as policies for RL in large combinatorial action spaces, achieving state-of-the-art performance through stable online training and policy mirror descent.

Details

Motivation: Reinforcement learning struggles with large combinatorial action spaces common in real-world problems, requiring more scalable and stable approaches.

Method: Uses discrete diffusion models as policies with efficient online training, leveraging policy mirror descent to define regularized target policy distributions and framing policy updates as distributional matching problems.

Result: Achieves state-of-the-art results and superior sample efficiency across diverse combinatorial benchmarks including DNA sequence generation, RL with macro-actions, and multi-agent systems.

Conclusion: The diffusion policy framework effectively addresses scalability challenges in RL for combinatorial action spaces, demonstrating superior performance compared to other baselines through stable and efficient training.

Abstract: Reinforcement learning (RL) struggles to scale to large, combinatorial action spaces common in many real-world problems. This paper introduces a novel framework for training discrete diffusion models as highly effective policies in these complex settings. Our key innovation is an efficient online training process that ensures stable and effective policy improvement. By leveraging policy mirror descent (PMD) to define an ideal, regularized target policy distribution, we frame the policy update as a distributional matching problem, training the expressive diffusion model to replicate this stable target. This decoupled approach stabilizes learning and significantly enhances training performance. Our method achieves state-of-the-art results and superior sample efficiency across a diverse set of challenging combinatorial benchmarks, including DNA sequence generation, RL with macro-actions, and multi-agent systems. Experiments demonstrate that our diffusion policies attain superior performance compared to other baselines.

[1127] Functional Critic Modeling for Provably Convergent Off-Policy Actor-Critic

Qinxun Bai, Yuxuan Han, Wei Xu, Zhengyuan Zhou

Main category: cs.LG

TL;DR: Proposes functional critic modeling to address challenges in off-policy actor-critic RL, providing provable convergence and practical neural network implementation.

Details

Motivation: Off-policy actor-critic methods face two key challenges: the "moving target" problem in critic learning and inefficient actor learning due to difficulty estimating exact off-policy policy gradients.

Method: Introduces functional critic modeling concept, provides theoretical analysis in linear function setting, and designs neural network architecture for practical implementation.

Result: Develops the first provably convergent off-policy target-based AC algorithm and demonstrates effectiveness on DeepMind Control Benchmark tasks.

Conclusion: The functional critic modeling framework successfully addresses both challenges in off-policy actor-critic learning under the deadly triad setting.

Abstract: Off-policy reinforcement learning (RL) with function approximation offers an effective way to improve sample efficiency by reusing past experience. Within this setting, the actor-critic (AC) framework has achieved strong empirical success. However, both the critic and actor learning is challenging for the off-policy AC methods: first of all, in addition to the classic “deadly triad” instability of off-policy evaluation, it also suffers from a “moving target” problem, where the policy being evaluated changes continually; secondly, actor learning becomes less efficient due to the difficulty of estimating the exact off-policy policy gradient. The first challenge essentially reduces the problem to repeatedly performing off-policy evaluation for changing policies. For the second challenge, the off-policy policy gradient theorem requires a complex and often impractical algorithm to estimate an additional emphasis critic, which is typically neglected in practice, thereby reducing to the on-policy policy gradient as an approximation. In this work, we introduce a novel concept of functional critic modeling, which leads to a new AC framework that addresses both challenges for actor-critic learning under the deadly triad setting. We provide a theoretical analysis in the linear function setting, establishing the provable convergence of our framework, which, to the best of our knowledge, is the first convergent off-policy target-based AC algorithm. From a practical perspective, we further propose a carefully designed neural network architecture for the functional critic modeling and demonstrate its effectiveness through preliminary experiments on widely used RL tasks from the DeepMind Control Benchmark.

[1128] Shape-Informed Clustering of Multi-Dimensional Functional Data via Deep Functional Autoencoders

Samuel V. Singh, Shirley Coyle, Mimi Zhang

Main category: cs.LG

TL;DR: FAEclust is a functional autoencoder framework for clustering multi-dimensional functional data, featuring universal-approximator encoder/decoder, innovative regularization, clustering loss integration, and phase-variation resistant shape-informed clustering.

Details

Motivation: To develop a robust framework for cluster analysis of multi-dimensional functional data that can handle complex nonlinear interdependencies and is resistant to phase variations.

Method: Functional autoencoder with universal-approximator encoder and decoder, innovative regularization strategies for functional weights/biases, clustering loss integration, and shape-informed clustering objective for phase variation resistance.

Result: The framework establishes universal approximation property for the nonlinear decoder and demonstrates effectiveness through extensive experiments.

Conclusion: FAEclust provides an effective solution for clustering multi-dimensional functional data with robustness to phase variations and complex interdependencies.

Abstract: We introduce FAEclust, a novel functional autoencoder framework for cluster analysis of multi-dimensional functional data, data that are random realizations of vector-valued random functions. Our framework features a universal-approximator encoder that captures complex nonlinear interdependencies among component functions, and a universal-approximator decoder capable of accurately reconstructing both Euclidean and manifold-valued functional data. Stability and robustness are enhanced through innovative regularization strategies applied to functional weights and biases. Additionally, we incorporate a clustering loss into the network’s training objective, promoting the learning of latent representations that are conducive to effective clustering. A key innovation is our shape-informed clustering objective, ensuring that the clustering results are resistant to phase variations in the functions. We establish the universal approximation property of our non-linear decoder and validate the effectiveness of our model through extensive experiments.

[1129] OptiMind: Teaching LLMs to Think Like Optimization Experts

Zeyi Chen, Xinzhi Zhang, Humishka Zope, Hugo Barbalho, Konstantina Mellou, Marco Molinaro, Janardhan Kulkarni, Ishai Menache, Sirui Li

Main category: cs.LG

TL;DR: This paper presents a systematic approach to improve LLM-based mathematical programming formulation by integrating optimization expertise through data cleaning and multi-turn inference strategies.

Details

Motivation: Current LLM approaches for translating natural language to mathematical programs achieve limited accuracy due to scarce/noisy training data and lack of domain knowledge, despite the fundamental importance of mathematical programming across domains.

Method: 1) Clean training data through class-based error analysis to prevent common mistakes; 2) Develop multi-turn inference strategies with class-specific error summaries and solver feedback for iterative refinement.

Result: The approach improves formulation accuracy by 14 percentage points on average across multiple base LLMs, specifically for mixed-integer linear programming problems.

Conclusion: Combining cleaned data with domain-informed prompting and feedback enables significant progress toward robust LLM-assisted optimization formulation.

Abstract: Mathematical programming – the task of expressing operations and decision-making problems in precise mathematical language – is fundamental across domains, yet remains a skill-intensive process requiring operations research expertise. Recent advances in large language models for complex reasoning have spurred interest in automating this task, translating natural language into executable optimization models. Current approaches, however, achieve limited accuracy, hindered by scarce and noisy training data without leveraging domain knowledge. In this work, we systematically integrate optimization expertise to improve formulation accuracy for mixed-integer linear programming, a key family of mathematical programs. Our approach first cleans training data through class-based error analysis to explicitly prevent common mistakes within each optimization class. We then develop multi-turn inference strategies that guide LLMs with class-specific error summaries and solver feedback, enabling iterative refinement. Experiments across multiple base LLMs demonstrate that combining cleaned data with domain-informed prompting and feedback improves formulation accuracy by 14 percentage points on average, enabling further progress toward robust LLM-assisted optimization formulation.

[1130] MDP modeling for multi-stage stochastic programs

David P. Morton, Oscar Dowson, Bernardo K. Pagnoncelli

Main category: cs.LG

TL;DR: The paper extends policy graphs to handle decision-dependent uncertainty in transition probabilities and incorporates statistical learning for multi-stage stochastic programs with continuous state/action spaces.

Details

Motivation: To address limitations in modeling multi-stage stochastic programs that combine features from Markov decision processes, particularly for problems with continuous state/action spaces and decision-dependent uncertainty.

Method: Extends policy graphs to include decision-dependent uncertainty for transition probabilities and statistical learning; develops new variants of stochastic dual dynamic programming with approximations for non-convexities.

Result: The approach demonstrates increased expressiveness through examples of increasing complexity, showing capability to handle more realistic modeling scenarios.

Conclusion: The proposed framework successfully extends policy graphs to incorporate decision-dependent uncertainty and statistical learning, with new SDDP variants providing solution methods for complex multi-stage stochastic programs.

Abstract: We study a class of multi-stage stochastic programs, which incorporate modeling features from Markov decision processes (MDPs). This class includes structured MDPs with continuous state and action spaces. We extend policy graphs to include decision-dependent uncertainty for one-step transition probabilities as well as a limited form of statistical learning. We focus on the expressiveness of our modeling approach, illustrating ideas with a series of examples of increasing complexity. As a solution method, we develop new variants of stochastic dual dynamic programming, including approximations to handle non-convexities.

[1131] T-TAMER: Provably Taming Trade-offs in ML Serving

Yuanyuan Yang, Ruimin Zhang, Jamie Morgenstern, Haifeng Xu

Main category: cs.LG

TL;DR: T-Tamer is a framework that formalizes multi-model serving as a multi-stage decision process, proving that recall (ability to revisit earlier models) is necessary and sufficient for achieving optimal accuracy-latency trade-offs.

Details

Motivation: Current strategies for multi-model serving are largely heuristic and case-specific, lacking theoretical guarantees and general applicability despite the importance of balancing accuracy, latency, and resource usage in growing ML models.

Method: Formalizes the setting as a multi-stage decision process where the objective is to determine both when to exit and which model to consult. Proves that recall-based strategies are necessary for optimal performance.

Result: Shows that strategies without recall cannot achieve constant-factor approximation to optimal trade-off, while recall-based strategies provably attain optimal trade-off in polynomial time. Experiments on synthetic datasets and vision/NLP benchmarks validate the approach.

Conclusion: Provides a principled foundation for bridging heuristic practice with theoretical guarantees in early-exit and cascaded models, demonstrating that recall-based strategies consistently yield efficient accuracy-latency trade-offs.

Abstract: As machine learning models continue to grow in size and complexity, efficient serving faces increasingly broad trade-offs spanning accuracy, latency, resource usage, and other objectives. Multi-model serving further complicates these trade-offs; for example, in cascaded models, each early-exit decision balances latency reduction against potential accuracy loss. Despite the pervasiveness and importance of such trade-offs, current strategies remain largely heuristic and case-specific, limiting both their theoretical guarantees and general applicability. We present a general framework, T-Tamer, which formalizes this setting as a multi-stage decision process, where the objective is to determine both when to exit and which model to consult. Our main result shows that recall (i.e., the ability to revisit earlier models) is both necessary and sufficient for achieving provable performance guarantees. In particular, we prove that strategies without recall cannot obtain any constant-factor approximation to the optimal trade-off, whereas recall-based strategies provably attain the optimal trade-off in polynomial time. We validate our analysis through experiments on synthetic datasets and early-exit workloads for vision and NLP benchmarks. The results show that recall-based strategies consistently yield efficient accuracy-latency trade-offs. We hope this work provides a principled foundation for bridging heuristic practice with theoretical guarantees in the design of early-exit and cascaded models.

[1132] Training-Free Multimodal Guidance for Video to Audio Generation

Eleonora Grassucci, Giuliano Galadini, Giordano Cicchetti, Aurelio Uncini, Fabio Antonacci, Danilo Comminiello

Main category: cs.LG

TL;DR: A training-free multimodal guidance mechanism for video-to-audio generation that uses modality embeddings to enforce unified alignment across video, audio, and text without requiring retraining of diffusion models.

Details

Motivation: Existing video-to-audio generation approaches require costly joint training on large datasets or rely on pairwise similarities that may not capture global multimodal coherence.

Method: Proposes multimodal diffusion guidance (MDG) that leverages the volume spanned by modality embeddings to enforce unified alignment across video, audio, and text as a plug-and-play control signal for pretrained audio diffusion models.

Result: Experiments on VGGSound and AudioCaps show MDG consistently improves perceptual quality and multimodal alignment compared to baselines.

Conclusion: The proposed joint multimodal guidance effectively enhances video-to-audio generation without requiring retraining, proving the value of unified multimodal alignment.

Abstract: Video-to-audio (V2A) generation aims to synthesize realistic and semantically aligned audio from silent videos, with potential applications in video editing, Foley sound design, and assistive multimedia. Although the excellent results, existing approaches either require costly joint training on large-scale paired datasets or rely on pairwise similarities that may fail to capture global multimodal coherence. In this work, we propose a novel training-free multimodal guidance mechanism for V2A diffusion that leverages the volume spanned by the modality embeddings to enforce unified alignment across video, audio, and text. The proposed multimodal diffusion guidance (MDG) provides a lightweight, plug-and-play control signal that can be applied on top of any pretrained audio diffusion model without retraining. Experiments on VGGSound and AudioCaps demonstrate that our MDG consistently improves perceptual quality and multimodal alignment compared to baselines, proving the effectiveness of a joint multimodal guidance for V2A.

[1133] Analysis of Variational Autoencoders

Zachary Baker, Yuxiao Li

Main category: cs.LG

TL;DR: Variational Sparse Autoencoder (vSAE) with probabilistic sampling underperforms standard SAE across core metrics, showing excessive regularization that reduces living features and degrades performance.

Details

Motivation: To investigate if incorporating variational methods into Sparse Autoencoders can improve feature organization and interpretability by creating dispersive pressure for more coherent latent space organization.

Method: Introduced vSAE that replaces deterministic ReLU gating with stochastic sampling from learned Gaussian posteriors and incorporates KL divergence regularization toward standard normal prior. Evaluated against standard TopK SAE on Pythia-70M transformer activations using SAE Bench, feature interpretability analysis, and t-SNE visualization.

Result: vSAE underperformed standard SAE across core evaluation metrics, though it excelled at feature independence and ablation metrics. KL divergence created excessive regularization that substantially reduced living features and caused performance degradation. vSAE features showed improved robustness but many more dead features than baseline.

Conclusion: Naive application of variational methods to SAEs does not improve feature organization or interpretability, as the excessive regularization pressure from KL divergence harms performance despite some benefits in feature independence.

Abstract: Sparse Autoencoders (SAEs) have emerged as a promising approach for interpreting neural network representations by learning sparse, human-interpretable features from dense activations. We investigate whether incorporating variational methods into SAE architectures can improve feature organization and interpretability. We introduce the variational Sparse Autoencoder (vSAE), which replaces deterministic ReLU gating with stochastic sampling from learned Gaussian posteriors and incorporates KL divergence regularization toward a standard normal prior. Our hypothesis is that this probabilistic sampling creates dispersive pressure, causing features to organize more coherently in the latent space while avoiding overlap. We evaluate a Topk vSAE against a standard TopK SAE on Pythia-70M transformer residual steam activations using comprehensive benchmarks including SAE Bench, individual feature interpretability analysis, and global latent space visualization through t-SNE. The vSAE underperforms standard SAE across core evaluation metrics, though excels at feature independence and ablation metrics. The KL divergence term creates excessive regularization pressure that substantially reduces the fraction of living features, leading to observed performance degradation. While vSAE features demonstrate improved robustness, they exhibit many more dead features than baseline. Our findings suggest that naive application of variational methods to SAEs does not improve feature organization or interpretability.

[1134] Sample-efficient Multiclass Calibration under $\ell_{p}$ Error

Konstantina Bairaktari, Huy L. Nguyen

Main category: cs.LG

TL;DR: Proposes a new calibration error definition that interpolates between two established notions, with polynomial sample complexity for most cases and improved error dependence at one endpoint.

Details

Motivation: Multiclass predictor calibration is challenging due to exponential number of possible prediction values, requiring better calibration error definitions and algorithms.

Method: Novel calibration error definition interpolating between two established notions; algorithm using adaptive data analysis with logarithmic overhead in sample complexity.

Result: Can calibrate predictors for entire interpolation range (except one endpoint) with polynomial samples; achieves nearly optimal error dependence at other endpoint.

Conclusion: Provides efficient calibration for multiclass predictors with improved sample complexity and error bounds through novel calibration error definition and adaptive analysis techniques.

Abstract: Calibrating a multiclass predictor, that outputs a distribution over labels, is particularly challenging due to the exponential number of possible prediction values. In this work, we propose a new definition of calibration error that interpolates between two established calibration error notions, one with known exponential sample complexity and one with polynomial sample complexity for calibrating a given predictor. Our algorithm can calibrate any given predictor for the entire range of interpolation, except for one endpoint, using only a polynomial number of samples. At the other endpoint, we achieve nearly optimal dependence on the error parameter, improving upon previous work. A key technical contribution is a novel application of adaptive data analysis with high adaptivity but only logarithmic overhead in the sample complexity.

[1135] Physically Plausible Multi-System Trajectory Generation and Symmetry Discovery

Jiayin Liu, Yulong Yang, Vineet Bansal, Christine Allen-Blanchette

Main category: cs.LG

TL;DR: SPS-GAN is a novel neural network model that captures dynamics of multiple systems and generalizes to unseen physical parameters without requiring prior knowledge of system configuration space, using a Hamiltonian neural network embedded in a conditional GAN architecture.

Details

Motivation: Existing neural network models based on classical mechanics typically capture only single systems with fixed parameters and require known configuration spaces. There's a need for models that can handle multiple systems, generalize to unseen parameters, and discover configuration space structure from arbitrary measurements.

Method: Embed a Hamiltonian neural network recurrent module in a conditional GAN backbone, optimizing with an additional physically motivated term that encourages sparse representation of configuration space. Can discover configuration space structure from various measurement types including state-space measurements and video frames.

Result: SPS-GAN captures multiple systems and achieves performance comparable to supervised models designed for single systems. Demonstrates utility for trajectory prediction, video generation, and symmetry discovery.

Conclusion: The proposed SPS-GAN successfully addresses limitations of existing physics-inspired neural networks by enabling multi-system dynamics capture, generalization to unseen parameters, and automatic discovery of configuration space structure from arbitrary measurements.

Abstract: From metronomes to celestial bodies, mechanics underpins how the world evolves in time and space. With consideration of this, a number of recent neural network models leverage inductive biases from classical mechanics to encourage model interpretability and ensure forecasted states are physical. However, in general, these models are designed to capture the dynamics of a single system with fixed physical parameters, from state-space measurements of a known configuration space. In this paper we introduce Symplectic Phase Space GAN (SPS-GAN) which can capture the dynamics of multiple systems, and generalize to unseen physical parameters from. Moreover, SPS-GAN does not require prior knowledge of the system configuration space. In fact, SPS-GAN can discover the configuration space structure of the system from arbitrary measurement types (e.g., state-space measurements, video frames). To achieve physically plausible generation, we introduce a novel architecture which embeds a Hamiltonian neural network recurrent module in a conditional GAN backbone. To discover the structure of the configuration space, we optimize the conditional time-series GAN objective with an additional physically motivated term to encourages a sparse representation of the configuration space. We demonstrate the utility of SPS-GAN for trajectory prediction, video generation and symmetry discovery. Our approach captures multiple systems and achieves performance on par with supervised models designed for single systems.

[1136] Moving Out: Physically-grounded Human-AI Collaboration

Xuhui Kang, Sung-Wook Lee, Haolin Liu, Yuyan Wang, Yen-Ling Kuo

Main category: cs.LG

TL;DR: Moving Out is a new human-AI collaboration benchmark that tests agents’ ability to adapt to physical constraints and diverse human behaviors in continuous action spaces, with BASS method showing superior performance.

Details

Motivation: Physical human-AI collaboration requires adaptation to continuous state-action spaces and constrained dynamics caused by physical constraints, which existing methods struggle with.

Method: Proposed BASS (Behavior Augmentation, Simulation, and Selection) to enhance agent diversity and action outcome understanding through data augmentation and simulation.

Result: BASS outperforms state-of-the-art models in both AI-AI and human-AI collaboration tasks on the Moving Out benchmark.

Conclusion: The Moving Out benchmark and BASS method effectively address physical collaboration challenges, showing improved adaptation to diverse human behaviors and physical constraints.

Abstract: The ability to adapt to physical actions and constraints in an environment is crucial for embodied agents (e.g., robots) to effectively collaborate with humans. Such physically grounded human-AI collaboration must account for the increased complexity of the continuous state-action space and constrained dynamics caused by physical constraints. In this paper, we introduce Moving Out, a new human-AI collaboration benchmark that resembles a wide range of collaboration modes affected by physical attributes and constraints, such as moving heavy items together and maintaining consistent actions to move a big item around a corner. Using Moving Out, we designed two tasks and collected human-human interaction data to evaluate models’ abilities to adapt to diverse human behaviors and unseen physical attributes. To address the challenges in physical environments, we propose a novel method, BASS (Behavior Augmentation, Simulation, and Selection), to enhance the diversity of agents and their understanding of the outcome of actions. Our experiments show that BASS outperforms state-of-the-art models in AI-AI and human-AI collaboration. The project page is available at https://live-robotics-uva.github.io/movingout_ai/.

[1137] T-MLP: Tailed Multi-Layer Perceptron for Level-of-Detail Signal Representation

Chuanxiang Yang, Yuanfeng Zhou, Guangshun Wei, Siyu Ren, Yuan Liu, Junhui Hou, Wenping Wang

Main category: cs.LG

TL;DR: T-MLP extends MLP with output branches at each hidden layer to enable level-of-detail signal representation from single-resolution supervision.

Details

Motivation: Standard MLPs lack native level-of-detail support for efficient signal modeling and transmission, requiring a multi-scale architecture.

Method: Tailed Multi-Layer Perceptron (T-MLP) attaches output branches to each hidden layer that refine residuals between current predictions and ground-truth signals.

Result: T-MLP outperforms existing neural LoD baselines across diverse signal representation tasks.

Conclusion: The proposed T-MLP architecture successfully enables level-of-detail signal representation with only single-resolution supervision.

Abstract: Level-of-detail (LoD) representation is critical for efficiently modeling and transmitting various types of signals, such as images and 3D shapes. In this work, we propose a novel network architecture that enables LoD signal representation. Our approach builds on a modified Multi-Layer Perceptron (MLP), which inherently operates at a single scale and thus lacks native LoD support. Specifically, we introduce the Tailed Multi-Layer Perceptron (T-MLP), which extends the MLP by attaching an output branch, also called tail, to each hidden layer. Each tail refines the residual between the current prediction and the ground-truth signal, so that the accumulated outputs across layers correspond to the target signals at different LoDs, enabling multi-scale modeling with supervision from only a single-resolution signal. Extensive experiments demonstrate that our T-MLP outperforms existing neural LoD baselines across diverse signal representation tasks.

[1138] MoE-PHDS: One MoE checkpoint for flexible runtime sparsity

Lauren. A Hannah, Soheil Zibakhsh, Kumari Nishu, Arnav Kundu, Mohammad Samragh Razlighi, Mehrdad Farajtabar, Minsik Cho

Main category: cs.LG

TL;DR: MoE-PHDS enables runtime sparsity control for Mixture of Experts models through lightweight fine-tuning, allowing practitioners to adjust sparsity levels at inference without model swapping.

Details

Motivation: Current sparse MoEs require training multiple models for different sparsity levels, increasing serving complexity and costs. This limits flexibility in meeting diverse latency and efficiency requirements.

Method: MoE-PHDS uses a lightweight SFT method with mixed training across sparsity levels and a curriculum at high sparsity, requiring no architectural changes.

Result: PHDS matches or exceeds oracle models, improves cross-sparsity agreement by up to 22%, and enables flexible runtime deployment by making global sparsity a serving primitive.

Conclusion: MoE-PHDS provides predictable accuracy/latency tradeoffs from a single model, allowing practitioners to dynamically adjust sparsity at inference time without architectural changes.

Abstract: Sparse Mixtures of Experts (MoEs) are typically trained to operate at a fixed sparsity level, e.g. $k$ in a top-$k$ gating function. This global sparsity level determines an operating point on the accuracy/latency curve; currently, meeting multiple efficiency targets means training and maintaining multiple models. This practice complicates serving, increases training and maintenance costs, and limits flexibility in meeting diverse latency, efficiency, and energy requirements. We show that pretrained MoEs are more robust to runtime sparsity shifts than commonly assumed, and introduce MoE-PHDS ({\bf P}ost {\bf H}oc {\bf D}eclared {\bf S}parsity), a lightweight SFT method that turns a single checkpoint into a global sparsity control surface. PHDS mixes training across sparsity levels and anchors with a short curriculum at high sparsity, requiring no architectural changes. The result is predictable accuracy/latency tradeoffs from one model: practitioners can ``dial $k$’’ at inference time without swapping checkpoints, changing architecture, or relying on token-level heuristics. Experiments on OLMoE-1B-7B-0125, Qwen1.5-MoE-A2.7B, and proprietary models fit on multiple operating points show that PHDS matches or exceeds well-specified oracle models, improves cross-sparsity agreement by up to 22% vs. well-specified oracle models, and enables simplified, flexible runtime MoE deployment by making global sparsity a first-class serving primitive.

[1139] Co-Evolving Complexity: An Adversarial Framework for Automatic MARL Curricula

Brennen Hill

Main category: cs.LG

TL;DR: Adversarial co-evolution between procedurally generated attackers and cooperative defenders creates self-scaling environments for training intelligent agents, driving emergence of complex behaviors without manual environment design.

Details

Motivation: Current hand-crafted environments are finite and biased, limiting development of generalizable agent skills. Scaling environmental complexity, diversity, and interactivity is crucial for advancing general-purpose intelligent agents.

Method: Framing environment generation as adversarial game: cooperative multi-agent defenders learn to survive against procedurally generative attacker that creates increasingly challenging enemy configurations tailored to exploit defenders’ weaknesses.

Result: Minimal training leads to emergence of complex intelligent behaviors - attacker develops flanking and shielding, defenders develop focus-fire and spreading strategies. Creates effectively infinite stream of novel training data.

Conclusion: Adversarial co-evolution is a powerful mechanism for automatically scaling environmental complexity, driving agents towards greater robustness and strategic depth.

Abstract: The advancement of general-purpose intelligent agents is intrinsically linked to the environments in which they are trained. While scaling models and datasets has yielded remarkable capabilities, scaling the complexity, diversity, and interactivity of environments remains a crucial bottleneck. Hand-crafted environments are finite and often contain implicit biases, limiting the potential for agents to develop truly generalizable and robust skills. In this work, we propose a paradigm for generating a boundless and adaptive curriculum of challenges by framing the environment generation process as an adversarial game. We introduce a system where a team of cooperative multi-agent defenders learns to survive against a procedurally generative attacker. The attacker agent learns to produce increasingly challenging configurations of enemy units, dynamically creating novel worlds tailored to exploit the defenders’ current weaknesses. Concurrently, the defender team learns cooperative strategies to overcome these generated threats. This co-evolutionary dynamic creates a self-scaling environment where complexity arises organically from the adversarial interaction, providing an effectively infinite stream of novel and relevant training data. We demonstrate that with minimal training, this approach leads to the emergence of complex, intelligent behaviors, such as flanking and shielding by the attacker, and focus-fire and spreading by the defenders. Our findings suggest that adversarial co-evolution is a powerful mechanism for automatically scaling environmental complexity, driving agents towards greater robustness and strategic depth.

[1140] On the Sheafification of Higher-Order Message Passing

Jacob Hume, Pietro Liò

Main category: cs.LG

TL;DR: This paper proposes using sheaf theory to enhance higher-order message passing in topological deep learning, addressing limitations of the Hodge Laplacian by developing a more expressive sheaf Laplacian framework.

Details

Motivation: The Hodge Laplacian's inductive bias becomes opaque and potentially degenerate in higher dimensions (k>0), limiting its effectiveness for higher-order message passing in topological deep learning.

Method: Develops sheaf theory as a principled formalism to modify the Hodge Laplacian, creating a sheaf Laplacian that correlates dimension-k data features with dimension-k sheaf cohomology.

Result: Provides novel theory and practice for higher-order sheaf diffusion, extending prior graph learning approaches to more complex topological structures.

Conclusion: Sheaf theory offers a natural and expressive framework for improving higher-order message passing in topological deep learning by generalizing beyond the limitations of the standard Hodge Laplacian.

Abstract: Recent work in Topological Deep Learning (TDL) seeks to generalize graph learning’s preeminent $message \ passing$ paradigm to more complex relational structures: simplicial complexes, cell complexes, hypergraphs, and combinations thereof. Many approaches to such ${higher\text{-}order \ message \ passing}$ (HOMP) admit formulation in terms of nonlinear diffusion with the Hodge (combinatorial) Laplacian, a graded operator which carries an inductive bias that dimension-$k$ data features correlate with dimension-$k$ topological features encoded in the (singular) cohomology of the underlying domain. For $k=0$ this recovers the graph Laplacian and its well-studied homophily bias. In higher gradings, however, the Hodge Laplacian’s bias is more opaque and potentially even degenerate. In this essay, we position sheaf theory as a natural and principled formalism for modifying the Hodge Laplacian’s diffusion-mediated interface between local and global descriptors toward more expressive message passing. The sheaf Laplacian’s inductive bias correlates dimension-$k$ data features with dimension-$k$ $sheaf$ cohomology, a data-aware generalization of singular cohomology. We will contextualize and novelly extend prior theory on sheaf diffusion in graph learning ($k=0$) in such a light – and explore how it fails to generalize to $k>0$ – before developing novel theory and practice for the higher-order setting. Our exposition is accompanied by a self-contained introduction shepherding sheaves from the abstract to the applied.

[1141] Tracing the Representation Geometry of Language Models from Pretraining to Post-training

Melody Zixuan Li, Kumar Krishna Agrawal, Arna Ghosh, Komal Kumar Teru, Adam Santoro, Guillaume Lajoie, Blake A. Richards

Main category: cs.LG

TL;DR: The paper identifies three geometric phases in LLM training: initial representational collapse, followed by entropy-seeking expansion, then compression-seeking consolidation that improves downstream performance. Post-training methods like SFT/DPO drive entropy-seeking while RLVR induces compression-seeking.

Details

Motivation: Standard training metrics fail to explain emergent capabilities in LLMs, so the authors investigate the geometry of learned representations to understand training dynamics.

Method: Spectral analysis using effective rank (RankMe) and eigenspectrum decay (α-ReQ) on OLMo (1B-7B) and Pythia (160M-12B) models across pretraining and post-training phases.

Result: Identified consistent non-monotonic sequence of three geometric phases during pretraining: warmup (collapse), entropy-seeking (expansion with peak n-gram memorization), compression-seeking (anisotropic consolidation with downstream performance improvement). Post-training methods show distinct geometric transformations.

Conclusion: Training phases emerge from cross-entropy optimization under skewed token frequencies and representational bottlenecks. Different post-training methods drive either entropy-seeking (SFT/DPO) or compression-seeking (RLVR) dynamics with trade-offs between performance and robustness/diversity.

Abstract: Standard training metrics like loss fail to explain the emergence of complex capabilities in large language models. We take a spectral approach to investigate the geometry of learned representations across pretraining and post-training, measuring effective rank (RankMe) and eigenspectrum decay ($\alpha$-ReQ). With OLMo (1B-7B) and Pythia (160M-12B) models, we uncover a consistent non-monotonic sequence of three geometric phases during autoregressive pretraining. The initial “warmup” phase exhibits rapid representational collapse. This is followed by an “entropy-seeking” phase, where the manifold’s dimensionality expands substantially, coinciding with peak n-gram memorization. Subsequently, a “compression-seeking” phase imposes anisotropic consolidation, selectively preserving variance along dominant eigendirections while contracting others, a transition marked with significant improvement in downstream task performance. We show these phases can emerge from a fundamental interplay of cross-entropy optimization under skewed token frequencies and representational bottlenecks ($d \ll |V|$). Post-training further transforms geometry: SFT and DPO drive “entropy-seeking” dynamics to integrate specific instructional or preferential data, improving in-distribution performance while degrading out-of-distribution robustness. Conversely, RLVR induces “compression-seeking”, enhancing reward alignment but reducing generation diversity.

[1142] Understanding Catastrophic Interference On the Identifibility of Latent Representations

Yuke Li, Yujia Zheng, Tianyi Xiong, Zhenyi Wang, Heng Huang

Main category: cs.LG

TL;DR: The paper proposes a novel theoretical framework that formulates catastrophic interference as an identification problem and introduces a method with two-stage training to mitigate forgetting by identifying shared latent variables.

Details

Motivation: To better understand and model catastrophic interference (catastrophic forgetting) from a latent representation learning perspective, where trained models lose performance on previously learned tasks when adapting to new ones.

Method: Proposes a two-stage training strategy: first uses maximum likelihood estimation to learn latent representations from partial-task aware (PTA) and all-task aware (ATA) setups, then optimizes KL divergence to identify and learn shared latent variables.

Result: Theoretical analysis shows forgetting can be quantified by distance between PTA and ATA setups, and empirical validations demonstrate that identifying shared representations effectively mitigates catastrophic interference.

Conclusion: Identifying and learning shared latent variables between task configurations provides both theoretical guarantees and practical performance improvements for mitigating catastrophic interference in machine learning systems.

Abstract: Catastrophic interference, also known as catastrophic forgetting, is a fundamental challenge in machine learning, where a trained learning model progressively loses performance on previously learned tasks when adapting to new ones. In this paper, we aim to better understand and model the catastrophic interference problem from a latent representation learning point of view, and propose a novel theoretical framework that formulates catastrophic interference as an identification problem. Our analysis demonstrates that the forgetting phenomenon can be quantified by the distance between partial-task aware (PTA) and all-task aware (ATA) setups. Building upon recent advances in identifiability theory, we prove that this distance can be minimized through identification of shared latent variables between these setups. When learning, we propose our method \ourmeos with two-stage training strategy: First, we employ maximum likelihood estimation to learn the latent representations from both PTA and ATA configurations. Subsequently, we optimize the KL divergence to identify and learn the shared latent variables. Through theoretical guarantee and empirical validations, we establish that identifying and learning these shared representations can effectively mitigate catastrophic interference in machine learning systems. Our approach provides both theoretical guarantees and practical performance improvements across both synthetic and benchmark datasets.

[1143] DPFNAS: Differential Privacy-Enhanced Federated Neural Architecture Search for 6G Edge Intelligence

Yang Lv, Jin Cao, Ben Niu, Zhe Sun, Fengwei Wang, Fenghua Li, Hui Li

Main category: cs.LG

TL;DR: A novel federated learning framework combining personalized differential privacy and adaptive model design to address data sensitivity and heterogeneity in 6G edge networks, achieving strong privacy protection while significantly improving model performance and efficiency.

Details

Motivation: To enable pervasive AI in 6G networks through edge intelligence while addressing key challenges: parameter sharing risks data reconstruction attacks, and unified global models struggle with diverse local data distributions.

Method: Integrates personalized differential privacy using sample-level representations for knowledge sharing, and develops a privacy-aware neural architecture search algorithm to generate locally customized architectures and hyperparameters under privacy constraints.

Result: Achieves strong privacy guarantees while significantly outperforming state-of-the-art methods. On CIFAR-10 and CIFAR-100, improves accuracy by 6.82% over PerFedRLNAS, reduces model size to 1/10 and communication cost to 1/20.

Conclusion: The proposed framework is the first personalized DP solution for representation-based FL with theoretical convergence guarantees, successfully balancing privacy protection with model performance and efficiency in edge computing environments.

Abstract: The Sixth-Generation (6G) network envisions pervasive artificial intelligence (AI) as a core goal, enabled by edge intelligence through on-device data utilization. To realize this vision, federated learning (FL) has emerged as a key paradigm for collaborative training across edge devices. However, the sensitivity and heterogeneity of edge data pose key challenges to FL: parameter sharing risks data reconstruction, and a unified global model struggles to adapt to diverse local distributions. In this paper, we propose a novel federated learning framework that integrates personalized differential privacy (DP) and adaptive model design. To protect training data, we leverage sample-level representations for knowledge sharing and apply a personalized DP strategy to resist reconstruction attacks. To ensure distribution-aware adaptation under privacy constraints, we develop a privacy-aware neural architecture search (NAS) algorithm that generates locally customized architectures and hyperparameters. To the best of our knowledge, this is the first personalized DP solution tailored for representation-based FL with theoretical convergence guarantees. Our scheme achieves strong privacy guarantees for training data while significantly outperforming state-of-the-art methods in model performance. Experiments on benchmark datasets such as CIFAR-10 and CIFAR-100 demonstrate that our scheme improves accuracy by 6.82% over the federated NAS method PerFedRLNAS, while reducing model size to 1/10 and communication cost to 1/20.

[1144] GuardNet: Graph-Attention Filtering for Jailbreak Defense in Large Language Models

Javad Forough, Mohammad Maheri, Hamed Haddadi

Main category: cs.LG

TL;DR: GuardNet is a hierarchical filtering framework that detects and filters jailbreak prompts in LLMs using graph neural networks on structured graphs combining sequential, syntactic, and attention-based token relations.

Details

Motivation: LLMs are vulnerable to jailbreak attacks that bypass safety constraints, posing critical risks in domains like healthcare, finance, and legal compliance by enabling unauthorized or harmful behaviors.

Method: Constructs structured graphs combining sequential links, syntactic dependencies, and attention-derived token relations, then applies graph neural networks at two levels: prompt-level filter for global detection and token-level filter for fine-grained adversarial span identification.

Result: Substantially outperforms prior defenses, raising prompt-level F1 scores from 66.4% to 99.8% on LLM-Fuzzer and from 67-79% to over 94% on PLeak datasets. Token-level F1 improved from 48-75% to 74-91%, with IoU gains up to +28%.

Conclusion: GuardNet is a practical and robust defense that maintains acceptable latency and generalizes well in cross-domain evaluations, making it suitable for real-world LLM deployments against jailbreak threats.

Abstract: Large Language Models (LLMs) are increasingly susceptible to jailbreak attacks, which are adversarial prompts that bypass alignment constraints and induce unauthorized or harmful behaviors. These vulnerabilities undermine the safety, reliability, and trustworthiness of LLM outputs, posing critical risks in domains such as healthcare, finance, and legal compliance. In this paper, we propose GuardNet, a hierarchical filtering framework that detects and filters jailbreak prompts prior to inference. GuardNet constructs structured graphs that combine sequential links, syntactic dependencies, and attention-derived token relations to capture both linguistic structure and contextual patterns indicative of jailbreak behavior. It then applies graph neural networks at two levels: (i) a prompt-level filter that detects global adversarial prompts, and (ii) a token-level filter that pinpoints fine-grained adversarial spans. Extensive experiments across three datasets and multiple attack settings show that GuardNet substantially outperforms prior defenses. It raises prompt-level F$_1$ scores from 66.4% to 99.8% on LLM-Fuzzer, and from 67-79% to over 94% on PLeak datasets. At the token level, GuardNet improves F$_1$ from 48-75% to 74-91%, with IoU gains up to +28%. Despite its structural complexity, GuardNet maintains acceptable latency and generalizes well in cross-domain evaluations, making it a practical and robust defense against jailbreak threats in real-world LLM deployments.

[1145] IsingFormer: Augmenting Parallel Tempering With Learned Proposals

Saleh Bunaiyan, Corentin Delacour, Shuvro Chowdhury, Kyle Lee, Kerem Y. Camsari

Main category: cs.LG

TL;DR: IsingFormer uses a Transformer trained on equilibrium samples to generate global spin configurations as proposals in Parallel Tempering, accelerating MCMC sampling and optimization in Ising models and spin glasses.

Details

Motivation: Traditional MCMC methods like Parallel Tempering mix slowly near critical points and in rough landscapes due to reliance on local updates. There's a need for global moves that can capture complex correlations and accelerate sampling/optimization.

Method: Train a Transformer on equilibrium samples to generate entire spin configurations. Use these neural proposals as global moves within Metropolis steps in Parallel Tempering, complementing standard single-spin flips.

Result: On 2D Ising models: reproduces magnetization/free-energy curves, generalizes to unseen temperatures including critical region, sharply reduces equilibration time. On 3D spin glasses: finds substantially lower-energy states. On factorization problems: transfers successfully to unseen semiprimes, boosting success rates beyond training distribution.

Conclusion: Neural proposals that capture global structure can systematically accelerate Monte Carlo methods, enabling faster sampling and stronger performance in combinatorial optimization across problem families.

Abstract: Markov Chain Monte Carlo (MCMC) underlies both statistical physics and combinatorial optimization, but mixes slowly near critical points and in rough landscapes. Parallel Tempering (PT) improves mixing by swapping replicas across temperatures, yet each replica still relies on slow local updates to change its configuration. We introduce IsingFormer, a Transformer trained on equilibrium samples that can generate entire spin configurations resembling those from the target distribution. These uncorrelated samples are used as proposals for global moves within a Metropolis step in PT, complementing the usual single-spin flips. On 2D Ising models (sampling), IsingFormer reproduces magnetization and free-energy curves and generalizes to unseen temperatures, including the critical region. Injecting even a single proposal sharply reduces equilibration time, replacing thousands of local updates. On 3D spin glasses (optimization), PT enhanced with IsingFormer finds substantially lower-energy states, demonstrating how global moves accelerate search in rugged landscapes. Finally, applied to integer factorization encoded as Ising problems, IsingFormer trained on a limited set of semiprimes transfers successfully to unseen semiprimes, boosting success rates beyond the training distribution. Since factorization is a canonical hard benchmark, this ability to generalize across instances highlights the potential of learning proposals that move beyond single problems to entire families of instances. The IsingFormer demonstrates that Monte Carlo methods can be systematically accelerated by neural proposals that capture global structure, yielding faster sampling and stronger performance in combinatorial optimization.

[1146] Beyond Aggregation: Guiding Clients in Heterogeneous Federated Learning

Zijian Wang, Xiaofei Zhang, Xin Zhang, Yukun Liu, Qiong Zhang

Main category: cs.LG

TL;DR: A novel federated learning paradigm where the central server not only aggregates models but also actively guides new queries to the most appropriate client using an empirical likelihood-based framework.

Details

Motivation: To address statistical heterogeneity in FL systems by leveraging the central server's potential to guide new tasks to the best-suited clients, inspired by healthcare scenarios where hospitals with different specialties could be matched to patients.

Method: An empirical likelihood-based framework that simultaneously learns effective local models on each client and finds the best matching client for new queries.

Result: Empirical results show improvements in both model accuracy and client guidance precision compared to standard FL approaches on benchmark datasets.

Conclusion: This work opens a new direction for building intelligent federated systems that treat heterogeneity as a feature rather than a problem, enabling more resource-efficient FL deployments.

Abstract: Federated learning (FL) is increasingly adopted in domains like healthcare, where data privacy is paramount. A fundamental challenge in these systems is statistical heterogeneity-the fact that data distributions vary significantly across clients (e.g., different hospitals may treat distinct patient demographics). While current FL algorithms focus on aggregating model updates from these heterogeneous clients, the potential of the central server remains under-explored. This paper is motivated by a healthcare scenario: could a central server not only build a model but also guide a new patient to the hospital best equipped for their specific condition? We generalize this idea to propose a novel paradigm for FL systems where the server actively guides the allocation of new tasks or queries to the most appropriate client in the network. To enable this, we introduce an empirical likelihood-based framework that simultaneously addresses two goals: (1) learning effective local models on each client, and (2) finding the best matching client for a new query. Empirical results demonstrate the framework’s effectiveness on benchmark datasets, showing improvements in both model accuracy and the precision of client guidance compared to standard FL approaches. This work opens a new direction for building more intelligent and resource-efficient federated systems that leverage heterogeneity as a feature, not just a bug. Code is available at https://github.com/zijianwang0510/FedDRM.git.

[1147] Understanding Language Prior of LVLMs by Contrasting Chain-of-Embedding

Lin Long, Changdae Oh, Seongheon Park, Yixuan Li

Main category: cs.LG

TL;DR: The paper analyzes language prior in large vision-language models through chain-of-embedding analysis, identifying Visual Integration Points (VIP) and introducing Total Visual Integration (TVI) estimator to quantify visual influence.

Details

Motivation: Large vision-language models often rely on memorized textual patterns from pre-training rather than visual evidence, but existing analysis methods fail to reveal the internal mechanisms of when and how vision influences model behavior.

Method: Systematic analysis through chain-of-embedding, examining layer-wise representation dynamics to identify Visual Integration Points (VIP) and developing Total Visual Integration (TVI) estimator to quantify visual influence.

Result: Across 54 model-dataset combinations spanning 9 LVLMs and 6 benchmarks, VIP consistently emerges and TVI reliably predicts the strength of language prior, showing universal phenomenon of critical layers where visual information reshapes representations.

Conclusion: Provides a principled toolkit for diagnosing and understanding language prior in LVLMs through VIP identification and TVI quantification.

Abstract: Large vision-language models (LVLMs) achieve strong performance on multimodal tasks, yet they often default to their language prior (LP) – memorized textual patterns from pre-training while under-utilizing visual evidence. Prior analyses of LP mostly rely on input-output probing, which fails to reveal the internal mechanisms governing when and how vision influences model behavior. To address this gap, we present the first systematic analysis of language prior through the lens of chain-of-embedding, which examines the layer-wise representation dynamics within LVLMs. Our analysis reveals a universal phenomenon: each model exhibits a Visual Integration Point (VIP), a critical layer at which visual information begins to meaningfully reshape hidden representations and influence decoding. Building on this observation, we introduce the Total Visual Integration (TVI) estimator, which aggregates representation distance beyond the VIP to quantify how strongly visual query influences response generation. Across 54 model-dataset combinations spanning 9 contemporary LVLMs and 6 benchmarks, we demonstrate that VIP consistently emerges, and that TVI reliably predicts the strength of language prior. This offers a principled toolkit for diagnosing and understanding language prior in LVLMs.

[1148] Dynamics of Learning: Generative Schedules from Latent ODEs

Matt L. Sampson, Peter Melchior

Main category: cs.LG

TL;DR: A new learning rate scheduler that models training as a dynamical system, using hyperparameter search data to predict optimal future learning rates for better long-term validation performance and generalization.

Details

Motivation: Current learning rate schedules are either simple parametric functions or react only to short-term signals, lacking a comprehensive temporal view of neural network training performance.

Method: Models training performance as a dynamical system, leverages hyperparameter search runs to learn latent representations of training process, predicts future learning rate schedules based on current metrics for optimal long-term validation performance.

Result: Achieves SOTA results for image classification (CNN/ResNet) and next-token prediction (transformer), produces models in flatter loss landscape regions with better generalization, computationally efficient and optimizer-agnostic.

Conclusion: The scheduler generalizes beyond observed training dynamics, creates specialized schedules that outperform common parametric functions, and can be easily integrated with ML experiment-tracking platforms.

Abstract: The learning rate schedule is one of the most impactful aspects of neural network optimization, yet most schedules either follow simple parametric functions or react only to short-term training signals. None of them are supported by a comprehensive temporal view of how well neural networks actually train. We present a new learning rate scheduler that models the training performance of neural networks as a dynamical system. It leverages training runs from a hyperparameter search to learn a latent representation of the training process. Given current training metrics, it predicts the future learning rate schedule with the best long-term validation performance. Our scheduler generalizes beyond previously observed training dynamics and creates specialized schedules that deviate noticeably from common parametric functions. It achieves SOTA results for image classification with CNN and ResNet models as well as for next-token prediction with a transformer model. The trained models are located in flatter regions of the loss landscape and thus provide better generalization than those trained with other schedules. Our method is computationally efficient, optimizer-agnostic, and can easily be layered on top of ML experiment-tracking platforms. An implementation of our scheduler will be made available after acceptance.

[1149] Beyond Model Ranking: Predictability-Aligned Evaluation for Time Series Forecasting

Wanjin Feng, Yuan Yuan, Jingtao Ding, Yong Li

Main category: cs.LG

TL;DR: The paper introduces a predictability-aligned diagnostic framework using spectral coherence to address the flaw in standard evaluation metrics that conflate model performance with data’s intrinsic unpredictability.

Details

Motivation: Standard evaluation metrics for time series forecasting fail to distinguish between model performance and data's inherent unpredictability, leading to unfair model comparisons on benchmark leaderboards.

Method: Proposes Spectral Coherence Predictability (SCP) - a computationally efficient score to quantify forecasting difficulty, and Linear Utilization Ratio (LUR) - a frequency-resolved diagnostic tool to measure how effectively models exploit linearly predictable information.

Result: Reveals two key insights: evidence of ‘predictability drift’ showing forecasting difficulty varies over time, and an architectural trade-off where complex models excel on low-predictability data while linear models are effective on predictable tasks.

Conclusion: Advocates for a paradigm shift from simplistic aggregate scores to predictability-aware evaluation for fairer model comparisons and deeper understanding of model behavior.

Abstract: In the era of increasingly complex AI models for time series forecasting, progress is often measured by marginal improvements on benchmark leaderboards. However, this approach suffers from a fundamental flaw: standard evaluation metrics conflate a model’s performance with the data’s intrinsic unpredictability. To address this pressing challenge, we introduce a novel, predictability-aligned diagnostic framework grounded in spectral coherence. Our framework makes two primary contributions: the Spectral Coherence Predictability (SCP), a computationally efficient ($O(N\log N)$) and task-aligned score that quantifies the inherent difficulty of a given forecasting instance, and the Linear Utilization Ratio (LUR), a frequency-resolved diagnostic tool that precisely measures how effectively a model exploits the linearly predictable information within the data. We validate our framework’s effectiveness and leverage it to reveal two core insights. First, we provide the first systematic evidence of “predictability drift”, demonstrating that a task’s forecasting difficulty varies sharply over time. Second, our evaluation reveals a key architectural trade-off: complex models are superior for low-predictability data, whereas linear models are highly effective on more predictable tasks. We advocate for a paradigm shift, moving beyond simplistic aggregate scores toward a more insightful, predictability-aware evaluation that fosters fairer model comparisons and a deeper understanding of model behavior.

[1150] CLAD-Net: Continual Activity Recognition in Multi-Sensor Wearable Systems

Reza Rahimi Azghan, Gautham Krishna Gudur, Mohit Malu, Edison Thomaz, Giulia Pedrielli, Pavan Turaga, Hassan Ghasemzadeh

Main category: cs.LG

TL;DR: CLAD-Net is a continual learning framework that addresses catastrophic forgetting in human activity recognition from wearable sensors, combining self-supervised transformers with supervised CNNs to maintain performance across subject-wise distribution shifts.

Details

Motivation: Real-world wearable sensor data suffers from distribution shifts between subjects and catastrophic forgetting in continual learning settings, compounded by limited labeled data availability.

Method: CLAD-Net integrates a self-supervised transformer (long-term memory) with a supervised CNN trained via knowledge distillation. The transformer captures global activity patterns through cross-attention across body sensors, while the CNN retains prior knowledge during subject-wise fine-tuning.

Result: On PAMAP2 dataset, CLAD-Net achieves 91.36% final accuracy with only 8.78% forgetting, outperforming memory-based and regularization-based baselines. It maintains strong performance even with only 10-20% labeled data in semi-supervised settings.

Conclusion: CLAD-Net effectively addresses catastrophic forgetting and distribution shifts in wearable sensor-based HAR, demonstrating robust performance across different subjects and limited labeled data scenarios through its dual-architecture approach.

Abstract: The rise of deep learning has greatly advanced human behavior monitoring using wearable sensors, particularly human activity recognition (HAR). While deep models have been widely studied, most assume stationary data distributions

an assumption often violated in real-world scenarios. For example, sensor data from one subject may differ significantly from another, leading to distribution shifts. In continual learning, this shift is framed as a sequence of tasks, each corresponding to a new subject. Such settings suffer from catastrophic forgetting, where prior knowledge deteriorates as new tasks are learned. This challenge is compounded by the scarcity and inconsistency of labeled data in human studies. To address these issues, we propose CLAD-Net (Continual Learning with Attention and Distillation), a framework enabling wearable-sensor models to be updated continuously without sacrificing performance on past tasks. CLAD-Net integrates a self-supervised transformer, acting as long-term memory, with a supervised Convolutional Neural Network (CNN) trained via knowledge distillation for activity classification. The transformer captures global activity patterns through cross-attention across body-mounted sensors, learning generalizable representations without labels. Meanwhile, the CNN leverages knowledge distillation to retain prior knowledge during subject-wise fine-tuning. On PAMAP2, CLAD-Net achieves 91.36 percent final accuracy with only 8.78 percent forgetting, surpassing memory-based and regularization-based baselines such as Experience Replay and Elastic Weight Consolidation. In semi-supervised settings with only 10-20 percent labeled data, CLAD-Net still delivers strong performance, demonstrating robustness to label scarcity. Ablation studies further validate each module’s contribution.

[1151] Signal Preserving Weight Initialization for Odd-Sigmoid Activations

Hyunwoo Lee, Hayoung Choi, Hyunju Kim

Main category: cs.LG

TL;DR: The paper proposes a novel initialization method tailored to odd sigmoid activation functions that prevents saturation and variance collapse, enabling reliable training without normalization layers.

Details

Motivation: Activation functions and weight initialization are interdependent, and standard initialization methods often fail with certain nonlinearities, causing saturation, variance collapse, and learning rate sensitivity.

Method: Defines an odd sigmoid function class and develops a closed-form initialization method that selects noise scale to keep forward activations well dispersed up to target layers, avoiding collapse to zero or saturation.

Result: The proposed method trains reliably without normalization layers, shows strong data efficiency, and enables learning for activations where standard initialization methods (Xavier, He, Orthogonal) often fail to converge.

Conclusion: Tailored initialization methods for specific activation function classes can overcome limitations of standard approaches and enable reliable training without normalization layers.

Abstract: Activation functions critically influence trainability and expressivity, and recent work has therefore explored a broad range of nonlinearities. However, activations and weight initialization are interdependent: without an appropriate initialization method, nonlinearities can cause saturation, variance collapse, and increased learning rate sensitivity. We address this by defining an odd sigmoid function class and, given any activation f in this class, proposing an initialization method tailored to f. The method selects a noise scale in closed form so that forward activations remain well dispersed up to a target layer, thereby avoiding collapse to zero or saturation. Empirically, the approach trains reliably without normalization layers, exhibits strong data efficiency, and enables learning for activations under which standard initialization methods (Xavier, He, Orthogonal) often do not converge reliably.

[1152] Unleashing Flow Policies with Distributional Critics

Deshu Chen, Yuchen Liu, Zhijian Zhou, Chao Qu, Yuan Qi

Main category: cs.LG

TL;DR: The paper introduces Distributional Flow Critic (DFC), a novel critic architecture that models the complete state-action return distribution using flow matching, enabling more stable and informative learning for flow-based policies in offline and offline-to-online RL.

Details

Motivation: Flow-based policies can model complex multimodal behaviors, but their potential is limited by traditional critics that only learn single scalar value estimates, creating a bottleneck in learning performance.

Method: DFC uses flow matching to model the return distribution as a continuous transformation from a simple base distribution to the complex target distribution, providing rich distributional Bellman targets for flow-based policies.

Result: Extensive experiments on D4RL and OGBench benchmarks show strong performance, particularly on tasks requiring multimodal action distributions, and superior results in both offline and offline-to-online fine-tuning compared to existing methods.

Conclusion: DFC effectively addresses the limitations of traditional scalar critics by providing distributional learning signals that enhance the performance of expressive flow-based policies in complex RL scenarios.

Abstract: Flow-based policies have recently emerged as a powerful tool in offline and offline-to-online reinforcement learning, capable of modeling the complex, multimodal behaviors found in pre-collected datasets. However, the full potential of these expressive actors is often bottlenecked by their critics, which typically learn a single, scalar estimate of the expected return. To address this limitation, we introduce the Distributional Flow Critic (DFC), a novel critic architecture that learns the complete state-action return distribution. Instead of regressing to a single value, DFC employs flow matching to model the distribution of return as a continuous, flexible transformation from a simple base distribution to the complex target distribution of returns. By doing so, DFC provides the expressive flow-based policy with a rich, distributional Bellman target, which offers a more stable and informative learning signal. Extensive experiments across D4RL and OGBench benchmarks demonstrate that our approach achieves strong performance, especially on tasks requiring multimodal action distributions, and excels in both offline and offline-to-online fine-tuning compared to existing methods.

[1153] Demystifying Network Foundation Models

Sylee, Beltiukov, Satyandra Guthula, Wenbo Guo, Walter Willinger, Arpit Gupta

Main category: cs.LG

TL;DR: Systematic analysis of Network Foundation Models (NFMs) reveals significant limitations in latent knowledge encoding, including anisotropy, inconsistent feature sensitivity, and inability to separate high-level context, with fixes improving performance by up to +0.35 F1 score.

Details

Motivation: To investigate the latent knowledge encoded within Network Foundation Models beyond just downstream task performance, focusing on hidden representations analysis that existing efforts have overlooked.

Method: Three-part evaluation: Embedding Geometry Analysis (representation space utilization), Metric Alignment Assessment (correspondence with domain-expert features), and Causal Sensitivity Testing (robustness to protocol perturbations) using five diverse network datasets and four state-of-the-art NFMs.

Result: All evaluated NFMs exhibit significant anisotropy, inconsistent feature sensitivity patterns, inability to separate high-level context, payload dependency, and other limitations. Addressing these issues can significantly improve model performance by up to +0.35 F1 score without architectural changes.

Conclusion: Current Network Foundation Models have fundamental limitations in their latent knowledge encoding that need to be addressed, and systematic analysis reveals concrete ways to improve their performance significantly.

Abstract: This work presents a systematic investigation into the latent knowledge encoded within Network Foundation Models (NFMs) that focuses on hidden representations analysis rather than pure downstream task performance. Different from existing efforts, we analyze the models through a three-part evaluation: Embedding Geometry Analysis to assess representation space utilization, Metric Alignment Assessment to measure correspondence with domain-expert features, and Causal Sensitivity Testing to evaluate robustness to protocol perturbations. Using five diverse network datasets spanning controlled and real-world environments, we evaluate four state-of-the-art NFMs, revealing that they all exhibit significant anisotropy, inconsistent feature sensitivity patterns, an inability to separate the high-level context, payload dependency, and other properties. Our work identifies numerous limitations across all models and demonstrates that addressing them can significantly improve model performance (by up to +0.35 $F_1$ score without architectural changes).

[1154] Sensitivity Analysis for Diffusion Models

Christopher Scarvelis, Justin Solomon

Main category: cs.LG

TL;DR: The paper presents a method to compute directional derivatives of diffusion models, enabling prediction of how model outputs change with training data perturbations without retraining.

Details

Motivation: To predict how diffusion model scores and samples would change under small training set perturbations before committing to costly retraining.

Method: Closed-form procedure using black-box access to pre-trained score models and their derivatives, with extensions to estimate sample sensitivity to target measure perturbations.

Result: Method is robust to numerical/approximation error, and computed sensitivities correlate with actual changes in image diffusion model samples after retraining/fine-tuning.

Conclusion: The approach enables efficient prediction of model behavior changes without retraining, with runtime comparable to standard sampling and log-likelihood computation.

Abstract: Training a diffusion model approximates a map from a data distribution $\rho$ to the optimal score function $s_t$ for that distribution. Can we differentiate this map? If we could, then we could predict how the score, and ultimately the model’s samples, would change under small perturbations to the training set before committing to costly retraining. We give a closed-form procedure for computing this map’s directional derivatives, relying only on black-box access to a pre-trained score model and its derivatives with respect to its inputs. We extend this result to estimate the sensitivity of a diffusion model’s samples to additive perturbations of its target measure, with runtime comparable to sampling from a diffusion model and computing log-likelihoods along the sample path. Our method is robust to numerical and approximation error, and the resulting sensitivities correlate with changes in an image diffusion model’s samples after retraining and fine-tuning.

[1155] Causally-Enhanced Reinforcement Policy Optimization

Xiangqi Wang, Yue Huang, Yujun Zhou, Xiaonan Luo, Kehan Guo, Xiangliang Zhang

Main category: cs.LG

TL;DR: CE-PO is a reward-shaping framework that enhances LLM training by incorporating causal coherence along the generation pathway, reducing shortcut strategies and improving robustness while maintaining accuracy.

Details

Motivation: LLMs trained with reinforcement objectives often use shortcut strategies that produce superficially correct answers with unfaithful reasoning, making them vulnerable to small causal perturbations.

Method: Uses Jacobian-based sensitivities to estimate model-internal influence, counterfactually hardens these signals, and fuses coherence scores with task-accuracy feedback via Minkowski combiner. Integrates with PPO/GRPO without architectural changes.

Result: Improves accuracy by 5.49% on average (up to 9.58%) across 4 datasets, reduces reward hacking and unfaithful chain-of-thought, and improves robustness to correlation-causation flips and counterfactual edits.

Conclusion: CE-PO effectively enhances policy optimization by incorporating causal coherence, achieving better accuracy and robustness while reducing unfaithful reasoning in LLMs.

Abstract: Large language models (LLMs) trained with reinforcement objectives often achieve superficially correct answers via shortcut strategies, pairing correct outputs with spurious or unfaithful reasoning and degrading under small causal perturbations. We introduce Causally-Enhanced Policy Optimization (CE-PO), a drop-in reward-shaping framework that augments policy optimization with a differentiable proxy for causal coherence along the generation pathway from prompt (Z) to rationale (X) to answer (Y). CE-PO estimates model-internal influence with Jacobian-based sensitivities, counterfactually hardens these signals to suppress nuisance cues, and fuses the resulting coherence score with task-accuracy feedback via a Minkowski (power-mean) combiner, exposing a single tunable between accuracy and coherence trade-off. The unified reward integrates with PPO/GRPO without architectural changes. Across reasoning benchmarks and causal stress tests, CE-PO reduces reward hacking and unfaithful chain-of-thought while improving robustness to correlation-causation flips and light counterfactual edits, all at near-parity accuracy. Experimental results across 4 datasets show that CE-PO improves accuracy over baselines by 5.49% on average (up to 9.58%), while improving robustness to correlation-causation flips and light counterfactual edits.

[1156] Towards Quantum-Ready Blockchain Fraud Detection via Ensemble Graph Neural Networks

M. Z. Haider, Tayyaba Noreen, M. Salman

Main category: cs.LG

TL;DR: Proposes an ensemble GNN framework combining GCN, GAT, and GIN for blockchain fraud detection, achieving high recall with low false positives on the Elliptic dataset, with quantum-ready design for future scalability.

Details

Motivation: Blockchain's pseudonymous nature enables illicit activities, challenging AML enforcement. Need for models that capture structural and temporal dependencies while being resilient to noise, imbalance, and adversarial behavior.

Method: Ensemble framework integrating Graph Convolutional Networks (GCN), Graph Attention Networks (GAT), and Graph Isomorphism Networks (GIN) with soft voting. Uses real-world Elliptic dataset and includes quantum-ready design hooks.

Result: Tuned soft voting ensemble achieves high recall of illicit transactions while maintaining false positive rate below 1%, outperforming individual GNN models and baseline methods.

Conclusion: Ensemble GNNs provide a practical and forward-looking solution for real-time cryptocurrency monitoring, offering immediate AML utility and a pathway toward quantum-enhanced financial security analytics.

Abstract: Blockchain Business applications and cryptocurrencies such as enable secure, decentralized value transfer, yet their pseudonymous nature creates opportunities for illicit activity, challenging regulators and exchanges in anti money laundering (AML) enforcement. Detecting fraudulent transactions in blockchain networks requires models that can capture both structural and temporal dependencies while remaining resilient to noise, imbalance, and adversarial behavior. In this work, we propose an ensemble framework that integrates Graph Convolutional Networks (GCN), Graph Attention Networks (GAT), and Graph Isomorphism Networks (GIN) to enhance blockchain fraud detection. Using the real-world Elliptic dataset, our tuned soft voting ensemble achieves high recall of illicit transactions while maintaining a false positive rate below 1%, beating individual GNN models and baseline methods. The modular architecture incorporates quantum-ready design hooks, allowing seamless future integration of quantum feature mappings and hybrid quantum classical graph neural networks. This ensures scalability, robustness, and long-term adaptability as quantum computing technologies mature. Our findings highlight ensemble GNNs as a practical and forward-looking solution for real-time cryptocurrency monitoring, providing both immediate AML utility and a pathway toward quantum-enhanced financial security analytics.

[1157] Effective Quantization of Muon Optimizer States

Aman Gupta, Rafael Celente, Abhishek Shivanna, D. T. Braithwaite, Gregory Dexter, Shao Tang, Hiroto Udagawa, Daniel Silva, Rohan Ramanath, S. Sathiya Keerthi

Main category: cs.LG

TL;DR: 8-bit Muon optimizer using blockwise quantization reduces memory footprint by ~74% while maintaining performance comparable to full-precision Muon, outperforming AdamW and 8-bit AdamW in LLM pretraining.

Details

Motivation: Muon optimizer shows faster convergence than AdamW but is stateful like AdamW, requiring significant memory for accumulated gradients. While 8-bit AdamW variants exist, they are typically stable only under dynamic quantization.

Method: Introduce 8-bit Muon optimizer using blockwise quantization, supporting both linear and dynamic quantization schemes.

Result: 8-bit Muon maintains stability under both quantization schemes, reduces memory footprint by ~74% compared to full-precision Muon, matches Muon’s performance, and outperforms AdamW/8-bit AdamW in pretraining 1.6B model on 4B tokens.

Conclusion: 8-bit Muon provides significant memory savings while maintaining performance, with theoretical explanation for its robustness under quantization.

Abstract: The Muon optimizer, based on matrix orthogonalization, has recently shown faster convergence and up to 2x computational efficiency over AdamW in LLM pretraining. Like AdamW, Muon is stateful, requiring storage of both model weights and accumulated gradients. While 8-bit AdamW variants mitigate this overhead using blockwise quantization, they are typically stable only under dynamic quantization - which improves stability on linear quantization for extreme values. In this paper, we introduce the 8-bit Muon optimizer using blockwise quantization, supporting both linear and dynamic schemes. We demonstrate that 8-bit Muon maintains stability under both, while delivering $\sim$74% reduction in memory footprint compared to full-precision Muon. In extensive experiments, 8-bit Muon closely matches the performance of Muon while outperforming AdamW and 8-bit AdamW in pre-training a 1.6B model on 4B FineWeb tokens. It also shows competitive results when fine-tuning the Llama 3.2 3B model on post-training data. We also provide a theoretical perspective to help explain this robustness under quantization.

[1158] RHYTHM: Reasoning with Hierarchical Temporal Tokenization for Human Mobility

Haoyu He, Haozheng Luo, Yan Chen, Qi R. Wang

Main category: cs.LG

TL;DR: RHYTHM is a framework using LLMs for human mobility prediction with temporal tokenization and hierarchical attention to handle long-range dependencies and multi-scale periodic behaviors.

Details

Motivation: Human mobility prediction is challenging due to complex long-range dependencies and multi-scale periodic behaviors that existing methods struggle to capture effectively.

Method: Uses temporal tokenization to partition trajectories into daily segments, encodes them as discrete tokens with hierarchical attention for daily/weekly dependencies, and enriches tokens with pre-computed prompt embeddings from frozen LLMs.

Result: Achieves 2.4% overall accuracy improvement, 5.0% increase on weekends, and 24.6% reduction in training time compared to state-of-the-art methods on three real-world datasets.

Conclusion: RHYTHM effectively leverages LLMs as spatio-temporal predictors with temporal tokenization and hierarchical attention, demonstrating superior performance and efficiency in human mobility prediction.

Abstract: Predicting human mobility is inherently challenging due to complex long-range dependencies and multi-scale periodic behaviors. To address this, we introduce RHYTHM (Reasoning with Hierarchical Temporal Tokenization for Human Mobility), a unified framework that leverages large language models (LLMs) as general-purpose spatio-temporal predictors and trajectory reasoners. Methodologically, RHYTHM employs temporal tokenization to partition each trajectory into daily segments and encode them as discrete tokens with hierarchical attention that captures both daily and weekly dependencies, thereby significantly reducing the sequence length while preserving cyclical information. Additionally, we enrich token representations by adding pre-computed prompt embeddings for trajectory segments and prediction targets via a frozen LLM, and feeding these combined embeddings back into the LLM backbone to capture complex interdependencies. Computationally, RHYTHM freezes the pretrained LLM’s backbone to reduce attention complexity and memory cost. We evaluate our model against state-of-the-art methods using three real-world datasets. Notably, RHYTHM achieves a 2.4% improvement in overall accuracy, a 5.0% increase on weekends, and a 24.6% reduction in training time. Code is publicly available at https://github.com/he-h/rhythm.

[1159] Impute-MACFM: Imputation based on Mask-Aware Flow Matching

Dengyi Liu, Honggang Wang, Hua Fang

Main category: cs.LG

TL;DR: Impute-MACFM is a mask-aware conditional flow matching framework for tabular data imputation that handles different missingness mechanisms (MCAR, MAR, MNAR) with stable training and efficient inference.

Details

Motivation: Tabular data, especially in healthcare, often has missing values that undermine model reliability. Existing methods either make restrictive assumptions, struggle with complex feature dependencies, or suffer from instability and high computational costs.

Method: Uses mask-aware conditional flow matching with trajectories only on missing entries, stability penalties on observed values, consistency regularization, and time-decayed noise injection. Inference uses constraint-preserving ODE integration with per-step projection.

Result: Achieves state-of-the-art performance across diverse benchmarks, providing more robust, efficient, and higher-quality imputation than competing approaches.

Conclusion: Flow matching is a promising direction for tabular missing-data problems, particularly for longitudinal data like in healthcare applications.

Abstract: Tabular data are central to many applications, especially longitudinal data in healthcare, where missing values are common, undermining model fidelity and reliability. Prior imputation methods either impose restrictive assumptions or struggle with complex cross-feature structure, while recent generative approaches suffer from instability and costly inference. We propose Impute-MACFM, a mask-aware conditional flow matching framework for tabular imputation that addresses missingness mechanisms, missing completely at random, missing at random, and missing not at random. Its mask-aware objective builds trajectories only on missing entries while constraining predicted velocity to remain near zero on observed entries, using flexible nonlinear schedules. Impute-MACFM combines: (i) stability penalties on observed positions, (ii) consistency regularization enforcing local invariance, and (iii) time-decayed noise injection for numeric features. Inference uses constraint-preserving ordinary differential equation integration with per-step projection to fix observed values, optionally aggregating multiple trajectories for robustness. Across diverse benchmarks, Impute-MACFM achieves state-of-the-art results while delivering more robust, efficient, and higher-quality imputation than competing approaches, establishing flow matching as a promising direction for tabular missing-data problems, including longitudinal data.

[1160] C$^2$GSPG: Confidence-calibrated Group Sequence Policy Gradient towards Self-aware Reasoning

Haotian Liu, Shuo Wang, Hongteng Xu

Main category: cs.LG

TL;DR: C²GSPG is a confidence-calibration group sequence policy gradient method that enhances reasoning performance while suppressing overconfidence in reinforcement learning models.

Details

Motivation: Existing RL methods like GRPO suffer from overconfidence issues that prevent achieving self-aware reasoning models.

Method: Proposes Group Sequence Policy Gradient (GSPG) framework to eliminate token-level bias, defines model confidence using normalized sequence-level probability, and applies cross-entropy regularizer to calibrate confidence to reward.

Result: Superior performance over state-of-the-art methods in logical and mathematical reasoning tasks, achieving better reasoning accuracy and confidence calibration.

Conclusion: C²GSPG effectively addresses overconfidence while improving reasoning performance through collaborative confidence calibration and policy gradient optimization.

Abstract: Reinforcement Learning (RL) methods, exemplified by Group Relative Policy Optimization (GRPO) and its variants, play a central role in developing reasoning models. However, these methods often suffer from a critical overconfidence issue, which prevents them from achieving self-aware reasoning models. In this study, we propose a simple yet effective confidence-calibration group sequence policy gradient method, called C$^2$GSPG, which simultaneously enhances reasoning performance while suppressing overconfidence. In principle, we propose a Group Sequence Policy Gradient (GSPG) framework for learning reasoning models, which eliminates the token-level bias commonly appearing in GRPO and its variants. In this framework, we define the model confidence for each reasoning problem using the normalized sequence-level probability, and then apply a cross-entropy regularizer to calibrate the model confidence to the sequence’s reward. We demonstrate that the confidence calibration regularizer and GSPG are collaborative for binary rewards, as their objectives always share the same gradient direction. For non-binary rewards, we apply nonlinear reward normalization and adaptive regularizer clipping, mitigating the potential conflict between the two objectives. Applying C$^2$GSPG to post-train large language models in logical and mathematical reasoning tasks, we show its superiority over state-of-the-art methods in both reasoning accuracy and confidence calibration. The code of C$^2$GSPG is available at https://github.com/HaotianLiu123/CCGSPG.

[1161] Trust Region Reward Optimization and Proximal Inverse Reward Optimization Algorithm

Yang Chen, Menglin Zou, Jiaqi Zhang, Yitan Zhang, Junyi Yang, Gael Gendron, Libo Zhang, Jiamou Liu, Michael J. Witbrock

Main category: cs.LG

TL;DR: This paper introduces Trust Region Reward Optimization (TRRO), a stable non-adversarial Inverse Reinforcement Learning framework that guarantees monotonic improvement in expert behavior likelihood, with practical instantiation as PIRO algorithm.

Details

Motivation: Modern adversarial IRL methods suffer from unstable training, while recent non-adversarial approaches lack formal guarantees despite improved stability.

Method: Proposes TRRO framework using Minorization-Maximization process to guarantee monotonic improvement, instantiated as PIRO algorithm that jointly learns reward and policy via energy-based formulations.

Result: PIRO matches or surpasses state-of-the-art baselines in reward recovery and policy imitation with high sample efficiency on MuJoCo, Gym-Robotics benchmarks and real-world animal behavior modeling.

Conclusion: TRRO provides IRL counterpart to TRPO’s stability guarantees in forward RL, offering a theoretically grounded and practically effective approach to stable IRL.

Abstract: Inverse Reinforcement Learning (IRL) learns a reward function to explain expert demonstrations. Modern IRL methods often use the adversarial (minimax) formulation that alternates between reward and policy optimization, which often lead to unstable training. Recent non-adversarial IRL approaches improve stability by jointly learning reward and policy via energy-based formulations but lack formal guarantees. This work bridges this gap. We first present a unified view showing canonical non-adversarial methods explicitly or implicitly maximize the likelihood of expert behavior, which is equivalent to minimizing the expected return gap. This insight leads to our main contribution: Trust Region Reward Optimization (TRRO), a framework that guarantees monotonic improvement in this likelihood via a Minorization-Maximization process. We instantiate TRRO into Proximal Inverse Reward Optimization (PIRO), a practical and stable IRL algorithm. Theoretically, TRRO provides the IRL counterpart to the stability guarantees of Trust Region Policy Optimization (TRPO) in forward RL. Empirically, PIRO matches or surpasses state-of-the-art baselines in reward recovery, policy imitation with high sample efficiency on MuJoCo and Gym-Robotics benchmarks and a real-world animal behavior modeling task.

[1162] Beyond Heuristics: Globally Optimal Configuration of Implicit Neural Representations

Sipeng Chen, Yan Zhang, Shibo Li

Main category: cs.LG

TL;DR: OptiINR is a unified framework that uses Bayesian optimization to automatically find optimal configurations for Implicit Neural Representations, replacing manual tuning with systematic optimization across activation functions and initialization parameters.

Details

Motivation: Current INR practice relies on ad-hoc heuristics and grid searches for configuration, leading to inconsistent results across different tasks and modalities. There's a need for principled, automated optimization of INR parameters.

Method: Formulates INR configuration as an optimization problem using Bayesian optimization to jointly explore discrete activation families (SIREN, WIRE, FINER) and continuous initialization parameters in a unified framework.

Result: OptiINR provides globally optimal configurations that consistently maximize performance across diverse signal processing applications, replacing fragmented manual tuning with data-driven optimization.

Conclusion: OptiINR establishes a principled foundation for INR design through systematic optimization, enabling consistent high performance across different modalities without manual intervention.

Abstract: Implicit Neural Representations (INRs) have emerged as a transformative paradigm in signal processing and computer vision, excelling in tasks from image reconstruction to 3D shape modeling. Yet their effectiveness is fundamentally limited by the absence of principled strategies for optimal configuration - spanning activation selection, initialization scales, layer-wise adaptation, and their intricate interdependencies. These choices dictate performance, stability, and generalization, but current practice relies on ad-hoc heuristics, brute-force grid searches, or task-specific tuning, often leading to inconsistent results across modalities. This work introduces OptiINR, the first unified framework that formulates INR configuration as a rigorous optimization problem. Leveraging Bayesian optimization, OptiINR efficiently explores the joint space of discrete activation families - such as sinusoidal (SIREN), wavelet-based (WIRE), and variable-periodic (FINER) - and their associated continuous initialization parameters. This systematic approach replaces fragmented manual tuning with a coherent, data-driven optimization process. By delivering globally optimal configurations, OptiINR establishes a principled foundation for INR design, consistently maximizing performance across diverse signal processing applications.

[1163] TimeExpert: Boosting Long Time Series Forecasting with Temporal Mix of Experts

Xiaowen Ma, Shuning Ge, Fan Yang, Xiangyu Li, Yun Chen, Mengting Ma, Wei Zhang, Zhipeng Liu

Main category: cs.LG

TL;DR: TimeExpert introduces Temporal Mix of Experts (TMOE) to replace vanilla attention in Transformers, addressing lag effects and anomalies in time series by combining local expert selection with global dependency modeling.

Details

Motivation: Standard Transformer attention fails to handle dynamic lag effects and anomalous segments in real-world time series data, leading to degraded forecasting accuracy.

Method: Proposes TMOE that treats key-value pairs as local temporal experts, performs adaptive expert selection via localized filtering, and maintains a shared global expert for long-range dependencies. Replaces attention in existing frameworks like PatchTST and Timer.

Result: TimeExpert and TimeExpert-G outperform state-of-the-art methods on seven real-world long-term forecasting benchmarks.

Conclusion: TMOE effectively addresses temporal challenges in time series modeling while preserving Transformer strengths, demonstrating superior performance across multiple benchmarks.

Abstract: Transformer-based architectures dominate time series modeling by enabling global attention over all timestamps, yet their rigid ‘one-size-fits-all’ context aggregation fails to address two critical challenges in real-world data: (1) inherent lag effects, where the relevance of historical timestamps to a query varies dynamically; (2) anomalous segments, which introduce noisy signals that degrade forecasting accuracy. To resolve these problems, we propose the Temporal Mix of Experts (TMOE), a novel attention-level mechanism that reimagines key-value (K-V) pairs as local experts (each specialized in a distinct temporal context) and performs adaptive expert selection for each query via localized filtering of irrelevant timestamps. Complementing this local adaptation, a shared global expert preserves the Transformer’s strength in capturing long-range dependencies. We then replace the vanilla attention mechanism in popular time-series Transformer frameworks (i.e., PatchTST and Timer) with TMOE, without extra structural modifications, yielding our specific version TimeExpert and general version TimeExpert-G. Extensive experiments on seven real-world long-term forecasting benchmarks demonstrate that TimeExpert and TimeExpert-G outperform state-of-the-art methods. Code is available at https://github.com/xwmaxwma/TimeExpert.

[1164] Critique to Verify: Accurate and Honest Test-Time Scaling with RL-Trained Verifiers

Zhicheng Yang, Zhijiang Guo, Yinya Huang, Yongxin Wang, Yiwei Wang, Xiaodan Liang, Jing Tang

Main category: cs.LG

TL;DR: Mirror-Critique is a framework that trains verifiers with informative critiques by contrasting model-generated solutions with ground-truth solutions, using synthetic critique data to improve verification ability and solution accuracy.

Details

Motivation: Current reward model selection in test-time scaling often fails to identify minority-yet-correct answers, limiting effectiveness beyond simple majority voting due to lack of informative critique signals during verifier training.

Method: Leverage rich critique signals by contrasting model-generated solutions with ground-truth solutions, use small instruction-tuned model to synthesize high-quality critique data with rejection sampling, and employ synthetic data to cold-start LLMs in RLVR process for improved verification.

Result: Mirror-Verifier significantly outperforms majority voting in solution accuracy and improves solver’s honesty to recognize and abstain from answering beyond capability boundaries.

Conclusion: The Mirror-Critique framework successfully addresses limitations of current verification methods by providing informative critiques, leading to better solution selection and improved model honesty.

Abstract: Test-time scaling via solution sampling and aggregation has become a key paradigm for improving the reasoning performance of Large Language Models (LLMs). While reward model selection is commonly employed in this approach, it often fails to identify minority-yet-correct answers, which limits its effectiveness beyond that of simple majority voting. We argue that this limitation stems from a lack of informative critique signals during verifier training. To bridge this gap, we introduce Mirror-Critique, a framework that trains a verifier with informative critiques. Our key insight is to leverage the rich critique signal by contrasting model-generated solutions with ground-truth solutions. We deploy a small instruction-tuned model to synthesize high-quality critique data with rejection sampling that teaches the verifier not only what is wrong, but also why. The synthetic data is used to cold-start the LLMs in the RLVR process to further improve the verification ability. The resulting Mirror-Verifier is deployed to evaluate candidate solutions by generating multiple critiques per solution, aggregating them into a verify score used for weighted voting or selective abstention. The experimental results show that our Mirror-Verifier significantly outperforms majority voting in terms of solution accuracy and also improves the solver’s honesty to recognize and abstain from answering beyond its capability boundaries.

[1165] CrystalGym: A New Benchmark for Materials Discovery Using Reinforcement Learning

Prashant Govindarajan, Mathieu Reymond, Antoine Clavaud, Mariano Phielipp, Santiago Miret, Sarath Chandar

Main category: cs.LG

TL;DR: CrystalGym is an open-source RL environment for crystalline material discovery that enables direct DFT feedback in material design through reinforcement learning.

Details

Motivation: To address the limitation of current ML approaches that avoid direct DFT signals due to computational costs, and to enable online reinforcement learning with DFT feedback for material design.

Method: Proposed CrystalGym environment for benchmarking RL algorithms on crystalline material design tasks, with properties like band gap, bulk modulus, and density calculated directly from DFT.

Result: Benchmarked various RL algorithms showing different sample efficiencies and convergence patterns, but none solved all tasks completely. Included case study on fine-tuning LLMs with RL for DFT-based rewards.

Conclusion: CrystalGym serves as a test bed for RL researchers and material scientists, introducing challenges for methods dealing with time-consuming reward signals and enabling future interdisciplinary research.

Abstract: In silico design and optimization of new materials primarily relies on high-accuracy atomic simulators that perform density functional theory (DFT) calculations. While recent works showcase the strong potential of machine learning to accelerate the material design process, they mostly consist of generative approaches that do not use direct DFT signals as feedback to improve training and generation mainly due to DFT’s high computational cost. To aid the adoption of direct DFT signals in the materials design loop through online reinforcement learning (RL), we propose CrystalGym, an open-source RL environment for crystalline material discovery. Using CrystalGym, we benchmark common value- and policy-based reinforcement learning algorithms for designing various crystals conditioned on target properties. Concretely, we optimize for challenging properties like the band gap, bulk modulus, and density, which are directly calculated from DFT in the environment. While none of the algorithms we benchmark solve all CrystalGym tasks, our extensive experiments and ablations show different sample efficiencies and ease of convergence to optimality for different algorithms and environment settings. Additionally, we include a case study on the scope of fine-tuning large language models with reinforcement learning for improving DFT-based rewards. Our goal is for CrystalGym to serve as a test bed for reinforcement learning researchers and material scientists to address these real-world design problems with practical applications. We therefore introduce a novel class of challenges for reinforcement learning methods dealing with time-consuming reward signals, paving the way for future interdisciplinary research for machine learning motivated by real-world applications.

[1166] Deep Learning-Based Detection of Cognitive Impairment from Passive Smartphone Sensing with Routine-Aware Augmentation and Demographic Personalization

Yufei Shen, Ji Hwan Park, Minchao Huang, Jared F. Benge, Justin F. Rousseau, Rosemary A. Lester-Smith, Edison Thomaz

Main category: cs.LG

TL;DR: LSTM model with routine-aware augmentation and demographic personalization improves cognitive impairment detection from smartphone sensing data in older adults.

Details

Motivation: Early detection of cognitive impairment is critical but infrequent clinical assessments lack sensitivity and temporal resolution to capture subtle declines.

Method: Implemented LSTM model with two techniques: routine-aware augmentation (generates synthetic sequences using behaviorally similar alternatives) and demographic personalization (reweights training samples based on demographic similarity).

Result: Joint techniques improved AUPRC from 0.637 to 0.766 on 6-month data from 36 older adults.

Conclusion: Demonstrates potential for scalable monitoring of cognitive impairment in aging populations using passive smartphone sensing.

Abstract: Early detection of cognitive impairment is critical for timely diagnosis and intervention, yet infrequent clinical assessments often lack the sensitivity and temporal resolution to capture subtle cognitive declines in older adults. Passive smartphone sensing has emerged as a promising approach for naturalistic and continuous cognitive monitoring. Building on this potential, we implemented a Long Short-Term Memory (LSTM) model to detect cognitive impairment from sequences of daily behavioral features, derived from multimodal sensing data collected in an ongoing one-year study of older adults. Our key contributions are two techniques to enhance model generalizability across participants: (1) routine-aware augmentation, which generates synthetic sequences by replacing each day with behaviorally similar alternatives, and (2) demographic personalization, which reweights training samples to emphasize those from individuals demographically similar to the test participant. Evaluated on 6-month data from 36 older adults, these techniques jointly improved the Area Under the Precision-Recall Curve (AUPRC) of the model trained on sensing and demographic features from 0.637 to 0.766, highlighting the potential of scalable monitoring of cognitive impairment in aging populations with passive sensing.

[1167] ProtoTS: Learning Hierarchical Prototypes for Explainable Time Series Forecasting

Ziheng Peng, Shijie Ren, Xinyue Gu, Linxiao Yang, Xiting Wang, Liang Sun

Main category: cs.LG

TL;DR: ProtoTS is an interpretable time series forecasting framework that uses prototypical temporal patterns to provide both high accuracy and transparent decision-making through hierarchical prototype organization and multi-level interpretability.

Details

Motivation: Deep learning models lack transparency in decision-making for high-stakes scenarios, and existing interpretable models only provide local/partial explanations without showing how heterogeneous input variables jointly shape temporal patterns.

Method: ProtoTS computes instance-prototype similarity using denoised representations that preserve heterogeneous information, and organizes prototypes hierarchically to capture both global temporal patterns (coarse prototypes) and finer-grained local variations (detailed prototypes).

Result: Experiments on multiple realistic benchmarks including a new LOF dataset show ProtoTS exceeds existing methods in forecast accuracy while providing expert-steerable interpretations for better model understanding and decision support.

Conclusion: ProtoTS successfully achieves both high forecasting accuracy and transparent decision-making through its prototypical pattern modeling approach, enabling multi-level interpretability and expert steering capabilities.

Abstract: While deep learning has achieved impressive performance in time series forecasting, it becomes increasingly crucial to understand its decision-making process for building trust in high-stakes scenarios. Existing interpretable models often provide only local and partial explanations, lacking the capability to reveal how heterogeneous and interacting input variables jointly shape the overall temporal patterns in the forecast curve. We propose ProtoTS, a novel interpretable forecasting framework that achieves both high accuracy and transparent decision-making through modeling prototypical temporal patterns. ProtoTS computes instance-prototype similarity based on a denoised representation that preserves abundant heterogeneous information. The prototypes are organized hierarchically to capture global temporal patterns with coarse prototypes while capturing finer-grained local variations with detailed prototypes, enabling expert steering and multi-level interpretability. Experiments on multiple realistic benchmarks, including a newly released LOF dataset, show that ProtoTS not only exceeds existing methods in forecast accuracy but also delivers expert-steerable interpretations for better model understanding and decision support.

[1168] Dense associative memory on the Bures-Wasserstein space

Chandan Tankala, Krishnakumar Balasubramanian

Main category: cs.LG

TL;DR: Extends dense associative memories from vectors to probability distributions using 2-Wasserstein distance, focusing on Gaussian densities with log-sum-exp energy and optimal transport-based retrieval dynamics.

Details

Motivation: Existing dense associative memory models are limited to vector representations, but many real-world data are naturally represented as probability distributions. This work aims to bridge classical associative memories with modern generative modeling by enabling distributional storage and retrieval.

Method: Defines a log-sum-exp energy over stored distributions and retrieval dynamics that aggregate optimal transport maps in a Gibbs-weighted manner. Focuses on Bures-Wasserstein class of Gaussian densities, with stationary points corresponding to self-consistent Wasserstein barycenters.

Result: Proves exponential storage capacity, provides quantitative retrieval guarantees under Wasserstein perturbations, and validates the model on both synthetic and real-world distributional tasks.

Conclusion: Successfully elevates associative memory from vectors to full distributions, enabling distributional storage and retrieval in memory-augmented learning and bridging classical DAMs with modern generative modeling.

Abstract: Dense associative memories (DAMs) store and retrieve patterns via energy-functional fixed points, but existing models are limited to vector representations. We extend DAMs to probability distributions equipped with the 2-Wasserstein distance, focusing mainly on the Bures-Wasserstein class of Gaussian densities. Our framework defines a log-sum-exp energy over stored distributions and a retrieval dynamics aggregating optimal transport maps in a Gibbs-weighted manner. Stationary points correspond to self-consistent Wasserstein barycenters, generalizing classical DAM fixed points. We prove exponential storage capacity, provide quantitative retrieval guarantees under Wasserstein perturbations, and validate the model on synthetic and real-world distributional tasks. This work elevates associative memory from vectors to full distributions, bridging classical DAMs with modern generative modeling and enabling distributional storage and retrieval in memory-augmented learning.

[1169] F-Adapter: Frequency-Adaptive Parameter-Efficient Fine-Tuning in Scientific Machine Learning

Hangwei Zhang, Chun Kang, Yan Wang, Difan Zou

Main category: cs.LG

TL;DR: First systematic study of Parameter-efficient Fine-tuning (PEFT) for Large Operator Models in scientific ML, showing adapters outperform LoRA, with proposed F-Adapter achieving SOTA results on 3D Navier-Stokes benchmarks.

Details

Motivation: PEFT has proven effective in vision and language processing but remains unexplored in scientific machine learning for modeling complex physical systems.

Method: Systematic study of PEFT for pre-trained Large Operator Models, theoretical analysis of LoRA vs adapters, and introduction of Frequency-Adaptive Adapter (F-Adapter) that allocates capacity based on spectral complexity.

Result: F-Adapters establish state-of-the-art results on multiple challenging 3D Navier-Stokes benchmarks, markedly enhancing both generalization and spectral fidelity over LoRA and other PEFT techniques.

Conclusion: This work is the first to explore PEFT for scientific machine learning and establishes F-Adapter as an effective paradigm for this domain.

Abstract: Parameter-efficient fine-tuning (PEFT) of powerful pre-trained models for complex downstream tasks has proven effective in vision and language processing, yet this paradigm remains unexplored in scientific machine learning, where the objective is to model complex physical systems. We conduct the first systematic study of PEFT for pre-trained Large Operator Models (LOMs) obtained by scaling variants of Fourier Neural Operator. First, we observe that the widely used Low-Rank Adaptation (LoRA) yields markedly poorer performance on LOMs than Adapter tuning. Then, we further theoretically establish that stacked LoRA incurs a depth-amplified lower bound on approximation error within Fourier layers, whereas adapters retain universal approximation capacity and, by concentrating parameters on energy-dominant low-frequency modes, attain exponentially decaying error with bottleneck width in the Fourier domain. Motivated by the robust empirical gains of adapters and by our theoretical characterization of PDE solutions as spectrally sparse, we introduce Frequency-Adaptive Adapter (F-Adapter). F-Adapter allocates adapter capacity based on spectral complexity, assigning higher-dimension modules to low-frequency components and lower-dimension modules to high-frequency components. Our F-Adapters establish state-of-the-art (SOTA) results on multiple challenging 3D Navier-Stokes benchmarks, markedly enhancing both generalization and spectral fidelity over LoRA and other PEFT techniques commonly used in LLMs. To the best of our knowledge, this work is the first to explore PEFT for scientific machine-learning and establishes F-Adapter as an effective paradigm for this domain.

[1170] ZeroSiam: An Efficient Siamese for Test-Time Entropy Optimization without Collapse

Guohao Chen, Shuaicheng Niu, Deyu Chen, Jiahao Yang, Zitian Zhang, Mingkui Tan, Pengcheng Wu, Zhiqi Shen

Main category: cs.LG

TL;DR: ZeroSiam prevents collapse in test-time entropy minimization using an asymmetric Siamese architecture with divergence alignment, achieving stable performance across vision and language tasks.

Details

Motivation: Pure entropy minimization can lead to collapsed solutions like constant one-hot outputs that trivially minimize entropy without meaningful learning, favoring non-generalizable shortcuts.

Method: Asymmetric Siamese architecture with learnable predictor and stop-gradient operator before classifier to prevent collapse through asymmetric divergence alignment.

Result: ZeroSiam performs more stably than prior methods with negligible overhead, works on both vision adaptation and LLM reasoning tasks, and handles collapse-prone tiny models effectively.

Conclusion: ZeroSiam effectively prevents collapse in test-time entropy minimization while enhancing performance, demonstrating broad applicability across diverse models and challenging scenarios.

Abstract: Test-time entropy minimization helps adapt a model to novel environments and incentivize its reasoning capability, unleashing the model’s potential during inference by allowing it to evolve and improve in real-time using its own predictions, achieving promising performance. However, pure entropy minimization can favor non-generalizable shortcuts, such as inflating the logit norm and driving all predictions to a dominant class to reduce entropy, risking collapsed solutions (e.g., constant one-hot outputs) that trivially minimize the objective without meaningful learning. In this paper, we introduce ZeroSiam, an efficient asymmetric Siamese architecture tailored for test-time entropy minimization. ZeroSiam prevents collapse through asymmetric divergence alignment, which is efficiently achieved by a learnable predictor and a stop-gradient operator before the classifier. We provide empirical and theoretical evidence that ZeroSiam not only prevents collapse solutions, but also absorbs and regularizes biased learning signals, enhancing performance even when no collapse occurs. Despite its simplicity, extensive results show that ZeroSiam performs more stably over prior methods using negligible overhead, demonstrating efficacy on both vision adaptation and large language model reasoning tasks across challenging test scenarios and diverse models, including tiny models that are particularly collapse-prone.

[1171] CoSIFL: Collaborative Secure and Incentivized Federated Learning with Differential Privacy

Zhanhong Xie, Meifan Zhang, Lihua Yin

Main category: cs.LG

TL;DR: CoSIFL is a federated learning framework that combines proactive security alarming, local differential privacy, and Stackelberg-based incentives to defend against attacks while encouraging client participation.

Details

Motivation: Address challenges in federated learning including malicious clients, inference attacks, and difficulties in incentivizing participants to contribute high-quality data under privacy constraints.

Method: Integrates proactive alarming mechanism with robust aggregation for Byzantine and inference attacks, uses Tullock contest-inspired incentive module, and formulates server-client interaction as a two-stage Stackelberg game with equilibrium analysis.

Result: Experimental results show CoSIFL outperforms state-of-the-art solutions in improving model robustness and reducing total server costs on standard benchmarks.

Conclusion: The integrated design of CoSIFL effectively addresses security, privacy, and incentive challenges in federated learning through a game-theoretic approach that achieves unique equilibrium and improved system efficiency.

Abstract: Federated learning (FL) has emerged as a promising paradigm for collaborative model training while preserving data locality. However, it still faces challenges from malicious or compromised clients, as well as difficulties in incentivizing participants to contribute high-quality data under strict privacy requirements. Motivated by these considerations, we propose CoSIFL, a novel framework that integrates proactive alarming for robust security and local differential privacy (LDP) for inference attacks, together with a Stackelberg-based incentive scheme to encourage client participation and data sharing. Specifically, CoSIFL uses an active alarming mechanism and robust aggregation to defend against Byzantine and inference attacks, while a Tullock contest-inspired incentive module rewards honest clients for both data contributions and reliable alarm triggers. We formulate the interplay between the server and clients as a two-stage game: in the first stage, the server determines total rewards, selects participants, and fixes global iteration settings, whereas in the second stage, each client decides its mini-batch size, privacy noise scale, and alerting strategy. We prove that the server-client game admits a unique equilibrium, and analyze how clients’ multi-dimensional attributes - such as non-IID degrees and privacy budgets - jointly affect system efficiency. Experimental results on standard benchmarks demonstrate that CoSIFL outperforms state-of-the-art solutions in improving model robustness and reducing total server costs, highlighting the effectiveness of our integrated design.

[1172] Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization

Vage Egiazarian, Roberto L. Castro, Denis Kuznedelev, Andrei Panferov, Eldar Kurtic, Shubhra Pandit, Alexandre Marques, Mark Kurtz, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh

Main category: cs.LG

TL;DR: MR-GPTQ is a new quantization method that addresses limitations of MXFP4 and NVFP4 formats, achieving significant speedups (up to 3.6x layer-wise, 2.2x end-to-end) while maintaining accuracy through block-wise Hadamard transforms and format-specific optimizations.

Details

Motivation: Hardware-accelerated 4-bit floating-point formats (MXFP4, NVFP4) promise to revolutionize LLM inference but their practical benefits remain unproven, with state-of-the-art methods struggling due to format-specific limitations.

Method: Micro-Rotated-GPTQ (MR-GPTQ) - a variant of GPTQ that uses block-wise Hadamard transforms and format-specific optimizations, supported by high-performance GPU kernels with rotation fusion into weights and fast online activation computation.

Result: MR-GPTQ achieves speedups vs. FP16 of up to 3.6x layer-wise and 2.2x end-to-end on NVIDIA B200, and 6x layer-wise and 4x end-to-end on RTX5090, while matching or outperforming state-of-the-art accuracy and significantly boosting MXFP4 performance.

Conclusion: While FP4 is not an automatic upgrade over INT4, format-specialized methods like MR-GPTQ can unlock new accuracy-performance trade-offs for 4-bit floating-point formats.

Abstract: The recent hardware-accelerated microscaling 4-bit floating-point formats such as MXFP4 and NVFP4, supported on NVIDIA and AMD GPUs, promise to revolutionize large language model (LLM) inference. Yet, their practical benefits remain unproven. We present the first comprehensive study of MXFP4 and NVFP4 for post-training quantization, revealing gaps between their promise and real-world performance. Our analysis shows that state-of-the-art methods struggle with FP4, due to two key issues: (1) NVFP4’s small group size provably neutralizes traditional outlier mitigation techniques; (2) MXFP4’s power-of-two scale quantization severely degrades accuracy due to high induced error. To bridge this gap, we introduce Micro-Rotated-GPTQ (MR-GPTQ), a variant of the classic GPTQ quantization algorithm that tailors the quantization process to FP4’s unique properties, by using block-wise Hadamard transforms and format-specific optimizations. We support our proposal with a set of high-performance GPU kernels that enable the MR-GPTQ format with negligible overhead, by rotation fusion into the weights, and fast online computation of the activations. This leads to speedups vs. FP16 of up to 3.6x layer-wise, and 2.2x end-to-end on NVIDIA B200, and of 6x layer-wise and 4x end-to-end on RTX5090. Our extensive empirical evaluation demonstrates that MR-GPTQ matches or outperforms state-of-the-art accuracy, significantly boosting MXFP4, to the point where it nears that of NVFP4. We conclude that, while FP4 is not an automatic upgrade over INT4, format-specialized methods like MR-GPTQ can unlock a new frontier of accuracy-performance trade-offs.

[1173] Towards Monotonic Improvement in In-Context Reinforcement Learning

Wenhao Zhang, Shao Zhang, Xihuai Wang, Yang Li, Ying Wen

Main category: cs.LG

TL;DR: CV-ICRL addresses Contextual Ambiguity in ICRL by introducing Context Value to ensure monotonic performance improvement during testing, unlike previous methods that fail to maintain training-time improvement patterns.

Details

Motivation: Current ICRL methods trained on monotonic policy improvement data fail to show continued improvement during testing due to Contextual Ambiguity, where the model's stochastic actions create misleading interaction histories.

Method: Proposed Context Value Informed ICRL (CV-ICRL) that uses Context Value as an explicit signal representing ideal performance achievable given current context, with Context Value being non-decreasing as context expands. Also developed two methods for estimating Context Value at training and testing time.

Result: Experiments on Dark Room and Minigrid testbeds show CV-ICRL effectively mitigates performance degradation and improves overall ICRL abilities across various tasks and environments.

Conclusion: CV-ICRL successfully resolves Contextual Ambiguity by incorporating Context Value, enabling monotonic performance improvement in ICRL and tightening the performance gap bound relative to ideal policies.

Abstract: In-Context Reinforcement Learning (ICRL) has emerged as a promising paradigm for developing agents that can rapidly adapt to new tasks by leveraging past experiences as context, without updating their parameters. Recent approaches train large sequence models on monotonic policy improvement data from online RL, aiming to a continue improved testing time performance. However, our experimental analysis reveals a critical flaw: these models cannot show a continue improvement like the training data during testing time. Theoretically, we identify this phenomenon as Contextual Ambiguity, where the model’s own stochastic actions can generate an interaction history that misleadingly resembles that of a sub-optimal policy from the training data, initiating a vicious cycle of poor action selection. To resolve the Contextual Ambiguity, we introduce Context Value into training phase and propose Context Value Informed ICRL (CV-ICRL). CV-ICRL use Context Value as an explicit signal representing the ideal performance theoretically achievable by a policy given the current context. As the context expands, Context Value could include more task-relevant information, and therefore the ideal performance should be non-decreasing. We prove that the Context Value tightens the lower bound on the performance gap relative to an ideal, monotonically improving policy. We fruther propose two methods for estimating Context Value at both training and testing time. Experiments conducted on the Dark Room and Minigrid testbeds demonstrate that CV-ICRL effectively mitigates performance degradation and improves overall ICRL abilities across various tasks and environments. The source code and data of this paper are available at https://github.com/Bluixe/towards_monotonic_improvement .

[1174] One-Shot Multi-Label Causal Discovery in High-Dimensional Event Sequences

Hugo Math, Robin Schön, Rainer Lienhart

Main category: cs.LG

TL;DR: OSCAR is a one-shot causal autoregressive method that efficiently infers per-sequence Markov Boundaries using pretrained Transformers, enabling scalable causal discovery on sparse event sequences without costly global conditional independence testing.

Details

Motivation: Current causal discovery methods fail to scale to event sequences with thousands of sparse event types in domains like healthcare, cybersecurity, and vehicle diagnostics, where understanding causality is critical.

Method: OSCAR uses two pretrained Transformers as density estimators to infer per-sequence Markov Boundaries, enabling efficient parallel causal discovery without requiring costly global conditional independence testing.

Result: On a real-world automotive dataset with 29,100 events and 474 labels, OSCAR recovered interpretable causal structures in minutes, while classical methods failed to scale.

Conclusion: OSCAR enables practical scientific diagnostics at production scale by providing efficient causal discovery for sparse event sequences where traditional methods are computationally infeasible.

Abstract: Understanding causality in event sequences with thousands of sparse event types is critical in domains such as healthcare, cybersecurity, or vehicle diagnostics, yet current methods fail to scale. We present OSCAR, a one-shot causal autoregressive method that infers per-sequence Markov Boundaries using two pretrained Transformers as density estimators. This enables efficient, parallel causal discovery without costly global CI testing. On a real-world automotive dataset with 29,100 events and 474 labels, OSCAR recovers interpretable causal structures in minutes, while classical methods fail to scale, enabling practical scientific diagnostics at production scale.

[1175] WirelessMathLM: Teaching Mathematical Reasoning for LLMs in Wireless Communications with Reinforcement Learning

Xin Li, Mengbing Liu, Yiyang Zhu, Wenhe Zhang, Li Wei, Jiancheng An, Chau Yuen

Main category: cs.LG

TL;DR: WirelessMathLM demonstrates that compact models (0.5B-7B) can match or exceed larger models in wireless mathematics through domain-specific reinforcement learning with verifiable rewards, achieving near-GPT-4o performance with 100x fewer parameters.

Details

Motivation: Large language models struggle with specialized technical mathematics in wireless communications, where precise manipulation of information-theoretic bounds and optimization constraints is required.

Method: Use domain-specific reinforcement learning with verifiable correctness rewards (Group Relative Policy Optimization) on WirelessMathBench-XL benchmark (4,027 problems from 970 papers), training directly from base checkpoints without supervised warm-start.

Result: 7B model achieves 39.5% accuracy on WirelessMathBench-XL, approaching GPT-4o (40.4%) while using 100x fewer parameters than DeepSeek-R1 (671B, 57.4%). GRPO training nearly doubles performance across all model scales with positive transfer to general mathematics benchmarks (+8.4 points average).

Conclusion: Compact models can excel in specialized technical domains through verifiable-reward reinforcement learning, achieving competitive performance while being much more parameter-efficient, with benefits transferring to general mathematical reasoning.

Abstract: Large language models (LLMs) excel at general mathematical reasoning but fail catastrophically on specialized technical mathematics. In wireless communications, where problems require precise manipulation of information-theoretic bounds, optimization constraints, and signal processing formulations, even state-of-the-art models struggle to achieve competent performance. We present WirelessMathLM, demonstrating that compact models (0.5B-7B parameters) can match or exceed much larger models through domain-specific reinforcement learning with verifiable rewards. Our key insight is that wireless mathematics problems possess a unique property–verifiable correctness–that enables effective reinforcement learning without human feedback. We construct WirelessMathBench-XL, a comprehensive benchmark of 4,027 problems from 970 papers. Using Group Relative Policy Optimization (GRPO) with binary verification rewards, we train models directly from base checkpoints without supervised warm-start. Our 7B model achieves 39.5% accuracy on WirelessMathBench-XL, approaching GPT-4o (40.4%) while using about 100 times fewer parameters than DeepSeek-R1 (671B, 57.4%). Remarkably, GRPO training nearly doubles performance across all model scales (0.5B +11%, 3B +103%, 7B +81%), with positive transfer to general mathematics benchmarks–our models gain +8.4 points on average across MATH, Minerva-Math, OlympiadBench, AMC, and AIME without any training on these tasks.

[1176] SPEC-RL: Accelerating On-Policy Reinforcement Learning via Speculative Rollouts

Bingshuai Liu, Ante Wang, Zijun Min, Liang Yao, Haibo Zhang, Yang Liu, Anxiang Zeng, Jinsong Su

Main category: cs.LG

TL;DR: SPEC-RL is a novel framework that accelerates RLVR training by reusing overlapping trajectory segments from consecutive epochs through speculative decoding, reducing rollout time by 2-3x without quality loss.

Details

Motivation: Current RLVR training is bottlenecked by computationally expensive rollout stages, and existing acceleration methods have limitations like diminishing returns, bias introduction, or overlooking redundancy across iterations.

Method: Integrates speculative decoding with RL rollout process by reusing prior trajectory segments as speculative prefixes and extending them via a draft-and-verify mechanism to avoid redundant generation while ensuring policy consistency.

Result: Reduces rollout time by 2-3x across diverse math reasoning and generalization benchmarks (GSM8K, MATH-500, OlympiadBench, MMLU-STEM) without compromising policy quality.

Conclusion: SPEC-RL offers a general and practical path to scale RLVR for large reasoning models, seamlessly integrating with mainstream algorithms like PPO, GRPO, and DAPO as a purely rollout-stage enhancement.

Abstract: Large Language Models (LLMs) increasingly rely on reinforcement learning with verifiable rewards (RLVR) to elicit reliable chain-of-thought reasoning. However, the training process remains bottlenecked by the computationally expensive rollout stage. Existing acceleration methods-such as parallelization, objective- and data-driven modifications, and replay buffers-either incur diminishing returns, introduce bias, or overlook redundancy across iterations. We identify that rollouts from consecutive training epochs frequently share a large portion of overlapping segments, wasting computation. To address this, we propose SPEC-RL, a novel framework that integrates SPECulative decoding with the RL rollout process. SPEC-RL reuses prior trajectory segments as speculative prefixes and extends them via a draft-and-verify mechanism, avoiding redundant generation while ensuring policy consistency. Experiments on diverse math reasoning and generalization benchmarks, including GSM8K, MATH-500, OlympiadBench, MMLU-STEM, and others, demonstrate that SPEC-RL reduces rollout time by 2-3x without compromising policy quality. As a purely rollout-stage enhancement, SPEC-RL integrates seamlessly with mainstream algorithms (e.g., PPO, GRPO, DAPO), offering a general and practical path to scale RLVR for large reasoning models. Our code is available at https://github.com/ShopeeLLM/Spec-RL

[1177] More Data or Better Algorithms: Latent Diffusion Augmentation for Deep Imbalanced Regression

Shayan Alahyari

Main category: cs.LG

TL;DR: LatentDiff is a novel framework using conditional diffusion models with priority-based generation to synthesize high-quality latent features for deep imbalanced regression, addressing the gap in data-level solutions for high-dimensional imbalanced data.

Details

Motivation: Deep imbalanced regression lacks dedicated data-level solutions for high-dimensional data, as existing approaches mainly focus on algorithmic modifications rather than addressing data imbalance directly in the latent space.

Method: Uses conditional diffusion models with priority-based generation to synthesize high-quality features in the latent representation space, making it computationally efficient and applicable across diverse data modalities.

Result: Experiments on three deep imbalanced regression benchmarks show substantial improvements in minority regions while maintaining overall accuracy.

Conclusion: LatentDiff effectively addresses deep imbalanced regression by providing a data-level solution that works with high-dimensional data across multiple modalities, filling an important gap in the field.

Abstract: In many real-world regression tasks, the data distribution is heavily skewed, and models learn predominantly from abundant majority samples while failing to predict minority labels accurately. While imbalanced classification has been extensively studied, imbalanced regression remains relatively unexplored. Deep imbalanced regression (DIR) represents cases where the input data are high-dimensional and unstructured. Although several data-level approaches for tabular imbalanced regression exist, deep imbalanced regression currently lacks dedicated data-level solutions suitable for high-dimensional data and relies primarily on algorithmic modifications. To fill this gap, we propose LatentDiff, a novel framework that uses conditional diffusion models with priority-based generation to synthesize high-quality features in the latent representation space. LatentDiff is computationally efficient and applicable across diverse data modalities, including images, text, and other high-dimensional inputs. Experiments on three DIR benchmarks demonstrate substantial improvements in minority regions while maintaining overall accuracy.

[1178] Adaptive Token-Weighted Differential Privacy for LLMs: Not All Tokens Require Equal Protection

Manjiang Yu, Priyanka Singh, Xue Li, Yang Cao

Main category: cs.LG

TL;DR: ATDP is a novel differential privacy method that focuses noise on sensitive tokens, reducing training time by 90% while maintaining privacy protection and model performance.

Details

Motivation: Current DP-SGD methods inject uniform noise across all gradients, which significantly increases training time and reduces model accuracy. There's a need for more efficient privacy protection that specifically targets sensitive information.

Method: ATDP adaptively assigns different gradient weights to sensitive and non-sensitive tokens, using larger noise scale early in training to disrupt memorization. It requires only a few additional epochs of lightweight post-processing after standard fine-tuning.

Result: ATDP achieves comparable canary protection to state-of-the-art DP-SGD methods while reducing computational overhead by approximately 90%. It maintains comparable or superior privacy protection with minimal accuracy degradation.

Conclusion: ATDP provides an efficient privacy-enhancing solution that can be seamlessly integrated into existing pipelines, offering targeted protection for sensitive information without significantly impacting model capabilities.

Abstract: Large language models (LLMs) frequently memorize sensitive or personal information, raising significant privacy concerns. Existing variants of differential privacy stochastic gradient descent (DPSGD) inject uniform noise into every gradient step, significantly extending training time and reducing model accuracy. We propose that concentrating noise primarily on gradients associated with sensitive tokens can substantially decrease DP training time, strengthen the protection of sensitive information, and simultaneously preserve the model’s performance on non-sensitive data. We operationalize this insight through Adaptive Token-Weighted Differential Privacy (ATDP), a modification of vanilla DP-SGD that adaptively assigns different gradient weights to sensitive and non-sensitive tokens. By employing a larger noise scale at the early stage of training, ATDP rapidly disrupts memorization of sensitive content. As a result, ATDP only requires a few additional epochs of lightweight post-processing following standard fine-tuning, injecting targeted noise primarily on parameters corresponding to sensitive tokens, thus minimally affecting the model’s general capabilities. ATDP can be seamlessly integrated into any existing DP-based fine-tuning pipeline or directly applied to non-private models as a fast privacy-enhancing measure. Additionally, combined with an initial redacted fine-tuning phase, ATDP forms a streamlined DP pipeline that achieves comparable canary protection to state-of-the-art DP-SGD methods, significantly reduces the computational overhead of DP fine-tuning, shortening training time by approximately 90 percent, while achieving comparable or superior privacy protection and minimal accuracy degradation.

[1179] Deep Learning for Subspace Regression

Vladimir Fanaskov, Vladislav Trifonov, Alexander Rudikov, Ekaterina Muravleva, Ivan Oseledets

Main category: cs.LG

TL;DR: Proposes using neural networks for subspace regression in parametric reduced order modeling, introducing redundancy by predicting larger-than-needed subspaces to improve accuracy and simplify learning.

Details

Motivation: Classical interpolation strategies become infeasible for high-dimensional parameter spaces in reduced order modeling, necessitating a more robust approach to approximate subspaces for unknown parameters.

Method: Relax interpolation to regression, use neural networks to approximate high-dimensional target functions, and introduce redundancy by predicting larger subspaces than required.

Result: Theoretical analysis shows reduced complexity and smoother mappings, with empirical results demonstrating significant accuracy improvements when predicting larger subspaces.

Conclusion: Subspace regression with neural networks and redundancy is effective for various tasks including parametric eigenproblems, PDE solutions, and optimal control.

Abstract: It is often possible to perform reduced order modelling by specifying linear subspace which accurately captures the dynamics of the system. This approach becomes especially appealing when linear subspace explicitly depends on parameters of the problem. A practical way to apply such a scheme is to compute subspaces for a selected set of parameters in the computationally demanding offline stage and in the online stage approximate subspace for unknown parameters by interpolation. For realistic problems the space of parameters is high dimensional, which renders classical interpolation strategies infeasible or unreliable. We propose to relax the interpolation problem to regression, introduce several loss functions suitable for subspace data, and use a neural network as an approximation to high-dimensional target function. To further simplify a learning problem we introduce redundancy: in place of predicting subspace of a given dimension we predict larger subspace. We show theoretically that this strategy decreases the complexity of the mapping for elliptic eigenproblems with constant coefficients and makes the mapping smoother for general smooth function on the Grassmann manifold. Empirical results also show that accuracy significantly improves when larger-than-needed subspaces are predicted. With the set of numerical illustrations we demonstrate that subspace regression can be useful for a range of tasks including parametric eigenproblems, deflation techniques, relaxation methods, optimal control and solution of parametric partial differential equations.

[1180] NanoFlux: Adversarial Dual-LLM Evaluation and Distillation For Multi-Domain Reasoning

Raviteja Anantha, Soheil Hor, Teodor Nicola Antoniu, Layne C. Price

Main category: cs.LG

TL;DR: NanoFlux is an adversarial framework that generates targeted training data for LLM reasoning improvement, where datasets with <200 examples outperform conventional fine-tuning methods across multiple domains with significant computational efficiency gains.

Details

Motivation: To improve LLM reasoning capabilities through targeted data generation rather than large-scale conventional fine-tuning, addressing the need for more efficient and effective training approaches.

Method: Uses a competitive dynamic between models alternating as Attacker and Defender, supervised by a tool-augmented Judge, to synthesize multi-step questions with explanatory annotations targeting specific reasoning capabilities. Includes embedding-based novelty filtering and automated training data generation.

Result: Fine-tuning a 4B-parameter model on NanoFlux-generated data achieved: +5.9% on GSMHard (mathematical reasoning), +3.6% on GenomeBench (scientific reasoning), and +16.6% on MultiMedQA (medical reasoning), while reducing computational requirements by 3-14x compared to full-benchmark fine-tuning.

Conclusion: Future model improvements may lie in intelligent synthesis of small, precisely targeted training datasets rather than large-scale data collection, with NanoFlux demonstrating that adversarial data generation can significantly enhance reasoning capabilities efficiently.

Abstract: We present NanoFlux, a novel adversarial framework for generating targeted training data to improve LLM reasoning, where adversarially-generated datasets containing fewer than 200 examples outperform conventional fine-tuning approaches. The framework employs a competitive dynamic between models alternating as Attacker and Defender, supervised by a tool-augmented Judge, synthesizing multi-step questions with explanatory annotations that target specific reasoning capabilities. Fine-tuning a 4B-parameter model on NanoFlux-generated data yields performance gains across diverse domains compared to full-benchmark fine-tuning: +5.9% on mathematical reasoning (GSMHard), +3.6% on scientific reasoning (GenomeBench), and +16.6% on medical reasoning (MultiMedQA), while reducing computational requirements by 3-14x. Ablation studies reveal a non-monotonic relationship between dataset characteristics and model performance, uncovering domain-specific optimal points for question complexity and reasoning quality. NanoFlux automates training data generation through embedding-based novelty filtering, tool-augmented evaluation, and multi-hop reasoning, suggesting that future model improvements may lie in the intelligent synthesis of small, precisely targeted training datasets.

[1181] ABConformer: Physics-inspired Sliding Attention for Antibody-Antigen Interface Prediction

Zhang-Yu You, Jiahao Ma, Hongzong Li, Ye-Fan Hu, Jian-Dong Huang

Main category: cs.LG

TL;DR: ABCONFORMER is a sequence-based model for predicting antibody-antigen interfaces using Conformer backbone and sliding attention mechanism, achieving state-of-the-art performance.

Details

Motivation: Accurate prediction of antibody-antigen interfaces is critical for vaccine design, immunodiagnostics, and therapeutic antibody development, but reliable predictions from sequences alone remains challenging.

Method: Uses Conformer backbone to capture local and global sequence features, and introduces physics-inspired sliding attention for residue-level contact recovery without 3D structural data.

Result: Achieves state-of-the-art performance on SARS-CoV-2 Ab-Ag dataset, surpasses sequence-based methods for antibody-agnostic epitope prediction, and sliding attention significantly enhances epitope prediction precision.

Conclusion: ABCONFORMER enables accurate paratope and epitope prediction from sequences alone, with sliding attention being a key improvement over conventional cross-attention for epitope prediction.

Abstract: Accurate prediction of antibody-antigen (Ab-Ag) interfaces is critical for vaccine design, immunodiagnostics, and therapeutic antibody development. However, achieving reliable predictions from sequences alone remains a challenge. In this paper, we present ABCONFORMER, a model based on the Conformer backbone that captures both local and global features of a biosequence. To accurately capture Ab-Ag interactions, we introduced the physics-inspired sliding attention, enabling residue-level contact recovery without relying on three-dimensional structural data. ABConformer can accurately predict paratopes and epitopes given the antibody and antigen sequence, and predict pan-epitopes on the antigen without antibody information. In comparison experiments, ABCONFORMER achieves state-of-the-art performance on a recent SARS-CoV-2 Ab-Ag dataset, and surpasses widely used sequence-based methods for antibody-agnostic epitope prediction. Ablation studies further quantify the contribution of each component, demonstrating that, compared to conventional cross-attention, sliding attention significantly enhances the precision of epitope prediction. To facilitate reproducibility, we will release the code under an open-source license upon acceptance.

[1182] CREPE: Controlling Diffusion with Replica Exchange

Jiajun He, Paul Jeha, Peter Potaptchik, Leo Zhang, José Miguel Hernández-Lobato, Yuanqi Du, Saifuddin Syed, Francisco Vargas

Main category: cs.LG

TL;DR: CREPE is a replica exchange-based method for inference-time control of diffusion models that offers sequential particle generation, high sample diversity, and online refinement capabilities.

Details

Motivation: To provide a flexible alternative to existing inference-time control methods like heuristic guidance and SMC, addressing limitations in sample diversity and enabling online refinement.

Method: Uses replica exchange algorithm adapted from sampling problems to control diffusion models at inference time without retraining.

Result: Demonstrates versatility across temperature annealing, reward-tilting, model composition, and classifier-free guidance debiasing with competitive performance compared to SMC methods.

Conclusion: CREPE offers an effective and flexible approach for inference-time control of diffusion models with advantages over SMC in sequential generation, diversity maintenance, and online refinement capabilities.

Abstract: Inference-time control of diffusion models aims to steer model outputs to satisfy new constraints without retraining. Previous approaches have mostly relied on heuristic guidance or have been coupled with Sequential Monte Carlo (SMC) for bias correction. In this paper, we propose a flexible alternative based on replica exchange, an algorithm designed initially for sampling problems. We refer to this method as the CREPE (Controlling with REPlica Exchange). Unlike SMC, CREPE: (1) generates particles sequentially, (2) maintains high diversity in the generated samples after a burn-in period, and (3) enables online refinement or early termination. We demonstrate its versatility across various tasks, including temperature annealing, reward-tilting, model composition and classifier-free guidance debiasing, with competitive performance compared to prior SMC methods.

[1183] Transfer Learning and Machine Learning for Training Five Year Survival Prognostic Models in Early Breast Cancer

Lisa Pilgram, Kai Yang, Ana-Alicia Beltran-Bless, Gregory R. Pond, Lisa Vandermeer, John Hilton, Marie-France Savard, Andréanne Leblanc, Lois Sheperd, Bingshu E. Chen, John M. S. Bartlett, Karen J. Taylor, Jane Bayani, Sarah L. Barker, Melanie Spears, Cornelis J. H. van der Velde, Elma Meershoek-Klein Kranenbarg, Luc Dirix, Elizabeth Mallon, Annette Hasenburg, Christos Markopoulos, Lamin Juwara, Fida K. Dankar, Mark Clemons, Khaled El Emam

Main category: cs.LG

TL;DR: Machine learning approaches including transfer learning, de-novo Random Survival Forests, and ensemble integration improve breast cancer survival prognostication compared to traditional PREDICT v3 tool, especially when dealing with missing data or dataset shifts.

Details

Motivation: To improve breast cancer survival prognostication using machine learning methods that are more accessible and cost-effective than genomic tools, and to address limitations of existing prognostic tools like PREDICT v3 which often has invalid predictions due to missing information.

Method: Used MA.27 trial data for training with external validation on TEAM trial and SEER cohort. Applied three approaches: transfer learning by fine-tuning PREDICT v3, de-novo ML with Random Survival Forests and Extreme Gradient Boosting, and ensemble integration through weighted sum of model predictions.

Result: Transfer learning, de-novo RSF, and ensemble integration improved calibration over PREDICT v3 (ICI reduced from 0.042 to ≤0.007) while maintaining comparable discrimination (AUC increased from 0.738 to 0.744-0.799). ML models could predict survival regardless of missing information, unlike PREDICT v3 which had 23.8-25.8% invalid predictions. External validation in SEER confirmed benefits.

Conclusion: Transfer learning, de-novo RSF, and ensemble integration can improve breast cancer prognostication when PREDICT v3 information is lacking or dataset shifts occur, with key features being patient age, nodal status, pathological grading and tumor size.

Abstract: Prognostic information is essential for decision-making in breast cancer management. Recently trials have predominantly focused on genomic prognostication tools, even though clinicopathological prognostication is less costly and more widely accessible. Machine learning (ML), transfer learning and ensemble integration offer opportunities to build robust prognostication frameworks. We evaluate this potential to improve survival prognostication in breast cancer by comparing de-novo ML, transfer learning from a pre-trained prognostic tool and ensemble integration. Data from the MA.27 trial was used for model training, with external validation on the TEAM trial and a SEER cohort. Transfer learning was applied by fine-tuning the pre-trained prognostic tool PREDICT v3, de-novo ML included Random Survival Forests and Extreme Gradient Boosting, and ensemble integration was realized through a weighted sum of model predictions. Transfer learning, de-novo RSF, and ensemble integration improved calibration in MA.27 over the pre-trained model (ICI reduced from 0.042 in PREDICT v3 to <=0.007) while discrimination remained comparable (AUC increased from 0.738 in PREDICT v3 to 0.744-0.799). Invalid PREDICT v3 predictions were observed in 23.8-25.8% of MA.27 individuals due to missing information. In contrast, ML models and ensemble integration could predict survival regardless of missing information. Across all models, patient age, nodal status, pathological grading and tumor size had the highest SHAP values, indicating their importance for survival prognostication. External validation in SEER, but not in TEAM, confirmed the benefits of transfer learning, RSF and ensemble integration. This study demonstrates that transfer learning, de-novo RSF, and ensemble integration can improve prognostication in situations where relevant information for PREDICT v3 is lacking or where a dataset shift is likely.

[1184] Continuous-Time Reinforcement Learning for Asset-Liability Management

Yilie Huang

Main category: cs.LG

TL;DR: Proposes continuous-time RL with linear-quadratic formulation for ALM, using model-free policy gradient algorithm with adaptive exploration, outperforming traditional and RL methods.

Details

Motivation: To develop a more effective Asset-Liability Management (ALM) approach that dynamically synchronizes assets and liabilities using reinforcement learning.

Method: Model-free policy gradient-based soft actor-critic algorithm with adaptive exploration for actor and scheduled exploration for critic, using continuous-time RL with linear-quadratic formulation.

Result: Achieves higher average rewards than all alternative strategies across 200 randomized market scenarios, with rapid initial gains and sustained superior performance.

Conclusion: The method outperforms traditional and RL approaches by directly learning optimal ALM strategy without learning the environment, using adaptive exploration rather than complex neural networks.

Abstract: This paper proposes a novel approach for Asset-Liability Management (ALM) by employing continuous-time Reinforcement Learning (RL) with a linear-quadratic (LQ) formulation that incorporates both interim and terminal objectives. We develop a model-free, policy gradient-based soft actor-critic algorithm tailored to ALM for dynamically synchronizing assets and liabilities. To ensure an effective balance between exploration and exploitation with minimal tuning, we introduce adaptive exploration for the actor and scheduled exploration for the critic. Our empirical study evaluates this approach against two enhanced traditional financial strategies, a model-based continuous-time RL method, and three state-of-the-art RL algorithms. Evaluated across 200 randomized market scenarios, our method achieves higher average rewards than all alternative strategies, with rapid initial gains and sustained superior performance. The outperformance stems not from complex neural networks or improved parameter estimation, but from directly learning the optimal ALM strategy without learning the environment.

[1185] A Neural ODE Approach to Aircraft Flight Dynamics Modelling

Gabriel Jarry, Ramon Dalmau, Xavier Olive, Philippe Very

Main category: cs.LG

TL;DR: NODE-FDM is a Neural ODE-based Flight Dynamics Model that combines analytical kinematics with data-driven components, achieving more accurate trajectory prediction than BADA-based methods, especially during descent phases.

Details

Motivation: Accurate aircraft trajectory prediction is critical for air traffic management, airline operations, and environmental assessment, requiring improved modeling approaches.

Method: Uses Neural Ordinary Differential Equations trained on Quick Access Recorder data, combining analytical kinematic relations with data-driven components.

Result: Achieves more accurate reproduction of recorded trajectories than BADA-based models, with marked improvements in altitude, speed, and mass dynamics, particularly in descent phase.

Conclusion: Demonstrates potential of physics-informed neural ODEs as high-fidelity, data-driven approach to aircraft performance modeling, with future work planned for lateral dynamics modeling.

Abstract: Accurate aircraft trajectory prediction is critical for air traffic management, airline operations, and environmental assessment. This paper introduces NODE-FDM, a Neural Ordinary Differential Equations-based Flight Dynamics Model trained on Quick Access Recorder (QAR) data. By combining analytical kinematic relations with data-driven components, NODE-FDM achieves a more accurate reproduction of recorded trajectories than state-of-the-art models such as a BADA-based trajectory generation methodology (BADA4 performance model combined with trajectory control routines), particularly in the descent phase of the flight. The analysis demonstrates marked improvements across altitude, speed, and mass dynamics. Despite current limitations, including limited physical constraints and the limited availability of QAR data, the results demonstrate the potential of physics-informed neural ordinary differential equations as a high-fidelity, data-driven approach to aircraft performance modelling. Future work will extend the framework to incorporate a full modelling of the lateral dynamics of the aircraft.

[1186] ASTGI: Adaptive Spatio-Temporal Graph Interactions for Irregular Multivariate Time Series Forecasting

Xvyuan Liu, Xiangfei Qiu, Hanyin Cheng, Xingjian Wu, Chenjuan Guo, Bin Yang, Jilin Hu

Main category: cs.LG

TL;DR: ASTGI framework addresses irregular multivariate time series forecasting by using spatio-temporal point representation, adaptive graph construction, and dynamic propagation to handle asynchronous sampling and complex dependencies.

Details

Motivation: Irregular multivariate time series with asynchronous sampling and irregular intervals pose challenges for accurate representation and capturing complex dynamic dependencies in domains like healthcare and finance.

Method: ASTGI framework with four modules: Spatio-Temporal Point Representation, Neighborhood-Adaptive Graph Construction, Spatio-Temporal Dynamic Propagation, and Query Point-based Prediction using adaptive causal graphs and nearest neighbor search.

Result: Extensive experiments on multiple benchmark datasets show ASTGI outperforms various state-of-the-art methods.

Conclusion: The proposed ASTGI framework effectively addresses the core challenges of irregular multivariate time series forecasting through adaptive spatio-temporal graph interactions.

Abstract: Irregular multivariate time series (IMTS) are prevalent in critical domains like healthcare and finance, where accurate forecasting is vital for proactive decision-making. However, the asynchronous sampling and irregular intervals inherent to IMTS pose two core challenges for existing methods: (1) how to accurately represent the raw information of irregular time series without introducing data distortion, and (2) how to effectively capture the complex dynamic dependencies between observation points. To address these challenges, we propose the Adaptive Spatio-Temporal Graph Interaction (ASTGI) framework. Specifically, the framework first employs a Spatio-Temporal Point Representation module to encode each discrete observation as a point within a learnable spatio-temporal embedding space. Second, a Neighborhood-Adaptive Graph Construction module adaptively builds a causal graph for each point in the embedding space via nearest neighbor search. Subsequently, a Spatio-Temporal Dynamic Propagation module iteratively updates information on these adaptive causal graphs by generating messages and computing interaction weights based on the relative spatio-temporal positions between points. Finally, a Query Point-based Prediction module generates the final forecast by aggregating neighborhood information for a new query point and performing regression. Extensive experiments on multiple benchmark datasets demonstrate that ASTGI outperforms various state-of-the-art methods.

[1187] Two-Scale Latent Dynamics for Recurrent-Depth Transformers

Francesco Pappone, Donato Crisostomi, Emanuele Rodolà

Main category: cs.LG

TL;DR: Recurrent-depth transformers use iterative latent computations before token emission, exhibiting two-scale dynamics: small refinements within blocks and larger drift across blocks. The paper proposes an early-exit mechanism based on second-order step-size differences that outperforms existing methods.

Details

Motivation: To understand the geometric properties of recurrent-depth transformers' iterative computations and develop more efficient early-exit strategies based on observed dynamics.

Method: Analyzed the geometry of iterates in recurrent-depth transformers, identified two-scale operational patterns, and proposed an early-exit mechanism using second-order difference in step-size.

Result: Found that loop steps become smaller and more orthogonal across checkpoints, indicating better local modeling. The proposed second-order exit strategy outperforms KL-divergence and first-order methods in performance, stability, and efficiency.

Conclusion: The two-scale dynamics in recurrent-depth transformers enable effective early-exit strategies, with second-order step-size differences providing superior performance over existing approaches.

Abstract: Recurrent-depth transformers scale test-time compute by iterating latent computations before emitting tokens. We study the geometry of these iterates and argue for a simple, \emph{two-scale} operational picture: (i) within a looped block, updates act as \emph{small-scale refinements}; (ii) across consecutive blocks, states undergo a \emph{larger-scale drift}. Across checkpoints, our measurements show that loop steps become \emph{smaller} and increasingly \emph{orthogonal} to one another, indicating better local modeling of fine structure rather than merely pushing in a single direction. These dynamics motivate an early-exit mechanism based on the model’s second-order difference in step-size, which we show is superior in terms of performance, stability and time-efficiency, when compared to the KL-divergence exit strategy of Geiping et al. and its naive first-order counterpart.

[1188] MELCOT: A Hybrid Learning Architecture with Marginal Preservation for Matrix-Valued Regression

Khang Tran, Hieu Cao, Thinh Pham, Nghiem Diep, Tri Cao, Binh Nguyen

Main category: cs.LG

TL;DR: MELCOT is a hybrid matrix-valued regression model combining classical marginal estimation with deep learning-based optimal transport to handle high-dimensional data while preserving spatial structure efficiently.

Details

Motivation: Existing regression methods struggle with high-dimensional matrix-valued data, often losing spatial structure or requiring excessive storage, necessitating a more efficient approach.

Method: Propose MELCOT with two blocks: ME block for marginal estimation to preserve spatial information, and LCOT block using learnable-cost optimal transport for complex global feature learning.

Result: Extensive experiments show MELCOT consistently outperforms all baseline methods across diverse datasets and domains while maintaining high efficiency.

Conclusion: MELCOT successfully integrates classical and deep learning approaches to achieve superior performance in matrix-valued regression with preserved spatial structure and computational efficiency.

Abstract: Regression is essential across many domains but remains challenging in high-dimensional settings, where existing methods often lose spatial structure or demand heavy storage. In this work, we address the problem of matrix-valued regression, where each sample is naturally represented as a matrix. We propose MELCOT, a hybrid model that integrates a classical machine learning-based Marginal Estimation (ME) block with a deep learning-based Learnable-Cost Optimal Transport (LCOT) block. The ME block estimates data marginals to preserve spatial information, while the LCOT block learns complex global features. This design enables MELCOT to inherit the strengths of both classical and deep learning methods. Extensive experiments across diverse datasets and domains demonstrate that MELCOT consistently outperforms all baselines while remaining highly efficient.

[1189] LLM Interpretability with Identifiable Temporal-Instantaneous Representation

Xiangchen Song, Jiaqi Sun, Zijian Li, Yujia Zheng, Kun Zhang

Main category: cs.LG

TL;DR: The paper introduces a temporal causal representation learning framework for LLMs that captures both time-delayed and instantaneous causal relations, providing theoretical guarantees and improving interpretability.

Details

Motivation: Current mechanistic interpretability tools like sparse autoencoders lack temporal dependency modeling, instantaneous relation representation, and theoretical guarantees, undermining confidence in LLM analysis. Causal representation learning offers theoretical grounding but can't scale to LLMs' rich conceptual space.

Method: An identifiable temporal causal representation learning framework designed for LLMs’ high-dimensional concept space, extending sparse autoencoder techniques with temporal causal modeling to capture both time-delayed and instantaneous causal relations.

Result: The approach demonstrates efficacy on synthetic datasets scaled to real-world complexity and successfully discovers meaningful concept relationships in LLM activations. It provides theoretical guarantees for the learned representations.

Conclusion: Modeling both temporal and instantaneous conceptual relationships advances the interpretability of LLMs, bridging the gap between theoretical foundations and practical analysis needs.

Abstract: Despite Large Language Models’ remarkable capabilities, understanding their internal representations remains challenging. Mechanistic interpretability tools such as sparse autoencoders (SAEs) were developed to extract interpretable features from LLMs but lack temporal dependency modeling, instantaneous relation representation, and more importantly theoretical guarantees, undermining both the theoretical foundations and the practical confidence necessary for subsequent analyses. While causal representation learning (CRL) offers theoretically grounded approaches for uncovering latent concepts, existing methods cannot scale to LLMs’ rich conceptual space due to inefficient computation. To bridge the gap, we introduce an identifiable temporal causal representation learning framework specifically designed for LLMs’ high-dimensional concept space, capturing both time-delayed and instantaneous causal relations. Our approach provides theoretical guarantees and demonstrates efficacy on synthetic datasets scaled to match real-world complexity. By extending SAE techniques with our temporal causal framework, we successfully discover meaningful concept relationships in LLM activations. Our findings show that modeling both temporal and instantaneous conceptual relationships advances the interpretability of LLMs.

[1190] Robust Fine-Tuning from Non-Robust Pretrained Models: Mitigating Suboptimal Transfer With Adversarial Scheduling

Jonas Ngnawé, Maxime Heuillet, Sabyasachi Sahoo, Yann Pequignot, Ola Ahmad, Audrey Durand, Frédéric Precioso, Christian Gagné

Main category: cs.LG

TL;DR: Fine-tuning non-robust pretrained models with robust objectives often leads to suboptimal transfer and poor performance. The paper proposes Epsilon-Scheduling to prevent this issue and introduces expected robustness as a comprehensive evaluation metric.

Details

Motivation: To address the challenge of robust fine-tuning (RFT) that simultaneously achieves task adaptation and adversarial robustness, especially when using abundant non-robust pretrained models from open-source repositories.

Method: Systematically examine RFT from non-robust models, identify suboptimal transfer phenomenon, and propose Epsilon-Scheduling - a schedule over perturbation strength during training to promote optimal transfer.

Result: Fine-tuning with robust objectives impedes task adaptation and prevents optimal transfer, but Epsilon-Scheduling successfully prevents suboptimal transfer and consistently improves expected robustness across various configurations.

Conclusion: Epsilon-Scheduling effectively addresses the suboptimal transfer problem in robust fine-tuning and provides a practical solution for achieving better accuracy-robustness trade-offs when using non-robust pretrained models.

Abstract: Fine-tuning pretrained models is a standard and effective workflow in modern machine learning. However, robust fine-tuning (RFT), which aims to simultaneously achieve adaptation to a downstream task and robustness to adversarial examples, remains challenging. Despite the abundance of non-robust pretrained models in open-source repositories, their potential for RFT is less understood. We address this knowledge gap by systematically examining RFT from such non-robust models. Our experiments reveal that fine-tuning non-robust models with a robust objective, even under small perturbations, can lead to poor performance, a phenomenon that we dub \emph{suboptimal transfer}. In challenging scenarios (eg, difficult tasks, high perturbation), the resulting performance can be so low that it may be considered a transfer failure. We find that fine-tuning using a robust objective impedes task adaptation at the beginning of training and eventually prevents optimal transfer. However, we propose a novel heuristic, \emph{Epsilon-Scheduling}, a schedule over perturbation strength used during training that promotes optimal transfer. Additionally, we introduce \emph{expected robustness}, a metric that captures performance across a range of perturbations, providing a more comprehensive evaluation of the accuracy-robustness trade-off for diverse models at test time. Extensive experiments on a wide range of configurations (six pretrained models and five datasets) show that \emph{Epsilon-Scheduling} successfully prevents \emph{suboptimal transfer} and consistently improves expected robustness.

[1191] Entering the Era of Discrete Diffusion Models: A Benchmark for Schrödinger Bridges and Entropic Optimal Transport

Xavier Aramayo Carrasco, Grigoriy Ksenofontov, Aleksei Leonov, Iaroslav Sergeevich Koshelev, Alexander Korotin

Main category: cs.LG

TL;DR: The paper introduces the first benchmark for evaluating Schrödinger bridge (SB) methods on discrete spaces, providing analytically known SB solutions for rigorous assessment of existing and new algorithms.

Details

Motivation: There is growing interest in applying SB methods to discrete domains for generative modeling, but no reliable way exists to evaluate how well these methods actually solve the underlying SB problem.

Method: The authors construct a benchmark that yields pairs of probability distributions with analytically known SB solutions. As a byproduct, they develop two new SB algorithms (DLightSB and DLightSB-M) and extend prior work to create the α-CSBM algorithm.

Result: The benchmark enables rigorous evaluation of both existing and new SB solvers in high-dimensional discrete settings, demonstrating its utility for assessing algorithm performance.

Conclusion: This work provides the first step toward proper evaluation of SB methods on discrete spaces, paving the way for more reproducible future studies in this important area connecting generative modeling with optimal transport theory.

Abstract: The Entropic Optimal Transport (EOT) problem and its dynamic counterpart, the Schr"odinger bridge (SB) problem, play an important role in modern machine learning, linking generative modeling with optimal transport theory. While recent advances in discrete diffusion and flow models have sparked growing interest in applying SB methods to discrete domains, there is still no reliable way to evaluate how well these methods actually solve the underlying problem. We address this challenge by introducing a benchmark for SB on discrete spaces. Our construction yields pairs of probability distributions with analytically known SB solutions, enabling rigorous evaluation. As a byproduct of building this benchmark, we obtain two new SB algorithms, DLightSB and DLightSB-M, and additionally extend prior related work to construct the $\alpha$-CSBM algorithm. We demonstrate the utility of our benchmark by evaluating both existing and new solvers in high-dimensional discrete settings. This work provides the first step toward proper evaluation of SB methods on discrete spaces, paving the way for more reproducible future studies.

[1192] Landing with the Score: Riemannian Optimization through Denoising

Andrey Kharitenko, Zebang Shen, Riccardo de Santi, Niao He, Florian Doerfler

Main category: cs.LG

TL;DR: The paper proposes methods for Riemannian optimization over data manifolds using diffusion model score functions, enabling data-driven design without explicit manifold operations.

Details

Motivation: High-dimensional data lies near low-dimensional manifolds, but classical optimization requires explicit manifold operations which are unavailable when manifolds are given only through data distributions.

Method: Introduces a link function connecting data distribution to geometric operations, leverages diffusion model score functions, and proposes two algorithms: Denoising Landing Flow (DLF) and Denoising Riemannian Gradient Descent (DRGD).

Result: Theoretical guarantees for feasibility (manifold adherence) and optimality (small Riemannian gradient), with effectiveness demonstrated on data-driven control tasks.

Conclusion: The approach enables practical generative and design applications by connecting diffusion models to Riemannian optimization over implicit data manifolds.

Abstract: Under the data manifold hypothesis, high-dimensional data are concentrated near a low-dimensional manifold. We study the problem of Riemannian optimization over such manifolds when they are given only implicitly through the data distribution, and the standard manifold operations required by classical algorithms are unavailable. This formulation captures a broad class of data-driven design problems that are central to modern generative AI. Our key idea is to introduce a link function that connects the data distribution to the geometric operations needed for optimization. We show that this function enables the recovery of essential manifold operations, such as retraction and Riemannian gradient computation. Moreover, we establish a direct connection between our construction and the score function in diffusion models of the data distribution. This connection allows us to leverage well-studied parameterizations, efficient training procedures, and even pretrained score networks from the diffusion model literature to perform optimization. Building on this foundation, we propose two efficient inference-time algorithms – Denoising Landing Flow (DLF) and Denoising Riemannian Gradient Descent (DRGD) – and provide theoretical guarantees for both feasibility (approximate manifold adherence) and optimality (small Riemannian gradient norm). Finally, we demonstrate the effectiveness of our approach on finite-horizon reference tracking tasks in data-driven control, highlighting its potential for practical generative and design applications.

[1193] Emergence of Superposition: Unveiling the Training Dynamics of Chain of Continuous Thought

Hanlin Zhu, Shibo Hao, Zhiting Hu, Jiantao Jiao, Stuart Russell, Yuandong Tian

Main category: cs.LG

TL;DR: This paper analyzes how the superposition mechanism in continuous Chain of Thought (CoT) emerges during training, revealing a two-stage process where the model balances exploration and exploitation through bounded index-matching logits.

Details

Motivation: To understand how the superposition mechanism in continuous CoT is naturally learned from gradient-based training methods, as previous work showed theoretical capabilities but not the learning dynamics.

Method: Theoretical analysis of training dynamics for a simplified two-layer transformer on directed graph reachability, tracking index-matching logits during two training stages: thought-generation and prediction.

Result: Analysis shows index-matching logit first increases then remains bounded, enabling superposition by balancing exploration (assigning comparable weights to multiple traces) and exploitation (using local problem structures).

Conclusion: The bounded index-matching logit mechanism naturally emerges during training and enables continuous CoT to maintain superposition of multiple reasoning traces, explaining how this capability is learned.

Abstract: Previous work shows that the chain of continuous thought (continuous CoT) improves the reasoning capability of large language models (LLMs) by enabling implicit parallel thinking, and a subsequent work provided theoretical insight by showing that a two-layer transformer equipped with continuous CoT can efficiently solve directed graph reachability by maintaining a superposition of multiple reasoning traces in the continuous thought. However, it remains unclear how the superposition mechanism is naturally learned from gradient-based training methods. To fill this gap, we theoretically analyze the training dynamics of a simplified two-layer transformer on the directed graph reachability problem to unveil how the superposition mechanism emerges during training in two training stages – (i) a thought-generation stage that autoregressively expands the continuous thought, and (ii) a prediction stage that converts the thought into the final answer. Our analysis reveals that during training using continuous thought, the index-matching logit, an important quantity which reflects the strength of the model’s local search ability, will first increase and then remain bounded under mild assumptions. The bounded index-matching logit effectively balances exploration and exploitation during the reasoning process: the model will exploit local problem structures to identify plausible search traces, and assign comparable weights to multiple such traces to explore when it is uncertain about which solution is correct, which results in superposition. Our experimental results tracking the growth of logits further validate our theory.

[1194] Splines-Based Feature Importance in Kolmogorov-Arnold Networks: A Framework for Supervised Tabular Data Dimensionality Reduction

Ange-Clément Akazan, Verlon Roel Mbingui

Main category: cs.LG

TL;DR: KAN-based feature selection methods provide competitive and interpretable alternatives to classical methods for high-dimensional tabular datasets, with different KAN variants excelling in specific scenarios like classification vs regression.

Details

Motivation: High-dimensional datasets require effective feature selection to improve predictive performance, interpretability, and robustness. Traditional methods may miss complex feature interactions and nonlinear relationships.

Method: Proposed four KAN-based feature selectors (KAN-L1, KAN-L2, KAN-SI, KAN-KO) using Kolmogorov-Arnold networks with spline parameterization, compared against classical baselines (LASSO, Random Forest, Mutual Information, SVM-RFE) across multiple classification and regression benchmarks.

Result: KAN-based selectors are competitive with and sometimes superior to classical baselines. KAN-L2 and KAN-SI perform well on noisy regression and heterogeneous datasets, while KAN-L1, KAN-KO, and KAN-SI excel in classification tasks by eliminating redundancy in high-dimensional multi-class data.

Conclusion: KAN-based feature selection provides a powerful and interpretable alternative to traditional methods, capable of uncovering nonlinear and multivariate feature relevance beyond sparsity or impurity-based measures.

Abstract: High-dimensional datasets require effective feature selection to improve predictive performance, interpretability, and robustness. We propose and evaluate feature selection methods for tabular datasets based on Kolmogorov-Arnold networks (KANs), which parameterize feature transformations through splines, enabling direct access to interpretable importance measures. We introduce four KAN-based selectors ($\textit{KAN-L1}$, $\textit{KAN-L2}$, $\textit{KAN-SI}$, $\textit{KAN-KO}$) and compare them against classical baselines (LASSO, Random Forest, Mutual Information, SVM-RFE) across multiple classification and regression tabular dataset benchmarks. Average (over three retention levels: 20%, 40%, and 60%) F1 scores and $R^2$ score results reveal that KAN-based selectors, particularly $\textit{KAN-L2}$, $\textit{KAN-L1}$, $\textit{KAN-SI}$, and $\textit{KAN-KO}$, are competitive with and sometimes superior to classical baselines in structured and synthetic datasets. However, $\textit{KAN-L1}$ is often too aggressive in regression, removing useful features, while $\textit{KAN-L2}$ underperforms in classification, where simple coefficient shrinkage misses complex feature interactions. $\textit{KAN-L2}$ and $\textit{KAN-SI}$ provide robust performance on noisy regression datasets and heterogeneous datasets, aligning closely with ensemble predictors. In classification tasks, KAN selectors such as $\textit{KAN-L1}$, $\textit{KAN-KO}$, and $\textit{KAN-SI}$ sometimes surpass the other selectors by eliminating redundancy, particularly in high-dimensional multi-class data. Overall, our findings demonstrate that KAN-based feature selection provides a powerful and interpretable alternative to traditional methods, capable of uncovering nonlinear and multivariate feature relevance beyond sparsity or impurity-based measures.

[1195] Graph Your Own Prompt

Xi Ding, Lei Wang, Piotr Koniusz, Yongsheng Gao

Main category: cs.LG

TL;DR: Graph Consistency Regularization (GCR) is a novel framework that uses model predictions to create relational graphs that regularize feature learning, promoting semantically meaningful representations through multi-layer graph alignment.

Details

Motivation: Deep networks learn rich representations but often capture noisy inter-class similarities that contradict the model's predicted semantics, requiring a method to enforce class-aware feature relationships.

Method: GCR introduces parameter-free Graph Consistency Layers (GCLs) that build feature similarity graphs and align them with class-aware masked prediction graphs using adaptive weighting based on graph discrepancy magnitudes.

Result: GCR promotes cleaner feature structure, stronger intra-class cohesion, and improved generalization across various networks and datasets without modifying architecture or training procedure.

Conclusion: GCR offers a model-agnostic, lightweight approach to enhance semantic structure in deep networks by learning from prediction structure through graph consistency regularization.

Abstract: We propose Graph Consistency Regularization (GCR), a novel framework that injects relational graph structures, derived from model predictions, into the learning process to promote class-aware, semantically meaningful feature representations. Functioning as a form of self-prompting, GCR enables the model to refine its internal structure using its own outputs. While deep networks learn rich representations, these often capture noisy inter-class similarities that contradict the model’s predicted semantics. GCR addresses this issue by introducing parameter-free Graph Consistency Layers (GCLs) at arbitrary depths. Each GCL builds a batch-level feature similarity graph and aligns it with a global, class-aware masked prediction graph, derived by modulating softmax prediction similarities with intra-class indicators. This alignment enforces that feature-level relationships reflect class-consistent prediction behavior, acting as a semantic regularizer throughout the network. Unlike prior work, GCR introduces a multi-layer, cross-space graph alignment mechanism with adaptive weighting, where layer importance is learned from graph discrepancy magnitudes. This allows the model to prioritize semantically reliable layers and suppress noisy ones, enhancing feature quality without modifying the architecture or training procedure. GCR is model-agnostic, lightweight, and improves semantic structure across various networks and datasets. Experiments show that GCR promotes cleaner feature structure, stronger intra-class cohesion, and improved generalization, offering a new perspective on learning from prediction structure. Project website Code

[1196] Planner Aware Path Learning in Diffusion Language Models Training

Fred Zhangzhi Peng, Zachary Bezemek, Jarrid Rector-Brooks, Shuibai Zhang, Anru R. Zhang, Michael Bronstein, Avishek Joey Bose, Alexander Tong

Main category: cs.LG

TL;DR: The paper addresses the mismatch between training and inference in diffusion language models by introducing a new training objective (P-ELBO) and method (PAPL) that incorporate planner-based reverse dynamics, leading to significant improvements across text, code, and protein sequence modeling tasks.

Details

Motivation: Diffusion language models use planning strategies for faster inference, but this creates a mismatch between uniformly random training paths and planned inference paths, which the standard ELBO training objective doesn't account for.

Method: The authors derive a new Planned Evidence Lower Bound (P-ELBO) that incorporates planner-based reverse dynamics, and propose Planner Aware Path Learning (PAPL) - a modification of standard masked discrete diffusion loss that aligns training with planned inference.

Result: PAPL achieves consistent improvements: 40% relative gain in protein sequence modeling, up to 4x improvement in MAUVE for text generation, and 23% relative gain in HumanEval pass@10 for code generation.

Conclusion: The proposed P-ELBO and PAPL method successfully bridge the training-inference gap in diffusion language models, enabling better alignment between training objectives and planned sampling strategies used during inference.

Abstract: Diffusion language models have emerged as a powerful alternative to autoregressive models, enabling fast inference through flexible and parallel generation paths. This flexibility is enabled by new sampling strategies, or planners, that iteratively choose where to denoise along the sequence rather than sampling uniformly at random. However, by modifying reverse paths, planners introduce a mismatch between the uniformly random denoising paths used during training and the planning-based paths used at inference. In this work, we systematically investigate this mismatch and theoretically show that the standard discrete diffusion training evidence lower bound (ELBO) does not accurately describe a denoiser under non-uniform planning. To bridge this gap, we derive a new Planned Evidence Lower Bound (P-ELBO) that directly incorporates planner-based reverse dynamics into the training objective. Building on this, we propose Planner Aware Path Learning (PAPL), a simple and effective modification of the standard masked discrete diffusion loss that aligns training and inference under planned denoisers. Empirically, PAPL delivers consistent improvements across domains, including a 40% relative gain in protein sequence modeling, up to a 4x improvement in MAUVE for text generation, and a 23% relative gain in HumanEval pass@10 for code generation.

[1197] Mind the Links: Cross-Layer Attention for Link Prediction in Multiplex Networks

Devesh Sharma, Aditya Kishore, Ayush Garg, Debajyoti Mazumder, Debasis Mohapatra, Jasabanta Patro

Main category: cs.LG

TL;DR: The paper proposes a multiplex link prediction framework using multi-view edge classification with cross-layer self-attention to fuse evidence across layers, addressing scalability and inter-layer dependency issues.

Details

Motivation: Existing predictors either collapse layers or treat them independently, losing crucial inter-layer dependencies and struggling with scalability in multiplex graphs.

Method: Frame multiplex link prediction as multi-view edge classification, construct sequences of per-layer edge views, apply cross-layer self-attention for fusion. Two models: Trans-SLE (lightweight transformer over static embeddings) and Trans-GAT (layer-specific GAT encoders with transformer fusion).

Result: Experiments on six public multiplex datasets show consistent macro-F1 gains over strong baselines (MELL, HOPLP-MUL, RMNE).

Conclusion: The approach is simple, scalable, and compatible with both precomputed embeddings and GNN encoders, with introduced Union-Set candidate pool and leakage-free protocols ensuring fairness.

Abstract: Multiplex graphs capture diverse relations among shared nodes. Most predictors either collapse layers or treat them independently. This loses crucial inter-layer dependencies and struggles with scalability. To overcome this, we frame multiplex link prediction as multi-view edge classification. For each node pair, we construct a sequence of per-layer edge views and apply cross-layer self-attention to fuse evidence for the target layer. We present two models as instances of this framework: Trans-SLE, a lightweight transformer over static embeddings, and Trans-GAT, which combines layer-specific GAT encoders with transformer fusion. To ensure scalability and fairness, we introduce a Union–Set candidate pool and two leakage-free protocols: cross-layer and inductive subgraph generalization. Experiments on six public multiplex datasets show consistent macro-F_1 gains over strong baselines (MELL, HOPLP-MUL, RMNE). Our approach is simple, scalable, and compatible with both precomputed embeddings and GNN encoders.

[1198] PATCH: Learnable Tile-level Hybrid Sparsity for LLMs

Younes Hourri, Mohammad Mozaffari, Maryam Mehri Dehnavi

Main category: cs.LG

TL;DR: PATCH introduces a hybrid sparsity framework that bridges the gap between unstructured sparsity (accurate but slow) and semi-structured 2:4 sparsity (fast but less accurate) by enabling continuous sparsity ratios between 0% and 50% through tile-based partitioning with learnable mask selection.

Details

Motivation: Existing model pruning approaches face trade-offs: unstructured sparsity preserves accuracy but prevents GPU acceleration due to irregular access patterns, while semi-structured 2:4 sparsity is hardware-friendly but enforces a rigid 50% pattern that degrades model quality.

Method: PATCH partitions weight matrices into tiles and assigns each tile to be either dense or 2:4 sparse using a learnable mask selection mechanism, providing fine-grained control over accuracy-acceleration tradeoffs and supporting non-uniform sparsity across layers.

Result: Across models from 0.5B to 8B parameters, PATCH consistently narrows the gap to dense accuracy while delivering practical speedups. On LLaMA-2 7B with A6000 GPU, it achieves 1.18x-1.38x end-to-end speedup over dense baselines while improving accuracy by 0.37%-2.96% compared to state-of-the-art 2:4 pruning method MaskLLM.

Conclusion: PATCH successfully bridges the gap between accuracy and acceleration in model pruning by enabling continuous sparsity control, achieving both superior model quality and practical speedups across various model sizes.

Abstract: Large language models (LLMs) deliver impressive performance but incur prohibitive memory and compute costs at deployment. Model pruning is an effective way to reduce these overheads, yet existing approaches face challenges: unstructured sparsity, where nonzeros can appear anywhere, preserves accuracy but yields irregular access patterns that prevent GPU acceleration, while semi-structured 2:4 sparsity is hardware-friendly but enforces a rigid 50% pattern that degrades model quality. To bridge this gap, we introduce PATCH, a hybrid sparsity framework that enables a continuous sparsity ratio between 0% and 50%. PATCH partitions weight matrices into tiles, assigning each tile to be either dense or 2:4 sparse via a learnable mask selection mechanism. This design provides fine-grained control over accuracy-acceleration tradeoffs and supports non-uniform sparsity across layers, leading to superior overall quality. Across models from 0.5B to 8B parameters, PATCH consistently narrows the gap to dense accuracy while delivering practical speedups. For instance, on LLaMA-2 7B with an A6000 GPU, PATCH achieves 1.18x-1.38x end-to-end speedup over dense baselines while improving accuracy by 0.37%-2.96% compared to the state-of-the-art 2:4 pruning method, MaskLLM.

[1199] URS: A Unified Neural Routing Solver for Cross-Problem Zero-Shot Generalization

Changliang Zhou, Canhong Yu, Shunyu Yao, Xi Lin, Zhenkun Wang, Yu Zhou, Qingfu Zhang

Main category: cs.LG

TL;DR: URS is a unified neural routing solver that achieves zero-shot generalization across 100+ VRP variants using a single model without fine-tuning, through unified data representation, adaptive parameter generation, and LLM-driven constraint satisfaction.

Details

Motivation: Existing neural solvers rely on predefined constraints or per-problem fine-tuning, limiting their zero-shot generalization to unseen VRP variants.

Method: Proposes Unified Data Representation (UDR) to replace problem enumeration, Mixed Bias Module (MBM) for learning geometric/relational biases, parameter generator for adaptive decoder adjustment, and LLM-driven constraint satisfaction mechanism.

Result: URS consistently produces high-quality solutions for 100+ distinct VRP variants (including 90+ unseen variants) without any fine-tuning.

Conclusion: URS is the first neural solver capable of handling over 100 VRP variants with a single model, demonstrating superior zero-shot generalization capabilities.

Abstract: Multi-task neural routing solvers have emerged as a promising paradigm for their ability to solve multiple vehicle routing problems (VRPs) using a single model. However, existing neural solvers typically rely on predefined problem constraints or require per-problem fine-tuning, which substantially limits their zero-shot generalization ability to unseen VRP variants. To address this critical bottleneck, we propose URS, a unified neural routing solver capable of zero-shot generalization across a wide range of unseen VRPs using a single model without any fine-tuning. The key component of URS is the unified data representation (UDR), which replaces problem enumeration with data unification, thereby broadening the problem coverage and reducing reliance on domain expertise. In addition, we propose a Mixed Bias Module (MBM) to efficiently learn the geometric and relational biases inherent in various problems. On top of the proposed UDR, we further develop a parameter generator that adaptively adjusts the decoder and bias weights of MBM to enhance zero-shot generalization. Moreover, we propose an LLM-driven constraint satisfaction mechanism, which translates raw problem descriptions into executable stepwise masking functions to ensure solution feasibility. Extensive experiments demonstrate that URS can consistently produce high-quality solutions for more than 100 distinct VRP variants without any fine-tuning, which includes more than 90 unseen variants. To the best of our knowledge, URS is the first neural solver capable of handling over 100 VRP variants with a single model.

[1200] LOTFormer: Doubly-Stochastic Linear Attention via Low-Rank Optimal Transport

Ashkan Shahbazi, Chayne Thrash, Yikun Bai, Keaton Hamm, Navid NaderiAlizadeh, Soheil Kolouri

Main category: cs.LG

TL;DR: LOTFormer is a linear-time doubly-stochastic attention mechanism that uses entropic optimal transport with learnable pivot measures to achieve efficient scaling while balancing token participation.

Details

Motivation: Standard softmax attention has quadratic complexity that limits scaling to long contexts, and most attention mechanisms produce row-normalized maps that can over-focus on few tokens, degrading robustness. Existing doubly-stochastic attention methods introduce substantial overhead.

Method: Proposes LOTFormer that formulates attention as transportation plans between query and key measures, constrained to be low-rank using learnable pivot measures. Solves two entropic optimal transport problems (queries→pivot and pivot→keys) and composes them into a conditional coupling, yielding doubly-stochastic attention with rank at most r ≪ n.

Result: Achieves state-of-the-art results on Long Range Arena benchmark, surpassing prior linear and transport-based attention methods in both accuracy and efficiency.

Conclusion: LOTFormer provides a principled approach for simultaneously achieving linear-time computation and doubly-stochastic attention, enabling efficient scaling to long contexts while maintaining balanced token participation.

Abstract: Transformers have proven highly effective across a wide range of modalities. However, the quadratic complexity of the standard softmax attention mechanism poses a fundamental barrier to scaling them to long context windows. A large body of work addresses this with linear attention, which reformulates attention as a kernel function and approximates it with finite feature maps to achieve linear-time computation. Orthogonal to computational scaling, most attention mechanisms – both quadratic and linear – produce row-normalized maps that can over-focus on a few tokens, degrading robustness and information flow. Enforcing doubly-stochastic attention alleviates this by balancing token participation across rows and columns, but existing doubly-stochastic attention mechanisms typically introduce substantial overhead, undermining scalability. We propose LOTFormer, a principled attention mechanism that is simultaneously linear-time and doubly-stochastic. Our approach exploits the connection between attention maps and transportation plans between query and key measures. The central idea is to constrain the transport plan to be low-rank by conditioning it on a learnable pivot measure with small support. Concretely, we solve two entropic optimal transport problems (queries $\to$ pivot and pivot $\to$ keys) and compose them into a conditional (glued) coupling. This yields an attention matrix that is provably doubly-stochastic, has rank at most $r \ll n$, and applies to values in $O(nr)$ time without forming the full $n \times n$ map. The pivot locations and masses are learned end-to-end. Empirically, LOTFormer achieves state-of-the-art results on the Long Range Arena benchmark, surpassing prior linear and transport-based attention methods in both accuracy and efficiency.

[1201] Better Hessians Matter: Studying the Impact of Curvature Approximations in Influence Functions

Steve Hong, Runa Eschenhagen, Bruno Mlodozeniec, Richard Turner

Main category: cs.LG

TL;DR: Better Hessian approximations consistently improve influence function data attribution performance in classification tasks, with K-FAC eigenvalue mismatch being the main source of error.

Details

Motivation: To understand how different Hessian approximation methods affect influence function data attribution performance, given that influence functions require Hessian inversion but it's unclear if better approximations actually lead to better attribution.

Method: Controlled experiments in classification setting comparing different Hessian approximations (GGN, K-FAC, EK-FAC), decomposing approximation steps and evaluating each step’s impact on attribution accuracy.

Result: Better Hessian approximations consistently yield better influence score quality. K-FAC eigenvalue mismatch accounts for most of the error and influence loss compared to GGN/EK-FAC.

Conclusion: Justifies research efforts for better Hessian approximations, identifies critical approximation steps, and provides guidance for balancing computational tractability with attribution accuracy.

Abstract: Influence functions offer a principled way to trace model predictions back to training data, but their use in deep learning is hampered by the need to invert a large, ill-conditioned Hessian matrix. Approximations such as Generalised Gauss-Newton (GGN) and Kronecker-Factored Approximate Curvature (K-FAC) have been proposed to make influence computation tractable, yet it remains unclear how the departure from exactness impacts data attribution performance. Critically, given the restricted regime in which influence functions are derived, it is not necessarily clear better Hessian approximations should even lead to better data attribution performance. In this paper, we investigate the effect of Hessian approximation quality on influence-function attributions in a controlled classification setting. Our experiments show that better Hessian approximations consistently yield better influence score quality, offering justification for recent research efforts towards that end. We further decompose the approximation steps for recent Hessian approximation methods and evaluate each step’s influence on attribution accuracy. Notably, the mismatch between K-FAC eigenvalues and GGN/EK-FAC eigenvalues accounts for the majority of the error and influence loss. These findings highlight which approximations are most critical, guiding future efforts to balance computational tractability and attribution accuracy.

[1202] Factor Decorrelation Enhanced Data Removal from Deep Predictive Models

Wenhao Yang, Lin Li, Xiaohui Tao, Kaize Shi

Main category: cs.LG

TL;DR: A novel data removal approach that enhances deep predictive models through factor decorrelation and loss perturbation, improving performance in both in-distribution and out-of-distribution scenarios while protecting user privacy.

Details

Motivation: User privacy protection and regulatory compliance require sensitive data removal in model training, but this process often causes distributional shifts that undermine model performance, especially in out-of-distribution scenarios.

Method: Proposes two key components: (1) discriminative-preserving factor decorrelation module with dynamic adaptive weight adjustment and iterative representation updating to reduce feature redundancy and minimize inter-feature correlations; (2) smoothed data removal mechanism with loss perturbation that creates information-theoretic safeguards against data leakage.

Result: Extensive experiments on five benchmark datasets show the approach outperforms other baselines, achieving high predictive accuracy and robustness even under significant distribution shifts.

Conclusion: The method demonstrates superior efficiency and adaptability in both in-distribution and out-of-distribution scenarios, effectively addressing privacy concerns while maintaining model performance.

Abstract: The imperative of user privacy protection and regulatory compliance necessitates sensitive data removal in model training, yet this process often induces distributional shifts that undermine model performance-particularly in out-of-distribution (OOD) scenarios. We propose a novel data removal approach that enhances deep predictive models through factor decorrelation and loss perturbation. Our approach introduces: (1) a discriminative-preserving factor decorrelation module employing dynamic adaptive weight adjustment and iterative representation updating to reduce feature redundancy and minimize inter-feature correlations. (2) a smoothed data removal mechanism with loss perturbation that creates information-theoretic safeguards against data leakage during removal operations. Extensive experiments on five benchmark datasets show that our approach outperforms other baselines and consistently achieves high predictive accuracy and robustness even under significant distribution shifts. The results highlight its superior efficiency and adaptability in both in-distribution and out-of-distribution scenarios.

[1203] PHASE: Physics-Integrated, Heterogeneity-Aware Surrogates for Scientific Simulations

Dawei Gao, Dali Wang, Zhuowei Gu, Qinglei Cao, Xiao Wang, Peter Thornton, Dan Ricciuto, Yunhe Feng

Main category: cs.LG

TL;DR: PHASE is a physics-integrated AI surrogate framework that accelerates scientific simulations by 60x, achieving near-equilibrium states in biogeochemical modeling with only 20 simulation years instead of 1,200+ years.

Details

Motivation: Large-scale scientific simulations are computationally expensive, and existing AI surrogates face adoption barriers due to concerns about physical plausibility, trustworthiness, and handling heterogeneous data.

Method: PHASE combines data-type-aware encoders for heterogeneous inputs with multi-level physics-based constraints that enforce consistency from local dynamics to global system behavior.

Result: PHASE achieved a 60x reduction in required integration time for biogeochemical spin-up, inferring near-equilibrium states using only 20 simulation years instead of 1,200+ years, and demonstrated strong generalization to higher spatial resolutions.

Conclusion: PHASE captures governing physical regularities rather than surface correlations, enabling practical, physically consistent acceleration of complex scientific workflows like land-surface modeling.

Abstract: Large-scale numerical simulations underpin modern scientific discovery but remain constrained by prohibitive computational costs. AI surrogates offer acceleration, yet adoption in mission-critical settings is limited by concerns over physical plausibility, trustworthiness, and the fusion of heterogeneous data. We introduce PHASE, a modular deep-learning framework for physics-integrated, heterogeneity-aware surrogates in scientific simulations. PHASE combines data-type-aware encoders for heterogeneous inputs with multi-level physics-based constraints that promote consistency from local dynamics to global system behavior. We validate PHASE on the biogeochemical (BGC) spin-up workflow of the U.S. Department of Energy’s Energy Exascale Earth System Model (E3SM) Land Model (ELM), presenting-to our knowledge-the first scientifically validated AI-accelerated solution for this task. Using only the first 20 simulation years, PHASE infers a near-equilibrium state that otherwise requires more than 1,200 years of integration, yielding an effective reduction in required integration length by at least 60x. The framework is enabled by a pipeline for fusing heterogeneous scientific data and demonstrates strong generalization to higher spatial resolutions with minimal fine-tuning. These results indicate that PHASE captures governing physical regularities rather than surface correlations, enabling practical, physically consistent acceleration of land-surface modeling and other complex scientific workflows.

[1204] Data-Efficient Training by Evolved Sampling

Ziheng Cheng, Zhong Li, Jiang Bian

Main category: cs.LG

TL;DR: Evolved Sampling (ES) is a dynamic data selection framework that accelerates training by selecting informative data samples based on loss dynamics and loss differences, reducing back propagation time while maintaining model performance.

Details

Motivation: To accelerate machine learning training while preserving performance by identifying and selecting the most informative data samples that contribute significantly to the training process.

Method: Proposes Evolved Sampling (ES) framework that performs batch-level data selection using loss dynamics and augmented loss differences, enabling flexible frequency tuning. Also extends to set-level selection (ESWP) for further acceleration.

Result: Achieves lossless training accelerations across various pre-training and post-training tasks, saving up to 45% wall-clock time while maintaining model performance.

Conclusion: The framework demonstrates effective data efficiency improvements for large-scale machine learning, motivating further research in this direction.

Abstract: Data selection is designed to accelerate learning with preserved performance. To achieve this, a fundamental thought is to identify informative data samples with significant contributions to the training. In this work, we propose \textbf{Evolved Sampling} (\textbf{ES}), a simple yet effective framework for \emph{dynamic} sampling along the training process. This method conducts \em batch \em level data selection based on the dynamics of losses and augmented \emph{loss differences}, which enables flexible \emph{frequency tuning}, and hence significantly reduces the back propagation time with maintained model performance. Due to its conciseness, ES is also readily extensible to incorporate \em set \em level data selection (to form ES with pruning, \textbf{ESWP}) for further accelerations. As a plug-and-play framework, ES(WP) consistently achieves lossless training accelerations across various pre-training and post-training tasks, saving up to nearly 45% wall-clock time. Our results motivate further investigations on the data efficiency aspect of modern large-scale machine learning.

[1205] Generative Evolutionary Meta-Solver (GEMS): Scalable Surrogate-Free Multi-Agent Learning

Alakh Sharma, Gaurish Trivedi, Kartikey Bhandari, Yash Sinha, Dhruv Kumar, Pratik Narang, Jagat Sesh Challa

Main category: cs.LG

TL;DR: GEMS is a scalable MARL framework that replaces explicit policy populations with latent anchors and an amortized generator, achieving 6x speedup and 1.3x less memory than PSRO while maintaining game-theoretic guarantees.

Details

Motivation: Existing population-based MARL methods like PSRO suffer from quadratic computation and linear memory costs due to storing explicit policy populations and constructing full payoff matrices, limiting scalability.

Method: Uses latent anchors and a single amortized generator instead of explicit populations, employs Monte Carlo rollouts, multiplicative-weights meta-dynamics, and model-free empirical-Bernstein UCB oracle for adaptive policy expansion, with advantage-based trust-region objective for best responses.

Result: GEMS achieves up to 6x faster computation, 1.3x less memory usage than PSRO, and higher rewards in Two-player and Multi-Player games including Deceptive Messages Game, Kuhn Poker, and Multi-Particle environment.

Conclusion: GEMS overcomes PSRO’s fundamental inefficiencies while retaining game-theoretic guarantees, enabling scalable multi-agent learning across multiple domains.

Abstract: Scalable multi-agent reinforcement learning (MARL) remains a central challenge for AI. Existing population-based methods, like Policy-Space Response Oracles, PSRO, require storing explicit policy populations and constructing full payoff matrices, incurring quadratic computation and linear memory costs. We present Generative Evolutionary Meta-Solver (GEMS), a surrogate-free framework that replaces explicit populations with a compact set of latent anchors and a single amortized generator. Instead of exhaustively constructing the payoff matrix, GEMS relies on unbiased Monte Carlo rollouts, multiplicative-weights meta-dynamics, and a model-free empirical-Bernstein UCB oracle to adaptively expand the policy set. Best responses are trained within the generator using an advantage-based trust-region objective, eliminating the need to store and train separate actors. We evaluated GEMS in a variety of Two-player and Multi-Player games such as the Deceptive Messages Game, Kuhn Poker and Multi-Particle environment. We find that GEMS is up to ~6x faster, has 1.3x less memory usage than PSRO, while also reaps higher rewards simultaneously. These results demonstrate that GEMS retains the game theoretic guarantees of PSRO, while overcoming its fundamental inefficiencies, hence enabling scalable multi-agent learning in multiple domains.

[1206] Solve Smart, Not Often: Policy Learning for Costly MILP Re-solving

Rui Ai, Hugo De Oliveira Barbalho, Sirui Li, Alexei Robsky, David Simchi-Levi, Ishai Menache

Main category: cs.LG

TL;DR: A framework called POC (Proximal Policy Optimization with Change Point Detection) is proposed to determine optimal re-solving times for computationally intensive MILP problems, balancing performance and cost.

Details

Motivation: Real-time operations require frequent solving of computationally intensive MILP problems, but re-solving too often is costly while not re-solving enough leads to suboptimal solutions. Existing methods focus on heuristics and low-data settings, lacking systematic approaches for NP-hard MILPs.

Method: Proximal Policy Optimization with Change Point Detection (POC) framework that systematically balances performance and cost by detecting environmental changes and selecting beneficial samples for MILP solving.

Result: POC consistently outperforms existing baselines by 2%-17% across eight synthetic and real-world datasets, and establishes theoretical relationship between number of re-solves and re-solving cost.

Conclusion: POC provides an effective solution for determining appropriate re-solving times for MILP problems, filling a gap in literature by introducing real-time MILP benchmarks and evaluation criteria.

Abstract: A common challenge in real-time operations is deciding whether to re-solve an optimization problem or continue using an existing solution. While modern data platforms may collect information at high frequencies, many real-time operations require repeatedly solving computationally intensive optimization problems formulated as Mixed-Integer Linear Programs (MILPs). Determining when to re-solve is, therefore, an economically important question. This problem poses several challenges: 1) How to characterize solution optimality and solving cost; 2) How to detect environmental changes and select beneficial samples for solving the MILP; 3) Given the large time horizon and non-MDP structure, vanilla reinforcement learning (RL) methods are not directly applicable and tend to suffer from value function explosion. Existing literature largely focuses on heuristics, low-data settings, and smooth objectives, with little focus on common NP-hard MILPs. We propose a framework called Proximal Policy Optimization with Change Point Detection (POC), which systematically offers a solution for balancing performance and cost when deciding appropriate re-solving times. Theoretically, we establish the relationship between the number of re-solves and the re-solving cost. To test our framework, we assemble eight synthetic and real-world datasets, and show that POC consistently outperforms existing baselines by 2%-17%. As a side benefit, our work fills the gap in the literature by introducing real-time MILP benchmarks and evaluation criteria.

[1207] Drift-Adapter: A Practical Approach to Near Zero-Downtime Embedding Model Upgrades in Vector Databases

Harshil Vejendla

Main category: cs.LG

TL;DR: Drift-Adapter is a lightweight transformation layer that maps new query embeddings to legacy embedding spaces, enabling continued use of existing ANN indexes during model upgrades without full re-computation.

Details

Motivation: Traditional embedding model upgrades require re-encoding entire corpora and rebuilding ANN indexes, causing significant operational disruption and high computational costs.

Method: Three adapter parameterizations: Orthogonal Procrustes, Low-Rank Affine, and compact Residual MLP, trained on small samples of paired old and new embeddings to map new queries to legacy embedding space.

Result: Recovers 95-99% of retrieval recall (Recall@10, MRR) compared to full re-embedding, adds less than 10μs query latency, reduces recompute costs by over 100x, and enables near-zero operational interruption upgrades.

Conclusion: Drift-Adapter provides a pragmatic solution for agile model deployment by enabling embedding model upgrades without the need for full corpus re-computation and index rebuilding.

Abstract: Upgrading embedding models in production vector databases typically requires re-encoding the entire corpus and rebuilding the Approximate Nearest Neighbor (ANN) index, leading to significant operational disruption and computational cost. This paper presents Drift-Adapter, a lightweight, learnable transformation layer designed to bridge embedding spaces between model versions. By mapping new queries into the legacy embedding space, Drift-Adapter enables the continued use of the existing ANN index, effectively deferring full re-computation. We systematically evaluate three adapter parameterizations: Orthogonal Procrustes, Low-Rank Affine, and a compact Residual MLP, trained on a small sample of paired old and new embeddings. Experiments on MTEB text corpora and a CLIP image model upgrade (1M items) show that Drift-Adapter recovers 95-99% of the retrieval recall (Recall@10, MRR) of a full re-embedding, adding less than 10 microseconds of query latency. Compared to operational strategies like full re-indexing or dual-index serving, Drift-Adapter reduces recompute costs by over 100 times and facilitates upgrades with near-zero operational interruption. We analyze robustness to varied model drift, training data size, scalability to billion-item systems, and the impact of design choices like diagonal scaling, demonstrating Drift-Adapter’s viability as a pragmatic solution for agile model deployment.

[1208] Memory-Efficient Fine-Tuning via Low-Rank Activation Compression

Jiang-Xin Shi, Wen-Da Wei, Jin-Fei Qi, Xuanyu Chen, Tong Wei, Yu-Feng Li

Main category: cs.LG

TL;DR: LoRAct is a memory-efficient fine-tuning method that compresses activations using low-rank approximation, reducing activation memory by ~80% compared to LoRA while maintaining performance.

Details

Motivation: Parameter-efficient fine-tuning methods still have substantial memory overhead due to model activations, especially with large batch sizes and long contexts. The authors observed that activation ranks remain consistently low, suggesting compression potential.

Method: LoRAct uses low-rank activation compression applied online during forward pass without calibration data. It employs a novel sampling-based orthogonal decomposition algorithm for low-rank matrices with better computational efficiency and tighter error bounds than RSVD.

Result: Experiments on vision and language tasks show LoRAct reduces activation memory by approximately 80% compared to LoRA while maintaining competitive performance.

Conclusion: LoRAct provides an effective memory-efficient fine-tuning approach that significantly reduces activation memory overhead without compromising model performance.

Abstract: The parameter-efficient fine-tuning paradigm has garnered significant attention with the advancement of foundation models. Although numerous methods have been proposed to reduce the number of trainable parameters, their substantial memory overhead remains a critical bottleneck that hinders practical deployment. In this paper, we observe that model activations constitute a major source of memory consumption, especially under large batch sizes and long context lengths; however, the rank of the activations remains consistently low. Motivated by this insight, we propose a memory-efficient fine-tuning approach Low-Rank Activation Compression (LoRAct). Unlike prior work, LoRAct provides a more flexible and versatile compressing strategy that can be applied online during the forward pass without the need for any calibration data. Moreover, LoRAct incorporates a novel sampling-based orthogonal decomposition algorithm specifically designed for low-rank matrices, offering improved computational efficiency and a tighter error bound compared to the widely used RSVD. Experiments on both vision and language tasks demonstrate the effectiveness of LoRAct. Notably, LoRAct further reduces activation memory by approximately 80% in comparison with the widely adopted LoRA method, while maintaining competitive performance. The source code is available at https://github.com/shijxcs/meft.

[1209] Statistical Learning Guarantees for Group-Invariant Barron Functions

Yahong Yang, Wei Zhu

Main category: cs.LG

TL;DR: Group-invariant neural networks show improved generalization for symmetric functions through a group-dependent approximation factor, while maintaining similar estimation error compared to non-invariant networks.

Details

Motivation: To provide rigorous theoretical foundation for the statistical advantages of encoding group-invariant structures in neural networks when learning symmetric target functions.

Method: Analyze generalization error within Barron framework, examining approximation rates with group-dependent factor δ and Rademacher complexity comparisons between invariant and non-invariant networks.

Result: Group invariance introduces δ≤1 factor that can substantially improve approximation accuracy when small. Rademacher complexity remains unchanged, so overall generalization error improves significantly for symmetric functions.

Conclusion: Encoding group-invariant structures in neural networks provides clear statistical advantages for learning symmetric target functions, with both favorable (δ≈|G|⁻¹) and unfavorable (δ≈1) cases demonstrated.

Abstract: We investigate the generalization error of group-invariant neural networks within the Barron framework. Our analysis shows that incorporating group-invariant structures introduces a group-dependent factor $\delta_{G,\Gamma,\sigma} \le 1$ into the approximation rate. When this factor is small, group invariance yields substantial improvements in approximation accuracy. On the estimation side, we establish that the Rademacher complexity of the group-invariant class is no larger than that of the non-invariant counterpart, implying that the estimation error remains unaffected by the incorporation of symmetry. Consequently, the generalization error can improve significantly when learning functions with inherent group symmetries. We further provide illustrative examples demonstrating both favorable cases, where $\delta_{G,\Gamma,\sigma}\approx |G|^{-1}$, and unfavorable ones, where $\delta_{G,\Gamma,\sigma}\approx 1$. Overall, our results offer a rigorous theoretical foundation showing that encoding group-invariant structures in neural networks leads to clear statistical advantages for symmetric target functions.

[1210] Temporal Generalization: A Reality Check

Divyam Madaan, Sumit Chopra, Kyunghyun Cho

Main category: cs.LG

TL;DR: The paper investigates whether machine learning models can generalize to future data using only past data through parameter interpolation and extrapolation methods, but finds none consistently outperform simply using the latest model parameters.

Details

Motivation: Machine learning models often fail to maintain performance under distribution shifts, leading to inaccurate predictions on unseen future data. The research aims to determine if models can achieve generalization using only past data.

Method: The study explores two approaches: parameter interpolation (convex combinations of past model parameters) and parameter extrapolation (explicit extrapolation beyond the convex hull of past parameters). These methods are benchmarked on diverse temporal tasks including language modeling, news summarization, image classification, and more.

Result: None of the evaluated methods consistently outperforms the simple baseline of using the latest available model parameters across all scenarios. The methods show inconsistent performance depending on the specific task and context.

Conclusion: In the absence of future data or robust assumptions about data-generating processes, there are inherent difficulties in generalizing and extrapolating to future data. The findings warrant caution when evaluating claims of such generalization.

Abstract: Machine learning (ML) models often struggle to maintain performance under distribution shifts, leading to inaccurate predictions on unseen future data. In this work, we investigate whether and under what conditions models can achieve such a generalization when relying solely on past data. We explore two primary approaches: convex combinations of past model parameters (\emph{parameter interpolation}) and explicit extrapolation beyond the convex hull of past parameters (\emph{parameter extrapolation}). We benchmark several methods within these categories on a diverse set of temporal tasks, including language modeling, news summarization, news tag prediction, academic paper categorization, satellite image-based land use classification over time, and historical yearbook photo gender prediction. Our empirical findings show that none of the evaluated methods consistently outperforms the simple baseline of using the latest available model parameters in all scenarios. In the absence of access to future data or robust assumptions about the underlying data-generating process, these results underscore the inherent difficulties of generalizing and extrapolating to future data and warrant caution when evaluating claims of such generalization.

[1211] Revisiting Multivariate Time Series Forecasting with Missing Values

Jie Yang, Yifan Hu, Kexin Zhang, Luyang Niu, Yushun Dong, Philip S. Yu, Kaize Ding

Main category: cs.LG

TL;DR: CRIB is a novel framework that directly predicts from partially observed time series without imputation, using Information Bottleneck principle with unified-variate attention and consistency regularization to handle missing values effectively.

Details

Motivation: Current imputation-then-prediction approaches are problematic because there's no ground truth for missing values, making imputation error-prone and potentially degrading prediction accuracy by corrupting the underlying data distribution.

Method: Proposes Consistency-Regularized Information Bottleneck (CRIB) framework that avoids imputation and directly predicts from partially observed data. Uses unified-variate attention mechanism and consistency regularization to learn robust representations that filter noise while preserving predictive signals.

Result: Comprehensive experiments on four real-world datasets show CRIB predicts accurately even under high missing rates, demonstrating effectiveness of the direct prediction approach without imputation.

Conclusion: The paper advocates for a paradigm shift away from imputation-based approaches to direct prediction from incomplete time series, showing that CRIB framework successfully handles missing values while maintaining prediction accuracy.

Abstract: Missing values are common in real-world time series, and multivariate time series forecasting with missing values (MTSF-M) has become a crucial area of research for ensuring reliable predictions. To address the challenge of missing data, current approaches have developed an imputation-then-prediction framework that uses imputation modules to fill in missing values, followed by forecasting on the imputed data. However, this framework overlooks a critical issue: there is no ground truth for the missing values, making the imputation process susceptible to errors that can degrade prediction accuracy. In this paper, we conduct a systematic empirical study and reveal that imputation without direct supervision can corrupt the underlying data distribution and actively degrade prediction accuracy. To address this, we propose a paradigm shift that moves away from imputation and directly predicts from the partially observed time series. We introduce Consistency-Regularized Information Bottleneck (CRIB), a novel framework built on the Information Bottleneck principle. CRIB combines a unified-variate attention mechanism with a consistency regularization scheme to learn robust representations that filter out noise introduced by missing values while preserving essential predictive signals. Comprehensive experiments on four real-world datasets demonstrate the effectiveness of CRIB, which predicts accurately even under high missing rates. Our code is available in https://github.com/Muyiiiii/CRIB.

[1212] Beyond Outliers: A Study of Optimizers Under Quantization

Georgios Vlassis, Saleh Ashkboos, Alexandra Volkova, Torsten Hoefler, Dan Alistarh

Main category: cs.LG

TL;DR: The paper investigates how optimizer choice affects model robustness under quantization, finding that traditional outlier metrics fail to predict PTQ performance and that Shampoo optimizer shows best results for quantization-aware training.

Details

Motivation: To understand the interaction between optimizer choice and model quantization, as systematic evidence on this relationship is limited despite progress in both areas.

Method: Trained full-precision models (50M-1.5B parameters) with six optimizers, applied PTQ to evaluate performance degradation, analyzed QAT by training quantized models from scratch, and derived scaling laws for different optimizers.

Result: Outlier metrics (MMR, Kurtosis) fail to predict PTQ performance across optimizers; Shampoo shows lowest accuracy degradation under QAT and achieves highest parameter efficiency.

Conclusion: Optimizer choice significantly impacts quantization robustness, with Shampoo performing best for quantization-aware training, and traditional outlier metrics are insufficient for predicting PTQ performance.

Abstract: As new optimizers gain traction and model quantization becomes standard for efficient deployment, a key question arises: how does the choice of optimizer affect model performance in the presence of quantization? Despite progress in both areas, systematic evidence on optimizer-quantization interactions remains limited. To fill this gap, we study the impact of optimizer choice on model robustness under quantization, considering both post-training quantization (PTQ), and quantization-aware training (QAT). We first train full-precision models, ranging from 50M to 1.5B parameters, with six optimizers, to explore the hyperparameter landscape, and establish well-tuned baselines. We then apply PTQ to evaluate how model performance degrades when trained with different optimizers. We find that outlier-related metrics, such as the max-to-mean ratio (MMR) and Kurtosis, fail to predict the PTQ performance across different optimizers. We show analytically that this is due to the MMR capturing only isolated layer errors, while ignoring how quantization errors accumulate and propagate through the network. To study the QAT degradation, we train quantized models from scratch and compare them to our original-precision baselines. We find that optimizers performing well in the original pretraining setup may not remain optimal under QAT, and that models trained with Shampoo show the lowest accuracy degradation. Finally, we derive scaling laws for quantization-aware training under different optimizers, showing that Shampoo achieves the highest parameter efficiency of all tested optimizers.

[1213] Disentanglement of Variations with Multimodal Generative Modeling

Yijie Zhang, Yiyang Shen, Weiran Wang

Main category: cs.LG

TL;DR: IDMVAE is a multimodal VAE that improves disentanglement between shared and private information using mutual information regularizations and diffusion models for better latent priors.

Details

Motivation: Existing multimodal generative models struggle with disentangling shared and private information, especially in challenging datasets where likelihood models are insufficient.

Method: Proposes mutual information-based regularizations including cross-view mutual information maximization for shared variables and cycle-consistency loss for redundancy removal, plus diffusion models for improved latent priors.

Result: IDMVAE achieves clean separation between shared and private information, showing superior generation quality and semantic coherence on challenging datasets.

Conclusion: The proposed components are complementary and effectively address the disentanglement problem in multimodal representation learning.

Abstract: Multimodal data are prevalent across various domains, and learning robust representations of such data is paramount to enhancing generation quality and downstream task performance. To handle heterogeneity and interconnections among different modalities, recent multimodal generative models extract shared and private (modality-specific) information with two separate variables. Despite attempts to enforce disentanglement between these two variables, these methods struggle with challenging datasets where the likelihood model is insufficient. In this paper, we propose Information-disentangled Multimodal VAE (IDMVAE) to explicitly address this issue, with rigorous mutual information-based regularizations, including cross-view mutual information maximization for extracting shared variables, and a cycle-consistency style loss for redundancy removal using generative augmentations. We further introduce diffusion models to improve the capacity of latent priors. These newly proposed components are complementary to each other. Compared to existing approaches, IDMVAE shows a clean separation between shared and private information, demonstrating superior generation quality and semantic coherence on challenging datasets.

[1214] Fusing Sequence Motifs and Pan-Genomic Features: Antimicrobial Resistance Prediction using an Explainable Lightweight 1D CNN-XGBoost Ensemble

Md. Saiful Bari Siddiqui, Nowshin Tarannum

Main category: cs.LG

TL;DR: AMR-EnsembleNet combines sequence-based 1D CNN and feature-based XGBoost to predict antimicrobial resistance, achieving superior performance on E. coli strains across multiple antibiotics.

Details

Motivation: Current AMR prediction methods either ignore sequence context of SNPs or require large datasets, creating limitations for moderately-sized genomic datasets in this domain.

Method: Developed ensemble framework with lightweight 1D CNN for sequence motif learning from SNP data and XGBoost for non-local feature interactions, trained on 809 E. coli strains.

Result: Achieved MCC of 0.926 for Ciprofloxacin and highest Macro F1-score of 0.691 for Gentamicin, with model focusing on known AMR genes like fusA and parC.

Conclusion: Fusing sequence-aware CNN with feature-based XGBoost creates powerful ensemble that overcomes limitations of standalone sequence or feature models.

Abstract: Antimicrobial Resistance (AMR) is a rapidly escalating global health crisis. While genomic sequencing enables rapid prediction of resistance phenotypes, current computational methods have limitations. Standard machine learning models treat the genome as an unordered collection of features, ignoring the sequential context of Single Nucleotide Polymorphisms (SNPs). State-of-the-art sequence models like Transformers are often too data-hungry and computationally expensive for the moderately-sized datasets that are typical in this domain. To address these challenges, we propose AMR-EnsembleNet, an ensemble framework that synergistically combines sequence-based and feature-based learning. We developed a lightweight, custom 1D Convolutional Neural Network (CNN) to efficiently learn predictive sequence motifs from high-dimensional SNP data. This sequence-aware model was ensembled with an XGBoost model, a powerful gradient boosting system adept at capturing complex, non-local feature interactions. We trained and evaluated our framework on a benchmark dataset of 809 E. coli strains, predicting resistance across four antibiotics with varying class imbalance. Our 1D CNN-XGBoost ensemble consistently achieved top-tier performance across all the antibiotics, reaching a Matthews Correlation Coefficient (MCC) of 0.926 for Ciprofloxacin (CIP) and the highest Macro F1-score of 0.691 for the challenging Gentamicin (GEN) AMR prediction. We also show that our model consistently focuses on SNPs within well-known AMR genes like fusA and parC, confirming it learns the correct genetic signals for resistance. Our work demonstrates that fusing a sequence-aware 1D CNN with a feature-based XGBoost model creates a powerful ensemble, overcoming the limitations of using either an order-agnostic or a standalone sequence model.

[1215] Improving constraint-based discovery with robust propagation and reliable LLM priors

Ruiqi Lyu, Alistair Turcan, Martin Jinye Zhang, Bryan Wilder

Main category: cs.LG

TL;DR: MosaCD is a causal discovery method that combines CI tests and LLM annotations to create high-confidence seed edges, then uses confidence-down propagation to build causal graphs more reliably than existing methods.

Details

Motivation: Traditional constraint-based methods like PC rely on perfect CI tests and exhaustive search, which often fail in practice. LLM-based approaches assume perfect experts, but LLMs are prone to hallucinations. MosaCD addresses both limitations.

Method: Uses both CI tests and LLM annotations to derive high-confidence seed edges, filters LLM hallucinations using shuffled queries to exploit positional bias, and applies confidence-down propagation that orients the most reliable edges first.

Result: MosaCD achieves higher accuracy in final graph construction across multiple real-world graphs compared to existing constraint-based methods, due to improved reliability of initial seeds and robust propagation strategies.

Conclusion: The proposed MosaCD method successfully overcomes limitations of both traditional constraint-based methods and LLM-based approaches by combining their strengths while mitigating their weaknesses through careful seed selection and propagation strategies.

Abstract: Learning causal structure from observational data is central to scientific modeling and decision-making. Constraint-based methods aim to recover conditional independence (CI) relations in a causal directed acyclic graph (DAG). Classical approaches such as PC and subsequent methods orient v-structures first and then propagate edge directions from these seeds, assuming perfect CI tests and exhaustive search of separating subsets – assumptions often violated in practice, leading to cascading errors in the final graph. Recent work has explored using large language models (LLMs) as experts, prompting sets of nodes for edge directions, and could augment edge orientation when assumptions are not met. However, such methods implicitly assume perfect experts, which is unrealistic for hallucination-prone LLMs. We propose MosaCD, a causal discovery method that propagates edges from a high-confidence set of seeds derived from both CI tests and LLM annotations. To filter hallucinations, we introduce shuffled queries that exploit LLMs’ positional bias, retaining only high-confidence seeds. We then apply a novel confidence-down propagation strategy that orients the most reliable edges first, and can be integrated with any skeleton-based discovery method. Across multiple real-world graphs, MosaCD achieves higher accuracy in final graph construction than existing constraint-based methods, largely due to the improved reliability of initial seeds and robust propagation strategies.

[1216] EVO-LRP: Evolutionary Optimization of LRP for Interpretable Model Explanations

Emerald Zhang, Julian Weaver, Edward Castillo

Main category: cs.LG

TL;DR: EVO-LRP uses evolutionary optimization (CMA-ES) to automatically tune LRP hyperparameters for better explainable AI, outperforming traditional methods in interpretability metrics and visual quality.

Details

Motivation: Traditional XAI methods face trade-offs between detail and interpretability, while LRP implementations rely on heuristic rules not optimized for clarity or model alignment.

Method: EVO-LRP applies Covariance Matrix Adaptation Evolution Strategy (CMA-ES) to optimize LRP hyperparameters based on quantitative interpretability metrics like faithfulness and sparseness.

Result: EVO-LRP outperforms traditional XAI approaches in both interpretability metric performance and visual coherence, with strong sensitivity to class-specific features.

Conclusion: Attribution quality in explainable AI can be systematically improved through principled, task-specific optimization rather than relying on heuristic rules.

Abstract: Explainable AI (XAI) methods help identify which image regions influence a model’s prediction, but often face a trade-off between detail and interpretability. Layer-wise Relevance Propagation (LRP) offers a model-aware alternative. However, LRP implementations commonly rely on heuristic rule sets that are not optimized for clarity or alignment with model behavior. We introduce EVO-LRP, a method that applies Covariance Matrix Adaptation Evolution Strategy (CMA-ES) to tune LRP hyperparameters based on quantitative interpretability metrics, such as faithfulness or sparseness. EVO-LRP outperforms traditional XAI approaches in both interpretability metric performance and visual coherence, with strong sensitivity to class-specific features. These findings demonstrate that attribution quality can be systematically improved through principled, task-specific optimization.

[1217] Sketching Low-Rank Plus Diagonal Matrices

Andres Fernandez, Felix Dangel, Philipp Hennig, Frank Schneider

Main category: cs.LG

TL;DR: SKETCHLORD is a method that simultaneously estimates both low-rank and diagonal components of linear operators, outperforming sequential approaches and providing high-fidelity approximations for large-scale operators.

Details

Motivation: Many machine learning tasks involve high-dimensional linear operators that are costly to access via matrix-vector products. Existing sketched methods only construct either low-rank or diagonal approximations, leading to errors from assuming simpler structure.

Method: SKETCHLORD simultaneously estimates both low-rank and diagonal components using a convex optimization formulation, targeting Low-Rank plus Diagonal (LoRD) linear operators.

Result: Theoretical and empirical results show SKETCHLORD’s joint estimation is superior to sequential variants (diagonal-then-low-rank or low-rank-then-diagonal) and accurately recovers LoRD structures in synthetic experiments.

Conclusion: SKETCHLORD provides a valuable addition to structured approximation tools, particularly for high-fidelity approximations of large-scale operators like deep learning Hessians.

Abstract: Many relevant machine learning and scientific computing tasks involve high-dimensional linear operators accessible only via costly matrix-vector products. In this context, recent advances in sketched methods have enabled the construction of either low-rank or diagonal approximations from few matrix-vector products. This provides great speedup and scalability, but approximation errors arise due to the assumed simpler structure. This work introduces SKETCHLORD, a method that simultaneously estimates both low-rank and diagonal components, targeting the broader class of Low-Rank plus Diagonal (LoRD) linear operators. We demonstrate theoretically and empirically that this joint estimation is superior also to any sequential variant (diagonal-then-low-rank or low-rank-then-diagonal). Then, we cast SKETCHLORD as a convex optimization problem, leading to a scalable algorithm. Comprehensive experiments on synthetic (approximate) LoRD matrices confirm SKETCHLORD’s performance in accurately recovering these structures. This positions it as a valuable addition to the structured approximation toolkit, particularly when high-fidelity approximations are desired for large-scale operators, such as the deep learning Hessian.

[1218] Toward a Holistic Approach to Continual Model Merging

Hoang Phan, Sungmin Cha, Tung Lam Tran, Qi Lei

Main category: cs.LG

TL;DR: A holistic framework for continual model merging that addresses catastrophic forgetting through pre-merging tangent space fine-tuning, functional-aware merging using optimizer states, and post-merging representation correction, all without accessing old data.

Details

Motivation: To overcome scalability issues of per-domain task vectors and functional information loss in weight-space merging when old data is inaccessible, enabling efficient continual learning under memory constraints.

Method: Three-stage intervention: 1) Pre-merging: fine-tune in tangent space for weight disentanglement; 2) Merging: use functional information from optimizer states; 3) Post-merging: correct representation discrepancy between pre- and post-merged models.

Result: Achieves competitive performance on class-incremental and domain-incremental benchmarks while providing scalable and efficient solution to catastrophic forgetting.

Conclusion: The framework effectively addresses continual learning challenges by maintaining performance without accessing historical data and operating under constant memory constraints.

Abstract: We present a holistic framework for continual model merging that intervenes at three critical stages: pre-merging, during merging, and post-merging-to address two fundamental challenges in continual learning. In particular, conventional approaches either maintain a growing list of per-domain task vectors, leading to scalability issues or rely solely on weight-space merging when old data is inaccessible, thereby losing crucial functional information. Our method overcomes these limitations by first fine-tuning the main model within its tangent space on domain-specific data; this linearization amplifies per-task weight disentanglement, effectively mitigating across-task interference. During merging, we leverage functional information from available optimizer states beyond mere parameter averages to avoid the need to revisit old data. Finally, a post-merging correction aligns the representation discrepancy between pre- and post-merged models, reducing bias and enhancing overall performance-all while operating under constant memory constraints without accessing historical data. Extensive experiments on standard class-incremental and domain-incremental benchmarks demonstrate that our approach not only achieves competitive performance but also provides a scalable and efficient solution to the catastrophic forgetting problem.

[1219] Avoid Catastrophic Forgetting with Rank-1 Fisher from Diffusion Models

Zekun Wang, Anant Gupta, Zihan Dong, Christopher J. MacLellan

Main category: cs.LG

TL;DR: Proposes a rank-1 EWC variant for continual learning in diffusion models, leveraging gradient collinearity in low SNR regimes to capture dominant curvature direction, combined with replay to mitigate forgetting.

Details

Motivation: Catastrophic forgetting in continual learning; limitations of existing methods like replay (requires strong generator, prone to drift) and EWC (assumes shared optimum, uses diagonal Fisher approximation).

Method: Rank-1 EWC variant that captures dominant curvature direction using empirical Fisher that becomes effectively rank-1 in low SNR regimes, paired with replay-based approach.

Result: Consistently improves average FID and reduces forgetting on class-incremental image generation datasets (MNIST, FashionMNIST, CIFAR-10, ImageNet-1k); nearly eliminates forgetting on MNIST/FashionMNIST, halves forgetting on ImageNet-1k.

Conclusion: Diffusion models admit approximately rank-1 Fisher; with better Fisher estimate, EWC becomes strong complement to replay - replay encourages parameter sharing while EWC constrains replay-induced drift.

Abstract: Catastrophic forgetting remains a central obstacle for continual learning in neural models. Popular approaches – replay and elastic weight consolidation (EWC) – have limitations: replay requires a strong generator and is prone to distributional drift, while EWC implicitly assumes a shared optimum across tasks and typically uses a diagonal Fisher approximation. In this work, we study the gradient geometry of diffusion models, which can already produce high-quality replay data. We provide theoretical and empirical evidence that, in the low signal-to-noise ratio (SNR) regime, per-sample gradients become strongly collinear, yielding an empirical Fisher that is effectively rank-1 and aligned with the mean gradient. Leveraging this structure, we propose a rank-1 variant of EWC that is as cheap as the diagonal approximation yet captures the dominant curvature direction. We pair this penalty with a replay-based approach to encourage parameter sharing across tasks while mitigating drift. On class-incremental image generation datasets (MNIST, FashionMNIST, CIFAR-10, ImageNet-1k), our method consistently improves average FID and reduces forgetting relative to replay-only and diagonal-EWC baselines. In particular, forgetting is nearly eliminated on MNIST and FashionMNIST and is roughly halved on ImageNet-1k. These results suggest that diffusion models admit an approximately rank-1 Fisher. With a better Fisher estimate, EWC becomes a strong complement to replay: replay encourages parameter sharing across tasks, while EWC effectively constrains replay-induced drift.

[1220] Characteristic Root Analysis and Regularization for Linear Time Series Forecasting

Zheng Wang, Kaixuan Zhang, Wanfang Chen, Xiaonan Lu, Longyuan Li, Tobias Schlagenhauf

Main category: cs.LG

TL;DR: This paper systematically studies linear models for time series forecasting, focusing on characteristic roots’ role in temporal dynamics. It analyzes noise-free and noisy regimes, identifies data-scaling challenges, and proposes two robust root restructuring strategies that achieve state-of-the-art results.

Details

Motivation: Recent studies show simple linear models are surprisingly competitive in time series forecasting, warranting deeper theoretical investigation into their robustness and interpretability compared to complex models.

Method: Systematic analysis of linear models focusing on characteristic roots, with two proposed strategies: (1) rank reduction techniques (Reduced-Rank Regression, Direct Weight Rank Reduction) to recover latent dynamics, and (2) Root Purge - a novel adaptive method that learns noise-suppressing null space during training.

Result: Extensive experiments on standard benchmarks demonstrate both approaches’ effectiveness, validating theoretical insights and achieving state-of-the-art results in several settings.

Conclusion: Integrating classical linear systems theories with modern learning techniques can build robust, interpretable, and data-efficient forecasting models.

Abstract: Time series forecasting remains a critical challenge across numerous domains, yet the effectiveness of complex models often varies unpredictably across datasets. Recent studies highlight the surprising competitiveness of simple linear models, suggesting that their robustness and interpretability warrant deeper theoretical investigation. This paper presents a systematic study of linear models for time series forecasting, with a focus on the role of characteristic roots in temporal dynamics. We begin by analyzing the noise-free setting, where we show that characteristic roots govern long-term behavior and explain how design choices such as instance normalization and channel independence affect model capabilities. We then extend our analysis to the noisy regime, revealing that models tend to produce spurious roots. This leads to the identification of a key data-scaling property: mitigating the influence of noise requires disproportionately large training data, highlighting the need for structural regularization. To address these challenges, we propose two complementary strategies for robust root restructuring. The first uses rank reduction techniques, including Reduced-Rank Regression and Direct Weight Rank Reduction, to recover the low-dimensional latent dynamics. The second, a novel adaptive method called Root Purge, encourages the model to learn a noise-suppressing null space during training. Extensive experiments on standard benchmarks demonstrate the effectiveness of both approaches, validating our theoretical insights and achieving state-of-the-art results in several settings. Our findings underscore the potential of integrating classical theories for linear systems with modern learning techniques to build robust, interpretable, and data-efficient forecasting models.

[1221] GraphIFE: Rethinking Graph Imbalance Node Classification via Invariant Learning

Fanlong Zeng, Wensheng Gan, Philip S. Yu

Main category: cs.LG

TL;DR: GraphIFE is a novel framework that addresses class imbalance in graph data by mitigating quality inconsistency in synthesized nodes through graph invariant learning and enhanced embedding space representation.

Details

Motivation: Class imbalance in graph-structured data causes biased learning and degraded performance on minority classes, as most GNNs assume balanced distributions and fail to handle quality inconsistency in synthesized nodes.

Method: Proposes GraphIFE framework that incorporates graph invariant learning concepts and strategies to strengthen embedding space representation, enabling better identification of invariant features.

Result: Extensive experiments show GraphIFE consistently outperforms various baselines across multiple datasets, demonstrating efficiency and robust generalization.

Conclusion: GraphIFE effectively mitigates quality inconsistency in synthesized nodes for imbalanced graph learning, providing a robust solution that generalizes well across different datasets.

Abstract: The class imbalance problem refers to the disproportionate distribution of samples across different classes within a dataset, where the minority classes are significantly underrepresented. This issue is also prevalent in graph-structured data. Most graph neural networks (GNNs) implicitly assume a balanced class distribution and therefore often fail to account for the challenges introduced by class imbalance, which can lead to biased learning and degraded performance on minority classes. We identify a quality inconsistency problem in synthesized nodes, which leads to suboptimal performance under graph imbalance conditions. To mitigate this issue, we propose GraphIFE (Graph Invariant Feature Extraction), a novel framework designed to mitigate quality inconsistency in synthesized nodes. Our approach incorporates two key concepts from graph invariant learning and introduces strategies to strengthen the embedding space representation, thereby enhancing the model’s ability to identify invariant features. Extensive experiments demonstrate the framework’s efficiency and robust generalization, as GraphIFE consistently outperforms various baselines across multiple datasets. The code is publicly available at https://github.com/flzeng1/GraphIFE.

[1222] DRIK: Distribution-Robust Inductive Kriging without Information Leakage

Chen Yang, Changhao Zhao, Chen Wang, Jiansheng Fan

Main category: cs.LG

TL;DR: The paper identifies information leakage issues in conventional inductive kriging evaluation setups and proposes a 3x3 partition method to eliminate leakage. It then introduces DRIK, a distribution-robust inductive kriging approach with three-tier strategy at node, edge, and subgraph levels to enhance out-of-distribution generalization.

Details

Motivation: Conventional training-evaluation setups for inductive kriging suffer from information leakage and poor out-of-distribution generalization, where test data can influence model selection through early stopping, obscuring true OOD characteristics.

Method: Proposes a 3x3 partition to cleanly separate training, validation, and test sets, eliminating leakage. Introduces DRIK with three-tier strategy: perturbing node coordinates to capture spatial relationships, dropping edges to reduce ambiguity and increase topological diversity, and adding pseudo-labeled subgraphs to strengthen domain generalization.

Result: Experiments on six diverse spatio-temporal datasets show DRIK consistently outperforms existing methods, achieving up to 12.48% lower MAE while maintaining strong scalability.

Conclusion: The proposed 3x3 partition and DRIK approach effectively address information leakage and enhance out-of-distribution generalization in inductive kriging, demonstrating superior performance across diverse datasets.

Abstract: Inductive kriging supports high-resolution spatio-temporal estimation with sparse sensor networks, but conventional training-evaluation setups often suffer from information leakage and poor out-of-distribution (OOD) generalization. We find that the common 2x2 spatio-temporal split allows test data to influence model selection through early stopping, obscuring the true OOD characteristics of inductive kriging. To address this issue, we propose a 3x3 partition that cleanly separates training, validation, and test sets, eliminating leakage and better reflecting real-world applications. Building on this redefined setting, we introduce DRIK, a Distribution-Robust Inductive Kriging approach designed with the intrinsic properties of inductive kriging in mind to explicitly enhance OOD generalization, employing a three-tier strategy at the node, edge, and subgraph levels. DRIK perturbs node coordinates to capture continuous spatial relationships, drops edges to reduce ambiguity in information flow and increase topological diversity, and adds pseudo-labeled subgraphs to strengthen domain generalization. Experiments on six diverse spatio-temporal datasets show that DRIK consistently outperforms existing methods, achieving up to 12.48% lower MAE while maintaining strong scalability.

[1223] PreScope: Unleashing the Power of Prefetching for Resource-Constrained MoE Inference

Enda Yu, Zhaoning Zhang, Dezun Dong, Yongwei Wu, Xiangke Liao

Main category: cs.LG

TL;DR: PreScope is a prediction-driven expert scheduling system that addresses memory and PCIe latency bottlenecks in Mixture-of-Experts models by using learnable prediction, cross-layer scheduling, and asynchronous I/O optimization.

Details

Motivation: Mixture-of-Experts models face memory and PCIe latency bottlenecks on commodity hardware, where offloading expert weights to CPU memory causes PCIe transfer latency that exceeds GPU computation by several folds.

Method: 1) Learnable Layer-Aware Predictor (LLaPor) for capturing layer-specific expert activation patterns; 2) Prefetch-Aware Cross-Layer Scheduling (PreSched) for generating globally optimal plans; 3) Asynchronous I/O Optimizer (AsyncIO) that decouples I/O from computation.

Result: PreScope achieves 141% higher throughput and 74.6% lower latency than state-of-the-art solutions.

Conclusion: PreScope effectively addresses the key challenges of inaccurate activation prediction, PCIe bandwidth competition, and cross-device scheduling complexity in MoE model deployment.

Abstract: Mixture-of-Experts (MoE) models face memory and PCIe latency bottlenecks when deployed on commodity hardware. Offloading expert weights to CPU memory results in PCIe transfer latency that exceeds GPU computation by several folds. We present PreScope, a prediction-driven expert scheduling system that addresses three key challenges: inaccurate activation prediction, PCIe bandwidth competition, and cross-device scheduling complexity. Our solution includes: 1) Learnable Layer-Aware Predictor (LLaPor) that captures layer-specific expert activation patterns; 2) Prefetch-Aware Cross-Layer Scheduling (PreSched) that generates globally optimal plans balancing prefetching costs and loading overhead; 3) Asynchronous I/O Optimizer (AsyncIO) that decouples I/O from computation, eliminating waiting bubbles. PreScope achieves 141% higher throughput and 74.6% lower latency than state-of-the-art solutions.

[1224] Virtual Nodes based Heterogeneous Graph Convolutional Neural Network for Efficient Long-Range Information Aggregation

Ranhui Yan, Jia cai

Main category: cs.LG

TL;DR: VN-HGCN introduces virtual nodes to enhance information flow in heterogeneous graphs, enabling efficient long-range information aggregation with only 4 layers while avoiding over-smoothing.

Details

Motivation: Existing heterogeneous graph neural networks struggle to capture long-range information efficiently, requiring many layers that lead to high computational complexity and over-smoothing issues.

Method: Proposes a Virtual Nodes based Heterogeneous Graph Convolutional Network (VN-HGCN) that uses auxiliary virtual nodes connected to all nodes of specific types, facilitating efficient long-range information aggregation across different node and edge types.

Result: VN-HGCN achieves effective information aggregation with only 4 layers and can be applied as a versatile framework to other HGNN models. Empirical evaluations show superiority over state-of-the-art baselines on three real-world datasets.

Conclusion: Virtual nodes provide an effective mechanism for enhancing information flow in heterogeneous graphs, enabling efficient long-range dependency learning with minimal layers while maintaining performance superiority.

Abstract: Heterogeneous Graph Neural Networks (HGNNs) have exhibited powerful performance in heterogeneous graph learning by aggregating information from various types of nodes and edges. However, existing heterogeneous graph models often struggle to capture long-range information or necessitate stacking numerous layers to learn such dependencies, resulting in high computational complexity and encountering over-smoothing issues. In this paper, we propose a Virtual Nodes based Heterogeneous Graph Convolutional Network (VN-HGCN), which leverages virtual nodes to facilitate enhanced information flow within the graph. Virtual nodes are auxiliary nodes interconnected with all nodes of a specific type in the graph, facilitating efficient aggregation of long-range information across different types of nodes and edges. By incorporating virtual nodes into the graph structure, VN-HGCN achieves effective information aggregation with only $4$ layers. Additionally, we demonstrate that VN-HGCN can serve as a versatile framework that can be seamlessly applied to other HGNN models, showcasing its generalizability. Empirical evaluations validate the effectiveness of VN-HGCN, and extensive experiments conducted on three real-world heterogeneous graph datasets demonstrate the superiority of our model over several state-of-the-art baselines.

[1225] Pure Node Selection for Imbalanced Graph Node Classification

Fanlong Zeng, Wensheng Gan, Jiayang Wu, Philip S. Yu

Main category: cs.LG

TL;DR: PNS (Pure Node Sampling) is a plug-and-play module that addresses the Randomness Anomalous Connectivity Problem (RACP) in graph neural networks by operating during node synthesis to eliminate random seed effects and improve performance on imbalanced graph data.

Details

Motivation: Class imbalance in graph-structured data is often overlooked by GNNs, which assume class balance. The authors identified RACP - where random seeds cause significant performance degradation in off-the-shelf models, highlighting the need to eliminate random factor influences.

Method: Proposed PNS (Pure Node Sampling) as a novel plug-and-play module that operates directly during node synthesis to mitigate RACP. Unlike specialized algorithms for quantity or topological imbalance, PNS addresses the randomness problem at the node synthesis stage and also alleviates performance degradation from abnormal neighbor distribution.

Result: Experimental results show PNS effectively eliminates the effect of unfavorable random seeds and outperforms baselines across various benchmark datasets with different GNN backbones. The method demonstrates both effectiveness and stability.

Conclusion: PNS successfully addresses the RACP problem in graph neural networks, providing a stable solution that mitigates random seed effects while improving overall performance on imbalanced graph datasets.

Abstract: The problem of class imbalance refers to an uneven distribution of quantity among classes in a dataset, where some classes are significantly underrepresented compared to others. Class imbalance is also prevalent in graph-structured data. Graph neural networks (GNNs) are typically based on the assumption of class balance, often overlooking the issue of class imbalance. In our investigation, we identified a problem, which we term the Randomness Anomalous Connectivity Problem (RACP), where certain off-the-shelf models are affected by random seeds, leading to a significant performance degradation. To eliminate the influence of random factors in algorithms, we proposed PNS (Pure Node Sampling) to address the RACP in the node synthesis stage. Unlike existing approaches that design specialized algorithms to handle either quantity imbalance or topological imbalance, PNS is a novel plug-and-play module that operates directly during node synthesis to mitigate RACP. Moreover, PNS also alleviates performance degradation caused by abnormal distribution of node neighbors. We conduct a series of experiments to identify what factors are influenced by random seeds. Experimental results demonstrate the effectiveness and stability of our method, which not only eliminates the effect of unfavorable random seeds but also outperforms the baseline across various benchmark datasets with different GNN backbones. Data and code are available at https://github.com/flzeng1/PNS.

[1226] Calibration Meets Reality: Making Machine Learning Predictions Trustworthy

Kristina P. Sinaga, Arjun S. Nair

Main category: cs.LG

TL;DR: Theoretical analysis of post-hoc calibration methods (Platt scaling and isotonic regression) with convergence guarantees, computational bounds, and investigation of how feature informativeness affects calibration performance.

Details

Motivation: Lack of comprehensive theoretical understanding of post-hoc calibration methods, particularly regarding performance across different datasets and model architectures, and the uninvestigated interplay between feature quality and calibration performance.

Method: Rigorous theoretical analysis with convergence guarantees and computational complexity bounds, controlled synthetic experiments to explore feature informativeness impact, and empirical evaluation across diverse real-world datasets and model architectures.

Result: Derived convergence guarantees and computational bounds for Platt scaling and isotonic regression, demonstrated consistent improvements in calibration metrics across various scenarios, and provided insights into calibration performance under varying feature conditions.

Conclusion: Provides practical guidelines for selecting calibration methods based on dataset characteristics and computational constraints, bridging the gap between theoretical understanding and practical implementation in uncertainty quantification.

Abstract: Post-hoc calibration methods are widely used to improve the reliability of probabilistic predictions from machine learning models. Despite their prevalence, a comprehensive theoretical understanding of these methods remains elusive, particularly regarding their performance across different datasets and model architectures. Input features play a crucial role in shaping model predictions and, consequently, their calibration. However, the interplay between feature quality and calibration performance has not been thoroughly investigated. In this work, we present a rigorous theoretical analysis of post-hoc calibration methods, focusing on Platt scaling and isotonic regression. We derive convergence guarantees, computational complexity bounds, and finite-sample performance metrics for these methods. Furthermore, we explore the impact of feature informativeness on calibration performance through controlled synthetic experiments. Our empirical evaluation spans a diverse set of real-world datasets and model architectures, demonstrating consistent improvements in calibration metrics across various scenarios. By examining calibration performance under varying feature conditions utilizing only informative features versus complete feature spaces including noise dimensions, we provide fundamental insights into the robustness and reliability of different calibration approaches. Our findings offer practical guidelines for selecting appropriate calibration methods based on dataset characteristics and computational constraints, bridging the gap between theoretical understanding and practical implementation in uncertainty quantification. Code and experimental data are available at: https://github.com/Ajwebdevs/calibration-analysis-experiments.

[1227] Beyond Greedy Exits: Improved Early Exit Decisions for Risk Control and Reliability

Divya Jyoti Bajpai, Manjesh Kumar Hanawal

Main category: cs.LG

TL;DR: UAT proposes an adaptive threshold method for early-exit DNNs using Multi-Armed Bandit framework to dynamically adjust exit decisions online, improving computational efficiency while maintaining accuracy.

Details

Motivation: Existing early-exit strategies use static thresholds that can be problematic due to overconfidence in wrong predictions and lack robustness to distribution shifts, undermining model trustworthiness.

Method: UAT adapts exit thresholds using Multi-Armed Bandit framework with a new reward function that balances predictive certainty, reliability, and computational efficiency while penalizing unnecessary late exits.

Result: UAT achieves consistent speedup (1.70-2.10x) with minimal performance drop (<2%) compared to full model performance across vision-language understanding, text generation, and classification tasks.

Conclusion: The proposed UAT framework enables online, unsupervised adaptation of exit decisions, providing guarantees on risk and demonstrating improved computational efficiency while maintaining prediction quality.

Abstract: Early-Exit Deep Neural Networks enable adaptive inference by allowing prediction at intermediary layers, significantly reducing computational costs and latency. Most of the early exit strategies greedily exit a sample at an intermediary layer if the confidence in class prediction exceeds a predefined threshold that is set using a static validation set. This is problematic as the model might be overconfident in a wrong class. Also, they are not robust to distribution shifts encountered in deployment, which can undermine model trustworthiness and accuracy. To address these challenges, we propose UAT that adapts the threshold for exit decisions using a Multi-Armed Bandit framework, enabling online, unsupervised adjustment of exit decisions. UAT makes decisions based on a new reward function that assesses predictive certainty and its reliability to balance computational efficiency and prediction quality while penalizing unnecessary late exits. We provide guarantees on risk achieved by UAT and validate its performance on diverse tasks spanning vision-language understanding, text generation, and classification. Our framework demonstrates consistent improvements in speedup (1.70-2.10x) with a minimal performance drop (<2%) as compared to full model performance. Our source code is available at https://github.com/Div290/UAT.

[1228] Why Alignment Must Precede Distillation: A Minimal Working Explanation

Sungmin Cha, Kyunghyun Cho

Main category: cs.LG

TL;DR: Standard KD -> Align workflow limits alignment of rare desirable behaviors due to low distributional recall. Reversing to Align -> KD pipeline enables better alignment by first aligning on high-recall reference before distillation.

Details

Motivation: Current practice of performing preference alignment on knowledge-distilled models overlooks the importance of the reference model's distributional recall, which limits alignment of rare but desirable behaviors.

Method: Propose Align -> KD pipeline where alignment is first performed on high-recall reference model before distillation. Validate through theoretical explanation, Mixture-of-Gaussians experiment, and LLM alignment with SmolLM2 family.

Result: Models aligned after KD fail to effectively align target behaviors with substantially lower reward and target precision. Align -> KD pipeline robustly aligns behaviors with superior target-oriented metrics and lower variance.

Conclusion: Reference-model recall is a first-order design choice in alignment. Alignment must precede distillation to effectively capture rare desirable behaviors.

Abstract: For efficiency, preference alignment is often performed on compact, knowledge-distilled (KD) models. We argue this common practice introduces a significant limitation by overlooking a key property of the alignment’s reference model: its distributional recall. We show that the standard KD -> Align workflow diminishes the model’s capacity to align rare yet desirable behaviors, even under strong preference signals. We instead demonstrate that reversing the pipeline (i.e., Align -> KD) is essential: alignment must first be performed on a high-recall reference before distillation. Our contributions are threefold. First, we provide a minimal working explanation of how the reference model constrains preference alignment objectives at a fundamental level. Second, we validate this theory in a controllable Mixture-of-Gaussians experiment, where low-recall anchoring consistently results in suboptimal model performance. Finally, we demonstrate that the same phenomenon holds in LLM alignment with the SmolLM2 family: models aligned after KD fail to effectively align target behaviors, resulting in substantially lower reward and target precision. In contrast, our proposed Align -> KD pipeline robustly aligns these behaviors, yielding models with superior target-oriented metrics and lower variance. Together, these results establish reference-model recall as a first-order design choice in alignment, offering a clear principle: alignment must precede distillation.

[1229] Multi-Scale Spatial-Temporal Hypergraph Network with Lead-Lag Structures for Stock Time Series Forecasting

Xiangfei Qiu, Liu Yang, Hanyin Cheng, Xingjian Wu, Rongjia Wu, Zhigang Zhang, Ding Tu, Chenjuan Guo, Bin Yang, Christian S. Jensen, Jilin Hu

Main category: cs.LG

TL;DR: Hermes framework improves stock time series forecasting by better capturing industry correlations through hyperedge-based moving aggregation for lead-lag relationships and multi-scale fusion with cross-scale message passing.

Details

Motivation: Stock time series exhibit industry correlations that existing hypergraph methods capture only superficially, lacking consideration of inter-industry lead-lag interactions and multi-scale information within and among industries.

Method: Proposes Hermes framework with hyperedge-based moving aggregation module using sliding window and dynamic temporal aggregation for lead-lag relationships, and cross-scale edge-to-edge message passing for multi-scale information integration.

Result: Experimental results on multiple real-world stock datasets show Hermes outperforms state-of-the-art methods in both efficiency and accuracy.

Conclusion: Hermes effectively addresses limitations of existing methods by better exploiting industry correlations through improved lead-lag relationship modeling and multi-scale information integration.

Abstract: Time series forecasting occurs in a range of financial applications providing essential decision-making support to investors, regulatory institutions, and analysts. Unlike multivariate time series from other domains, stock time series exhibit industry correlation. Exploiting this kind of correlation can improve forecasting accuracy. However, existing methods based on hypergraphs can only capture industry correlation relatively superficially. These methods face two key limitations: they do not fully consider inter-industry lead-lag interactions, and they do not model multi-scale information within and among industries. This study proposes the Hermes framework for stock time series forecasting that aims to improve the exploitation of industry correlation by eliminating these limitations. The framework integrates moving aggregation and multi-scale fusion modules in a hypergraph network. Specifically, to more flexibly capture the lead-lag relationships among industries, Hermes proposes a hyperedge-based moving aggregation module. This module incorporates a sliding window and utilizes dynamic temporal aggregation operations to consider lead-lag dependencies among industries. Additionally, to effectively model multi-scale information, Hermes employs cross-scale, edge-to-edge message passing to integrate information from different scales while maintaining the consistency of each scale. Experimental results on multiple real-world stock datasets show that Hermes outperforms existing state-of-the-art methods in both efficiency and accuracy.

[1230] Graph Neural Networks with Diversity-aware Neighbor Selection and Dynamic Multi-scale Fusion for Multivariate Time Series Forecasting

Jingqi Xu, Guibin Chen, Jingxi Lu, Yuzhang Lin

Main category: cs.LG

TL;DR: DIMIGNN is a GNN-based method for multivariate time series forecasting that addresses redundant information aggregation and single-scale dependency through diversity-aware neighbor selection and dynamic multi-scale fusion.

Details

Motivation: Existing GNN-based methods for MTS forecasting often overlook neighbor diversity, leading to redundant information aggregation, and rely solely on single temporal scale representations for final predictions.

Method: Proposes DIMIGNN with two key components: Diversity-aware Neighbor Selection Mechanism (DNSM) to ensure high informational similarity with neighbors while maintaining diversity, and Dynamic Multi-Scale Fusion Module (DMFM) to dynamically adjust contributions from different temporal scales.

Result: Extensive experiments on real-world datasets demonstrate that DIMIGNN consistently outperforms prior methods in multivariate time series forecasting.

Conclusion: DIMIGNN effectively addresses the limitations of existing GNN-based MTS forecasting methods by incorporating diversity-aware neighbor selection and dynamic multi-scale fusion, achieving superior performance.

Abstract: Recently, numerous deep models have been proposed to enhance the performance of multivariate time series (MTS) forecasting. Among them, Graph Neural Networks (GNNs)-based methods have shown great potential due to their capability to explicitly model inter-variable dependencies. However, these methods often overlook the diversity of information among neighbors, which may lead to redundant information aggregation. In addition, their final prediction typically relies solely on the representation from a single temporal scale. To tackle these issues, we propose a Graph Neural Networks (GNNs) with Diversity-aware Neighbor Selection and Dynamic Multi-scale Fusion (DIMIGNN). DIMIGNN introduces a Diversity-aware Neighbor Selection Mechanism (DNSM) to ensure that each variable shares high informational similarity with its neighbors while maintaining diversity among neighbors themselves. Furthermore, a Dynamic Multi-Scale Fusion Module (DMFM) is introduced to dynamically adjust the contributions of prediction results from different temporal scales to the final forecasting result. Extensive experiments on real-world datasets demonstrate that DIMIGNN consistently outperforms prior methods.

[1231] Towards a Comprehensive Scaling Law of Mixture-of-Experts

Guoliang Zhao, Yuhan Fu, Shuaipeng Li, Xingwu Sun, Ruobing Xie, An Wang, Weidong Han, Zhen Yang, Weixuan Sun, Yudong Zhang, Cheng-zhong Xu, Di Wang, Jie Jiang

Main category: cs.LG

TL;DR: The paper develops comprehensive scaling laws specifically for Mixture-of-Experts (MoE) models, identifying five key factors and their optimal configurations through systematic experiments.

Details

Motivation: Existing scaling laws for dense models don't apply to MoE models due to multiple influencing factors, intricate coupling relationships, and non-monotonic performance impacts, requiring MoE-specific scaling analysis.

Method: Systematic decomposition of MoE settings into 5 key factors (data size, total model size, activated model size, number of active experts, shared expert ratio) with 446 controlled experiments to characterize marginal effects and construct joint scaling laws.

Result: Optimal settings for number of active experts (G) and shared expert ratio (S) are independent of model architecture and data size. Optimal activation parameter ratio (Na/N) becomes sparser as total model size (N) scales up.

Conclusion: The proposed MoE scaling law provides accurate and insightful guidance for future MoE model design and training, addressing the unique challenges of MoE scaling that differ from dense models.

Abstract: Mixture-of-Experts (MoE) models have become the consensus approach for enabling parameter-efficient scaling and cost-effective deployment in large language models. However, existing scaling laws for dense models are inapplicable to MoE models, which stems from three critical challenges: the multiplicity of influencing factors, their intricate coupling relationships and the non-monotonic nature of their performance impacts. They collectively necessitate a fine-grained investigation into MoE-specific scaling laws. In this work, we perform a systematic decomposition of MoE settings, identifying five key factors that influence model performance from both size and structural perspectives (data size ($D$), total model size ($N$), activated model size ($N_a$), number of active experts ($G$) and the ratio of shared experts ($S$)). Specifically, we design $446$ controlled experiments to characterize their marginal effects, ultimately constructing a comprehensive and precise joint MoE scaling law that considers all essential factors. Furthermore, we derive the theoretically optimal and practically efficiency-aware optimal configurations for $G$, $S$ and $N_a/N$ with detailed analyses. Our results demonstrate that the optimal settings for $G$ and $S$ are independent of both the model architecture and data size. With the scaling of $N$, the optimal activation parameter ratio of $N_a/N$ becomes sparser. Our proposed MoE scaling law could function as an accurate and insightful guidance to facilitate future MoE model design and training.

[1232] Decentralized Dynamic Cooperation of Personalized Models for Federated Continual Learning

Danni Yang, Zhikang Chen, Sen Cui, Mengyue Yang, Ding Li, Abudukelimu Wuerkaixi, Haoxuan Li, Jinke Ren, Mingming Gong

Main category: cs.LG

TL;DR: A decentralized dynamic cooperation framework for federated continual learning that forms dynamic coalitions between clients to address catastrophic forgetting in heterogeneous environments.

Details

Motivation: Existing federated continual learning approaches face challenges with catastrophic forgetting due to temporal and cross-client shifts, and global model aggregation may introduce interference in heterogeneous scenarios.

Method: Proposes decentralized dynamic cooperation where clients form non-overlapping coalitions based on selective cooperation, using coalitional affinity games to model relationships and assessing gradient coherence and model similarity to quantify cooperation benefits.

Result: Comprehensive experiments demonstrate superiority over various baselines in handling catastrophic forgetting and improving model performance in federated continual learning scenarios.

Conclusion: The proposed framework effectively balances new knowledge acquisition with prior learning retention, achieving personalized models through dynamic coalitions that adapt to changing tasks in federated environments.

Abstract: Federated continual learning (FCL) has garnered increasing attention for its ability to support distributed computation in environments with evolving data distributions. However, the emergence of new tasks introduces both temporal and cross-client shifts, making catastrophic forgetting a critical challenge. Most existing works aggregate knowledge from clients into a global model, which may not enhance client performance since irrelevant knowledge could introduce interference, especially in heterogeneous scenarios. Additionally, directly applying decentralized approaches to FCL suffers from ineffective group formation caused by task changes. To address these challenges, we propose a decentralized dynamic cooperation framework for FCL, where clients establish dynamic cooperative learning coalitions to balance the acquisition of new knowledge and the retention of prior learning, thereby obtaining personalized models. To maximize model performance, each client engages in selective cooperation, dynamically allying with others who offer meaningful performance gains. This results in non-overlapping, variable coalitions at each stage of the task. Moreover, we use coalitional affinity game to simulate coalition relationships between clients. By assessing both client gradient coherence and model similarity, we quantify the client benefits derived from cooperation. We also propose a merge-blocking algorithm and a dynamic cooperative evolution algorithm to achieve cooperative and dynamic equilibrium. Comprehensive experiments demonstrate the superiority of our method compared to various baselines. Code is available at: https://github.com/ydn3229/DCFCL.

[1233] Hedonic Neurons: A Mechanistic Mapping of Latent Coalitions in Transformer MLPs

Tanya Chowdhury, Atharva Nijasure, Yair Zick, James Allan

Main category: cs.LG

TL;DR: A mechanistic interpretability framework using coalitional game theory to identify stable groups of neurons that work synergistically in LLM MLP layers, revealing higher-order structure beyond individual neuron analysis.

Details

Motivation: To understand how neurons in LLM MLP layers work together rather than in isolation, as prior research shows statistical priors may strengthen, split, or vanish across depth, and empirical inspection reveals new features concentrate in mid-layer MLPs.

Method: Introduced a framework based on coalitional game theory where neurons are agents in a hedonic game, using top-responsive utilities and PAC-Top-Cover algorithm to extract stable coalitions of neurons with non-additive joint ablation effects.

Result: Applied to LLaMA, Mistral, and Pythia rerankers fine-tuned on scalar IR tasks, the method found coalitions with consistently higher synergy than clustering baselines, revealing how neurons cooperate to encode features.

Conclusion: Hedonic coalitions uncover higher-order structure beyond disentanglement and yield computational units that are functionally important, interpretable, and predictive across domains.

Abstract: Fine-tuned Large Language Models (LLMs) encode rich task-specific features, but the form of these representations, especially within MLP layers, remains unclear. Empirical inspection of LoRA updates shows that new features concentrate in mid-layer MLPs, yet the scale of these layers obscures meaningful structure. Prior probing suggests that statistical priors may strengthen, split, or vanish across depth, motivating the need to study how neurons work together rather than in isolation. We introduce a mechanistic interpretability framework based on coalitional game theory, where neurons mimic agents in a hedonic game whose preferences capture their synergistic contributions to layer-local computations. Using top-responsive utilities and the PAC-Top-Cover algorithm, we extract stable coalitions of neurons: groups whose joint ablation has non-additive effects. We then track their transitions across layers as persistence, splitting, merging, or disappearance. Applied to LLaMA, Mistral, and Pythia rerankers fine-tuned on scalar IR tasks, our method finds coalitions with consistently higher synergy than clustering baselines. By revealing how neurons cooperate to encode features, hedonic coalitions uncover higher-order structure beyond disentanglement and yield computational units that are functionally important, interpretable, and predictive across domains.

[1234] FedDAPL: Toward Client-Private Generalization in Federated Learning

Soroosh Safari Loaliyan, Jose-Luis Ambite, Paul M. Thompson, Neda Jahanshad, Greg Ver Steeg

Main category: cs.LG

TL;DR: This paper integrates Domain-Adversarial Neural Network (DANN) with Federated Learning to address scanner-induced domain shift in medical imaging while maintaining privacy constraints.

Details

Motivation: Federated Learning is suitable for medical imaging due to privacy laws, but scanner-induced domain shift causes models to fail on external sites. Existing harmonization methods violate FL privacy constraints by requiring data comparison across sites.

Method: Proposed a federated DANN approach with proximal regularization to stabilize adversarial training among clients, addressing convergence issues of naive federated DANN.

Result: Experiments on T1-weighted 3-D brain MRIs from OpenBHB dataset showed superior cross-site generalization over FedAvg and ERM when training on 15 sites and testing on 19 unseen sites, while preserving data privacy.

Conclusion: The proposed federated DANN with proximal regularization effectively addresses domain shift in medical imaging under FL constraints, achieving better generalization than baseline methods without compromising privacy.

Abstract: Federated Learning (FL) trains models locally at each research center or clinic and aggregates only model updates, making it a natural fit for medical imaging, where strict privacy laws forbid raw data sharing. A major obstacle is scanner-induced domain shift: non-biological variations in hardware or acquisition protocols can cause models to fail on external sites. Most harmonization methods correct this shift by directly comparing data across sites, conflicting with FL’s privacy constraints. Domain Generalization (DG) offers a privacy-friendly alternative - learning site-invariant representations without sharing raw data - but standard DG pipelines still assume centralized access to multi-site data, again violating FL’s guarantees. This paper meets these difficulties with a straightforward integration of a Domain-Adversarial Neural Network (DANN) within the FL process. After demonstrating that a naive federated DANN fails to converge, we propose a proximal regularization method that stabilizes adversarial training among clients. Experiments on T1-weighted 3-D brain MRIs from the OpenBHB dataset, performing brain-age prediction on participants aged 6-64 y (mean 22+/-6 y; 45 percent male) in training and 6-79 y (mean 19+/-13 y; 55 percent male) in validation, show that training on 15 sites and testing on 19 unseen sites yields superior cross-site generalization over FedAvg and ERM while preserving data privacy.

[1235] Merge Now, Regret Later: The Hidden Cost of Model Merging is Adversarial Transferability

Ankit Gangwal, Aaryan Ajay Sharma

Main category: cs.LG

TL;DR: Model Merging (MM) does not provide reliable defense against transfer attacks, with over 95% relative transfer attack success rate. Stronger MM methods actually increase vulnerability to transfer attacks.

Details

Motivation: To study the impact of Model Merging on the transferability of adversarial examples, challenging the prevailing notion that MM confers free adversarial robustness.

Method: Comprehensive evaluations and statistical analysis using 8 MM methods, 7 datasets, and 6 attack methods across 336 distinct attack settings.

Result: MM cannot reliably defend against transfer attacks, with over 95% relative transfer attack success rate. Stronger MM methods increase vulnerability, mitigating representation bias increases vulnerability, and weight averaging is the most vulnerable MM method.

Conclusion: Model Merging increases vulnerability to transfer attacks, and practitioners need to consider these security implications when designing systems using MM.

Abstract: Model Merging (MM) has emerged as a promising alternative to multi-task learning, where multiple fine-tuned models are combined, without access to tasks’ training data, into a single model that maintains performance across tasks. Recent works have explored the impact of MM on adversarial attacks, particularly backdoor attacks. However, none of them have sufficiently explored its impact on transfer attacks using adversarial examples, i.e., a black-box adversarial attack where examples generated for a surrogate model successfully mislead a target model. In this work, we study the effect of MM on the transferability of adversarial examples. We perform comprehensive evaluations and statistical analysis consisting of 8 MM methods, 7 datasets, and 6 attack methods, sweeping over 336 distinct attack settings. Through it, we first challenge the prevailing notion of MM conferring free adversarial robustness, and show MM cannot reliably defend against transfer attacks, with over 95% relative transfer attack success rate. Moreover, we reveal 3 key insights for machine-learning practitioners regarding MM and transferability for a robust system design: (1) stronger MM methods increase vulnerability to transfer attacks; (2) mitigating representation bias increases vulnerability to transfer attacks; and (3) weight averaging, despite being the weakest MM method, is the most vulnerable MM method to transfer attacks. Finally, we analyze the underlying reasons for this increased vulnerability, and provide potential solutions to the problem. Our findings offer critical insights for designing more secure systems employing MM.

[1236] Estimating Time Series Foundation Model Transferability via In-Context Learning

Qingren Yao, Ming Jin, Chengqi Zhang, Chao-Han Huck Yang, Jun Qi, Shirui Pan

Main category: cs.LG

TL;DR: TimeTic is a transferability estimation framework that uses in-context learning to predict how time series foundation models will perform after fine-tuning on target datasets, achieving strong correlation with actual performance.

Details

Motivation: With the growing number of time series foundation models, efficiently identifying the best model for downstream fine-tuning becomes challenging, especially in domains with limited public data.

Method: TimeTic recasts model selection as an in-context-learning problem, organizing observed model-data relationships as contextual information using tabular foundation models. It introduces a novel model characterization based on entropy evolution across model layers to capture embedding-space distinctions.

Result: On a comprehensive benchmark with 10 datasets, 10 foundation models, and 3 forecasting tasks, TimeTic achieves a mean rank correlation of approximately 0.6 with actual fine-tuned performance, representing a 30% improvement over using zero-shot performance as transferability score.

Conclusion: TimeTic provides an effective framework for transferability estimation that can generalize across arbitrary model sets and adapt to various test-time scenarios, enabling better model selection for time series forecasting tasks.

Abstract: Time series foundation models (TSFMs) offer strong zero-shot forecasting via large-scale pre-training, yet fine-tuning remains critical for boosting performance in domains with limited public data. With the growing number of TSFMs, efficiently identifying the best model for downstream fine-tuning becomes increasingly challenging. In this work, we introduce TimeTic, a transferability estimation framework that recasts model selection as an in-context-learning problem: given observations on known (source) datasets, it predicts how a TSFM will perform after fine-tuning on a downstream (target) dataset. TimeTic flexibly organizes the observed model-data relationships as contextual information, allowing it to adapt seamlessly to various test-time scenarios. Leveraging the natural tabular structure formed by dataset meta-features, model characteristics, and fine-tuned performance, we employ tabular foundation models to serve as in-context learners. We further introduce a novel model characterization based on entropy evolution across model layers, capturing embedding-space distinctions and enabling TimeTic to generalize across arbitrary model sets. We establish a comprehensive benchmark for transferability estimation including 10 datasets, 10 foundation models, and 3 forecasting tasks. On this benchmark, TimeTic’s estimation demonstrates strong alignment with actual fine-tuned performance for previously unseen datasets, achieving a mean rank correlation of approximately 0.6 and a 30% improvement compared to using zero-shot performance as the transferability score.

[1237] Bridging Discrete and Continuous RL: Stable Deterministic Policy Gradient with Martingale Characterization

Ziheng Cheng, Xin Guo, Yufei Zhang

Main category: cs.LG

TL;DR: Proposes CT-DDPG, a continuous-time deterministic policy gradient method that addresses sensitivity to time discretization in RL, offering improved stability and faster convergence.

Details

Motivation: Real-world RL applications are often continuous and complex, but discrete-time algorithms struggle with time discretization sensitivity, leading to poor stability and slow convergence in continuous-time settings.

Method: Derives a continuous-time policy gradient formula using an analogue of the advantage function and establishes its martingale characterization, leading to the CT-DDPG algorithm for stable deterministic policy learning.

Result: Numerical experiments show CT-DDPG offers improved stability and faster convergence compared to existing discrete-time and continuous-time methods across various control tasks with different time discretizations and noise levels.

Conclusion: The proposed continuous-time deterministic policy gradient framework provides a stable and efficient approach for RL in continuous environments, overcoming limitations of discrete-time methods.

Abstract: The theory of discrete-time reinforcement learning (RL) has advanced rapidly over the past decades. Although primarily designed for discrete environments, many real-world RL applications are inherently continuous and complex. A major challenge in extending discrete-time algorithms to continuous-time settings is their sensitivity to time discretization, often leading to poor stability and slow convergence. In this paper, we investigate deterministic policy gradient methods for continuous-time RL. We derive a continuous-time policy gradient formula based on an analogue of the advantage function and establish its martingale characterization. This theoretical foundation leads to our proposed algorithm, CT-DDPG, which enables stable learning with deterministic policies in continuous-time environments. Numerical experiments show that the proposed CT-DDPG algorithm offers improved stability and faster convergence compared to existing discrete-time and continuous-time methods, across a wide range of control tasks with varying time discretizations and noise levels.

[1238] FraudTransformer: Time-Aware GPT for Transaction Fraud Detection

Gholamali Aminian, Andrew Elliott, Tiger Li, Timothy Cheuk Hin Wong, Victor Claude Dehon, Lukasz Szpruch, Carsten Maple, Christopher Read, Martin Brown, Gesine Reinert, Mo Mamouei

Main category: cs.LG

TL;DR: FraudTransformer is a sequence model for payment fraud detection that enhances GPT-style architecture with time encoding and learned positional encoding, outperforming classical baselines and transformer ablations on industrial transaction data.

Details

Motivation: Real-world banking fraud detection requires models that can utilize both event order and irregular time gaps between transactions, which existing methods may not fully exploit.

Method: Augments GPT-style transformer with dedicated time encoder (for absolute timestamps or inter-event values) and learned positional encoder to preserve relative order of events.

Result: Outperforms four classical baselines (Logistic Regression, XGBoost, LightGBM) and transformer ablations without time or positional components, achieving highest AUROC and PRAUC on held-out test set.

Conclusion: FraudTransformer effectively combines time and positional information in sequence modeling for superior fraud detection performance in real-world banking streams.

Abstract: Detecting payment fraud in real-world banking streams requires models that can exploit both the order of events and the irregular time gaps between them. We introduce FraudTransformer, a sequence model that augments a vanilla GPT-style architecture with (i) a dedicated time encoder that embeds either absolute timestamps or inter-event values, and (ii) a learned positional encoder that preserves relative order. Experiments on a large industrial dataset – tens of millions of transactions and auxiliary events – show that FraudTransformer surpasses four strong classical baselines (Logistic Regression, XGBoost and LightGBM) as well as transformer ablations that omit either the time or positional component. On the held-out test set it delivers the highest AUROC and PRAUC.

[1239] A Self-Adaptive Frequency Domain Network for Continuous Intraoperative Hypotension Prediction

Xian Zeng, Tianze Xu, Kai Yang, Jie Sun, Youran Wang, Jun Xu, Mucheng Ren

Main category: cs.LG

TL;DR: SAFDNet is a novel deep learning model that combines frequency domain analysis with adaptive noise filtering and attention mechanisms to provide early warnings for intraoperative hypotension, achieving superior performance over existing methods.

Details

Motivation: Intraoperative hypotension is strongly associated with serious postoperative complications, but existing AI models have limitations in handling frequency domain information, temporal dependencies, and noise sensitivity in biosignal data.

Method: Proposed SAFDNet with adaptive spectral block using Fourier analysis for frequency-domain feature extraction and self-adaptive thresholding for noise reduction, plus interactive attention block to capture both long-term and short-term dependencies.

Result: Achieved up to 97.3% AUROC in IOH early warning on two large-scale real-world datasets, outperforming state-of-the-art models with robust predictive performance and low sensitivity to noise.

Conclusion: SAFDNet is well-suited for practical clinical applications due to its superior performance, noise robustness, and ability to effectively handle both time and frequency domain information for intraoperative hypotension prediction.

Abstract: Intraoperative hypotension (IOH) is strongly associated with postoperative complications, including postoperative delirium and increased mortality, making its early prediction crucial in perioperative care. While several artificial intelligence-based models have been developed to provide IOH warnings, existing methods face limitations in incorporating both time and frequency domain information, capturing short- and long-term dependencies, and handling noise sensitivity in biosignal data. To address these challenges, we propose a novel Self-Adaptive Frequency Domain Network (SAFDNet). Specifically, SAFDNet integrates an adaptive spectral block, which leverages Fourier analysis to extract frequency-domain features and employs self-adaptive thresholding to mitigate noise. Additionally, an interactive attention block is introduced to capture both long-term and short-term dependencies in the data. Extensive internal and external validations on two large-scale real-world datasets demonstrate that SAFDNet achieves up to 97.3% AUROC in IOH early warning, outperforming state-of-the-art models. Furthermore, SAFDNet exhibits robust predictive performance and low sensitivity to noise, making it well-suited for practical clinical applications.

[1240] GBSK: Skeleton Clustering via Granular-ball Computing and Multi-Sampling for Large-Scale Data

Yewang Chen, Junfeng Li, Shuyin Xia, Qinghong Lai, Xinbo Gao, Guoyin Wang, Dongdong Cheng, Yi Liu, Yi Wang

Main category: cs.LG

TL;DR: GBSK is a scalable skeleton clustering algorithm that uses granular-ball technique to efficiently cluster large datasets by approximating data structure through multi-grained granular-balls, achieving high efficiency on datasets up to 100 million instances.

Details

Motivation: To handle clustering tasks for large-scale datasets efficiently while maintaining accuracy, addressing the computational challenges of traditional clustering methods on massive data.

Method: Uses granular-ball technique with multi-sampling to construct multi-grained granular-balls that progressively uncover a statistical skeleton approximating the essential data structure and distribution. Also introduces adaptive version AGBSK with simplified parameters.

Result: Achieves dramatic reduction in computational overhead while maintaining high clustering accuracy. Successfully handles datasets with up to 100 million instances across 256 dimensions on standard computing hardware.

Conclusion: GBSK provides an efficient and scalable solution for large-scale clustering tasks, with the adaptive version enhancing usability for real-world deployment.

Abstract: To effectively handle clustering task for large-scale datasets, we propose a novel scalable skeleton clustering algorithm, namely GBSK, which leverages the granular-ball technique to capture the underlying structure of data. By multi-sampling the dataset and constructing multi-grained granular-balls, GBSK progressively uncovers a statistical “skeleton” – a spatial abstraction that approximates the essential structure and distribution of the original data. This strategy enables GBSK to dramatically reduce computational overhead while maintaining high clustering accuracy. In addition, we introduce an adaptive version, AGBSK, with simplified parameter settings to enhance usability and facilitate deployment in real-world scenarios. Extensive experiments conducted on standard computing hardware demonstrate that GBSK achieves high efficiency and strong clustering performance on large-scale datasets, including one with up to 100 million instances across 256 dimensions. Our implementation and experimental results are available at: https://github.com/XFastDataLab/GBSK/.

[1241] Time-Shifted Token Scheduling for Symbolic Music Generation

Ting-Kang Wang, Chih-Pin Tan, Yi-Hsuan Yang

Main category: cs.LG

TL;DR: A delay-based scheduling mechanism (DP) is proposed to address the trade-off between efficiency and quality in symbolic music generation by expanding compound-like tokens across decoding steps, enabling autoregressive modeling of intra-token dependencies without additional parameters.

Details

Motivation: Symbolic music generation faces a fundamental trade-off between efficiency and quality - fine-grained tokenizations achieve strong coherence but incur long sequences and high complexity, while compact tokenizations improve efficiency at the expense of intra-token dependencies.

Method: Adapt a delay-based scheduling mechanism (DP) that expands compound-like tokens across decoding steps, enabling autoregressive modeling of intra-token dependencies while preserving efficiency. DP is a lightweight strategy that introduces no additional parameters and can be seamlessly integrated into existing representations.

Result: Experiments on symbolic orchestral MIDI datasets show that the method improves all metrics over standard compound tokenizations and narrows the gap to fine-grained tokenizations.

Conclusion: The delay-based scheduling mechanism successfully addresses the efficiency-quality trade-off in symbolic music generation by enabling intra-token dependency modeling while maintaining computational efficiency.

Abstract: Symbolic music generation faces a fundamental trade-off between efficiency and quality. Fine-grained tokenizations achieve strong coherence but incur long sequences and high complexity, while compact tokenizations improve efficiency at the expense of intra-token dependencies. To address this, we adapt a delay-based scheduling mechanism (DP) that expands compound-like tokens across decoding steps, enabling autoregressive modeling of intra-token dependencies while preserving efficiency. Notably, DP is a lightweight strategy that introduces no additional parameters and can be seamlessly integrated into existing representations. Experiments on symbolic orchestral MIDI datasets show that our method improves all metrics over standard compound tokenizations and narrows the gap to fine-grained tokenizations.

[1242] An Investigation of Batch Normalization in Off-Policy Actor-Critic Algorithms

Li Wang, Sudun, Xingjian Zhang, Wenjun Wu, Lei Huang

Main category: cs.LG

TL;DR: Batch Normalization (BN) can be effectively used in deep reinforcement learning despite non-i.i.d. data and shifting distributions, with proposed MA-BN method improving training stability and performance.

Details

Motivation: BN has been underutilized in DRL due to non-i.i.d. data and dynamic distribution shifts, but it retains unique advantages like stochasticity and training ease that could benefit DRL.

Method: Conducted empirical study on BN in off-policy actor-critic algorithms, identified failure modes, and proposed Mode-Aware Batch Normalization (MA-BN) with practical recommendations.

Result: MA-BN accelerates and stabilizes training, broadens effective learning rate range, enhances exploration, and reduces optimization difficulty in RL settings.

Conclusion: BN can be effectively integrated into DRL pipelines with proper methods like MA-BN, overcoming previous limitations and improving training outcomes.

Abstract: Batch Normalization (BN) has played a pivotal role in the success of deep learning by improving training stability, mitigating overfitting, and enabling more effective optimization. However, its adoption in deep reinforcement learning (DRL) has been limited due to the inherent non-i.i.d. nature of data and the dynamically shifting distributions induced by the agent’s learning process. In this paper, we argue that, despite these challenges, BN retains unique advantages in DRL settings, particularly through its stochasticity and its ability to ease training. When applied appropriately, BN can adapt to evolving data distributions and enhance both convergence speed and final performance. To this end, we conduct a comprehensive empirical study on the use of BN in off-policy actor-critic algorithms, systematically analyzing how different training and evaluation modes impact performance. We further identify failure modes that lead to instability or divergence, analyze their underlying causes, and propose the Mode-Aware Batch Normalization (MA-BN) method with practical actionable recommendations for robust BN integration in DRL pipelines. We also empirically validate that, in RL settings, MA-BN accelerates and stabilizes training, broadens the effective learning rate range, enhances exploration, and reduces overall optimization difficulty. Our code is available at: https://github.com/monster476/ma-bn.git.

[1243] Anchored Supervised Fine-Tuning

He Zhu, Junyou Su, Peng Lai, Ren Ma, Wenjia Zhang, Linyi Yang, Guanhua Chen

Main category: cs.LG

TL;DR: Anchored Supervised Fine-Tuning (ASFT) improves upon Dynamic Fine-Tuning by adding KL regularization to prevent distributional drift, achieving better performance than both SFT and DFT across multiple domains with minimal computational overhead.

Details

Motivation: Address the trade-off between SFT's efficiency but memorization tendency and RL's better generalization but high computational cost, while fixing DFT's instability issues caused by distributional drift.

Method: Propose ASFT which augments DFT’s reweighting with lightweight KL regularization to preserve tightness while ensuring stability, analyzed through the reward-weighted regression framework.

Result: ASFT consistently outperforms both SFT and DFT across mathematical reasoning, medical knowledge grounding, and code generation, achieving substantial improvements with minimal computational overhead.

Conclusion: The RWR framework provides a systematic understanding of post-training methods, showing that principled theoretical analysis leads to both stronger guarantees and practical gains.

Abstract: Post-training of large language models involves a fundamental trade-off between supervised fine-tuning (SFT), which efficiently mimics demonstrations but tends to memorize, and reinforcement learning (RL), which achieves better generalization at higher computational cost. Dynamic Fine-Tuning (DFT) recently emerged as a promising middle ground, reweighting SFT objectives with token probabilities and achieving improvements in certain reasoning domains, though it exhibits instability in other tasks. We provide a analysis of DFT through the reward-weighted regression (RWR) framework, revealing that it corresponds to a specific auxiliary distribution choice that yields provably tighter RL bounds than standard SFT. However, our analysis also uncovers a critical limitation: this construction lacks distributional anchoring, leading to progressive drift that undermines training stability. To address this, we propose Anchored Supervised Fine-Tuning (ASFT), which augments DFT’s reweighting with lightweight KL regularization to preserve tightness while ensuring stability. Empirically, ASFT consistently outperforms both SFT and DFT across mathematical reasoning, medical knowledge grounding, and code generation, achieving substantial improvements with minimal computational overhead. Our RWR framework provides a systematic lens for understanding post-training methods and demonstrates that principled theoretical analysis leads to both stronger guarantees and practical gains.

[1244] SHAPoint: Task-Agnostic, Efficient, and Interpretable Point-Based Risk Scoring via Shapley Values

Tomer D. Meirman, Bracha Shapira, Noa Dagan, Lior S. Rokach

Main category: cs.LG

TL;DR: SHAPoint is a task-agnostic framework that combines gradient boosted trees’ predictive power with interpretable point-based risk scores for clinical decision support.

Details

Motivation: Traditional methods for interpretable risk scores rely on manual preprocessing, task-specific modeling, and simplified assumptions that limit flexibility and predictive power.

Method: Integrates gradient boosted trees with interpretable point-based risk scores, supporting classification, regression, and survival tasks while handling missing data and supporting monotonic constraints.

Result: SHAPoint produces compact interpretable scores with predictive performance comparable to state-of-the-art methods but at a fraction of the runtime.

Conclusion: SHAPoint is a powerful tool for transparent and scalable risk stratification with superior flexibility, reduced manual preprocessing, and faster runtime compared to existing frameworks.

Abstract: Interpretable risk scores play a vital role in clinical decision support, yet traditional methods for deriving such scores often rely on manual preprocessing, task-specific modeling, and simplified assumptions that limit their flexibility and predictive power. We present SHAPoint, a novel, task-agnostic framework that integrates the predictive accuracy of gradient boosted trees with the interpretability of point-based risk scores. SHAPoint supports classification, regression, and survival tasks, while also inheriting valuable properties from tree-based models, such as native handling of missing data and support for monotonic constraints. Compared to existing frameworks, SHAPoint offers superior flexibility, reduced reliance on manual preprocessing, and faster runtime performance. Empirical results show that SHAPoint produces compact and interpretable scores with predictive performance comparable to state-of-the-art methods, but at a fraction of the runtime, making it a powerful tool for transparent and scalable risk stratification.

[1245] Knowledge Homophily in Large Language Models

Utkarsh Sahu, Zhisheng Qi, Mahantesh Halappanavar, Nedim Lipka, Ryan A. Rossi, Franck Dernoncourt, Yu Zhang, Yao Ma, Yu Wang

Main category: cs.LG

TL;DR: LLMs show knowledge homophily patterns where related entities have similar knowledge levels. A GNN model predicts entity knowledgeability scores to optimize active labeling and improve knowledge coverage.

Details

Motivation: To understand the structural organization of knowledge in LLMs and leverage cognitive neuroscience principles like semantic clustering to improve knowledge-intensive applications.

Method: Map LLM knowledge into graphs, analyze knowledge homophily, and develop a GNN regression model to predict entity-level knowledgeability scores for triplets.

Result: The approach enables prioritizing less-known triplets for labeling, improving knowledge coverage efficiency and enhancing multi-hop reasoning in question answering.

Conclusion: Knowledge homophily in LLMs can be leveraged through GNN-based prediction to optimize knowledge injection and improve reasoning performance in knowledge-intensive tasks.

Abstract: Large Language Models (LLMs) have been increasingly studied as neural knowledge bases for supporting knowledge-intensive applications such as question answering and fact checking. However, the structural organization of their knowledge remains unexplored. Inspired by cognitive neuroscience findings, such as semantic clustering and priming, where knowing one fact increases the likelihood of recalling related facts, we investigate an analogous knowledge homophily pattern in LLMs. To this end, we map LLM knowledge into a graph representation through knowledge checking at both the triplet and entity levels. After that, we analyze the knowledgeability relationship between an entity and its neighbors, discovering that LLMs tend to possess a similar level of knowledge about entities positioned closer in the graph. Motivated by this homophily principle, we propose a Graph Neural Network (GNN) regression model to estimate entity-level knowledgeability scores for triplets by leveraging their neighborhood scores. The predicted knowledgeability enables us to prioritize checking less well-known triplets, thereby maximizing knowledge coverage under the same labeling budget. This not only improves the efficiency of active labeling for fine-tuning to inject knowledge into LLMs but also enhances multi-hop path retrieval in reasoning-intensive question answering.

[1246] Trained Mamba Emulates Online Gradient Descent in In-Context Linear Regression

Jiarui Jiang, Wei Huang, Miao Zhang, Taiji Suzuki, Liqiang Nie

Main category: cs.LG

TL;DR: Theoretical analysis shows Mamba SSMs achieve in-context learning for linear regression through online gradient descent, with comparable performance to Transformers but different mechanisms.

Details

Motivation: While Mamba shows competitive in-context learning capabilities with Transformers, there's limited theoretical understanding of its mechanisms, especially for fundamental tasks like linear regression ICL.

Method: Developed novel techniques for non-convex optimization with gradient descent related to Mamba’s structure, analyzing training dynamics on linear regression ICL task.

Result: Established exponential convergence rate to ICL solution with comparable loss bound to Transformers, revealing Mamba performs online gradient descent to learn latent functions in context.

Conclusion: Mamba achieves in-context learning through online gradient descent, a different mechanism than Transformers’ gradient descent emulation, with theoretical results verified experimentally.

Abstract: State-space models (SSMs), particularly Mamba, emerge as an efficient Transformer alternative with linear complexity for long-sequence modeling. Recent empirical works demonstrate Mamba’s in-context learning (ICL) capabilities competitive with Transformers, a critical capacity for large foundation models. However, theoretical understanding of Mamba’s ICL remains limited, restricting deeper insights into its underlying mechanisms. Even fundamental tasks such as linear regression ICL, widely studied as a standard theoretical benchmark for Transformers, have not been thoroughly analyzed in the context of Mamba. To address this gap, we study the training dynamics of Mamba on the linear regression ICL task. By developing novel techniques tackling non-convex optimization with gradient descent related to Mamba’s structure, we establish an exponential convergence rate to ICL solution, and derive a loss bound that is comparable to Transformer’s. Importantly, our results reveal that Mamba can perform a variant of \textit{online gradient descent} to learn the latent function in context. This mechanism is different from that of Transformer, which is typically understood to achieve ICL through gradient descent emulation. The theoretical results are verified by experimental simulation.

[1247] Visual CoT Makes VLMs Smarter but More Fragile

Chunxue Xu, Yiwei Wang, Yujun Cai, Bryan Hooi, Songze Li

Main category: cs.LG

TL;DR: This paper presents the first systematic evaluation of Visual Chain-of-Thought (CoT) robustness against image corruption, revealing that while Visual CoT improves accuracy, it increases sensitivity to visual perturbations.

Details

Motivation: To investigate the unexplored robustness of Visual CoT-based Vision-Language Models against image-level noise and visual perturbations.

Method: Systematic evaluation across 12 image corruption types on 4 VQA datasets, comparing Visual CoT VLMs with standard VLMs, and proposing a plug-and-play robustness enhancement using Grounding DINO.

Result: Visual CoT improves absolute accuracy on both clean and corrupted images but increases sensitivity to perturbations, with edited image patches identified as the primary fragility source.

Conclusion: The work reveals fragility patterns in Visual CoT and provides an effective, architecture-agnostic solution using Grounding DINO to enhance visual robustness.

Abstract: Chain-of-Thought (CoT) techniques have significantly enhanced reasoning in Vision-Language Models (VLMs). Extending this paradigm, Visual CoT integrates explicit visual edits, such as cropping or annotating regions of interest, into the reasoning process, achieving superior multimodal performance. However, the robustness of Visual CoT-based VLMs against image-level noise remains unexplored. In this paper, we present the first systematic evaluation of Visual CoT robustness under visual perturbations. Our benchmark spans 12 image corruption types across 4 Visual Question Answering (VQA) datasets, enabling a comprehensive comparison between VLMs that use Visual CoT, and VLMs that do not. The results reveal that integrating Visual CoT consistently improves absolute accuracy regardless of whether the input images are clean or corrupted by noise; however, it also increases sensitivity to input perturbations, resulting in sharper performance degradation compared to standard VLMs. Through extensive analysis, we identify the intermediate reasoning components of Visual CoT, i.e., the edited image patches , as the primary source of fragility. Building on this analysis, we propose a plug-and-play robustness enhancement method that integrates Grounding DINO model into the Visual CoT pipeline, providing high-confidence local visual cues to stabilize reasoning. Our work reveals clear fragility patterns in Visual CoT and offers an effective, architecture-agnostic solution for enhancing visual robustness.

[1248] Enhancing LLM Steering through Sparse Autoencoder-Based Vector Refinement

Anyi Wang, Xuansheng Wu, Dong Shu, Yunpu Ma, Ninghao Liu

Main category: cs.LG

TL;DR: SAE-RSV refines steering vectors for LLM control using sparse autoencoders to remove noise and enhance task-relevant features from limited data, outperforming supervised fine-tuning.

Details

Motivation: Existing steering methods require large datasets, limiting real-world applicability. Small datasets produce noisy steering vectors with task-irrelevant features that reduce effectiveness.

Method: Use sparse autoencoders (SAEs) to semantically denoise and augment steering vectors by removing task-irrelevant features and enriching missing task-relevant features through semantic similarity.

Result: SAE-RSV substantially outperforms all baseline methods including supervised fine-tuning, demonstrating effective steering vector construction from limited data.

Conclusion: Effective steering vectors can be constructed from limited training data by refining original vectors through SAEs, enabling better LLM control without parameter modification.

Abstract: Steering has emerged as a promising approach in controlling large language models (LLMs) without modifying model parameters. However, most existing steering methods rely on large-scale datasets to learn clear behavioral information, which limits their applicability in many real-world scenarios. The steering vectors extracted from small dataset often contain task-irrelevant noising features, which degrades their effectiveness. To refine the steering vectors learned from limited data, we introduce Refinement of Steering Vector via Sparse Autoencoder (SAE-RSV) that leverages SAEs to semantically denoise and augment the steering vectors. In our framework, we first remove task-irrelevant features according to their semantics provided by SAEs, and then enrich task-relevant features missing from the small dataset through their semantic similarity to the identified relevant features. Extensive experiments demonstrate that the proposed SAE-RSV substantially outperforms all the baseline methods including supervised fine-tuning. Our findings show that effective steering vector can be constructed from limited training data by refining the original steering vector through SAEs.

[1249] STAIR: Addressing Stage Misalignment through Temporal-Aligned Preference Reinforcement Learning

Yao Luan, Ni Mu, Yiqin Yang, Bo Xu, Qing-Shan Jia

Main category: cs.LG

TL;DR: STAIR addresses stage misalignment in PbRL for multi-stage tasks by learning stage approximations via temporal distance and prioritizing same-stage comparisons, improving policy learning.

Details

Motivation: PbRL struggles with multi-stage tasks due to stage misalignment, where comparing segments from different stages (e.g., navigation vs manipulation) provides uninformative feedback that hinders learning.

Method: STAIR learns stage approximations using temporal distance via contrastive learning, grouping temporally close states into coherent stages without predefined knowledge, then prioritizes comparisons within the same stage.

Result: STAIR shows superior performance in multi-stage tasks and competitive performance in single-stage tasks. Human studies confirm that its learned stages align with human cognition.

Conclusion: STAIR effectively mitigates stage misalignment in PbRL for multi-stage tasks through temporal distance-based stage approximation and same-stage comparison prioritization, enabling better policy learning.

Abstract: Preference-based reinforcement learning (PbRL) bypasses complex reward engineering by learning rewards directly from human preferences, enabling better alignment with human intentions. However, its effectiveness in multi-stage tasks, where agents sequentially perform sub-tasks (e.g., navigation, grasping), is limited by stage misalignment: Comparing segments from mismatched stages, such as movement versus manipulation, results in uninformative feedback, thus hindering policy learning. In this paper, we validate the stage misalignment issue through theoretical analysis and empirical experiments. To address this issue, we propose STage-AlIgned Reward learning (STAIR), which first learns a stage approximation based on temporal distance, then prioritizes comparisons within the same stage. Temporal distance is learned via contrastive learning, which groups temporally close states into coherent stages, without predefined task knowledge, and adapts dynamically to policy changes. Extensive experiments demonstrate STAIR’s superiority in multi-stage tasks and competitive performance in single-stage tasks. Furthermore, human studies show that stages approximated by STAIR are consistent with human cognition, confirming its effectiveness in mitigating stage misalignment.

[1250] Beyond the Exploration-Exploitation Trade-off: A Hidden State Approach for LLM Reasoning in RLVR

Fanding Huang, Guanbo Huang, Xiao Fan, Yi He, Xiao Liang, Xiao Chen, Qinting Jiang, Faisal Nadeem Khan, Jingyan Jiang, Zhi Wang

Main category: cs.LG

TL;DR: The paper challenges the traditional exploration-exploitation trade-off view in RLVR, showing it’s an artifact of measurement level rather than fundamental constraint. By analyzing hidden-state space using Effective Rank metrics, the authors demonstrate exploration and exploitation can be decoupled and enhanced simultaneously through their VERL method.

Details

Motivation: To re-examine the prevailing exploration-exploitation trade-off perspective in RLVR, which may be an artifact of token-level metrics rather than a fundamental constraint, and investigate whether both capacities can be enhanced simultaneously at the hidden-state level.

Method: Proposed VERL (Velocity-Exploiting Rank-Learning) method that shifts analysis to hidden-state space using Effective Rank (ER) metrics and their derivatives (ERV, ERA). Uses ERA as a predictive meta-controller to shape the RL advantage function, creating a dual-channel incentive structure that prospectively amplifies rewards for exploration and reinforces exploitative gains.

Result: Experiments show consistent gains across diverse LLMs and reasoning benchmarks, including up to 21.4% absolute accuracy improvement on the challenging Gaokao 2024 dataset. Demonstrates that exploration and exploitation can be decoupled and enhanced simultaneously.

Conclusion: The exploration-exploitation trade-off is not fundamental but an artifact of measurement level. By analyzing hidden-state dynamics with ER metrics, both capacities can be synergistically enhanced through VERL’s dual-channel incentive structure, leading to significant performance improvements.

Abstract: A prevailing view in Reinforcement Learning for Verifiable Rewards (RLVR) interprets recent progress through the lens of an exploration-exploitation trade-off, a perspective largely shaped by token-level metrics. We re-examine this perspective, proposing that this perceived trade-off may not be a fundamental constraint but rather an artifact of the measurement level. To investigate this, we shift the analysis to the semantically rich hidden-state space, adopting Effective Rank (ER) to quantify exploration and proposing its novel first- and second-order derivatives, named Effective Rank Velocity (ERV) and Effective Rank Acceleration (ERA), to capture exploitation dynamics. Our analysis reveals that at the hidden-state level, exploration and exploitation could be decoupled (Sec. 4). This finding reveals an opportunity to enhance both capacities simultaneously. This insight motivates our method, Velocity-Exploiting Rank-Learning (VERL), the first to operationalize the principle of synergistic exploration-exploitation enhancement by directly shaping the RL advantage function. The key innovation is leveraging the theoretically stable ERA as a predictive meta-controller to create a synergistic, dual-channel incentive structure. Instead of forcing a trade-off, VERL prospectively amplifies rewards for exploration to preempt overconfidence and reinforces exploitative gains to consolidate reasoning. Experiments across diverse LLMs and reasoning benchmarks show consistent gains, including up to 21.4% absolute accuracy improvement on the challenging Gaokao 2024 dataset.

[1251] Tequila: Trapping-free Ternary Quantization for Large Language Models

Hong Huang, Decheng Wu, Rui Cen, Guanghua Yu, Zonghang Li, Kai Liu, Jianchen Zhu, Peng Chen, Xue Liu, Dapeng Wu

Main category: cs.LG

TL;DR: Tequila is a ternary weight quantization method that solves deadzone trapping by repurposing trapped weights as dynamic biases, achieving near full-precision performance with 3x inference speedup.

Details

Motivation: Current ternary quantization methods suffer from significant accuracy degradation due to deadzone trapping, where weights get stuck at boundaries and receive uninformative gradients, limiting model capacity.

Method: Proposes Tequila which reactivates deadzone-trapped weights by converting them into dynamic biases, allowing continuous forward signals and meaningful gradient updates during backpropagation with minimal inference overhead.

Result: Outperforms SOTA ternary quantization methods across five benchmarks, achieving >4% accuracy gain on ARC benchmark and nearly matching full-precision performance (<1% gap) with 3.0x inference speedup.

Conclusion: Tequila provides a practical and efficient solution for deploying advanced LLMs in resource-constrained environments by effectively addressing deadzone trapping in ternary quantization.

Abstract: Quantization techniques are essential for the deployment of Large Language Models (LLMs) on edge devices. However, prevailing methods often rely on mixed-precision multiplication that lacks efficient hardware support, making it not feasible. Ternary weight quantization addresses this by constraining weights to {-1, 0, 1}, replacing expensive multiplications with hardware-efficient additions. However, such aggressive compression leads to significant accuracy degradation, even after costly quantization-aware training with massive data. We identify the core issue as deadzone trapping: a large number of weights are trapped at the deadzone boundary. This occurs because these weights receive only noisy, uninformative gradients, preventing stable escape from the deadzone and severely impeding model capacity and optimization. To address this issue, we propose Tequila, a trapping-free quantization optimization method that reactivates deadzone-trapped weights by repurposing them as dynamic biases. This allows the repurposed weights to provide a continuous signal in the forward pass and, critically, receive direct, meaningful gradient signals during backpropagation, thereby enhancing model capacity and optimization with nearly zero inference overhead. Extensive evaluations demonstrate that Tequila outperforms state-of-the-art (SOTA) ternary quantization methods across five benchmarks. Specifically, on the ARC benchmark, it achieves >4% accuracy gain over the SOTA baseline, nearly matching full-precision performance (within <1% gap) with a 3.0x inference speedup. Consequently, Tequila offers a highly practical and efficient implementation for the deployment of advanced LLMs in resource-constrained environments. The code is available at https://github.com/Tencent/AngelSlim.

[1252] IndexNet: Timestamp and Variable-Aware Modeling for Time Series Forecasting

Beiliang Wu, Peiyuan Liu, Yifan Hu, Luyan Zhang, Ao Hu, Zenglin Xu

Main category: cs.LG

TL;DR: IndexNet is an MLP-based framework for multivariate time series forecasting that incorporates timestamp and variable index embeddings to capture periodic patterns and distinguish between heterogeneous variables, achieving comparable performance with strong generality and interpretability.

Details

Motivation: Most existing MTSF methods overlook index-related descriptive information like timestamps and variable indices, which carry rich contextual semantics that could improve forecasting accuracy.

Method: Proposes IndexNet with Index Embedding module containing Timestamp Embedding (transforms timestamps into vectors for periodic pattern capture) and Channel Embedding (assigns unique identity embeddings to variables to prevent homogenized predictions).

Result: Extensive experiments on 12 real-world datasets show IndexNet achieves comparable performance to mainstream baselines, with plug-and-play experiments and visualizations demonstrating strong generality and interpretability.

Conclusion: The temporally and variably aware design of IndexNet effectively captures long-term periodic patterns and distinguishes heterogeneous variables, addressing underexplored aspects of generality and interpretability in MTSF research.

Abstract: Multivariate time series forecasting (MTSF) plays a vital role in a wide range of real-world applications, such as weather prediction and traffic flow forecasting. Although recent advances have significantly improved the modeling of temporal dynamics and inter-variable dependencies, most existing methods overlook index-related descriptive information, such as timestamps and variable indices, which carry rich contextual semantics. To unlock the potential of such information and take advantage of the lightweight and powerful periodic capture ability of MLP-based architectures, we propose IndexNet, an MLP-based framework augmented with an Index Embedding (IE) module. The IE module consists of two key components: Timestamp Embedding (TE) and Channel Embedding (CE). Specifically, TE transforms timestamps into embedding vectors and injects them into the input sequence, thereby improving the model’s ability to capture long-term complex periodic patterns. In parallel, CE assigns each variable a unique and trainable identity embedding based on its index, allowing the model to explicitly distinguish between heterogeneous variables and avoid homogenized predictions when input sequences seem close. Extensive experiments on 12 diverse real-world datasets demonstrate that IndexNet achieves comparable performance across mainstream baselines, validating the effectiveness of our temporally and variably aware design. Moreover, plug-and-play experiments and visualization analyses further reveal that IndexNet exhibits strong generality and interpretability, two aspects that remain underexplored in current MTSF research.

[1253] Test-time GNN Model Evaluation on Dynamic Graphs

Bo Li, Xin Zheng, Ming Jin, Can Wang, Shirui Pan

Main category: cs.LG

TL;DR: The paper introduces DyGEval, a framework for evaluating Dynamic Graph Neural Networks (DGNNs) on unseen test graphs by estimating their performance during deployment, addressing distribution shifts between training and test data.

Details

Motivation: Well-trained DGNNs face performance uncertainty when inferring on unseen dynamic test graphs due to evolving data distributions over time. Evaluating deployed DGNNs is crucial to determine their suitability for inference on unseen graphs.

Method: Proposes DyGEval with a two-stage framework: (1) test-time dynamic graph simulation that captures training-test distributional differences as supervision signals, and (2) DyGEval development and training that accurately estimates DGNN performance on test-time dynamic graphs.

Result: Extensive experiments demonstrate that DyGEval effectively evaluates various DGNN backbones across different dynamic graphs under distribution shifts.

Conclusion: DyGEval serves as an effective evaluator for assessing DGNN performance on unseen dynamic graphs, addressing the challenge of distribution shifts in dynamic graph learning.

Abstract: Dynamic graph neural networks (DGNNs) have emerged as a leading paradigm for learning from dynamic graphs, which are commonly used to model real-world systems and applications. However, due to the evolving nature of dynamic graph data distributions over time, well-trained DGNNs often face significant performance uncertainty when inferring on unseen and unlabeled test graphs in practical deployment. In this case, evaluating the performance of deployed DGNNs at test time is crucial to determine whether a well-trained DGNN is suited for inference on an unseen dynamic test graph. In this work, we introduce a new research problem: DGNN model evaluation, which aims to assess the performance of a specific DGNN model trained on observed dynamic graphs by estimating its performance on unseen dynamic graphs during test time. Specifically, we propose a Dynamic Graph neural network Evaluator, dubbed DyGEval, to address this new problem. The proposed DyGEval involves a two-stage framework: (1) test-time dynamic graph simulation, which captures the training-test distributional differences as supervision signals and trains an evaluator; and (2) DyGEval development and training, which accurately estimates the performance of the well-trained DGNN model on the test-time dynamic graphs. Extensive experiments demonstrate that the proposed DyGEval serves as an effective evaluator for assessing various DGNN backbones across different dynamic graphs under distribution shifts.

[1254] Space Group Conditional Flow Matching

Omri Puny, Yaron Lipman, Benjamin Kurt Miller

Main category: cs.LG

TL;DR: Space Group Conditional Flow Matching is a generative framework that samples highly-symmetric crystals by conditioning on space groups and Wyckoff positions, achieving state-of-the-art results in crystal structure prediction.

Details

Motivation: Most generative models overlook crystallographic symmetry constraints, leading to unrealistic crystal structures with limited symmetry. This work addresses the need to generate crystals that respect the fundamental symmetry operations of space groups and Wyckoff positions.

Method: The method conditions generation on space groups and Wyckoff positions, using a symmetric noise base distribution and group-conditioned equivariant vector field that restricts atom motion to their initial Wyckoff positions. It employs efficient group averaging tailored for symmetric crystals.

Result: The approach achieves state-of-the-art results on crystal structure prediction and de novo generation benchmarks, with significantly reduced computational overhead for symmetrization.

Conclusion: The proposed framework successfully generates highly-symmetric, stable crystals by explicitly incorporating crystallographic symmetry constraints, outperforming existing methods that overlook these fundamental constraints.

Abstract: Inorganic crystals are periodic, highly-symmetric arrangements of atoms in three-dimensional space. Their structures are constrained by the symmetry operations of a crystallographic \emph{space group} and restricted to lie in specific affine subspaces known as \emph{Wyckoff positions}. The frequency an atom appears in the crystal and its rough positioning are determined by its Wyckoff position. Most generative models that predict atomic coordinates overlook these symmetry constraints, leading to unrealistically high populations of proposed crystals exhibiting limited symmetry. We introduce Space Group Conditional Flow Matching, a novel generative framework that samples significantly closer to the target population of highly-symmetric, stable crystals. We achieve this by conditioning the entire generation process on a given space group and set of Wyckoff positions; specifically, we define a conditionally symmetric noise base distribution and a group-conditioned, equivariant, parametric vector field that restricts the motion of atoms to their initial Wyckoff position. Our form of group-conditioned equivariance is achieved using an efficient reformulation of \emph{group averaging} tailored for symmetric crystals. Importantly, it reduces the computational overhead of symmetrization to a negligible level. We achieve state of the art results on crystal structure prediction and de novo generation benchmarks. We also perform relevant ablations.

[1255] Electric Currents for Discrete Data Generation

Alexander Kolesov, Stepan Manukhov, Vladimir V. Palyulin, Alexander Korotin

Main category: cs.LG

TL;DR: ECD²G is a novel discrete data generation method that uses electrical current flow analogies to transfer probability mass between source and target distributions, with neural networks learning the probability flow.

Details

Motivation: To develop a theoretically grounded method for discrete data generation that guarantees distribution transfer using principles from electrical engineering.

Method: Analogizes source distribution samples as current input nodes and target distribution as output nodes in a circuit. Uses neural networks to learn electric currents representing probability flow, then transports source samples along circuit pathways according to learned currents.

Result: The method provably guarantees transfer between data distributions, with proof-of-concept experiments demonstrating its effectiveness.

Conclusion: ECD²G provides a novel, theoretically sound approach for discrete data generation that bridges electrical engineering concepts with probability theory to ensure reliable distribution transfer.

Abstract: We propose $\textbf{E}$lectric $\textbf{C}$urrent $\textbf{D}$iscrete $\textbf{D}$ata $\textbf{G}$eneration (ECD$^{2}$G), a pioneering method for data generation in discrete settings that is grounded in electrical engineering theory. Our approach draws an analogy between electric current flow in a circuit and the transfer of probability mass between data distributions. We interpret samples from the source distribution as current input nodes of a circuit and samples from the target distribution as current output nodes. A neural network is then used to learn the electric currents to represent the probability flow in the circuit. To map the source distribution to the target, we sample from the source and transport these samples along the circuit pathways according to the learned currents. This process provably guarantees transfer between data distributions. We present proof-of-concept experiments to illustrate our ECD$^{2}$G method.

[1256] Bayesian Mixture-of-Experts: Towards Making LLMs Know What They Don’t Know

Albus Yizhuo Li

Main category: cs.LG

TL;DR: This paper proposes a Bayesian Mixture-of-Experts routing framework that replaces deterministic routing with probabilistic routing to address brittleness, miscalibration, and overconfidence in large language models.

Details

Motivation: Standard deterministic routing in MoE architectures is brittle and contributes to model miscalibration and overconfidence, preventing models from knowing what they don't know.

Method: A Bayesian MoE routing framework that models probability distributions over routing decisions, investigated through three approaches: weight-space, logit-space, and selection-space uncertainty.

Result: The framework significantly improves routing stability, in-distribution calibration, and out-of-distribution detection in a 3-billion parameter MoE model, creating more reliable internal uncertainty signals.

Conclusion: The Bayesian routing framework provides a practical pathway toward building more robust and self-aware LLMs that can better understand their own limitations.

Abstract: The Mixture-of-Experts (MoE) architecture has enabled the creation of massive yet efficient Large Language Models (LLMs). However, the standard deterministic routing mechanism presents a significant limitation: its inherent brittleness is a key contributor to model miscalibration and overconfidence, resulting in systems that often do not know what they don’t know. This thesis confronts this challenge by proposing a structured \textbf{Bayesian MoE routing framework}. Instead of forcing a single, deterministic expert selection, our approach models a probability distribution over the routing decision itself. We systematically investigate three families of methods that introduce this principled uncertainty at different stages of the routing pipeline: in the \textbf{weight-space}, the \textbf{logit-space}, and the final \textbf{selection-space}. Through a series of controlled experiments on a 3-billion parameter MoE model, we demonstrate that this framework significantly improves routing stability, in-distribution calibration, and out-of-distribution (OoD) detection. The results show that by targeting this core architectural component, we can create a more reliable internal uncertainty signal. This work provides a practical and computationally tractable pathway towards building more robust and self-aware LLMs, taking a crucial step towards making them know what they don’t know.

[1257] Adversarial Diffusion for Robust Reinforcement Learning

Daniele Foffano, Alessio Russo, Alexandre Proutiere

Main category: cs.LG

TL;DR: AD-RRL uses diffusion models to train robust RL policies by generating worst-case trajectories during training, optimizing CVaR for improved robustness to environment uncertainties.

Details

Motivation: Address the challenge of robustness to modeling errors and uncertainties in reinforcement learning, which remains a central problem in the field.

Method: Leverage diffusion models with conditional sampling to generate worst-case trajectories during training, building on the connection between CVaR optimization and robust RL.

Result: Empirical results across standard benchmarks show AD-RRL achieves superior robustness and performance compared to existing robust RL methods.

Conclusion: AD-RRL effectively addresses robustness challenges in RL by using diffusion models to optimize for worst-case scenarios through CVaR optimization.

Abstract: Robustness to modeling errors and uncertainties remains a central challenge in reinforcement learning (RL). In this work, we address this challenge by leveraging diffusion models to train robust RL policies. Diffusion models have recently gained popularity in model-based RL due to their ability to generate full trajectories “all at once”, mitigating the compounding errors typical of step-by-step transition models. Moreover, they can be conditioned to sample from specific distributions, making them highly flexible. We leverage conditional sampling to learn policies that are robust to uncertainty in environment dynamics. Building on the established connection between Conditional Value at Risk (CVaR) optimization and robust RL, we introduce Adversarial Diffusion for Robust Reinforcement Learning (AD-RRL). AD-RRL guides the diffusion process to generate worst-case trajectories during training, effectively optimizing the CVaR of the cumulative return. Empirical results across standard benchmarks show that AD-RRL achieves superior robustness and performance compared to existing robust RL methods.

[1258] Efficient Multi-turn RL for GUI Agents via Decoupled Training and Adaptive Data Curation

Pengxiang Li, Zechen Hu, Zirui Shang, Jingrong Wu, Yang Liu, Hui Liu, Zhi Gao, Chenrui Shi, Bofei Zhang, Zihao Zhang, Xiaochuan Shi, Zedong YU, Yuwei Wu, Xinxiao Wu, Yunde Jia, Liuyu Xiang, Zhaofeng He, Qing Li

Main category: cs.LG

TL;DR: DART is a decoupled RL training framework for GUI agents that addresses slow interactions and insufficient data by separating training into four asynchronous modules, achieving significant efficiency improvements and state-of-the-art performance on OSWorld benchmark.

Details

Motivation: VLM-based GUI agents face challenges in RL training due to slow multi-turn interactions with GUI environments and insufficient high-quality agent-environment interactions for effective policy learning.

Method: DART separates training into four asynchronous modules (environment cluster, rollout service, data manager, trainer) with non-blocking communication, and uses adaptive data curation including pre-collecting successful trajectories, dynamic rollout adjustment, selective training on high-entropy steps, and truncated importance sampling.

Result: Achieves 42.13% task success rate on OSWorld benchmark (14.61% gain over base model, 7.34% higher than SOTA), with 1.6× GPU utilization for rollout, 1.9× training throughput, and 5.5× environment utilization.

Conclusion: DART provides an efficient decoupled training framework for GUI agents that significantly improves RL training efficiency and performance, and will be open-sourced to benefit the agentic RL community.

Abstract: Vision-language model (VLM) based GUI agents show promise for automating complex desktop and mobile tasks, but face significant challenges in applying reinforcement learning (RL): (1) slow multi-turn interactions with GUI environments for policy rollout, and (2) insufficient high-quality agent-environment interactions for policy learning. To address these challenges, we propose DART, a Decoupled Agentic RL Training framework for GUI agents, which coordinates heterogeneous modules in a highly decoupled manner. DART separates the training system into four asynchronous modules: environment cluster, rollout service, data manager, and trainer. This design enables non-blocking communication, asynchronous training, rollout-wise trajectory sampling, and per-worker model synchronization, significantly improving the system efficiency: 1.6GPU utilization for rollout, 1.9 training throughput, and 5.5* environment utilization. To facilitate effective learning from abundant samples, we introduce an adaptive data curation scheme: (1) pre-collecting successful trajectories for challenging tasks to supplement sparse success in online sampling; (2) dynamically adjusting rollout numbers and trajectory lengths based on task difficulty; (3) training selectively on high-entropy steps to prioritize critical decisions; (4) stabilizing learning via truncated importance sampling for policy mismatch between policy rollout and updating. On the OSWorld benchmark, DART-GUI-7B achieves a 42.13% task success rate, a 14.61% absolute gain over the base model, and 7.34% higher than open-source SOTA. We will fully open-source our training framework, data, and model checkpoints via computer-use-agents.github.io/dart-gui, which we believe is a timely contribution to the open-source community of agentic RL training.

[1259] Towards Understanding Subliminal Learning: When and How Hidden Biases Transfer

Simon Schrodi, Elias Kempf, Fazl Barez, Thomas Brox

Main category: cs.LG

TL;DR: Subliminal learning occurs when language models transfer hidden biases during distillation, even with hard distillation where students only see sampled tokens. This happens through divergence tokens - rare cases where teachers with different biases predict different tokens.

Details

Motivation: To understand when and how subliminal learning actually occurs, particularly why it happens under hard distillation where students only see sampled tokens rather than full distributions.

Method: Conducted controlled experiments and mechanistic analysis to identify the mechanism behind subliminal learning, focusing on divergence tokens and early layer importance.

Result: Subliminal learning doesn’t require token entanglement or logit leakage. It occurs through divergence tokens, and masking these tokens mostly removes bias transfer. Early layers are critical - finetuning even a single early layer enables subliminal learning.

Conclusion: Subliminal learning is fragile and can be suppressed by small changes like paraphrasing prompts. The mechanism relies on divergence tokens and early layer processing rather than global token patterns.

Abstract: Language models can transfer hidden biases during distillation. For example, a teacher that “likes owls” can make its student “like owls” too, even when the training data consists only of lists of numbers. This surprising phenomenon is called subliminal learning. Subliminal learning can be expected under soft distillation, where the student is trained on the teacher’s full next-token distribution. But the fact that this also occurs under hard distillation-where the student only sees sampled tokens-raises a deeper question: when and how does subliminal learning actually occur? We answer this question through controlled experiments and mechanistic analysis. Our results show that subliminal learning does not need (global) token entanglement or logit leakage. Instead, it comes down to a small set of divergence tokens-rare cases where teachers with different biases would predict different tokens. Masking out these tokens mostly removes the hidden bias transfer. Mechanistically, divergence tokens reveal that early layers are critical. Surprisingly, finetuning even a single such early layer is sufficient for subliminal learning. Finally, we find that subliminal learning is fragile. Even small changes, like paraphrasing prompts, are usually sufficient to suppress it.

[1260] Gradient Flow Convergence Guarantee for General Neural Network Architectures

Yash Jakhmola

Main category: cs.LG

TL;DR: A unified proof for linear convergence of gradient descent in training neural networks with various activations, covering both known and new architectures under weaker assumptions.

Details

Motivation: To explain the success of gradient-based optimization in deep learning and provide a united theoretical framework, as current proofs are limited to specific architectures.

Method: Developed a general theorem for linear convergence of continuous gradient descent (gradient flow) for neural networks with piecewise non-zero polynomial, ReLU, and sigmoid activations.

Result: The theoretical proof shows linear convergence in the infinitesimal step size limit, with empirical validation showing excellent agreement with practical gradient descent methods.

Conclusion: The paper presents a unified theoretical framework that consolidates and extends existing convergence results for gradient-based optimization in deep neural networks.

Abstract: A key challenge in modern deep learning theory is to explain the remarkable success of gradient-based optimization methods when training large-scale, complex deep neural networks. Though linear convergence of such methods has been proved for a handful of specific architectures, a united theory still evades researchers. This article presents a unified proof for linear convergence of continuous gradient descent, also called gradient flow, while training any neural network with piecewise non-zero polynomial activations or ReLU, sigmoid activations. Our primary contribution is a single, general theorem that not only covers architectures for which this result was previously unknown but also consolidates existing results under weaker assumptions. While our focus is theoretical and our results are only exact in the infinitesimal step size limit, we nevertheless find excellent empirical agreement between the predictions of our result and those of the practical step-size gradient descent method.

[1261] Dynamic Orthogonal Continual Fine-tuning for Mitigating Catastrophic Forgettings

Zhixin Zhang, Zeming Wei, Meng Sun

Main category: cs.LG

TL;DR: DOC fine-tuning addresses catastrophic forgetting in LLMs by tracking functional direction drift and making new task gradients orthogonal to historical directions.

Details

Motivation: Catastrophic forgetting in continual learning where LLMs lose performance on historical tasks when fine-tuned on new sequential data without access to past datasets.

Method: Dynamic Orthogonal Continual (DOC) fine-tuning that tracks drift of functional directions and dynamically updates them, while adjusting new task gradients to be orthogonal to historical function directions.

Result: Extensive experiments show DOC outperforms prior methods, effectively reducing catastrophic forgetting in various LLM continual learning benchmarks.

Conclusion: DOC provides a robust tool for continuous LLM fine-tuning by mitigating interference between new and old tasks through orthogonal gradient adjustment.

Abstract: Catastrophic forgetting remains a critical challenge in continual learning for large language models (LLMs), where models struggle to retain performance on historical tasks when fine-tuning on new sequential data without access to past datasets. In this paper, we first reveal that the drift of functional directions during the fine-tuning process is a key reason why existing regularization-based methods fail in long-term LLM continual learning. To address this, we propose Dynamic Orthogonal Continual (DOC) fine-tuning, a novel approach that tracks the drift of these functional directions and dynamically updates them during the fine-tuning process. Furthermore, by adjusting the gradients of new task parameters to be orthogonal to the tracked historical function directions, our method mitigates interference between new and old tasks. Extensive experiments on various LLM continual learning benchmarks demonstrate that this approach outperforms prior methods, effectively reducing catastrophic forgetting and providing a robust tool for continuous LLM fine-tuning. Our code is available at https://github.com/meloxxxxxx/DOC.

[1262] Differentiable Sparsity via $D$-Gating: Simple and Versatile Structured Penalization

Chris Kolb, Laetitia Frost, Bernd Bischl, David Rügamer

Main category: cs.LG

TL;DR: D-Gating is a differentiable structured overparameterization method that enables neural network compression through group sparsity regularization, providing theoretical equivalence to non-smooth structured penalties while being compatible with standard gradient descent.

Details

Motivation: Structured sparsity regularization is effective for neural network compression but is non-differentiable, requiring specialized optimizers or post-hoc pruning without formal guarantees. D-Gating aims to make structured sparsity compatible with conventional stochastic gradient descent.

Method: D-Gating splits each group of weights into a primary weight vector and multiple scalar gating factors, creating a fully differentiable overparameterization. It proves theoretical equivalence to non-smooth structured L_{2,2/D} penalization and shows exponential convergence to the regularized loss.

Result: D-Gating consistently delivers strong performance-sparsity tradeoffs across vision, language, and tabular tasks, outperforming both direct optimization of structured penalties and conventional pruning baselines.

Conclusion: D-Gating provides a theoretically grounded, differentiable approach to structured sparsity that evolves from non-sparse to sparse optimization while maintaining compatibility with standard gradient-based training methods.

Abstract: Structured sparsity regularization offers a principled way to compact neural networks, but its non-differentiability breaks compatibility with conventional stochastic gradient descent and requires either specialized optimizers or additional post-hoc pruning without formal guarantees. In this work, we propose $D$-Gating, a fully differentiable structured overparameterization that splits each group of weights into a primary weight vector and multiple scalar gating factors. We prove that any local minimum under $D$-Gating is also a local minimum using non-smooth structured $L_{2,2/D}$ penalization, and further show that the $D$-Gating objective converges at least exponentially fast to the $L_{2,2/D}$-regularized loss in the gradient flow limit. Together, our results show that $D$-Gating is theoretically equivalent to solving the original group sparsity problem, yet induces distinct learning dynamics that evolve from a non-sparse regime into sparse optimization. We validate our theory across vision, language, and tabular tasks, where $D$-Gating consistently delivers strong performance-sparsity tradeoffs and outperforms both direct optimization of structured penalties and conventional pruning baselines.

[1263] Integrated Communication and Control for Energy-Efficient UAV Swarms: A Multi-Agent Reinforcement Learning Approach

Tianjiao Sun, Ningyan Guo, Haozhe Gu, Yanyan Peng, Zhiyong Feng

Main category: cs.LG

TL;DR: Proposes an integrated communication-control co-design using multi-agent reinforcement learning to optimize UAV swarm energy efficiency and communication fairness in complex environments.

Details

Motivation: UAV swarm communications face reliability issues in complex environments due to unpredictable wireless channels and energy constraints, requiring improved quality of service.

Method: Formulates joint resource allocation and 3D trajectory control as MDP, develops MARL framework with novel MAHPPO-AM algorithm using action masking for hybrid action spaces.

Result: Achieves 0.99 fairness index and reduces energy consumption by up to 25% compared to baseline methods.

Conclusion: The proposed integrated approach effectively balances communication fairness and energy efficiency in UAV swarm-assisted networks.

Abstract: The deployment of unmanned aerial vehicle (UAV) swarm-assisted communication networks has become an increasingly vital approach for remediating coverage limitations in infrastructure-deficient environments, with especially pressing applications in temporary scenarios, such as emergency rescue, military and security operations, and remote area coverage. However, complex geographic environments lead to unpredictable and highly dynamic wireless channel conditions, resulting in frequent interruptions of air-to-ground (A2G) links that severely constrain the reliability and quality of service in UAV swarm-assisted mobile communications. To improve the quality of UAV swarm-assisted communications in complex geographic environments, we propose an integrated communication and control co-design mechanism. Given the stringent energy constraints inherent in UAV swarms, our proposed mechanism is designed to optimize energy efficiency while maintaining an equilibrium between equitable communication rates for mobile ground users (GUs) and UAV energy expenditure. We formulate the joint resource allocation and 3D trajectory control problem as a Markov decision process (MDP), and develop a multi-agent reinforcement learning (MARL) framework to enable real-time coordinated actions across the UAV swarm. To optimize the action policy of UAV swarms, we propose a novel multi-agent hybrid proximal policy optimization with action masking (MAHPPO-AM) algorithm, specifically designed to handle complex hybrid action spaces. The algorithm incorporates action masking to enforce hard constraints in high-dimensional action spaces. Experimental results demonstrate that our approach achieves a fairness index of 0.99 while reducing energy consumption by up to 25% compared to baseline methods.

[1264] Graph Mixing Additive Networks

Maya Bechler-Speicher, Andrea Zerio, Maor Huri, Marie Vibeke Vestergaard, Ran Gilad-Bachrach, Tine Jess, Samir Bhatt, Aleksejs Sazonovs

Main category: cs.LG

TL;DR: GMAN extends Graph Neural Additive Networks to learn from sparse time-series data by representing trajectories as directed graphs, offering flexible interpretability-expressivity trade-offs and outperforming black-box models on real-world tasks.

Details

Motivation: To create an interpretable yet expressive framework for learning from sparse time-series data that can compete with black-box models while providing domain-aligned explanations.

Method: Represents time-dependent trajectories as directed graphs and applies an enriched Graph Neural Additive Network (GNAN) to each graph, allowing feature grouping and graph grouping to control interpretability-expressivity trade-offs.

Result: Outperforms strong non-interpretable black-box baselines on real-world datasets including mortality prediction from blood tests and fake-news detection.

Conclusion: GMAN successfully bridges the gap between interpretability and performance, delivering both accurate predictions and actionable, domain-aligned explanations for time-series data.

Abstract: We introduce GMAN, a flexible, interpretable, and expressive framework that extends Graph Neural Additive Networks (GNANs) to learn from sets of sparse time-series data. GMAN represents each time-dependent trajectory as a directed graph and applies an enriched, more expressive GNAN to each graph. It allows users to control the interpretability-expressivity trade-off by grouping features and graphs to encode priors, and it provides feature, node, and graph-level interpretability. On real-world datasets, including mortality prediction from blood tests and fake-news detection, GMAN outperforms strong non-interpretable black-box baselines while delivering actionable, domain-aligned explanations.

[1265] HiViS: Hiding Visual Tokens from the Drafter for Speculative Decoding in Vision-Language Models

Zhinan Xie, Peisong Wang, Jian Cheng

Main category: cs.LG

TL;DR: HiViS is a speculative decoding method for Vision-Language Models that hides visual tokens from the drafter to accelerate inference while maintaining lossless generation quality.

Details

Motivation: Adapting speculative decoding to VLMs is challenging due to semantic misalignment between drafter and target VLM for visual tokens, and the large number of visual tokens slowing down drafter's self-attention.

Method: Explicit-implicit input decomposition framework that removes visual tokens from drafter’s input, reuses target VLM’s hidden states as implicit visual information, and uses multi-step self-feedback training with dynamic data selection.

Result: Compresses drafter’s prefill sequence length to 0.7%-1.3% of target VLM’s input while maintaining lossless generation quality, achieving up to 2.65x speedup across diverse models and tasks.

Conclusion: HiViS effectively accelerates VLM inference through visual token hiding and efficient training strategies, demonstrating significant speed improvements without quality degradation.

Abstract: Speculative decoding is an effective approach for accelerating inference in Large Language models (LLMs), but its adaptation to Vision-Language models (VLMs) remains challenging for additional visual tokens in multimodal inputs. First, owing to the fact that the drafter and the target VLM may derived from different families, the semantic representations of visual tokens in the target VLM are misaligned with those in the drafter, introducing bias into the KV-cache during the prefill stage. Second, the large number of visual tokens substantially slows down the drafter’s self-attention during the decoding stage. We propose Hiding Visual Tokens from the Drafter for Speculative Decoding in Vision-Language Models (HiViS), an explicit-implicit input decomposition framework that alleviates the above inefficiency. All visual tokens are removed from the drafter’s input, retaining only textual tokens as explicit inputs, while directly reusing the target VLM’s corresponding last-layer hidden states as implicit visual information without additional processing. To train the drafter efficiently, we introduces multi-step self-feedback training strategy with dynamic data selection and sequential embedding supervision to simulate reasoning during training. Our approach compresses the prefill sequence length of the drafter to only 0.7%-1.3% of the target VLM’s input, while maintaining lossless generation quality. Extensive experiments across diverse models and tasks demonstrate up to 2.65x speedup, confirming the effectiveness of HiViS in accelerating VLM inference.

[1266] Beyond Benchmarks: Understanding Mixture-of-Experts Models through Internal Mechanisms

Jiahao Ying, Mingbao Lin, Qianru Sun, Yixin Cao

Main category: cs.LG

TL;DR: This paper investigates the internal mechanisms of Mixture-of-Experts (MoE) architectures using MUI metric, revealing insights about neuron utilization, training dynamics, expert collaboration, and activation patterns.

Details

Motivation: Current MoE research is performance-centric with limited understanding of internal mechanisms, constraining broader progress. The authors aim to analyze routing mechanisms and expert-level behaviors to gain deeper insights.

Method: Used MUI (an internal metric) to systematically analyze routing mechanisms and expert-level behaviors across a wide range of publicly available MoE models.

Result: Found that: (1) neuron utilization decreases with model evolution, indicating stronger generalization; (2) training shows dynamic trajectory where MUI reveals deeper insights than benchmarks alone; (3) task completion involves collaborative expert contributions with shared experts driving concentration; (4) neuron-level activation patterns serve as fine-grained proxy for data diversity.

Conclusion: MUI serves as a valuable complementary indicator to benchmark performance, offering new insights into MoE model capacity, dynamics, and specialization.

Abstract: Mixture-of-Experts (MoE) architectures have emerged as a promising direction, offering efficiency and scalability by activating only a subset of parameters during inference. However, current research remains largely performance-centric, with limited understanding of its internal mechanisms, thereby constraining broader progress. In this work, we use an internal metric to investigate the mechanisms of MoE architecture by explicitly incorporating routing mechanisms and analyzing expert-level behaviors. Through systematic analyses of a wide range of publicly available MoE models, we uncover several findings: (1) neuron utilization decreases as models evolve, reflecting stronger generalization; (2) training exhibits a dynamic trajectory, where benchmark performance alone provides limited signal while MUI reveals deeper insights; (3) task completion emerges from collaborative contributions of multiple experts, with shared experts driving concentration; and (4) activation patterns at the neuron level provide a fine-grained proxy for data diversity. Together, these results demonstrate the potential of MUI as a complementary indicator to benchmark performance, offering new insights into the capacity, dynamics, and specialization of MoE models. Our project can be found at https://yingjiahao14.github.io/MoE-MUI/.

[1267] Diffusion Models are Kelly Gamblers

Akhil Premkumar

Main category: cs.LG

TL;DR: This paper connects diffusion models to the Kelly criterion for betting games, showing that conditional diffusion models store mutual information between signal X and conditioning Y. Classifier-free guidance boosts this mutual information at sampling time, which is particularly useful for image models where image-label mutual information is low due to the manifold hypothesis.

Details

Motivation: To establish connections between diffusion models and the Kelly criterion from betting theory, and to better understand how conditional diffusion models store and utilize mutual information between signals and conditioning information.

Method: Theoretical analysis connecting diffusion models to Kelly criterion, examination of mutual information storage in conditional diffusion models, and investigation of classifier-free guidance mechanisms.

Result: Found that conditional diffusion models store mutual information equal to I(X;Y), classifier-free guidance effectively boosts this mutual information during sampling, and identified nuances in the perspective of diffusion models as infinitely deep autoencoders.

Conclusion: The paper provides new theoretical insights connecting diffusion models to betting theory and quantum mechanics, while clarifying how mutual information is managed in conditional generation tasks, especially for low mutual information scenarios like image classification.

Abstract: We draw a connection between diffusion models and the Kelly criterion for maximizing returns in betting games. We find that conditional diffusion models store additional information to bind the signal $X$ with the conditioning information $Y$, equal to the mutual information between them. Classifier-free guidance effectively boosts the mutual information between $X$ and $Y$ at sampling time. This is especially helpful in image models, since the mutual information between images and their labels is low, a fact which is intimately connected to the manifold hypothesis. Finally, we point out some nuances in the popular perspective that diffusion models are infinitely deep autoencoders. In doing so, we relate the denoising loss to the Fermi Golden Rule from quantum mechanics.

[1268] Brain-language fusion enables interactive neural readout and in-silico experimentation

Victoria Bosch, Daniel Anthes, Adrien Doerig, Sushrut Thorat, Peter König, Tim Christian Kietzmann

Main category: cs.LG

TL;DR: CorText integrates neural activity into LLM latent space, enabling natural language interaction with brain data for image captioning and question answering using only fMRI data.

Details

Motivation: Current neural decoding methods are static and non-interactive, lacking the ability for open-ended natural language interaction with brain data.

Method: Integrates neural activity directly into LLM latent space, trained on fMRI data from natural scene viewing, enabling zero-shot generalization and counterfactual analysis.

Result: Generates accurate image captions and answers detailed questions better than controls, achieves zero-shot generalization beyond training categories.

Conclusion: Marks a shift from passive decoding to generative, flexible brain-language interfaces.

Abstract: Large language models (LLMs) have revolutionized human-machine interaction, and have been extended by embedding diverse modalities such as images into a shared language space. Yet, neural decoding has remained constrained by static, non-interactive methods. We introduce CorText, a framework that integrates neural activity directly into the latent space of an LLM, enabling open-ended, natural language interaction with brain data. Trained on fMRI data recorded during viewing of natural scenes, CorText generates accurate image captions and can answer more detailed questions better than controls, while having access to neural data only. We showcase that CorText achieves zero-shot generalization beyond semantic categories seen during training. Furthermore, we present a counterfactual analysis that emulates in-silico cortical microstimulation. These advances mark a shift from passive decoding toward generative, flexible interfaces between brain activity and language.

[1269] Efficient Identification of High Similarity Clusters in Polygon Datasets

John N. Daras

Main category: cs.LG

TL;DR: A framework that reduces computational load for large-scale spatial similarity computations by integrating dynamic thresholding, supervised scheduling, and recall-constrained optimization.

Details

Motivation: Advancements in tools like Shapely 2.0 and Triton improve spatial computation efficiency, but face challenges with extremely large datasets due to high computational volume.

Method: Proposes a framework with dynamic similarity index thresholding using KDE, supervised scheduling with ML models to prioritize clusters, and recall-constrained optimization to reduce verification clusters.

Result: Achieves substantial reductions in computational cost without sacrificing accuracy, demonstrating scalability and effectiveness in large-scale geospatial analysis.

Conclusion: Offers a practical solution for efficient large-scale spatial similarity computations by reducing verification workload while maintaining precision and recall requirements.

Abstract: Advancements in tools like Shapely 2.0 and Triton can significantly improve the efficiency of spatial similarity computations by enabling faster and more scalable geometric operations. However, for extremely large datasets, these optimizations may face challenges due to the sheer volume of computations required. To address this, we propose a framework that reduces the number of clusters requiring verification, thereby decreasing the computational load on these systems. The framework integrates dynamic similarity index thresholding, supervised scheduling, and recall-constrained optimization to efficiently identify clusters with the highest spatial similarity while meeting user-defined precision and recall requirements. By leveraging Kernel Density Estimation (KDE) to dynamically determine similarity thresholds and machine learning models to prioritize clusters, our approach achieves substantial reductions in computational cost without sacrificing accuracy. Experimental results demonstrate the scalability and effectiveness of the method, offering a practical solution for large-scale geospatial analysis.

[1270] Explore-Execute Chain: Towards an Efficient Structured Reasoning Paradigm

Kaisen Yang, Lixuan He, Rushi Shah, Kaicheng Yang, Qinwei Ma, Dianbo Liu, Alex Lamb

Main category: cs.LG

TL;DR: E²C is a structured reasoning framework that decouples reasoning into exploration (generating high-level plans) and execution (carrying out plans), achieving computational efficiency and improved performance.

Details

Motivation: Current Chain-of-Thought methods conflate planning and execution, leading to computational inefficiency, limited path exploration, and reduced interpretability.

Method: Two-phase framework: exploratory phase generates high-level plans stochastically, execution phase deterministically carries out plans. Uses two-stage training with SFT (enforcing plan adherence) and RL (reinforcing execution determinism).

Result: Achieves 58.1% accuracy on AIME'2024 using <10% of decoding tokens compared to Forest-of-Thought. Cross-domain adaptation with EF-SFT uses only 3.5% of tokens but yields up to 14.5% higher accuracy on medical benchmarks.

Conclusion: E²C provides state-of-the-art performance, strong generalization, and greater interpretability by separating planning from execution, while significantly reducing computational overhead.

Abstract: Chain-of-Thought (CoT) and its variants have markedly advanced the reasoning abilities of Large Language Models (LLMs), yet their monolithic and auto-regressive architecture inherently conflates high-level strategic planning with low-level step-by-step execution, leading to computational inefficiency, limited exploration of reasoning paths, and reduced interpretability. To overcome these issues, we propose the Explore-Execute Chain ($E^2C$), a structured reasoning framework that decouples reasoning into two distinct phases: an exploratory phase that stochastically generates succinct high-level plans, followed by an execution phase that deterministically carries out the chosen plan. Our approach incorporates a two-stage training methodology, which combines Supervised Fine-Tuning (SFT) - augmented by a novel data generation algorithm enforcing strict plan adherence - with a subsequent Reinforcement Learning (RL) stage that capitalizes on the informativeness of exploration and reinforces the determinism of execution.This decomposition enables an efficient test-time scaling strategy: on AIME'2024, $E^2C$ Test Time Scaling reaches 58.1% accuracy using <10% of the decoding tokens required by comparable methods (e.g., Forest-of-Thought), sharply cutting self-consistency overhead. For cross-domain adaptation, our Exploration-Focused SFT (EF-SFT) fine-tunes with only 3.5% of the tokens used by standard SFT yet yields up to 14.5% higher accuracy than standard SFT on medical benchmarks, delivering state-of-the-art performance, strong generalization, and greater interpretability by separating planning from execution. The code and pre-trained models for the project are available at: https://github.com/yks23/Explore-Execute-Chain.git

[1271] DiBS-MTL: Transformation-Invariant Multitask Learning with Direction Oracles

Surya Murthy, Kushagra Gupta, Mustafa O. Karabag, David Fridovich-Keil, Ufuk Topcu

Main category: cs.LG

TL;DR: DiBS-MTL is a multitask learning method based on cooperative bargaining theory that achieves Pareto stationary solutions immune to task domination by being invariant to monotonic nonaffine task loss transformations.

Details

Motivation: Existing MTL methods suffer from task domination when task losses are arbitrarily scaled, which degrades overall performance. The Direction-based Bargaining Solution (DiBS) from cooperative bargaining theory offers a promising approach that is immune to such scaling issues.

Method: Proposed DiBS-MTL, a computationally efficient adaptation of DiBS to MTL settings. Proved convergence of DiBS iterates to Pareto stationary points for nonconvex task losses under standard assumptions.

Result: DiBS-MTL achieves competitive performance with state-of-the-art methods on standard MTL benchmarks while maintaining robustness to nonaffine monotonic transformations that degrade existing approaches.

Conclusion: DiBS-MTL provides a theoretically grounded and empirically validated MTL method that addresses the critical issue of task domination through its invariance to task loss scaling, outperforming prior bargaining-inspired methods.

Abstract: Multitask learning (MTL) algorithms typically rely on schemes that combine different task losses or their gradients through weighted averaging. These methods aim to find Pareto stationary points by using heuristics that require access to task loss values, gradients, or both. In doing so, a central challenge arises because task losses can be arbitrarily, nonaffinely scaled relative to one another, causing certain tasks to dominate training and degrade overall performance. A recent advance in cooperative bargaining theory, the Direction-based Bargaining Solution (DiBS), yields Pareto stationary solutions immune to task domination because of its invariance to monotonic nonaffine task loss transformations. However, the convergence behavior of DiBS in nonconvex MTL settings is currently not understood. To this end, we prove that under standard assumptions, a subsequence of DiBS iterates converges to a Pareto stationary point when task losses are possibly nonconvex, and propose DiBS-MTL, a computationally efficient adaptation of DiBS to the MTL setting. Finally, we validate DiBS-MTL empirically on standard MTL benchmarks, showing that it achieves competitive performance with state-of-the-art methods while maintaining robustness to nonaffine monotonic transformations that significantly degrade the performance of existing approaches, including prior bargaining-inspired MTL methods. Code available at https://github.com/suryakmurthy/dibs-mtl.

[1272] Evaluating the Robustness of Chinchilla Compute-Optimal Scaling

Rylan Schaeffer, Noam Levi, Andreas Kirsch, Theo Guenais, Brando Miranda, Elyas Obbad, Sanmi Koyejo

Main category: cs.LG

TL;DR: The paper validates Chinchilla’s compute-optimal scaling principles despite concerns about parameter ambiguity and discrepancies, showing that key results remain robust under different parameter interpretations and perturbations.

Details

Motivation: To address concerns about Chinchilla's scaling laws - including wide confidence intervals, discrepancies between approaches, and incongruities with other scaling laws - and determine if practitioners can still rely on its prescriptions.

Method: Analyzed three possible interpretations of Chinchilla’s model parameters, then deliberately perturbed parameters in four structured ways to test sensitivity of key results.

Result: Found that different parameter interpretations (with up to 15.2% differences) don’t meaningfully affect scaling law estimates or compute-optimal tokens-to-parameter ratio. Key results withstand sizable perturbations, though most sensitive to additive/systematic errors.

Conclusion: Chinchilla’s key results are durable and practitioners can maintain confidence in it as a guide for scaling language models.

Abstract: Hoffman et al (2022)’s Chinchilla paper introduced the principle of compute-optimal scaling, laying a foundation for future scaling of language models. In the years since, however, valid concerns about Chinchilla have been raised: wide confidence intervals, discrepancies between its three approaches, and incongruities with other scaling laws. This raises a critical question for the field: Can practitioners still rely on Chinchilla’s prescriptions? Our work demonstrates the answer is yes. We begin by uncovering that the model parameters central to Chinchilla’s analyses were ambiguous: three interpretations are possible, with relative differences between different interpretations of model parameters as high as 15.2%. We find that, perhaps surprisingly, which model parameters are used for the analyses do not meaningfully affect key results: the scaling law estimates and the compute-optimal tokens-to-parameter ratio. Indeed, under one interpretation, the tokens-to-parameter ratio becomes more constant with the target compute budget. We then ask how distorted the Chinchilla model parameters could have been without meaningfully affecting the key results. By deliberately perturbing model parameters in four structured ways, we find that key Chinchilla results are most sensitive to additive or systematic errors, which can alter the otherwise flat trend of the optimal tokens-to-parameter ratio, but overall, Chinchilla’s key results withstand sizable perturbations. Altogether, our findings offer the field renewed confidence in Chinchilla as a durable guide for scaling language models.

[1273] Detecting and Rectifying Noisy Labels: A Similarity-based Approach

Dang Huu-Tien, Naoya Inoue

Main category: cs.LG

TL;DR: Proposes model-agnostic error detection and rectification using penultimate layer features, leveraging feature similarity to identify mislabeled data points and automatically correct them.

Details

Motivation: Label noise damages neural network performance, and with growing network sizes, there's increasing need for automated error detection tools that work across different models.

Method: Uses penultimate layer features to measure similarity between data points; detects mislabeled data by comparing feature similarity within true class clusters versus other classes; probability of label occurrence in tight clusters helps identify errors.

Result: Method shows high performance across various noise types and successfully rectifies errors to improve dataset quality in extensive experiments.

Conclusion: The proposed post-hoc, model-agnostic approach effectively detects and corrects label errors using penultimate features, enhancing dataset quality for neural network training.

Abstract: Label noise in datasets could damage the performance of neural net training. As the size of modern deep networks grows, there is a growing demand for automated tools for detecting such errors. In this paper, we propose post-hoc, model-agnostic error detection and rectification methods utilizing the penultimate feature from a neural network. Our idea is based on the observation that the similarity between the penultimate feature of a mislabeled data point and its true class data points is higher than that for data points from other classes, making the probability of label occurrence within a tight, similar cluster informative for detecting and rectifying errors. Extensive experiments show our method not only demonstrates high performance across various noises but also automatically rectifies these errors to improve the quality of datasets.

[1274] Curriculum-Guided Reinforcement Learning for Synthesizing Gas-Efficient Financial Derivatives Contracts

Maruf Ahmed Mridul, Oshani Seneviratne

Main category: cs.LG

TL;DR: RL framework generates gas-optimized Solidity smart contracts from CDM specifications, achieving 35.59% gas savings through PPO agent and two-phase curriculum learning.

Details

Motivation: Smart contract automation offers efficiency gains but faces challenges in translating financial specifications into gas-efficient code, particularly from high-level specifications like CDM.

Method: Uses Reinforcement Learning with Proximal Policy Optimization (PPO) agent that selects optimal code snippets from a pre-defined library, employing two-phase curriculum learning (functional correctness first, then gas optimization).

Result: RL agent generates contracts with significant gas savings - up to 35.59% cost reduction on unseen test data compared to unoptimized baselines.

Conclusion: Presents viable methodology for automated synthesis of reliable and economically sustainable smart contracts, bridging gap between high-level financial agreements and efficient on-chain execution.

Abstract: Smart contract-based automation of financial derivatives offers substantial efficiency gains, but its real-world adoption is constrained by the complexity of translating financial specifications into gas-efficient executable code. In particular, generating code that is both functionally correct and economically viable from high-level specifications, such as the Common Domain Model (CDM), remains a significant challenge. This paper introduces a Reinforcement Learning (RL) framework to generate functional and gas-optimized Solidity smart contracts directly from CDM specifications. We employ a Proximal Policy Optimization (PPO) agent that learns to select optimal code snippets from a pre-defined library. To manage the complex search space, a two-phase curriculum first trains the agent for functional correctness before shifting its focus to gas optimization. Our empirical results show the RL agent learns to generate contracts with significant gas savings, achieving cost reductions of up to 35.59% on unseen test data compared to unoptimized baselines. This work presents a viable methodology for the automated synthesis of reliable and economically sustainable smart contracts, bridging the gap between high-level financial agreements and efficient on-chain execution.

[1275] Guide: Generalized-Prior and Data Encoders for DAG Estimation

Amartya Roy, Devharish N, Shreya Ganguly, Kripabandhu Ghosh

Main category: cs.LG

TL;DR: GUIDE is a causal discovery framework that integrates LLM-generated adjacency matrices with observational data using a dual-encoder architecture, achieving significant improvements in computational efficiency and accuracy while scaling to larger node counts.

Details

Motivation: Modern causal discovery methods face limitations in scalability, computational efficiency, and adaptability to mixed data types, with traditional algorithms struggling beyond 70 nodes and exhibiting high energy costs.

Method: GUIDE integrates LLM-generated adjacency matrices with observational data through a dual-encoder architecture, using reinforcement learning to dynamically balance reward maximization (accuracy) and penalty avoidance (DAG constraints).

Result: GUIDE reduces runtime by ≈42% compared to RL-BIC and KCRL methods, achieves ≈117% improvement in accuracy over NOTEARS and GraN-DAG individually, and scales to ≥70 nodes where baseline methods fail.

Conclusion: GUIDE provides a robust solution for causal discovery that optimizes computational efficiency while maintaining high accuracy across mixed data types and large-scale problems.

Abstract: Modern causal discovery methods face critical limitations in scalability, computational efficiency, and adaptability to mixed data types, as evidenced by benchmarks on node scalability (30, $\le 50$, $\ge 70$ nodes), computational energy demands, and continuous/non-continuous data handling. While traditional algorithms like PC, GES, and ICA-LiNGAM struggle with these challenges, exhibiting prohibitive energy costs for higher-order nodes and poor scalability beyond 70 nodes, we propose \textbf{GUIDE}, a framework that integrates Large Language Model (LLM)-generated adjacency matrices with observational data through a dual-encoder architecture. GUIDE uniquely optimizes computational efficiency, reducing runtime on average by $\approx 42%$ compared to RL-BIC and KCRL methods, while achieving an average $\approx 117%$ improvement in accuracy over both NOTEARS and GraN-DAG individually. During training, GUIDE’s reinforcement learning agent dynamically balances reward maximization (accuracy) and penalty avoidance (DAG constraints), enabling robust performance across mixed data types and scalability to $\ge 70$ nodes – a setting where baseline methods fail.

[1276] Does Weak-to-strong Generalization Happen under Spurious Correlations?

Chenruo Liu, Yijun Dong, Qi Lei

Main category: cs.LG

TL;DR: The paper studies weak-to-strong (W2S) generalization in scenarios with spurious correlations from group imbalance. It shows W2S succeeds when labeled and unlabeled data have balanced groups but fails when imbalance differs, and proposes an algorithm to improve W2S by retraining on high-confidence data.

Details

Motivation: To understand when and why weak-to-strong generalization fails in the presence of spurious correlations caused by group imbalance in labeled and unlabeled data, and to develop methods to improve W2S performance when it fails.

Method: Theoretical analysis at the proportional asymptotic limit to characterize W2S gain, extensive experiments on spurious correlation benchmarks, and a proposed algorithm that retrains the strong student on its high-confidence data subset after W2S fine-tuning.

Result: W2S always succeeds with sufficient pseudolabels when labeled and unlabeled data have the same group imbalance (η_u = η_ℓ), but fails when imbalance differs, with W2S gain diminishing as (η_u - η_ℓ)² increases. The proposed algorithm consistently improves over vanilla W2S fine-tuning.

Conclusion: Group imbalance mismatch between labeled and unlabeled data causes W2S failure, but this can be mitigated by retraining the strong student on high-confidence data, providing a practical solution for improving weak-to-strong generalization in real-world scenarios.

Abstract: We initiate a unified theoretical and algorithmic study of a key problem in weak-to-strong (W2S) generalization: when fine-tuning a strong pre-trained student with pseudolabels from a weaker teacher on a downstream task with spurious correlations, does W2S happen, and how to improve it upon failures? We consider two sources of spurious correlations caused by group imbalance: (i) a weak teacher fine-tuned on group-imbalanced labeled data with a minority group of fraction $\eta_\ell$, and (ii) a group-imbalanced unlabeled set pseudolabeled by the teacher with a minority group of fraction $\eta_u$. Theoretically, a precise characterization of W2S gain at the proportional asymptotic limit shows that W2S always happens with sufficient pseudolabels when $\eta_u = \eta_\ell$ but may fail when $\eta_u \ne \eta_\ell$, where W2S gain diminishes as $(\eta_u - \eta_\ell)^2$ increases. Our theory is corroborated by extensive experiments on various spurious correlation benchmarks and teacher-student pairs. To boost W2S performance upon failures, we further propose a simple, effective algorithmic remedy that retrains the strong student on its high-confidence data subset after W2S fine-tuning. Our algorithm is group-label-free and achieves consistent, substantial improvements over vanilla W2S fine-tuning.

[1277] SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention

Jintao Zhang, Haoxu Wang, Kai Jiang, Shuo Yang, Kaiwen Zheng, Haocheng Xi, Ziteng Wang, Hongzhou Zhu, Min Zhao, Ion Stoica, Joseph E. Gonzalez, Jun Zhu, Jianfei Chen

Main category: cs.LG

TL;DR: SLA (Sparse-Linear Attention) accelerates DiT models by fusing sparse and linear attention, reducing computation by 95% without quality loss.

Details

Motivation: Attention latency is a major bottleneck in Diffusion Transformer models for video generation due to quadratic complexity with long sequences.

Method: SLA classifies attention weights into critical (O(N²) attention), marginal (O(N) linear attention), and negligible (skipped), combining them in a single GPU kernel.

Result: 20x reduction in attention computation, 95% computation reduction, 13.7x attention speedup, and 2.2x end-to-end speedup in video generation on Wan2.1-1.3B.

Conclusion: SLA enables significant acceleration of DiT models without compromising generation quality through efficient attention weight classification and computation fusion.

Abstract: In Diffusion Transformer (DiT) models, particularly for video generation, attention latency is a major bottleneck due to the long sequence length and the quadratic complexity. We find that attention weights can be separated into two parts: a small fraction of large weights with high rank and the remaining weights with very low rank. This naturally suggests applying sparse acceleration to the first part and low-rank acceleration to the second. Based on this finding, we propose SLA (Sparse-Linear Attention), a trainable attention method that fuses sparse and linear attention to accelerate diffusion models. SLA classifies attention weights into critical, marginal, and negligible categories, applying O(N^2) attention to critical weights, O(N) attention to marginal weights, and skipping negligible ones. SLA combines these computations into a single GPU kernel and supports both forward and backward passes. With only a few fine-tuning steps using SLA, DiT models achieve a 20x reduction in attention computation, resulting in significant acceleration without loss of generation quality. Experiments show that SLA reduces attention computation by 95% without degrading end-to-end generation quality, outperforming baseline methods. In addition, we implement an efficient GPU kernel for SLA, which yields a 13.7x speedup in attention computation and a 2.2x end-to-end speedup in video generation on Wan2.1-1.3B.

[1278] Pretraining Scaling Laws for Generative Evaluations of Language Models

Rylan Schaeffer, Noam Levi, Brando Miranda, Sanmi Koyejo

Main category: cs.LG

TL;DR: The paper proposes three scaling laws for predicting performance on generative evaluations (pass-at-k), showing how generative tasks offer new hyperparameters and establishing theoretical connections between scaling approaches.

Details

Motivation: While neural scaling laws are well-established for pretraining losses and discriminative tasks, little research exists for generative evaluations like mathematical problem-solving or software engineering.

Method: Three different pretraining scaling laws are proposed and evaluated: (1) compute-based, (2) model parameters and tokens, (3) log likelihoods of gold reference solutions.

Result: All three scaling laws perform comparably, with compute scaling slightly worse for small k and gold reference likelihood slightly worse for large k. Scaling parameters stabilize at different orders of magnitude across methods.

Conclusion: The framework provides researchers with methodologies to forecast generative performance, establishing that compute scaling emerges as the compute-optimal envelope of parameters-and-tokens scaling.

Abstract: Neural scaling laws have played a central role in modern machine learning, driving the field’s ever-expanding scaling of parameters, data and compute. While much research has gone into fitting scaling laws and predicting performance on pretraining losses and on discriminative evaluations such as multiple-choice question-answering, comparatively little research has been done on fitting scaling laws and predicting performance on generative evaluations such as mathematical problem-solving or software engineering. We propose and evaluate three different pretraining scaling laws for fitting pass-at-$k$ on generative evaluations and for predicting pass-at-$k$ of the most expensive model using the performance of cheaper models. Our three scaling laws differ in the covariates used: (1) compute, (2) model parameters and tokens, (3) log likelihoods of gold reference solutions. We make four main contributions: (1) We show how generative evaluations offer new hyperparameters (in our setting, $k$) that researchers can use to control the scaling laws parameters and the predictability of performance. (2) In terms of scaling law parameters, we find that the compute scaling law and parameters,+,tokens scaling law stabilize for the last ~$1.5{-}2.5$ orders of magnitude, whereas the gold reference likelihood scaling law stabilizes for the last ~$5$ orders of magnitude. (3) In terms of predictive performance, we find all three scaling laws perform comparably, although the compute scaling law predicts slightly worse for small $k$ and the log likelihoods of gold reference solutions predicts slightly worse for large $k$. (4) We establish a theoretical connection that the compute scaling law emerges as the compute-optimal envelope of the parameters-and-tokens scaling law. Our framework provides researchers and practitioners with insights and methodologies to forecast generative performance.

[1279] Optimism as Risk-Seeking in Multi-Agent Reinforcement Learning

Runyu Zhang, Na Li, Asuman Ozdaglar, Jeff Shamma, Gioele Zardini

Main category: cs.LG

TL;DR: The paper proposes a principled framework that interprets risk-seeking objectives as optimism in multi-agent reinforcement learning, unifying risk-sensitive learning and optimism to improve cooperation.

Details

Motivation: Existing optimistic methods in cooperative MARL are typically heuristic and lack theoretical grounding, while risk-averse approaches often lead to suboptimal equilibria. The authors aim to provide a theoretically sound approach to promote cooperation through optimism.

Method: Building on dual representation for convex risk measures, the authors introduce optimistic value functions that formalize optimism as divergence-penalized risk-seeking evaluations. They derive a policy-gradient theorem for these functions and develop decentralized optimistic actor-critic algorithms.

Result: Empirical results on cooperative benchmarks show that risk-seeking optimism consistently improves coordination over both risk-neutral baselines and heuristic optimistic methods.

Conclusion: The framework successfully unifies risk-sensitive learning and optimism, offering a theoretically grounded and practically effective approach to cooperation in MARL.

Abstract: Risk sensitivity has become a central theme in reinforcement learning (RL), where convex risk measures and robust formulations provide principled ways to model preferences beyond expected return. Recent extensions to multi-agent RL (MARL) have largely emphasized the risk-averse setting, prioritizing robustness to uncertainty. In cooperative MARL, however, such conservatism often leads to suboptimal equilibria, and a parallel line of work has shown that optimism can promote cooperation. Existing optimistic methods, though effective in practice, are typically heuristic and lack theoretical grounding. Building on the dual representation for convex risk measures, we propose a principled framework that interprets risk-seeking objectives as optimism. We introduce optimistic value functions, which formalize optimism as divergence-penalized risk-seeking evaluations. Building on this foundation, we derive a policy-gradient theorem for optimistic value functions, including explicit formulas for the entropic risk/KL-penalty setting, and develop decentralized optimistic actor-critic algorithms that implement these updates. Empirical results on cooperative benchmarks demonstrate that risk-seeking optimism consistently improves coordination over both risk-neutral baselines and heuristic optimistic methods. Our framework thus unifies risk-sensitive learning and optimism, offering a theoretically grounded and practically effective approach to cooperation in MARL.

[1280] Collaborative Device-Cloud LLM Inference through Reinforcement Learning

Wenzhi Fang, Dong-Jun Han, Liangqi Yuan, Christopher Brinton

Main category: cs.LG

TL;DR: Proposes a device-cloud collaboration framework where on-device LLMs make routing decisions after problem-solving, using post-training with reward maximization and adaptive policy gradient algorithms.

Details

Motivation: Existing binary classifier routers struggle to determine task difficulty from prompt patterns, limiting effective device-cloud collaboration for LLM deployment.

Method: On-device LLM makes routing decisions after solving, using post-training with reward maximization and group-adaptive policy gradient algorithm with adaptive prompt filtering.

Result: Consistently outperforms existing baselines and significantly narrows the gap to full cloud LLM performance across models and benchmarks.

Conclusion: The proposed framework enables more effective device-cloud collaboration by allowing on-device LLMs to make informed routing decisions after problem-solving.

Abstract: Device-cloud collaboration has emerged as a promising paradigm for deploying large language models (LLMs), combining the efficiency of lightweight on-device inference with the superior performance of powerful cloud LLMs. An essential problem in this scenario lies in deciding whether a given query is best handled locally or delegated to the cloud. Existing approaches typically rely on external routers, implemented as binary classifiers, which often struggle to determine task difficulty from the prompt’s surface pattern. To address these limitations, we propose a framework where the on-device LLM makes routing decisions at the end of its solving process, with this capability instilled through post-training. In particular, we formulate a reward maximization problem with carefully designed rewards that encourage effective problem solving and judicious offloading to the cloud. To solve this problem, we develop a group-adaptive policy gradient algorithm, featuring a group-level policy gradient, designed to yield an unbiased gradient estimator of the reward, and adaptive prompt filtering, developed to enforce the constraint on cloud LLM usage. Extensive experiments across models and benchmarks show that the proposed methodology consistently outperforms existing baselines and significantly narrows the gap to full cloud LLM performance.

[1281] On The Variability of Concept Activation Vectors

Julia Wenkmann, Damien Garreau

Main category: cs.LG

TL;DR: This paper provides a theoretical analysis of Concept Activation Vectors (CAVs) to quantify their variability due to random sampling, finding that variance decreases as 1/N where N is the number of random examples.

Details

Motivation: To address the variability in CAVs that arises from random sampling during their computation, which can lead to different results for different users despite using the same method.

Method: The authors conduct a fine-grained theoretical analysis of CAVs construction and validate their findings through experiments on several real-life datasets.

Result: The analysis reveals a universal result: the variance of CAVs decreases as 1/N, where N is the number of random examples used in the computation.

Conclusion: Based on the theoretical findings, the paper provides practical recommendations for resource-efficient application of the CAV method in explainable AI.

Abstract: One of the most pressing challenges in artificial intelligence is to make models more transparent to their users. Recently, explainable artificial intelligence has come up with numerous method to tackle this challenge. A promising avenue is to use concept-based explanations, that is, high-level concepts instead of plain feature importance score. Among this class of methods, Concept Activation vectors (CAVs), Kim et al. (2018) stands out as one of the main protagonists. One interesting aspect of CAVs is that their computation requires sampling random examples in the train set. Therefore, the actual vectors obtained may vary from user to user depending on the randomness of this sampling. In this paper, we propose a fine-grained theoretical analysis of CAVs construction in order to quantify their variability. Our results, confirmed by experiments on several real-life datasets, point out towards an universal result: the variance of CAVs decreases as $1/N$, where $N$ is the number of random examples. Based on this we give practical recommendations for a resource-efficient application of the method.

[1282] In-Context Compositional Q-Learning for Offline Reinforcement Learning

Qiushui Xu, Yuhao Huang, Yushu Jiang, Lei Song, Jinyu Wang, Wenliang Zheng, Jiang Bian

Main category: cs.LG

TL;DR: ICQL is an offline RL framework that formulates Q-learning as contextual inference using linear Transformers to adaptively infer local Q-functions from retrieved transitions, achieving improved performance in compositional tasks.

Details

Motivation: Existing approaches rely on single global Q-functions that struggle to capture compositional nature of tasks involving diverse subtasks, limiting performance in offline RL.

Method: Proposes In-context Compositional Q-Learning (ICQL) that uses linear Transformers to adaptively infer local Q-functions from retrieved transitions without explicit subtask labels, treating Q-learning as contextual inference problem.

Result: ICQL substantially improves performance: up to 16.4% in kitchen tasks, 8.6% in Gym tasks, and 6.3% in Adroit tasks. Theoretically achieves bounded Q-function approximation error and supports near-optimal policy extraction under certain assumptions.

Conclusion: ICQL demonstrates the potential of in-context learning for robust and compositional value estimation, positioning it as a principled and effective framework for offline reinforcement learning.

Abstract: Accurately estimating the Q-function is a central challenge in offline reinforcement learning. However, existing approaches often rely on a single global Q-function, which struggles to capture the compositional nature of tasks involving diverse subtasks. We propose In-context Compositional Q-Learning (\texttt{ICQL}), the first offline RL framework that formulates Q-learning as a contextual inference problem, using linear Transformers to adaptively infer local Q-functions from retrieved transitions without explicit subtask labels. Theoretically, we show that under two assumptions–linear approximability of the local Q-function and accurate weight inference from retrieved context–\texttt{ICQL} achieves bounded Q-function approximation error, and supports near-optimal policy extraction. Empirically, \texttt{ICQL} substantially improves performance in offline settings: improving performance in kitchen tasks by up to 16.4%, and in Gym and Adroit tasks by up to 8.6% and 6.3%. These results highlight the underexplored potential of in-context learning for robust and compositional value estimation, positioning \texttt{ICQL} as a principled and effective framework for offline RL.

[1283] A Small Math Model: Recasting Strategy Choice Theory in an LLM-Inspired Architecture

Roussel Rahman, Jeff Shrager

Main category: cs.LG

TL;DR: This paper recasts Strategy Choice Theory (SCT) as a Small Math Model (SMM) using neural-network architecture similar to LLMs, extending SCT to include counting practice, number embedding, and gated attention.

Details

Motivation: To provide a unified platform for investigating mathematical reasoning emergence in LLM-based agents by extending SCT with modern neural network approaches.

Method: Developed a Small Math Model (SMM) using neural-network-based architecture with counting practice, symbol embedding, and gated attention mechanisms, analogous to LLM architectures.

Result: The SMM demonstrates constructive and destructive interference between counting and addition, and shows wave-like use of finger-counting as sum recall improves, similar to earlier SCT findings.

Conclusion: The SMM provides a foundation for extending to later aspects of the SCT program, including adaptive strategy choice and strategy discovery, enabling investigation of numerical understanding emergence in LLM-based agents.

Abstract: Strategy Choice Theory (SCT)\footnote{Strategy Choice Theory'', Distributions of Associations’’, and Overlapping Wave Theory'' have been used to refer to this line of work, emphasizing different aspects.}\citep[e.g.,][]{siegler1984strategychoices, siegler2000rebirth} explains important aspects of children's arithmetic learning based upon principles including learning from developmentally naturalistic data, probabilistic representation, confidence-based retrieval, and the phase-like importance of scaffolding strategies, such as finger-counting. Here we recast SCT as a Small Math Model’’ (SMM), employing a neural-network-based architecture analogous to LLMs. The SMM extends SCT to include counting practice\footnote{The original SCT model was pre-biased in accordance with the supposed experience of counting.}, symbol (number) embedding, and gated attention. Similar to earlier work, the SMM demonstrates constructive and destructive interference between counting and addition, and the ``wave-like’’ use of finger-counting as sum recall improves. We plan to extend the SMM to later aspects of the decades-long SCT program, including adaptive strategy choice and eventually strategy discovery, providing a unified platform to investigate the understanding of numerical characteristics and relationships essential for mathematical reasoning – as it can emerge in LLM-based agents.

[1284] AQUAIR: A High-Resolution Indoor Environmental Quality Dataset for Smart Aquaculture Monitoring

Youssef Sabiri, Walid Houmaidi, Ouail El Maadi, Yousra Chtouki

Main category: cs.LG

TL;DR: AQUAIR is an open-access dataset of indoor environmental quality variables from a fish aquaculture facility, providing over 23,000 time-stamped observations for smart aquaculture research.

Details

Motivation: Public datasets describing air conditions around indoor aquaculture tanks are scarce, limiting development of forecasting and anomaly-detection tools that link air quality with water-quality dynamics.

Method: Used a single Awair HOME monitor to sample six IEQ variables every five minutes from October 2024 to January 2025, with ISO-compliant mounting, calibration checks, and an open-source processing pipeline for quality control.

Result: Produced a quality-controlled dataset showing stable environmental conditions with feeding-time peaks, suitable for short-horizon forecasting, event detection, and sensor drift studies.

Conclusion: AQUAIR fills a critical gap in smart aquaculture informatics and provides a reproducible benchmark for machine learning curricula and environmental sensing research in recirculating aquaculture systems.

Abstract: Smart aquaculture systems depend on rich environmental data streams to protect fish welfare, optimize feeding, and reduce energy use. Yet public datasets that describe the air surrounding indoor tanks remain scarce, limiting the development of forecasting and anomaly-detection tools that couple head-space conditions with water-quality dynamics. We therefore introduce AQUAIR, an open-access public dataset that logs six Indoor Environmental Quality (IEQ) variables–air temperature, relative humidity, carbon dioxide, total volatile organic compounds, PM2.5 and PM10–inside a fish aquaculture facility in Amghass, Azrou, Morocco. A single Awair HOME monitor sampled every five minutes from 14 October 2024 to 9 January 2025, producing more than 23,000 time-stamped observations that are fully quality-controlled and publicly archived on Figshare. We describe the sensor placement, ISO-compliant mounting height, calibration checks against reference instruments, and an open-source processing pipeline that normalizes timestamps, interpolates short gaps, and exports analysis-ready tables. Exploratory statistics show stable conditions (median CO2 = 758 ppm; PM2.5 = 12 micrograms/m3) with pronounced feeding-time peaks, offering rich structure for short-horizon forecasting, event detection, and sensor drift studies. AQUAIR thus fills a critical gap in smart aquaculture informatics and provides a reproducible benchmark for data-centric machine learning curricula and environmental sensing research focused on head-space dynamics in recirculating aquaculture systems.

[1285] A Family of Kernelized Matrix Costs for Multiple-Output Mixture Neural Networks

Bo Hu, José C. Príncipe

Main category: cs.LG

TL;DR: This paper combines Mixture Density Networks (MDNs) with contrastive learning by proposing four kernelized matrix costs for data density approximation.

Details

Motivation: To improve self-supervised and contrastive feature learning by integrating pairwise distance-based costs with mixture density estimation.

Method: Proposes four kernelized matrix costs (scalar cost, vector-matrix cost, matrix-matrix cost using trace of Schur complement, and SVD cost using nuclear norm) combined with Mixture Density Networks for learning multiple centers in mixture densities.

Result: A novel framework that enables data density approximation through contrastive learning with multiple center representations.

Conclusion: The integration of MDNs with contrastive costs using kernelized matrix formulations provides an effective approach for generative modeling and density estimation in self-supervised learning.

Abstract: Pairwise distance-based costs are crucial for self-supervised and contrastive feature learning. Mixture Density Networks (MDNs) are a widely used approach for generative models and density approximation, using neural networks to produce multiple centers that define a Gaussian mixture. By combining MDNs with contrastive costs, this paper proposes data density approximation using four types of kernelized matrix costs: the scalar cost, the vector-matrix cost, the matrix-matrix cost (the trace of Schur complement), and the SVD cost (the nuclear norm), for learning multiple centers required to define a mixture density.

[1286] Demographic-Agnostic Fairness without Harm

Zhongteng Cai, Mohammad Mahdi Khalili, Xueru Zhang

Main category: cs.LG

TL;DR: Proposes a demographic-agnostic fairness optimization method that jointly learns group partitioning and decoupled classifiers to achieve preference-based fairness without requiring demographic information.

Details

Motivation: Address limitations of parity-based fairness (which lowers accuracy) and preference-based fairness (which requires demographic information), aiming to achieve fairness without sacrificing accuracy and without needing demographic data.

Method: DAFH algorithm that jointly learns a group classifier to partition population into groups and a set of decoupled classifiers for these groups, without requiring demographic information during training.

Result: Theoretical analysis shows method outperforms baselines when demographic information is known; experiments on synthetic and real data validate the approach.

Conclusion: Proposed demographic-agnostic method successfully achieves preference-based fairness without requiring demographic information, maintaining accuracy while ensuring fairness.

Abstract: As machine learning (ML) algorithms are increasingly used in social domains to make predictions about humans, there is a growing concern that these algorithms may exhibit biases against certain social groups. Numerous notions of fairness have been proposed in the literature to measure the unfairness of ML. Among them, one class that receives the most attention is \textit{parity-based}, i.e., achieving fairness by equalizing treatment or outcomes for different social groups. However, achieving parity-based fairness often comes at the cost of lowering model accuracy and is undesirable for many high-stakes domains like healthcare. To avoid inferior accuracy, a line of research focuses on \textit{preference-based} fairness, under which any group of individuals would experience the highest accuracy and collectively prefer the ML outcomes assigned to them if they were given the choice between various sets of outcomes. However, these works assume individual demographic information is known and fully accessible during training. In this paper, we relax this requirement and propose a novel \textit{demographic-agnostic fairness without harm (DAFH)} optimization algorithm, which jointly learns a group classifier that partitions the population into multiple groups and a set of decoupled classifiers associated with these groups. Theoretically, we conduct sample complexity analysis and show that our method can outperform the baselines when demographic information is known and used to train decoupled classifiers. Experiments on both synthetic and real data validate the proposed method.

[1287] PEARL: Peer-Enhanced Adaptive Radio via On-Device LLM

Ju-Hyung Lee, Yanqing Lu, Klaus Doppler

Main category: cs.LG

TL;DR: PEARL is a framework using on-device LLMs for cooperative D2D communication optimization, improving performance and reducing energy consumption through peer-aware context and efficient training methods.

Details

Motivation: Extend single-device on-device LLMs to cooperative scenarios by leveraging both publisher and subscriber states for better Wi-Fi Aware parameter selection in D2D communication.

Method: Uses context-aware reward (normalizing latency by application tolerances and modulating energy by battery states) for KL-based finetuning. Two variants: PEARL (Head + LoRA) and PEARL-Lite (Head-only).

Result: PEARL improves objective scores over baselines, reduces energy by up to 16% in cooperative low-battery cases, with PEARL-Lite achieving sub-20ms inference at near-identical performance.

Conclusion: Peer-aware context, reward-aligned training, and head-based efficiency make LLMs practical for always-on, on-device cross-layer control in D2D communication.

Abstract: We present PEARL (Peer-Enhanced Adaptive Radio via On-Device LLM), a framework for cooperative cross-layer optimization in device-to-device (D2D) communication. Building on our previous work on single-device on-device LLMs, PEARL extends the paradigm by leveraging both publisher and subscriber states to guide Wi-Fi Aware (WA) parameter selection. A context-aware reward, which normalizes latency by application tolerances and modulates energy by device battery states, provides richer supervision for KL-based finetuning. We study two lightweight variants: PEARL (Head + Low-Rank Adaptation (LoRA)) achieves the best overall performance, while PEARL-Lite (Head-only) delivers sub-20 ms inference at near-identical objective scores. Across synthetic scenarios grounded in real measurements, PEARL improves objective scores over heuristic and compact model baselines and reduces energy by up to 16% in cooperative low-battery cases. These results demonstrate that peer-aware context, reward-aligned training, and head-based efficiency make LLMs practical for always-on, on-device cross-layer control.

[1288] Clebsch-Gordan Transformer: Fast and Global Equivariant Attention

Owen Lewis Howell, Linfeng Zhao, Xupeng Zhu, Yaoyao Qian, Haojie Huang, Lingfeng Sun, Wil Thomason, Robert Platt, Robin Walters

Main category: cs.LG

TL;DR: The Clebsch-Gordan Transformer introduces efficient global attention using Clebsch-Gordon Convolution on SO(3) irreducible representations, achieving O(N log N) complexity while supporting high-order equivariant features and optional token permutation equivariance.

Details

Motivation: Existing equivariant transformers suffer from quadratic computational costs and limited support for high-order equivariant features, restricting their expressiveness and performance in geometric tasks.

Method: Proposes Clebsch-Gordon Convolution on SO(3) irreducible representations for efficient global attention, exploits sparsity of Clebsch-Gordon matrix for scalability with high-order features, and incorporates token permutation equivariance via weight sharing or data augmentation.

Result: Achieves superior performance in n-body simulation, QM9, ModelNet point cloud classification, and robotic grasping, with clear gains in GPU memory, speed, and accuracy over existing equivariant transformers.

Conclusion: The Clebsch-Gordan Transformer enables efficient global attention with high-order equivariant features while maintaining computational efficiency, advancing the state of equivariant transformers for geometric tasks.

Abstract: The global attention mechanism is one of the keys to the success of transformer architecture, but it incurs quadratic computational costs in relation to the number of tokens. On the other hand, equivariant models, which leverage the underlying geometric structures of problem instance, often achieve superior accuracy in physical, biochemical, computer vision, and robotic tasks, at the cost of additional compute requirements. As a result, existing equivariant transformers only support low-order equivariant features and local context windows, limiting their expressiveness and performance. This work proposes Clebsch-Gordan Transformer, achieving efficient global attention by a novel Clebsch-Gordon Convolution on $\SO(3)$ irreducible representations. Our method enables equivariant modeling of features at all orders while achieving ${O}(N \log N)$ input token complexity. Additionally, the proposed method scales well with high-order irreducible features, by exploiting the sparsity of the Clebsch-Gordon matrix. Lastly, we also incorporate optional token permutation equivariance through either weight sharing or data augmentation. We benchmark our method on a diverse set of benchmarks including n-body simulation, QM9, ModelNet point cloud classification and a robotic grasping dataset, showing clear gains over existing equivariant transformers in GPU memory size, speed, and accuracy.

[1289] ADAPT: Lightweight, Long-Range Machine Learning Force Fields Without Graphs

Evan Dramko, Yihuang Xiong, Yizhi Zhu, Geoffroy Hautier, Thomas Reps, Christopher Jermaine, Anastasios Kyrillidis

Main category: cs.LG

TL;DR: ADAPT is a machine-learning force field using Transformer architecture that outperforms graph neural networks for modeling silicon point defects with 33% lower errors and reduced computational cost.

Details

Motivation: First-principles methods for computing defect properties are computationally expensive, and existing MLFFs based on graph neural networks suffer from oversmoothing and poor long-range interaction modeling, especially problematic for point defects.

Method: ADAPT replaces graph representations with direct coordinates-in-space formulation, treats atoms as tokens, uses Transformer encoder to model all pairwise atomic interactions explicitly.

Result: Achieved ~33% reduction in both force and energy prediction errors compared to state-of-the-art GNN-based model, while requiring only a fraction of the computational cost.

Conclusion: ADAPT provides an effective alternative to GNN-based MLFFs for modeling point defects, addressing key limitations through its coordinate-based Transformer architecture.

Abstract: Point defects play a central role in driving the properties of materials. First-principles methods are widely used to compute defect energetics and structures, including at scale for high-throughput defect databases. However, these methods are computationally expensive, making machine-learning force fields (MLFFs) an attractive alternative for accelerating structural relaxations. Most existing MLFFs are based on graph neural networks (GNNs), which can suffer from oversmoothing and poor representation of long-range interactions. Both of these issues are especially of concern when modeling point defects. To address these challenges, we introduce the Accelerated Deep Atomic Potential Transformer (ADAPT), an MLFF that replaces graph representations with a direct coordinates-in-space formulation and explicitly considers all pairwise atomic interactions. Atoms are treated as tokens, with a Transformer encoder modeling their interactions. Applied to a dataset of silicon point defects, ADAPT achieves a roughly 33 percent reduction in both force and energy prediction errors relative to a state-of-the-art GNN-based model, while requiring only a fraction of the computational cost.

[1290] GeoFunFlow: Geometric Function Flow Matching for Inverse Operator Learning over Complex Geometries

Sifan Wang, Zhikai Wu, David van Dijk, Lu Lu

Main category: cs.LG

TL;DR: GeoFunFlow is a geometric diffusion model framework for solving inverse PDE problems on complex geometries, combining a geometric function autoencoder with latent diffusion to achieve efficient posterior sampling and accurate reconstructions.

Details

Motivation: Inverse PDE problems are challenging due to ill-posedness, data sparsity, and complex geometries. Classical optimization methods are computationally expensive, while existing learning approaches are limited to regular domains or forward modeling.

Method: Combines a geometric function autoencoder (GeoFAE) with Perceiver modules to handle unstructured meshes and varying sizes, and a latent diffusion model trained via rectified flow for posterior sampling from sparse noisy data.

Result: Achieves state-of-the-art reconstruction accuracy across five benchmarks on complex geometries, provides calibrated uncertainty quantification, and delivers efficient inference compared to operator-learning and diffusion model baselines.

Conclusion: GeoFunFlow effectively addresses inverse PDE problems on complex geometries through its geometric diffusion framework, offering accurate reconstructions, uncertainty quantification, and computational efficiency.

Abstract: Inverse problems governed by partial differential equations (PDEs) are crucial in science and engineering. They are particularly challenging due to ill-posedness, data sparsity, and the added complexity of irregular geometries. Classical PDE-constrained optimization methods are computationally expensive, especially when repeated posterior sampling is required. Learning-based approaches improve efficiency and scalability, yet most are designed for regular domains or focus on forward modeling. Here, we introduce {\em GeoFunFlow}, a geometric diffusion model framework for inverse problems on complex geometries. GeoFunFlow combines a novel geometric function autoencoder (GeoFAE) and a latent diffusion model trained via rectified flow. GeoFAE employs a Perceiver module to process unstructured meshes of varying sizes and produces continuous reconstructions of physical fields, while the diffusion model enables posterior sampling from sparse and noisy data. Across five benchmarks, GeoFunFlow achieves state-of-the-art reconstruction accuracy over complex geometries, provides calibrated uncertainty quantification, and delivers efficient inference compared to operator-learning and diffusion model baselines.

[1291] HyMaTE: A Hybrid Mamba and Transformer Model for EHR Representation Learning

Md Mozaharul Mottalib, Thao-Ly T. Phan, Rahmatollah Beheshti

Main category: cs.LG

TL;DR: HyMaTE is a hybrid model combining State Space Models (Mamba) and Transformers for EHR representation learning, addressing challenges of long sequences and computational complexity while improving interpretability.

Details

Motivation: EHR data has complexity with long multivariate sequences, sparsity, and missing values. Transformers face quadratic complexity and limited context length, while SSMs like Mamba focus more on sequence-level than channel-level information.

Method: Proposed HyMaTE model that combines State Space Models (SSMs) with advanced attention mechanisms, creating a hybrid architecture for longitudinal EHR data representation.

Result: Demonstrated HyMaTE’s ability to capture effective, richer, and more nuanced unified representation of EHR data across multiple clinical datasets, with improved interpretability through self-attention mechanisms.

Conclusion: HyMaTE provides a scalable and generalizable solution for real-world healthcare applications, offering better efficiency and interpretability for EHR data modeling.

Abstract: Electronic health Records (EHRs) have become a cornerstone in modern-day healthcare. They are a crucial part for analyzing the progression of patient health; however, their complexity, characterized by long, multivariate sequences, sparsity, and missing values poses significant challenges in traditional deep learning modeling. While Transformer-based models have demonstrated success in modeling EHR data and predicting clinical outcomes, their quadratic computational complexity and limited context length hinder their efficiency and practical applications. On the other hand, State Space Models (SSMs) like Mamba present a promising alternative offering linear-time sequence modeling and improved efficiency for handling long sequences, but focus mostly on mixing sequence-level information rather than channel-level data. To overcome these challenges, we propose HyMaTE (A Hybrid Mamba and Transformer Model for EHR Representation Learning), a novel hybrid model tailored for representing longitudinal data, combining the strengths of SSMs with advanced attention mechanisms. By testing the model on predictive tasks on multiple clinical datasets, we demonstrate HyMaTE’s ability to capture an effective, richer, and more nuanced unified representation of EHR data. Additionally, the interpretability of the outcomes achieved by self-attention illustrates the effectiveness of our model as a scalable and generalizable solution for real-world healthcare applications. Codes are available at: https://github.com/healthylaife/HyMaTE.

[1292] Echo Flow Networks

Hongbo Liu, Jia Xu

Main category: cs.LG

TL;DR: Echo Flow Networks (EFNs) introduce a novel reservoir computing framework that combines extended Echo State Networks with MLP readouts and Matrix-Gated Composite Random Activation, achieving state-of-the-art time-series forecasting performance with significantly improved efficiency.

Details

Motivation: To address the fundamental challenge in time-series forecasting of efficiently capturing long-range temporal dependencies while overcoming the computational complexity vs. information retention trade-off in conventional architectures.

Method: EFNs use a group of extended Echo State Networks (X-ESNs) with MLP readouts, enhanced by Matrix-Gated Composite Random Activation (MCRA) for complex temporal dynamics, and a dual-stream architecture that dynamically selects reservoir features from infinite-horizon memory.

Result: EFNs achieve up to 4x faster training and 3x smaller model size compared to PatchTST, reducing forecasting error from 43% to 35% (20% relative improvement), with EchoFormer achieving state-of-the-art performance across five benchmark datasets.

Conclusion: The EFN framework successfully bridges the gap between computational efficiency and forecasting accuracy in long-range time-series forecasting, offering a promising direction for efficient temporal modeling.

Abstract: At the heart of time-series forecasting (TSF) lies a fundamental challenge: how can models efficiently and effectively capture long-range temporal dependencies across ever-growing sequences? While deep learning has brought notable progress, conventional architectures often face a trade-off between computational complexity and their ability to retain accumulative information over extended horizons. Echo State Networks (ESNs), a class of reservoir computing models, have recently regained attention for their exceptional efficiency, offering constant memory usage and per-step training complexity regardless of input length. This makes them particularly attractive for modeling extremely long-term event history in TSF. However, traditional ESNs fall short of state-of-the-art performance due to their limited nonlinear capacity, which constrains both their expressiveness and stability. We introduce Echo Flow Networks (EFNs), a framework composed of a group of extended Echo State Networks (X-ESNs) with MLP readouts, enhanced by our novel Matrix-Gated Composite Random Activation (MCRA), which enables complex, neuron-specific temporal dynamics, significantly expanding the network’s representational capacity without compromising computational efficiency. In addition, we propose a dual-stream architecture in which recent input history dynamically selects signature reservoir features from an infinite-horizon memory, leading to improved prediction accuracy and long-term stability. Extensive evaluations on five benchmarks demonstrate that EFNs achieve up to 4x faster training and 3x smaller model size compared to leading methods like PatchTST, reducing forecasting error from 43% to 35%, a 20% relative improvement. One instantiation of our framework, EchoFormer, consistently achieves new state-of-the-art performance across five benchmark datasets: ETTh, ETTm, DMV, Weather, and Air Quality.

[1293] The Impossibility of Inverse Permutation Learning in Transformer Models

Rohan Alur, Chris Hays, Manish Raghavan, Devavrat Shah

Main category: cs.LG

TL;DR: Decoder-only transformers cannot learn inverse permutation tasks, but adding scratch tokens enables this capability, suggesting a mechanism for chain-of-thought reasoning.

Details

Motivation: To study a natural robustness property across reasoning tasks like long-context retrieval, multiple choice QA, and in-context learning through inverse permutation learning.

Method: Analyzing the expressive capacity of decoder-only transformers for inverse permutation tasks, exploring alternative constructions including causal attention masks and scratch token padding.

Result: Proved impossibility of inverse permutation learning in arbitrary depth decoder-only transformers, but showed feasibility with causal attention masks or scratch token padding.

Conclusion: Scratch tokens may provide an alternative mechanism for chain-of-thought reasoning in LLMs, even without meaningful semantic content.

Abstract: In this technical note, we study the problem of inverse permutation learning in decoder-only transformers. Given a permutation and a string to which that permutation has been applied, the model is tasked with producing the original (canonical'') string. We argue that this task models a natural robustness property across a variety of reasoning tasks, including long-context retrieval, multiple choice QA and in-context learning. Our primary contribution is an impossibility result: we show that an arbitrary depth, decoder-only transformer cannot learn this task. This result concerns the expressive capacity of decoder-only transformer models and is agnostic to training dynamics or sample complexity. We give a pair of alternative constructions under which inverse permutation learning is feasible. The first of these highlights the fundamental role of the causal attention mask, and reveals a gap between the expressivity of encoder-decoder transformers and the more popular decoder-only architecture. The latter result is more surprising: we show that simply padding the input with scratch tokens” yields a construction under which inverse permutation learning is possible. We conjecture that this may suggest an alternative mechanism by which chain-of-thought prompting or, more generally, intermediate ``thinking’’ tokens can enable reasoning in large language models, even when these tokens encode no meaningful semantic information (e.g., the results of intermediate computations).

[1294] A signal separation view of classification

H. N. Mhaskar, Ryan O’Dowd

Main category: cs.LG

TL;DR: A novel classification approach using localized trigonometric polynomial kernels to determine the number of classes and achieve perfect classification with minimal labeled data in compact metric spaces.

Details

Motivation: Traditional classification methods rely on function approximation, but this paper proposes an alternative that can automatically determine the number of classes and achieve perfect classification with minimal labeled data.

Method: Uses localized trigonometric polynomial kernels originally developed for point source signal separation, adapted to separate supports of different probability distributions representing classes. Implements hierarchical MASC algorithm to handle touching/overlapping class boundaries.

Result: The method successfully separates classes in various simulated and real datasets including Salinas and Indian Pines hyperspectral datasets and document datasets.

Conclusion: The localized kernel approach provides an effective alternative to function approximation methods for classification, capable of automatically determining class numbers and achieving perfect classification with minimal labeled data.

Abstract: The problem of classification in machine learning has often been approached in terms of function approximation. In this paper, we propose an alternative approach for classification in arbitrary compact metric spaces which, in theory, yields both the number of classes, and a perfect classification using a minimal number of queried labels. Our approach uses localized trigonometric polynomial kernels initially developed for the point source signal separation problem in signal processing. Rather than point sources, we argue that the various classes come from different probability distributions. The localized kernel technique developed for separating point sources is then shown to separate the supports of these distributions. This is done in a hierarchical manner in our MASC algorithm to accommodate touching/overlapping class boundaries. We illustrate our theory on several simulated and real life datasets, including the Salinas and Indian Pines hyperspectral datasets and a document dataset.

[1295] Evaluation of Machine and Deep Learning Techniques for Cyclone Trajectory Regression and Status Classification by Time Series Data

Ethan Zachary Lo, Dan Chie-Tien Lo

Main category: cs.LG

TL;DR: Machine learning approach using gradient boosting regression and random forest classification achieves 93% accuracy in cyclone trajectory and status forecasting, outperforming traditional methods.

Details

Motivation: Traditional numerical weather prediction models are computationally intensive and error-prone due to atmospheric chaos, creating need for more efficient forecasting methods.

Method: Two-stage ML pipeline: regression model predicts cyclone features using sliding window of historical data, then classification models predict categorical status using gradient boosting, random forest, SVM, and MLP classifiers with SMOTE.

Result: RF classifier achieves 93% accuracy with strong precision, recall, and F1 scores, particularly robust for minority statuses. Regression yields low errors: pressure within 2.2 mb, wind within 2.4 kt.

Conclusion: ML models, especially ensemble-based classifiers, offer effective scalable alternative to traditional forecasting with potential for real-time prediction and decision support integration.

Abstract: Accurate cyclone forecasting is essential for minimizing loss of life, infrastructure damage, and economic disruption. Traditional numerical weather prediction models, though effective, are computationally intensive and prone to error due to the chaotic nature of atmospheric systems. This study proposes a machine learning (ML) approach to forecasting tropical cyclone trajectory and status using time series data from the National Hurricane Center, including recently added best track wind radii. A two-stage ML pipeline is developed: a regression model first predicts cyclone features maximum wind speed, minimum pressure, trajectory length, and directional change using a sliding window of historical data. These outputs are then input into classification models to predict the cyclone’s categorical status. Gradient boosting regression and three classifiers random forest (RF), support vector machine (SVM), and multilayer perceptron (MLP) are evaluated. After hyperparameter tuning and synthetic minority oversampling (SMOTE), the RF classifier achieves the highest performance with 93% accuracy, outperforming SVM and MLP across precision, recall, and F1 score. The RF model is particularly robust in identifying minority cyclone statuses and minimizing false negatives. Regression results yield low mean absolute errors, with pressure and wind predictions within about 2.2 mb and 2.4 kt, respectively. These findings demonstrate that ML models, especially ensemble-based classifiers, offer an effective, scalable alternative to traditional forecasting methods, with potential for real-time cyclone prediction and integration into decision support systems.

[1296] Stable Forgetting: Bounded Parameter-Efficient Unlearning in LLMs

Arpit Garg, Hemanth Saratchandran, Ravi Garg, Simon Lucey

Main category: cs.LG

TL;DR: Proposes Bounded Parameter-Efficient Unlearning, a stable method for machine unlearning in LLMs that addresses gradient instability in existing gradient difference approaches by applying bounded functions to MLP adapters.

Details

Motivation: Existing machine unlearning approaches in LLMs are unstable and unreliable, with gradient difference methods causing unbounded weight growth and training instability when combined with cross-entropy loss.

Method: A parameter-efficient approach that stabilizes LoRA-based fine-tuning by applying bounded functions to MLP adapters, controlling weight dynamics during gradient ascent on forget data.

Result: Achieves substantial improvements in forgetting while preserving retention across TOFU, TDEC, and MUSE benchmarks, and scales from 125M to 8B parameters.

Conclusion: Establishes a theoretically grounded and practically scalable framework for stable unlearning in LLMs.

Abstract: Machine unlearning in large language models (LLMs) is essential for privacy and safety; however, existing approaches remain unstable and unreliable. A widely used strategy, the gradient difference method, applies gradient descent on retained data while performing gradient ascent on forget data, the data whose influence should be removed. However, when combined with cross-entropy loss, this procedure causes unbounded growth of weights and gradients, leading to training instability and degrading both forgetting and retention. We provide a theoretical framework that explains this failure, explicitly showing how ascent on the forget set destabilizes optimization in the feedforward MLP layers of LLMs. Guided by this insight, we propose Bounded Parameter-Efficient Unlearning, a parameter-efficient approach that stabilizes LoRA-based fine-tuning by applying bounded functions to MLP adapters. This simple modification controls the weight dynamics during ascent, enabling the gradient difference method to converge reliably. Across the TOFU, TDEC, and MUSE benchmarks, and across architectures and scales from 125M to 8B parameters, our method achieves substantial improvements in forgetting while preserving retention, establishing a novel theoretically grounded and practically scalable framework for unlearning in LLMs.

[1297] Multi-Scale Geometric Autoencoder

Qipeng Zhan, Zhuoping Zhou, Zexuan Wang, Li Shen

Main category: cs.LG

TL;DR: MAE is an asymmetric autoencoder that preserves both global and local geometric structures by applying global constraints to the encoder and local constraints to the decoder, outperforming existing methods.

Details

Motivation: Existing autoencoder approaches either preserve global or local geometric properties separately, leading to accumulated distance errors or distorted large-scale relationships.

Method: Asymmetric architecture with global distance constraints on the encoder and local geometric constraints on the decoder to simultaneously preserve multi-scale geometric structure.

Result: MAE consistently outperforms existing methods across various evaluation metrics on both synthetic manifolds and real-world datasets.

Conclusion: The asymmetric design naturally aligns with encoder-decoder roles and effectively preserves multi-scale geometric structure in latent representations.

Abstract: Autoencoders have emerged as powerful models for visualization and dimensionality reduction based on the fundamental assumption that high-dimensional data is generated from a low-dimensional manifold. A critical challenge in autoencoder design is to preserve the geometric structure of data in the latent space, with existing approaches typically focusing on either global or local geometric properties separately. Global approaches often encounter errors in distance approximation that accumulate, while local methods frequently converge to suboptimal solutions that distort large-scale relationships. We propose Multi-Scale Geometric Autoencoder (MAE), which introduces an asymmetric architecture that simultaneously preserves both scales of the geometric structure by applying global distance constraints to the encoder and local geometric constraints to the decoder. Through theoretical analysis, we establish that this asymmetric design aligns naturally with the distinct roles of the encoder and decoder components. Our comprehensive experiments on both synthetic manifolds and real-world datasets demonstrate that MAE consistently outperforms existing methods across various evaluation metrics.

[1298] Model Correlation Detection via Random Selection Probing

Ruibo Chen, Sheng Zhang, Yihan Wu, Tong Zheng, Peihua Mai, Heng Huang

Main category: cs.LG

TL;DR: RSP is a statistical framework that detects model correlations by testing transferability of optimized prefixes, producing rigorous p-values to quantify evidence of correlation between models.

Details

Motivation: Existing methods for detecting model correlations require parameter access or produce heuristic scores without principled thresholds, limiting their applicability in the growing ecosystem of LLMs and VLMs.

Method: RSP formulates model correlation detection as a statistical test by optimizing textual/visual prefixes on a reference model for random selection tasks and evaluating their transferability to target models, using an unrelated baseline model to filter out generic features.

Result: Experiments show RSP consistently yields small p-values for related models while maintaining high p-values for unrelated ones, demonstrating robustness across LLMs and VLMs under diverse access conditions.

Conclusion: RSP establishes the first principled statistical framework for model correlation detection, enabling transparent and interpretable decisions in machine learning ecosystems.

Abstract: The growing prevalence of large language models (LLMs) and vision-language models (VLMs) has heightened the need for reliable techniques to determine whether a model has been fine-tuned from or is even identical to another. Existing similarity-based methods often require access to model parameters or produce heuristic scores without principled thresholds, limiting their applicability. We introduce Random Selection Probing (RSP), a hypothesis-testing framework that formulates model correlation detection as a statistical test. RSP optimizes textual or visual prefixes on a reference model for a random selection task and evaluates their transferability to a target model, producing rigorous p-values that quantify evidence of correlation. To mitigate false positives, RSP incorporates an unrelated baseline model to filter out generic, transferable features. We evaluate RSP across both LLMs and VLMs under diverse access conditions for reference models and test models. Experiments on fine-tuned and open-source models show that RSP consistently yields small p-values for related models while maintaining high p-values for unrelated ones. Extensive ablation studies further demonstrate the robustness of RSP. These results establish RSP as the first principled and general statistical framework for model correlation detection, enabling transparent and interpretable decisions in modern machine learning ecosystems.

[1299] FM-FoG: A Real-Time Foundation Model-based Wearable System for Freezing-of-Gait Mitigation

Chuntian Chi, John Clapham, Leslie Cloud, Ingrid Pretzer-Aboff, GinaMari Blackwell, Huajie Shao, Gang Zhou

Main category: cs.LG

TL;DR: FM-FoG is a real-time foundation model-based wearable system that detects Freezing-of-Gait in Parkinson’s patients without requiring patient-specific training, achieving 98.5% F1-score on unseen patients while extending smartphone battery life by up to 72%.

Details

Motivation: Current FoG detection systems require extensive patient-specific training data and lack generalization, limiting clinical deployment. FoG affects over 50% of mid-to-late stage PD patients, significantly impairing mobility independence and quality of life.

Method: Combines self-supervised pretraining on diverse IMU datasets with sensor context integration. Uses a lightweight CNN-LSTM activity classifier to selectively activate the foundation model only during walking or standing, avoiding unnecessary computation.

Result: Achieves 98.5% F1-score when tested on previously unseen patients (VCU FoG-IMU dataset with 23 PD patients), substantially outperforming competitive baseline methods. Deployed on Google Pixel 8a smartphone, extends battery life by up to 72% while maintaining sub-20ms intervention latency.

Conclusion: FM-FoG can enable practical, energy-efficient healthcare applications that generalize across patients without individual training requirements, addressing key limitations of current FoG detection systems.

Abstract: Freezing-of-Gait (FoG) affects over 50% of mid-to-late stage Parkinson’s disease (PD) patients, significantly impairing patients’ mobility independence and reducing quality of life. FoG is characterized by sudden episodes where walking cannot start or is interrupted, occurring exclusively during standing or walking, and never while sitting or lying down. Current FoG detection systems require extensive patient-specific training data and lack generalization, limiting clinical deployment. To address these issues, we introduce FM-FoG, a real-time foundation model-based wearable system achieving FoG detection in unseen patients without patient-specific training. Our approach combines self-supervised pretraining on diverse Inertial Measurement Unit (IMU) datasets with sensor context integration. Since FoG occurs only during ambulatory activities, a lightweight CNN-LSTM activity classifier selectively activates the foundation model only during walking or standing, avoiding unnecessary computation. Evaluated on the VCU FoG-IMU dataset with 23 PD patients, FM-FoG achieves a 98.5% F1-score when tested on previously unseen patients, substantially outperforming competitive baseline methods. Deployed on a Google Pixel 8a smartphone, the system extends battery life by up to 72% while maintaining sub-20ms intervention latency. The results indicate that our FM-FoG can enable practical, energy-efficient healthcare applications that generalize across patients without individual training requirements.

[1300] Negative Pre-activations Differentiate Syntax

Linghao Kong, Angelina Ning, Micah Adler, Nir Shavit

Main category: cs.LG

TL;DR: Wasserstein neurons are a sparse but critical class of entangled neurons in LLMs that enable syntax processing through negative differentiation in pre-activation space, particularly for syntactic tokens.

Details

Motivation: To understand the functional role of recently discovered Wasserstein neurons and their unique contribution to language model performance, especially in syntactic processing.

Method: Used targeted ablation experiments that zero only negative pre-activations of entangled neurons, compared with random and perplexity-matched controls, and analyzed layer-specific interventions across training checkpoints.

Result: Targeted removal of negative pre-activations in Wasserstein neurons significantly weakens model function and disrupts grammatical behavior, while controls leave performance unchanged. The effect accumulates across layers and emerges as Wasserstein neurons stabilize during training.

Conclusion: Negative differentiation in a sparse subset of entangled Wasserstein neurons is a crucial mechanism that language models rely on for syntax processing.

Abstract: A recently discovered class of entangled neurons, known as Wasserstein neurons, is disproportionately critical in large language models despite constituting only a very small fraction of the network: their targeted removal collapses the model, consistent with their unique role in differentiating similar inputs. Interestingly, in Wasserstein neurons immediately preceding smooth activation functions, such differentiation manifests in the negative pre-activation space, especially in early layers. Pairs of similar inputs are driven to highly distinct negative values, and these pairs involve syntactic tokens such as determiners and prepositions. We show that this negative region is functional rather than simply favorable for optimization. A minimal, sign-specific intervention that zeroes only the negative pre-activations of a small subset of entangled neurons significantly weakens overall model function and disrupts grammatical behavior, while both random and perplexity-matched controls leave grammatical performance largely unchanged. Part of speech analysis localizes the excess surprisal to syntactic scaffolding tokens, and layer-specific interventions reveal that small local degradations accumulate across depth. Over training checkpoints, the same ablation impairs grammatical behavior as Wasserstein neurons emerge and stabilize. Together, these results identify negative differentiation in a sparse subset of entangled neurons as a crucial mechanism that language models rely on for syntax.

[1301] Group-Relative REINFORCE Is Secretly an Off-Policy Algorithm: Demystifying Some Myths About GRPO and Its Friends

Chaorui Yao, Yanxi Chen, Yuchang Sun, Yushuo Chen, Wenhao Zhang, Xuchen Pan, Yaliang Li, Bolin Ding

Main category: cs.LG

TL;DR: This paper provides a first-principles derivation showing that group-relative REINFORCE can be interpreted as an off-policy algorithm, leading to two principles for adapting REINFORCE to off-policy settings and unifying recent RL methods.

Details

Motivation: The motivation is to address practical constraints in real-world LLM applications, the complexity of LLM-RL infrastructure, and the need for RL methodology innovations by enabling off-policy reinforcement learning for large language models.

Method: The authors present a first-principles derivation of group-relative REINFORCE without assuming specific training data distribution, revealing its native off-policy interpretation. This leads to two principles: regularizing policy updates and actively shaping data distribution.

Result: The analysis demystifies roles of importance sampling and clipping in GRPO, unifies OPMD and AsymRE as regularized REINFORCE forms, provides theoretical justification for data-weighting strategies, and offers actionable insights validated through empirical studies.

Conclusion: The work opens up new opportunities for principled algorithm design in off-policy RL for LLMs by providing a theoretical foundation that enables practical adaptation of REINFORCE to off-policy settings.

Abstract: Off-policy reinforcement learning (RL) for large language models (LLMs) is attracting growing interest, driven by practical constraints in real-world applications, the complexity of LLM-RL infrastructure, and the need for further innovations of RL methodologies. While classic REINFORCE and its modern variants like Group Relative Policy Optimization (GRPO) are typically regarded as on-policy algorithms with limited tolerance of off-policyness, we present in this work a first-principles derivation for group-relative REINFORCE without assuming a specific training data distribution, showing that it admits a native off-policy interpretation. This perspective yields two general principles for adapting REINFORCE to off-policy settings: regularizing policy updates, and actively shaping the data distribution. Our analysis demystifies some myths about the roles of importance sampling and clipping in GRPO, unifies and reinterprets two recent algorithms – Online Policy Mirror Descent (OPMD) and Asymmetric REINFORCE (AsymRE) – as regularized forms of the REINFORCE loss, and offers theoretical justification for seemingly heuristic data-weighting strategies. Our findings lead to actionable insights that are validated with extensive empirical studies, and open up new opportunities for principled algorithm design in off-policy RL for LLMs. Source code for this work is available at https://github.com/modelscope/Trinity-RFT/tree/main/examples/rec_gsm8k.

[1302] MDD-Thinker: Towards Large Reasoning Models for Major Depressive Disorder Diagnosis

Yuyang Sha, Hongxin Pan, Gang Luo, Caijuan Shi, Jing Wang, Kefeng Li

Main category: cs.LG

TL;DR: MDD-Thinker is an LLM-based diagnostic framework that combines supervised fine-tuning and reinforcement learning to improve MDD diagnosis accuracy and interpretability, achieving superior performance over traditional methods and general-purpose LLMs.

Details

Motivation: Current MDD diagnostic approaches rely on subjective assessments and lack multimodal integration. LLMs offer potential for better accuracy but face challenges with interpretability, hallucination, and synthetic data reliance.

Method: Developed MDD-Thinker framework using UK Biobank data (40,000 reasoning samples) plus 10,000 samples from public mental health datasets. Combined supervised fine-tuning with reinforcement learning to enhance reasoning and interpretability.

Result: Achieved accuracy of 0.8268 and F1-score of 0.8081, significantly outperforming SVM, MLP, and general-purpose LLMs. SFT+RL combination yielded 29.0% accuracy gain, 38.1% F1-score gain, and 34.8% AUC gain. Comparable reasoning to larger LLMs with better efficiency.

Conclusion: First reasoning-enhanced LLM framework for MDD diagnosis using real-world clinical data. Successfully balances accuracy, interpretability, and efficiency, offering scalable approach for psychiatric diagnostics and potential broader mental health applications.

Abstract: Background Major depressive disorder (MDD) is a leading cause of global disability, yet current diagnostic approaches often rely on subjective assessments and lack the ability to integrate multimodal clinical information. Large language models (LLMs) hold promise for enhancing diagnostic accuracy through advanced reasoning but face challenges in interpretability, hallucination, and reliance on synthetic data. Methods We developed MDD-Thinker, an LLM-based diagnostic framework that integrates supervised fine-tuning (SFT) with reinforcement learning (RL) to strengthen reasoning ability and interpretability. Using the UK Biobank dataset, we generated 40,000 reasoning samples, supplemented with 10,000 samples from publicly available mental health datasets. The model was fine-tuned on these reasoning corpora, and its diagnostic and reasoning performance was evaluated against machine learning, deep learning, and state-of-the-art LLM baselines. Findings MDD-Thinker achieved an accuracy of 0.8268 and F1-score of 0.8081, significantly outperforming traditional baselines such as SVM and MLP, as well as general-purpose LLMs. Incorporating both SFT and RL yielded the greatest improvements, with relative gains of 29.0% in accuracy, 38.1% in F1-score, and 34.8% in AUC. Moreover, the model demonstrated comparable reasoning performance compared to much larger LLMs, while maintaining computational efficiency. Interpretation This study presents the first reasoning-enhanced LLM framework for MDD diagnosis trained on large-scale real-world clinical data. By integrating SFT and RL, MDD-Thinker balances accuracy, interpretability, and efficiency, offering a scalable approach for intelligent psychiatric diagnostics. These findings suggest that reasoning-oriented LLMs can provide clinically reliable support for MDD detection and may inform broader applications in mental health care.

[1303] CAOTE: KV Cache Selection for LLMs via Attention Output Error-Based Token Eviction

Raghavv Goel, Junyoung Park, Mukul Gagrani, Dalton Jones, Matthew Morse, Harper Langston, Mingu Lee, Chris Lott

Main category: cs.LG

TL;DR: CAOTE is a token eviction method that uses attention scores and value vectors to optimize eviction error, improving accuracy when combined with existing attention-based methods.

Details

Motivation: Current token eviction methods use attention scores as importance metrics but lack information about tokens' contributions to attention outputs, causing limitations in memory and compute optimization for large language models.

Method: Proposes CAOTE - a token eviction criterion based on cached tokens’ contributions to attention outputs, integrating attention scores and value vectors in closed-form. Can be used as a meta-heuristic with any token eviction method.

Result: CAOTE consistently improves accuracies on downstream tasks when combined with state-of-the-art attention score-based methods, demonstrating the importance of leveraging value information during token eviction.

Conclusion: Using value tokens alongside attention scores in token eviction processes enhances performance, and CAOTE provides an effective approach that can flexibly improve existing eviction methods.

Abstract: While long context support of large language models has extended their abilities, it also incurs challenges in memory and compute which becomes crucial bottlenecks in resource-restricted devices. Token eviction, a widely adopted post-training methodology designed to alleviate the bottlenecks by evicting less important tokens from the cache, typically uses attention scores as proxy metrics for token importance. However, one major limitation of attention score as a token-wise importance metrics is that it lacks the information about contribution of tokens to the attention output. In this paper, we propose a simple eviction criterion based on the contribution of cached tokens to attention outputs. Our method, CAOTE, optimizes for eviction error due to token eviction, by seamlessly integrating attention scores and value vectors. This is the first method which uses value tokens on top of attention-based eviction scores in closed-form. Additionally, CAOTE can act as a meta-heuristic method with flexible usage with any token eviction method. We show that CAOTE, when combined with the state-of-the-art attention score-based methods, always improves accuracies on the downstream task, indicating the importance of leveraging information from values during token eviction process.

[1304] Conda: Column-Normalized Adam for Training Large Language Models Faster

Junjie Wang, Pan Zhou, Yiming Dong, Huan Li, Jia Li, Xun Zhou, Qicheng Lao, Cong Fang, Zhouchen Lin

Main category: cs.LG

TL;DR: Conda is a novel optimizer that combines the coordinate-wise adaptivity of Adam with improved spectral conditioning through column-wise normalization, achieving 2-2.5x faster convergence than AdamW in LLM pre-training.

Details

Motivation: Address the limitations of Adam-based optimizers which suffer from poor spectral conditioning and low-rank structures, while Muon lacks the per-coordinate adaptivity of Adam.

Method: Projects updates into an orthogonal subspace and applies column-wise second moment normalization based on projected gradients, maintaining coordinate-wise adaptivity while improving spectral conditioning.

Result: Conda consistently outperforms AdamW, Muon, and other baselines in pre-training LLaMA and GPT-2 series, achieving 2-2.5x faster convergence speed than AdamW in both training steps and training time.

Conclusion: Conda is an effective and broadly applicable optimizer for large-scale LLM training that bridges the strengths of Adam and spectral normalization approaches.

Abstract: Large language models (LLMs) have demonstrated impressive generalization and emergent capabilities, yet their pre-training remains computationally expensive and sensitive to optimization dynamics. While Adam-based optimizers offer fast convergence by adapting learning rates coordinate-wise, recent studies reveal that their updates often suffer from poor spectral conditioning and low-rank structures, hindering efficiency. Muon addresses this issue via global spectral normalization but lacks the per-coordinate adaptivity of Adam. In this work, we propose \textbf{Column-Normalized Adam (Conda)}, a novel optimizer that bridges the strengths of both approaches. Conda projects updates into an orthogonal subspace and applies column-wise second moment normalization based on the projected gradients, thereby achieving both improved spectral conditioning and maintaining coordinate-wise adaptivity. This design alleviates the spectral pathologies of Adam while preserving its fast convergence behavior. Extensive experiments on the LLaMA and GPT-2 series show that Conda consistently outperforms AdamW, Muon, and other baselines in pre-training. Remarkably, on the LLaMA series, \textbf{Conda achieves $2{\sim}2.5\times$ the convergence speed of AdamW, measured in both training steps and training time.} Further ablations demonstrate its robustness under diverse training setups. These results collectively highlight Conda as an effective and broadly applicable optimizer for large-scale LLM training. The code is released on https://github.com/jie040109/Conda

[1305] Multiplicative-Additive Constrained Models:Toward Joint Visualization of Interactive and Independent Effects

Fumin Wang

Main category: cs.LG

TL;DR: MACMs combine multiplicative and additive components to improve interpretability while capturing feature interactions, outperforming both CESR and GAMs in predictive performance.

Details

Motivation: To address the trade-off between interpretability and predictive performance in machine learning for high-stakes applications like healthcare, where GAMs sacrifice interactions for interpretability and CESR fails to outperform GAMs.

Method: Introduces Multiplicative-Additive Constrained Models (MACMs) that augment CESR with an additive part to disentangle interactive and independent feature effects, using neural network implementations.

Result: Neural network-based MACMs significantly outperform both CESR and state-of-the-art GAMs in predictive performance while maintaining interpretability.

Conclusion: MACMs provide an effective solution that combines the interpretability benefits of GAMs with the interaction-capturing ability of multiplicative models, achieving superior performance in high-stakes applications.

Abstract: Interpretability is one of the considerations when applying machine learning to high-stakes fields such as healthcare that involve matters of life safety. Generalized Additive Models (GAMs) enhance interpretability by visualizing shape functions. Nevertheless, to preserve interpretability, GAMs omit higher-order interaction effects (beyond pairwise interactions), which imposes significant constraints on their predictive performance. We observe that Curve Ergodic Set Regression (CESR), a multiplicative model, naturally enables the visualization of its shape functions and simultaneously incorporates both interactions among all features and individual feature effects. Nevertheless, CESR fails to demonstrate superior performance compared to GAMs. We introduce Multiplicative-Additive Constrained Models (MACMs), which augment CESR with an additive part to disentangle the intertwined coefficients of its interactive and independent terms, thus effectively broadening the hypothesis space. The model is composed of a multiplicative part and an additive part, whose shape functions can both be naturally visualized, thereby assisting users in interpreting how features participate in the decision-making process. Consequently, MACMs constitute an improvement over both CESR and GAMs. The experimental results indicate that neural network-based MACMs significantly outperform both CESR and the current state-of-the-art GAMs in terms of predictive performance.

[1306] Semantic Editing with Coupled Stochastic Differential Equations

Jianxin Zhang, Clayton Scott

Main category: cs.LG

TL;DR: Using coupled stochastic differential equations (SDEs) to guide sampling in pretrained text-to-image models for image editing, preserving source image details while achieving high prompt fidelity.

Details

Motivation: Existing image editing methods with pretrained text-to-image models often distort fine details or introduce artifacts, making content editing challenging.

Method: Propose coupled SDEs that drive both source and edited images with same correlated noise, steering samples toward desired semantics while preserving visual similarity to source.

Result: Achieves high prompt fidelity with near-pixel-level consistency, works out-of-the-box without retraining or auxiliary networks.

Conclusion: Coupled SDEs provide a simple yet powerful tool for controlled generative AI in image editing tasks.

Abstract: Editing the content of an image with a pretrained text-to-image model remains challenging. Existing methods often distort fine details or introduce unintended artifacts. We propose using coupled stochastic differential equations (coupled SDEs) to guide the sampling process of any pre-trained generative model that can be sampled by solving an SDE, including diffusion and rectified flow models. By driving both the source image and the edited image with the same correlated noise, our approach steers new samples toward the desired semantics while preserving visual similarity to the source. The method works out-of-the-box-without retraining or auxiliary networks-and achieves high prompt fidelity along with near-pixel-level consistency. These results position coupled SDEs as a simple yet powerful tool for controlled generative AI.

[1307] Proposing a Framework for Machine Learning Adoption on Legacy Systems

Ashiqur Rahman, Hamed Alhoori

Main category: cs.LG

TL;DR: API-based framework that decouples ML lifecycle from production to enable ML adoption in legacy systems without costly upgrades or downtime.

Details

Motivation: Overcome prohibitive costs and operational disruptions of upgrading legacy systems for ML integration, especially for SMEs facing financial/logistical barriers.

Method: Lightweight browser-based interface with human-in-the-loop approach, giving domain experts interactive control over model parameters while maintaining zero production downtime.

Result: Enables ML adoption without local hardware upgrades, fosters trust through expert control, and provides scalable pathway for production quality and safety enhancement.

Conclusion: Framework offers accessible ML integration that strengthens manufacturing competitiveness by mitigating financial and operational risks.

Abstract: The integration of machine learning (ML) is critical for industrial competitiveness, yet its adoption is frequently stalled by the prohibitive costs and operational disruptions of upgrading legacy systems. The financial and logistical overhead required to support the full ML lifecycle presents a formidable barrier to widespread implementation, particularly for small and medium-sized enterprises. This paper introduces a pragmatic, API-based framework designed to overcome these challenges by strategically decoupling the ML model lifecycle from the production environment. Our solution delivers the analytical power of ML to domain experts through a lightweight, browser-based interface, eliminating the need for local hardware upgrades and ensuring model maintenance can occur with zero production downtime. This human-in-the-loop approach empowers experts with interactive control over model parameters, fostering trust and facilitating seamless integration into existing workflows. By mitigating the primary financial and operational risks, this framework offers a scalable and accessible pathway to enhance production quality and safety, thereby strengthening the competitive advantage of the manufacturing sector.

[1308] Accessible, Realistic, and Fair Evaluation of Positive-Unlabeled Learning Algorithms

Wei Wang, Dong-Dong Wu, Ming Li, Jingxiong Zhang, Gang Niu, Masashi Sugiyama

Main category: cs.LG

TL;DR: This paper proposes the first comprehensive benchmark for positive-unlabeled (PU) learning to address inconsistent experimental settings and enable fair comparison of PU learning algorithms.

Details

Motivation: Current PU learning research suffers from highly inconsistent experimental settings, making it difficult to compare algorithm performance fairly. Many algorithms rely on unrealistic validation sets with negative data, and evaluation protocols are biased towards one-sample settings.

Method: The authors develop a systematic PU learning benchmark that investigates model selection criteria without requiring negative data, identifies the internal label shift problem in one-sample settings, and proposes a calibration approach for fair comparisons across different PU learning families.

Result: The benchmark framework addresses critical factors affecting realistic and fair evaluation of PU learning algorithms, including model selection without negative data and calibration for internal label shift in one-sample settings.

Conclusion: The proposed benchmark provides an accessible, realistic, and fair environment for evaluating PU learning algorithms, enabling better comparisons within and across different PU learning families in future research.

Abstract: Positive-unlabeled (PU) learning is a weakly supervised binary classification problem, in which the goal is to learn a binary classifier from only positive and unlabeled data, without access to negative data. In recent years, many PU learning algorithms have been developed to improve model performance. However, experimental settings are highly inconsistent, making it difficult to identify which algorithm performs better. In this paper, we propose the first PU learning benchmark to systematically compare PU learning algorithms. During our implementation, we identify subtle yet critical factors that affect the realistic and fair evaluation of PU learning algorithms. On the one hand, many PU learning algorithms rely on a validation set that includes negative data for model selection. This is unrealistic in traditional PU learning settings, where no negative data are available. To handle this problem, we systematically investigate model selection criteria for PU learning. On the other hand, the problem settings and solutions of PU learning have different families, i.e., the one-sample and two-sample settings. However, existing evaluation protocols are heavily biased towards the one-sample setting and neglect the significant difference between them. We identify the internal label shift problem of unlabeled training data for the one-sample setting and propose a simple yet effective calibration approach to ensure fair comparisons within and across families. We hope our framework will provide an accessible, realistic, and fair environment for evaluating PU learning algorithms in the future.

[1309] ChessArena: A Chess Testbed for Evaluating Strategic Reasoning Capabilities of Large Language Models

Jincheng Liu, Sijun He, Jingjing Wu, Xiangsen Wang, Yang Chen, Zhaoqi Kuang, Siqi Bao, Yuan Yao

Main category: cs.LG

TL;DR: ChessArena testbed evaluates LLMs’ strategic reasoning via chess games, revealing current models struggle with complex strategic reasoning and cannot beat amateur-level chess engines.

Details

Motivation: To determine if LLMs possess genuine complex strategic reasoning skills or just sophisticated pattern recognition from training data.

Method: Created ChessArena - a competitive framework where LLMs play chess against each other in four different modes, with ranking algorithms and leaderboards. Evaluated 13 LLMs playing over 800 games.

Result: No LLM could beat Maia-1100 (amateur-level chess engine), some failed to defeat random players. Fine-tuned Qwen3-8B showed substantial improvement, approaching state-of-the-art reasoning models.

Conclusion: Current LLMs have significant shortcomings in strategic reasoning capabilities, though fine-tuning can substantially improve performance in this domain.

Abstract: Recent large language models (LLMs) have shown strong reasoning capabilities. However, a critical question remains: do these models possess genuine reasoning skills particularly complex strategic reasoning or are they primarily excelling at sophisticated pattern recognition within their training data? To address this question, this paper presents a chess testbed, ChessArena, to evaluate the strategic reasoning capabilities of LLMs. Chess requires complex strategic reasoning capabilities including long-term planning, strict rule comprehension, and multi-turn conversation memorization. Specifically, ChessArena is a competitive framework where LLMs play against each other, under four different play modes. The testbed is equipped with a ranking algorithm and a leaderboard. The testbed can also evaluate fine-grained capabilities including basic understanding, move selection, and puzzle solving. Over 13 LLMs with different modes are evaluated in ChessArena, playing over 800 games. The results reveal significant shortcomings in current LLMs: no model can beat Maia-1100 (a chess engine at human amateur level), while some even failed to defeat a random player that selects moves arbitrarily. We also present a strong baseline to the testbed: our fine-tuned Qwen3-8B substantially improved performance, approaching much larger state-of-the-art reasoning models.

[1310] Graph Foundation Models: Bridging Language Model Paradigms and Graph Optimization

Yunhao Liang, Pujun Zhang, Yuan Qu, Shaochong Lin, Zuo-jun Max Shen

Main category: cs.LG

TL;DR: GFM is the first framework that applies pretrain-transfer paradigm to solve distance-based optimization problems on graphs, achieving competitive performance with specialized solvers while being much faster.

Details

Motivation: Bridging the gap between LLM pretraining success and graph optimization challenges, addressing the conflict between language flexibility and graph combinatorial constraints.

Method: Self-supervised pre-training on paths from random walks, treating graph connectivity as supervisory signal, then using simple generative heuristics with the foundation model.

Result: Achieves competitive performance against specialized solvers across networks from 20 to 893 nodes, with significantly faster inference times.

Conclusion: Establishes new paradigm for adapting pretrain-transfer framework to graph optimization, enabling foundation model innovations in Operations Research.

Abstract: The pretrain-transfer paradigm, which underpins the success of large language models (LLMs), has demonstrated the immense power of creating foundation models that learn generalizable representations from vast datasets. However, extending this paradigm to Operations Research (OR) problems on graph structures remains challenging due to the fundamental conflict between the statistical flexibility of language and the strict combinatorial constraints of graphs. To bridge this gap, we introduce the Graph Foundation Model (GFM), the first framework capable of solving all distance-based optimization problems on graph structures. By introducing the LLM-like self-supervised pre-training paradigm on the paths generated from random walks in the graph, GFM is compelled to internalize the graph’s complex topological and combinatorial rules, where the connectivity of the structure itself can be treated as the supervisory signal. Unlike existing neural methods that learn complex and task-specific solving policies, our approach leverages the pre-trained GFM as a foundational model of the graph’s intrinsic structure, which in turn enables a simple generative heuristic to tackle a diverse range of optimization challenges effectively. Comprehensive experiments on networks ranging from 20 to 893 nodes demonstrate that GFM achieves competitive performance against specialized solvers across a variety of distinct optimization task classes, while maintaining significantly faster inference times. Our work establishes a new paradigm of adapting the pretrain-transfer framework to graph optimization, opening the door for applying foundation model innovations to OR.

[1311] Adversarial Reinforcement Learning Framework for ESP Cheater Simulation

Inkyu Park, Jeong-Gwan Lee, Taehwan Kwon, Juheon Choi, Seungku Kim, Junsu Kim, Kimin Lee

Main category: cs.LG

TL;DR: A simulation framework for modeling ESP cheaters in games, using reinforcement learning agents and adversarial games to study adaptive cheating behaviors and develop detectors.

Details

Motivation: ESP cheats are hard to detect due to lack of observable evidence and cheaters' adaptive behaviors, making labeled data collection difficult for anti-cheat systems.

Method: Propose a simulation framework with RL agents modeling cheaters/non-cheaters with different observability levels, detectors classifying behavioral trajectories, and adversarial game formulation between cheaters and detectors with structured cheater model that dynamically switches behaviors.

Result: Framework successfully simulates adaptive cheater behaviors that strategically balance reward optimization and detection evasion.

Conclusion: Provides a controllable and extensible platform for studying adaptive cheating behaviors and developing effective cheat detectors.

Abstract: Extra-Sensory Perception (ESP) cheats, which reveal hidden in-game information such as enemy locations, are difficult to detect because their effects are not directly observable in player behavior. The lack of observable evidence makes it difficult to collect reliably labeled data, which is essential for training effective anti-cheat systems. Furthermore, cheaters often adapt their behavior by limiting or disguising their cheat usage, which further complicates detection and detector development. To address these challenges, we propose a simulation framework for controlled modeling of ESP cheaters, non-cheaters, and trajectory-based detectors. We model cheaters and non-cheaters as reinforcement learning agents with different levels of observability, while detectors classify their behavioral trajectories. Next, we formulate the interaction between the cheater and the detector as an adversarial game, allowing both players to co-adapt over time. To reflect realistic cheater strategies, we introduce a structured cheater model that dynamically switches between cheating and non-cheating behaviors based on detection risk. Experiments demonstrate that our framework successfully simulates adaptive cheater behaviors that strategically balance reward optimization and detection evasion. This work provides a controllable and extensible platform for studying adaptive cheating behaviors and developing effective cheat detectors.

[1312] ELASTIQ: EEG-Language Alignment with Semantic Task Instruction and Querying

Muyun Jiang, Shuailei Zhang, Zhenjie Yang, Mengjun Wu, Weibang Jiang, Zhiwei Guo, Wei Zhang, Rui Liu, Shangen Zhang, Yong Li, Yi Ding, Cuntai Guan

Main category: cs.LG

TL;DR: ELASTIQ is an EEG foundation model that integrates language instructions to align EEG representations with semantic knowledge, improving decoding robustness and transferability across various BCI tasks.

Details

Motivation: Existing EEG foundation models struggle to incorporate language instructions as prior constraints, limiting their ability to leverage semantic knowledge to unify different labels and tasks.

Method: Uses joint Spectral-Temporal Reconstruction (STR) for pretraining and Instruction-conditioned Q-Former (IQF) for instruction tuning, which injects instruction embeddings into EEG tokens and aligns them with textual label embeddings.

Result: Achieves state-of-the-art performance on 14 of 20 datasets across motor imagery, emotion recognition, SSVEP, covert speech, and healthcare tasks, with best average results across all five task categories.

Conclusion: Task instructions serve as semantic priors that guide EEG embeddings into coherent and linguistically grounded spaces, demonstrating the effectiveness of language-aligned EEG representation learning.

Abstract: Recent advances in electroencephalography (EEG) foundation models, which capture transferable EEG representations, have greatly accelerated the development of brain-computer interfaces (BCI). However, existing approaches still struggle to incorporate language instructions as prior constraints for EEG representation learning, limiting their ability to leverage the semantic knowledge inherent in language to unify different labels and tasks. To address this challenge, we present ELASTIQ, a foundation model for EEG-Language Alignment with Semantic Task Instruction and Querying. ELASTIQ integrates task-aware semantic guidance to produce structured and linguistically aligned EEG embeddings, thereby enhancing decoding robustness and transferability. In the pretraining stage, we introduce a joint Spectral-Temporal Reconstruction (STR) module, which combines frequency masking as a global spectral perturbation with two complementary temporal objectives: random masking to capture contextual dependencies and causal masking to model sequential dynamics. In the instruction tuning stage, we propose the Instruction-conditioned Q-Former (IQF), a query-based cross-attention transformer that injects instruction embeddings into EEG tokens and aligns them with textual label embeddings through learnable queries. We evaluate ELASTIQ on 20 datasets spanning motor imagery, emotion recognition, steady-state visual evoked potentials, covert speech, and healthcare tasks. ELASTIQ achieves state-of-the-art performance on 14 of the 20 datasets and obtains the best average results across all five task categories. Importantly, our analyses reveal for the first time that explicit task instructions serve as semantic priors guiding EEG embeddings into coherent and linguistically grounded spaces. The code and pre-trained weights will be released.

[1313] Asynchronous Policy Gradient Aggregation for Efficient Distributed Reinforcement Learning

Alexander Tyurin, Andrei Spiridonov, Varvara Rudenko

Main category: cs.LG

TL;DR: The paper introduces two distributed RL algorithms - Rennala NIGT and Malenia NIGT - that handle asynchronous computations and communications, achieving state-of-the-art efficiency in both homogeneous and heterogeneous settings.

Details

Motivation: Distributed reinforcement learning methods remain less explored compared to non-distributed approaches, especially in the presence of heterogeneous asynchronous computations and communication bottlenecks.

Method: Two new algorithms: Rennala NIGT for homogeneous settings with AllReduce operation support, and Malenia NIGT for heterogeneous settings handling asynchronous computations and heterogeneous environments.

Result: Rennala NIGT improves total computational and communication complexity in homogeneous settings. Malenia NIGT provides strictly better theoretical guarantees for heterogeneous environments. Experimental results show significant outperformance over prior approaches.

Conclusion: The proposed distributed RL algorithms effectively address asynchronous computations and communication challenges, achieving superior efficiency and performance in both homogeneous and heterogeneous distributed settings.

Abstract: We study distributed reinforcement learning (RL) with policy gradient methods under asynchronous and parallel computations and communications. While non-distributed methods are well understood theoretically and have achieved remarkable empirical success, their distributed counterparts remain less explored, particularly in the presence of heterogeneous asynchronous computations and communication bottlenecks. We introduce two new algorithms, Rennala NIGT and Malenia NIGT, which implement asynchronous policy gradient aggregation and achieve state-of-the-art efficiency. In the homogeneous setting, Rennala NIGT provably improves the total computational and communication complexity while supporting the AllReduce operation. In the heterogeneous setting, Malenia NIGT simultaneously handles asynchronous computations and heterogeneous environments with strictly better theoretical guarantees. Our results are further corroborated by experiments, showing that our methods significantly outperform prior approaches.

[1314] A study of Universal ODE approaches to predicting soil organic carbon

Satyanarayana Raju G. V. V, Prathamesh Dinesh Joshi, Raj Abhijit Dandekar, Rajat Dandekar, Sreedath Panat

Main category: cs.LG

TL;DR: Scientific Machine Learning with Universal Differential Equations can accurately predict Soil Organic Carbon dynamics in low-noise conditions but struggles with high noise, requiring noise-aware methods for field deployment.

Details

Motivation: Soil Organic Carbon prediction is challenging due to complex physical, chemical, and biological processes, and current methods need improvement for accurate forecasting.

Method: Used Universal Differential Equations combining mechanistic physics (advection diffusion transport) with neural networks to learn nonlinear microbial processes, tested on synthetic datasets with varying noise levels.

Result: Achieved near-perfect accuracy in clean conditions (MSE=1.6e-5, R2=0.9999) and remained robust with 7% noise (MSE=3.4e-6, R2=0.99998), but overfitted with 35% noise at t=0 (R2=0.94) and failed with 35% noise at t=50 (negative R2).

Conclusion: UDEs are promising for SOC forecasting but need noise-aware loss functions, probabilistic modeling, and better microbial dynamics integration for field applications.

Abstract: Soil Organic Carbon (SOC) is a foundation of soil health and global climate resilience, yet its prediction remains difficult because of intricate physical, chemical, and biological processes. In this study, we explore a Scientific Machine Learning (SciML) framework built on Universal Differential Equations (UDEs) to forecast SOC dynamics across soil depth and time. UDEs blend mechanistic physics, such as advection diffusion transport, with neural networks that learn nonlinear microbial production and respiration. Using synthetic datasets, we systematically evaluated six experimental cases, progressing from clean, noise free benchmarks to stress tests with high (35%) multiplicative, spatially correlated noise. Our results highlight both the potential and limitations of the approach. In noise free and moderate noise settings, the UDE accurately reconstructed SOC dynamics. In clean terminal profile at 50 years (Case 4) achieved near perfect fidelity, with MSE = 1.6e-5, and R2 = 0.9999. Case 5, with 7% noise, remained robust (MSE = 3.4e-6, R2 = 0.99998), capturing depth wise SOC trends while tolerating realistic measurement uncertainty. In contrast, Case 3 (35% noise at t = 0) showed clear evidence of overfitting: the model reproduced noisy inputs with high accuracy but lost generalization against the clean truth (R2 = 0.94). Case 6 (35% noise at t = 50) collapsed toward overly smooth mean profiles, failing to capture depth wise variability and yielding negative R2, underscoring the limits of standard training under severe uncertainty. These findings suggest that UDEs are well suited for scalable, noise tolerant SOC forecasting, though advancing toward field deployment will require noise aware loss functions, probabilistic modelling, and tighter integration of microbial dynamics.

[1315] Rethinking JEPA: Compute-Efficient Video SSL with Frozen Teachers

Xianhang Li, Chen Huang, Chun-Liang Li, Eran Malach, Josh Susskind, Vimal Thilak, Etai Littwin

Main category: cs.LG

TL;DR: SALT (Static-teacher Asymmetric Latent Training) is a two-stage video representation learning method that replaces EMA-based teacher-student training with a frozen teacher, achieving better performance and compute efficiency than V-JEPA.

Details

Motivation: To address the limitations of EMA-based teacher-student training in V-JEPA, which complicates model selection and couples architectures, by proposing a simpler frozen teacher approach.

Method: Two-stage training: (1) train teacher encoder with pixel reconstruction under V-JEPA masking, (2) freeze teacher and train student to predict teacher’s latents on masked regions without EMA regularization.

Result: SALT outperforms V-JEPA 2 encoders in frozen evaluation, achieves higher compute efficiency, and shows student quality is robust to teacher quality, suggesting compute budget should favor students.

Conclusion: SALT provides a simpler, more scalable and compute-efficient alternative to EMA-based self-distillation for video representation learning.

Abstract: Video Joint Embedding Predictive Architectures (V-JEPA) learn generalizable off-the-shelf video representation by predicting masked regions in latent space with an exponential moving average (EMA)-updated teacher. While EMA prevents representation collapse, it complicates scalable model selection and couples teacher and student architectures. We revisit masked-latent prediction and show that a frozen teacher suffices. Concretely, we (i) train a target encoder with a simple pixel-reconstruction objective under V-JEPA masking, then (ii) freeze it and train a student to predict the teacher’s latents on masked regions. This leads to a two-stage, unregularized scheme that we refer to as SALT (Static-teacher Asymmetric Latent Training). SALT decouples optimization into pixel reconstruction (teacher) and masked latent prediction (student), increasing transparency, efficiency, and scalability while preserving the ability of representation to generalize under frozen evaluation. Empirically, our student models outperform recently proposed V-JEPA 2 encoders under frozen backbone evaluation across diverse benchmarks. They are also more compute-optimal: at matched pretraining FLOPs, our method achieves higher probing accuracy, and its scaling curves dominate V-JEPA’s accuracy-FLOPs Pareto frontier. Finally, we find that student quality is remarkably robust to teacher quality: high-performing students emerge even with small, sub-optimal teachers. This points to a compute budget allocation that should overwhelmingly favor the student. These results position SALT as a simple, scalable, and compute-efficient alternative to EMA-based self-distillation for video representation learning.

[1316] AuON: A Linear-time Alternative to Semi-Orthogonal Momentum Updates

Dipan Maity

Main category: cs.LG

TL;DR: AuON is a linear-time optimizer that achieves strong performance without constructing semi-orthogonal matrices, using hyperbolic-cosine RMS scaling with normalization to preserve structural alignment and recondition ill-posed updates.

Details

Motivation: Traditional orthogonal gradient methods like SVD/QR are computationally expensive (O(n^3)) and underperform compared to SGD with momentum. Recent methods like Muon reduce complexity to O(n^2) but quadratic costs remain a bottleneck.

Method: AuON bounds momentum updates under a spectral-norm trust region using hyperbolic-cosine RMS scaling transformations with normalization, preserving directional information without explicit semi-orthogonalization. Also introduces Hybrid-AuON with a single Newton-Schulz iteration.

Result: Experiments across vision and language benchmarks show AuON and Hybrid-AuON achieve performance comparable to strong baselines like AdamW and Muon.

Conclusion: AuON provides an efficient linear-time alternative to orthogonal optimization methods, achieving competitive performance while avoiding the computational bottlenecks of traditional approaches.

Abstract: Orthogonal gradient updates have emerged as a promising direction in optimization for machine learning. However, traditional approaches such as SVD/QR decomposition incur prohibitive computational costs of O(n^3) and underperform compared to well-tuned SGD with momentum, since momentum is applied only after strict orthogonalization. Recent advances, such as Muon, improve efficiency by applying momentum before orthogonalization and producing semi-orthogonal matrices via Newton-Schulz iterations, reducing complexity to O(n^2). Nevertheless, quadratic costs remain a bottleneck. In this work, we study the semi-orthogonal properties of momentum-based updates and develop a method to bound momentum updates under a spectral-norm trust region, preserving directional information without requiring explicit semi-orthogonalization. We propose AuON (Alternative Unit-norm momentum updates by Normalized nonlinear scaling), a linear-time optimizer that achieves strong performance without constructing semi-orthogonal matrices, while preserving structural alignment and reconditioning ill-posed updates. Our approach combines hyperbolic-cosine RMS scaling transformations with normalization, demonstrating both effectiveness and computational efficiency compared to Newton-Schulz methods. We further introduce a hybrid variant (Hybrid-AuON) that applies a single Newton-Schulz iteration. Experiments across vision and language benchmarks show that AuON and its hybrid variant achieve performance comparable to strong baselines such as AdamW and Muon. Code is available at: https://github.com/ryyzn9/AuON

[1317] H+: An Efficient Similarity-Aware Aggregation for Byzantine Resilient Federated Learning

Shiyuan Zuo, Rongfei Fan, Cheng Zhan, Jie Xu, Puning Zhao, Han Hu

Main category: cs.LG

TL;DR: H+ is a novel similarity-aware aggregation method for federated learning that provides Byzantine attack resilience without requiring clean data, outperforming existing methods in both scenarios with and without clean data.

Details

Motivation: Existing similarity-aware aggregation methods for federated learning require clean data to identify malicious clients, making them inapplicable in settings where such data is unavailable, thus limiting their practical deployment.

Method: H+ randomly selects r-dimensional segments from p-dimensional parameter vectors and applies a similarity check function H to compare each segment against a reference vector, preserving the most similar client vectors for aggregation. The reference vector is derived either from existing robust algorithms (when clean data is unavailable) or directly from clean data.

Result: H+ demonstrates substantial robustness improvements over existing approaches under varying Byzantine attack ratios and multiple types of traditional Byzantine attacks, achieving state-of-the-art performance across all evaluated scenarios and benchmark datasets.

Conclusion: H+ is an effective and computationally efficient similarity-aware aggregation approach that extends Byzantine attack resilience to federated learning systems without requiring clean data, while maintaining superior performance in scenarios with clean data.

Abstract: Federated Learning (FL) enables decentralized model training without sharing raw data. However, it remains vulnerable to Byzantine attacks, which can compromise the aggregation of locally updated parameters at the central server. Similarity-aware aggregation has emerged as an effective strategy to mitigate such attacks by identifying and filtering out malicious clients based on similarity between client model parameters and those derived from clean data, i.e., data that is uncorrupted and trustworthy. However, existing methods adopt this strategy only in FL systems with clean data, making them inapplicable to settings where such data is unavailable. In this paper, we propose H+, a novel similarity-aware aggregation approach that not only outperforms existing methods in scenarios with clean data, but also extends applicability to FL systems without any clean data. Specifically, H+ randomly selects $r$-dimensional segments from the $p$-dimensional parameter vectors uploaded to the server and applies a similarity check function $H$ to compare each segment against a reference vector, preserving the most similar client vectors for aggregation. The reference vector is derived either from existing robust algorithms when clean data is unavailable or directly from clean data. Repeating this process $K$ times enables effective identification of honest clients. Moreover, H+ maintains low computational complexity, with an analytical time complexity of $\mathcal{O}(KMr)$, where $M$ is the number of clients and $Kr \ll p$. Comprehensive experiments validate H+ as a state-of-the-art (SOTA) method, demonstrating substantial robustness improvements over existing approaches under varying Byzantine attack ratios and multiple types of traditional Byzantine attacks, across all evaluated scenarios and benchmark datasets.

[1318] Towards Generalizable PDE Dynamics Forecasting via Physics-Guided Invariant Learning

Siyang Li, Yize Chen, Yan Guo, Ming Huang, Hui Xiong

Main category: cs.LG

TL;DR: The paper proposes iMOOE, a physics-guided invariant learning method for spatiotemporal PDE forecasting that achieves superior in-distribution performance and zero-shot generalization across unseen OOD scenarios by capturing fundamental PDE invariance principles.

Details

Motivation: Real-world physical environments have capricious PDE system parameters, making generalization across unseen out-of-distribution forecasting scenarios challenging. Existing methods require extra test-time samples for domain adaptation and fail to achieve true zero-shot generalization because they don't investigate fundamental physical invariance in PDE systems.

Method: Proposes iMOOE with: 1) Explicit definition of two-fold PDE invariance principle (ingredient operators and composition relationships remain invariant), 2) Invariance-aligned Mixture Of Operator Expert architecture, 3) Frequency-enriched invariant learning objective to capture the two-fold PDE invariance.

Result: Extensive experiments across simulated benchmarks and real-world applications validate iMOOE’s superior in-distribution performance and zero-shot generalization capabilities on diverse OOD forecasting scenarios.

Conclusion: The proposed iMOOE method successfully captures fundamental PDE invariance principles, enabling superior forecasting performance and true zero-shot generalization across diverse out-of-distribution scenarios without requiring test-time adaptation samples.

Abstract: Advanced deep learning-based approaches have been actively applied to forecast the spatiotemporal physical dynamics governed by partial differential equations (PDEs), which acts as a critical procedure in tackling many science and engineering problems. As real-world physical environments like PDE system parameters are always capricious, how to generalize across unseen out-of-distribution (OOD) forecasting scenarios using limited training data is of great importance. To bridge this barrier, existing methods focus on discovering domain-generalizable representations across various PDE dynamics trajectories. However, their zero-shot OOD generalization capability remains deficient, since extra test-time samples for domain-specific adaptation are still required. This is because the fundamental physical invariance in PDE dynamical systems are yet to be investigated or integrated. To this end, we first explicitly define a two-fold PDE invariance principle, which points out that ingredient operators and their composition relationships remain invariant across different domains and PDE system evolution. Next, to capture this two-fold PDE invariance, we propose a physics-guided invariant learning method termed iMOOE, featuring an Invariance-aligned Mixture Of Operator Expert architecture and a frequency-enriched invariant learning objective. Extensive experiments across simulated benchmarks and real-world applications validate iMOOE’s superior in-distribution performance and zero-shot generalization capabilities on diverse OOD forecasting scenarios.

[1319] Expanding Horizons of Level Diversity via Multi-objective Evolutionary Learning

Qingquan Zhang, Ziqi Wang, Yuchen Li, Keyuan Zhang, Bo Yuan, Jialin Liu

Main category: cs.LG

TL;DR: This paper proposes a multi-objective evolutionary learning framework for game level generation that optimizes multiple diversity metrics simultaneously, demonstrating enhanced multi-dimensional diversity in Super Mario Bros.

Details

Motivation: Existing level generation approaches fail to comprehensively assess diversity across multiple dimensions, as level diversity metrics are naturally multi-dimensional with conflicted or complementary relationships.

Method: Formulate model training as a multi-objective learning problem where each diversity metric is a distinct objective, and propose a multi-objective evolutionary learning framework that optimizes multiple diversity metrics simultaneously.

Result: The framework enhances multi-dimensional diversity and identifies a Pareto front of generative models providing tradeoffs among playability and two representative diversity metrics (content-based and player-centered).

Conclusion: The proposed framework enables decision-makers to make informed choices when selecting generators for various scenarios and diverse needs of players and designers.

Abstract: In recent years, the generation of diverse game levels has gained increasing interest, contributing to a richer and more engaging gaming experience. A number of level diversity metrics have been proposed in literature, which are naturally multi-dimensional, leading to conflicted, complementary, or both relationships among these dimensions. However, existing level generation approaches often fail to comprehensively assess diversity across those dimensions. This paper aims to expand horizons of level diversity by considering multi-dimensional diversity when training generative models. We formulate the model training as a multi-objective learning problem, where each diversity metric is treated as a distinct objective. Furthermore, a multi-objective evolutionary learning framework that optimises multiple diversity metrics simultaneously throughout the model training process is proposed. Our case study on the commonly used benchmark Super Mario Bros. demonstrates that our proposed framework can enhance multi-dimensional diversity and identify a Pareto front of generative models, which provides a range of tradeoffs among playability and two representative diversity metrics, including a content-based one and a player-centered one. Such capability enables decision-makers to make informed choices when selecting generators accommodating a variety of scenarios and the diverse needs of players and designers.

[1320] Watermarking Diffusion Language Models

Thibaud Gloaguen, Robin Staab, Nikola Jovanović, Martin Vechev

Main category: cs.LG

TL;DR: First watermark for diffusion language models (DLMs) that works with arbitrary token generation order, achieving >99% true positive rate with minimal quality impact.

Details

Motivation: Existing ARLM watermarks rely on sequential token generation, which doesn't work for DLMs that generate tokens in arbitrary order. Need specialized watermark for this new paradigm.

Method: Apply watermark in expectation over context when some tokens are undetermined, and promote tokens that increase watermark strength when used as context for other tokens.

Result: Achieves >99% true positive rate with minimal quality impact and similar robustness to existing ARLM watermarks.

Conclusion: Enables reliable watermarking for diffusion language models for the first time, addressing the unique challenges of arbitrary-order token generation.

Abstract: We introduce the first watermark tailored for diffusion language models (DLMs), an emergent LLM paradigm able to generate tokens in arbitrary order, in contrast to standard autoregressive language models (ARLMs) which generate tokens sequentially. While there has been much work in ARLM watermarking, a key challenge when attempting to apply these schemes directly to the DLM setting is that they rely on previously generated tokens, which are not always available with DLM generation. In this work we address this challenge by: (i) applying the watermark in expectation over the context even when some context tokens are yet to be determined, and (ii) promoting tokens which increase the watermark strength when used as context for other tokens. This is accomplished while keeping the watermark detector unchanged. Our experimental evaluation demonstrates that the DLM watermark leads to a >99% true positive rate with minimal quality impact and achieves similar robustness to existing ARLM watermarks, enabling for the first time reliable DLM watermarking.

[1321] LEAF: A Robust Expert-Based Framework for Few-Shot Continual Event Detection

Bao-Ngoc Dao, Quang Nguyen, Luyen Ngo Dinh, Minh Le, Linh Ngo Van

Main category: cs.LG

TL;DR: LEAF is a novel expert-based framework for Few-shot Continual Event Detection that addresses catastrophic forgetting and limited data challenges through specialized mixture of experts with LoRA parameterization, semantic-aware routing, contrastive learning, and knowledge distillation.

Details

Motivation: Existing FCED approaches suffer from severe forgetting due to full fine-tuning of shared base models causing knowledge interference, and rely on data augmentation that introduces unnatural inputs.

Method: Integrates mixture of experts with LoRA matrices, uses semantic-aware expert selection for dynamic routing, incorporates contrastive learning with label descriptions, and employs knowledge distillation to prevent overfitting on memory buffer.

Result: Extensive experiments on multiple FCED benchmarks show LEAF consistently achieves state-of-the-art performance.

Conclusion: LEAF effectively addresses the dual challenges of few-shot learning and catastrophic forgetting in continual event detection through its expert-based architecture and robust learning strategies.

Abstract: Few-shot Continual Event Detection (FCED) poses the dual challenges of learning from limited data and mitigating catastrophic forgetting across sequential tasks. Existing approaches often suffer from severe forgetting due to the full fine-tuning of a shared base model, which leads to knowledge interference between tasks. Moreover, they frequently rely on data augmentation strategies that can introduce unnatural or semantically distorted inputs. To address these limitations, we propose LEAF, a novel and robust expert-based framework for FCED. LEAF integrates a specialized mixture of experts architecture into the base model, where each expert is parameterized with low-rank adaptation (LoRA) matrices. A semantic-aware expert selection mechanism dynamically routes instances to the most relevant experts, enabling expert specialization and reducing knowledge interference. To improve generalization in limited-data settings, LEAF incorporates a contrastive learning objective guided by label descriptions, which capture high-level semantic information about event types. Furthermore, to prevent overfitting on the memory buffer, our framework employs a knowledge distillation strategy that transfers knowledge from previous models to the current one. Extensive experiments on multiple FCED benchmarks demonstrate that LEAF consistently achieves state-of-the-art performance.

[1322] Evolution Strategies at Scale: LLM Fine-Tuning Beyond Reinforcement Learning

Xin Qiu, Yulu Gan, Conor F. Hayes, Qiyao Liang, Elliot Meyerson, Babak Hodjat, Risto Miikkulainen

Main category: cs.LG

TL;DR: ES successfully scales to fine-tune billion-parameter LLMs, outperforming RL methods in efficiency, stability, and robustness.

Details

Motivation: ES was previously overlooked for LLM fine-tuning due to scalability concerns, despite its historical success with smaller models.

Method: Scaled up evolution strategies to fine-tune full parameters of large language models.

Result: ES outperforms RL methods in sample efficiency, reward tolerance, model robustness, reduced reward hacking, and stability.

Conclusion: ES provides a viable alternative to RL for LLM fine-tuning, opening new directions in model optimization.

Abstract: Fine-tuning pre-trained large language models (LLMs) for down-stream tasks is a critical step in the AI deployment pipeline. Reinforcement learning (RL) is arguably the most prominent fine-tuning method, contributing to the birth of many state-of-the-art LLMs. In contrast, evolution strategies (ES), which once showed comparable performance to RL on models with a few million parameters, was neglected due to the pessimistic perception of its scalability to larger models. In this work, we report the first successful attempt to scale up ES for fine-tuning the full parameters of LLMs, showing the surprising fact that ES can search efficiently over billions of parameters and outperform existing RL fine-tuning methods in multiple respects, including sample efficiency, tolerance to long-horizon rewards, robustness to different base LLMs, less tendency to reward hacking, and more stable performance across runs. It therefore serves as a basis to unlock a new direction in LLM fine-tuning beyond what current RL techniques provide. The source codes are provided at: https://github.com/VsonicV/es-fine-tuning-paper.

[1323] AXIS: Explainable Time Series Anomaly Detection with Large Language Models

Tian Lan, Hao Duong Le, Jinbo Li, Wenjun He, Meng Wang, Chenghao Liu, Chen Zhang

Main category: cs.LG

TL;DR: AXIS is a framework that enhances LLMs for explainable time-series anomaly detection by providing numerical grounding, fine-grained dynamics, and global anomaly characteristics, achieving high-quality explanations and competitive detection accuracy.

Details

Motivation: Current LLM-based approaches for explainable TSAD struggle with processing continuous time-series data due to the modality gap between discrete text tokens and continuous signals, leading to poor contextual grounding and representation alignment.

Method: AXIS conditions a frozen LLM with three complementary hints: (1) symbolic numeric hint for numerical grounding, (2) context-integrated step-aligned hint from pretrained time-series encoder for fine-grained dynamics, and (3) task-prior hint for global anomaly characteristics.

Result: AXIS produces significantly higher quality explanations and achieves competitive detection accuracy compared to general-purpose LLMs, specialized time-series LLMs, and time-series Vision Language Models in extensive experiments.

Conclusion: The AXIS framework effectively bridges the modality gap between time-series and text, enabling LLMs to provide nuanced understanding and high-quality explanations for time-series anomaly detection while maintaining competitive detection performance.

Abstract: Time-series anomaly detection (TSAD) increasingly demands explanations that articulate not only if an anomaly occurred, but also what pattern it exhibits and why it is anomalous. Leveraging the impressive explanatory capabilities of Large Language Models (LLMs), recent works have attempted to treat time series as text for explainable TSAD. However, this approach faces a fundamental challenge: LLMs operate on discrete tokens and struggle to directly process long, continuous signals. Consequently, naive time-to-text serialization suffers from a lack of contextual grounding and representation alignment between the two modalities. To address this gap, we introduce AXIS, a framework that conditions a frozen LLM for nuanced time-series understanding. Instead of direct serialization, AXIS enriches the LLM’s input with three complementary hints derived from the series: (i) a symbolic numeric hint for numerical grounding, (ii) a context-integrated, step-aligned hint distilled from a pretrained time-series encoder to capture fine-grained dynamics, and (iii) a task-prior hint that encodes global anomaly characteristics. Furthermore, to facilitate robust evaluation of explainability, we introduce a new benchmark featuring multi-format questions and rationales that supervise contextual grounding and pattern-level semantics. Extensive experiments, including both LLM-based and human evaluations, demonstrate that AXIS yields explanations of significantly higher quality and achieves competitive detection accuracy compared to general-purpose LLMs, specialized time-series LLMs, and time-series Vision Language Models.

[1324] OrthAlign: Orthogonal Subspace Decomposition for Non-Interfering Multi-Objective Alignment

Liang Lin, Zhihao Xu, Junhao Dong, Jian Zhao, Yuchen Yuan, Guibin Zhang, Miao Yu, Yiming Zhang, Zhengtao Yao, Huahui Yi, Dongrui Liu, Xinfeng Li, Kun Wang

Main category: cs.LG

TL;DR: OrthAlign uses orthogonal subspace decomposition to resolve gradient conflicts in multi-objective LLM alignment, enabling simultaneous improvement across competing preferences like helpfulness and harmlessness without trade-offs.

Details

Motivation: Address the fundamental limitation in LLM alignment where improving one preference dimension (e.g., helpfulness) often degrades others (e.g., harmlessness), moving beyond constraint-based approaches to resolve conflicts at the parameter level.

Method: Decomposes parameter update spaces into orthogonal subspaces using orthogonal subspace decomposition, ensuring optimization toward different preferences occurs in mathematically non-interfering directions with theoretical guarantees for stable convergence.

Result: Achieves maximum single-preference improvements of 34.61% to 50.89% across helpful, harmless, and truthful dimensions after multi-objective alignment, with 13.96% average overall reward improvement.

Conclusion: OrthAlign provides a fundamental solution to multi-preference alignment conflicts by ensuring orthogonal optimization directions, enabling stable simultaneous improvements across competing objectives without performance trade-offs.

Abstract: Large language model (LLM) alignment faces a critical dilemma when addressing multiple human preferences: improvements in one dimension frequently come at the expense of others, creating unavoidable trade-offs between competing objectives like helpfulness and harmlessness. While prior work mainly focuses on constraint-based optimization algorithms and data selection strategies to mitigate conflicts, these approaches overlook the fundamental issue of resolving conflicts directly at the parameter level. In this paper, we present OrthAlign, an innovative approach that pioneers a new paradigm by leveraging orthogonal subspace decomposition to fundamentally resolve gradient-level conflicts in multi-objective preference alignment. OrthAlign strategically decomposes parameter update spaces into orthogonal subspaces, ensuring that optimization toward different preferences occurs in mathematically non-interfering directions. Building upon this, we provide theoretical guarantees demonstrating that when parameter increments satisfy both orthogonal subspace constraints and spectral norm bounds, the resulting updates exhibit linear Lipschitz growth rather than exponential instability, ensuring stable convergence across all preference dimensions. Extensive experiments show that: I. OrthAlign achieves maximum single-preference improvements ranging from 34.61% to 50.89% after multiple-objective alignment across helpful, harmless, and truthful dimensions. II. With an average overall reward improvement of 13.96%.

[1325] Muon: Training and Trade-offs with Latent Attention and MoE

Sushant Mehta, Raj Dandekar, Rajat Dandekar, Sreedath Panat

Main category: cs.LG

TL;DR: Muon optimizer enables efficient transformer training with 48-52% less computation than AdamW while maintaining or improving perplexity, with additional gains when combined with MLA and MoE architectures.

Details

Motivation: To provide rigorous theoretical analysis of Muon optimizer's convergence properties and demonstrate its practical efficiency for training small to medium transformers, especially when combined with modern architectural optimizations.

Method: Comprehensive theoretical analysis including convergence rate proofs, spectral regularization properties, connections to natural gradient descent, and equivalence to steepest gradient descent. Empirical validation across 100+ training runs with 30M-200M parameter models, combining Muon with Multi-Head Latent Attention and Mixture-of-Experts.

Result: Muon achieves target loss with 48-52% less training computation than AdamW while maintaining or improving final perplexity. MLA+MoE+Muon combination achieves 68% memory reduction, 3.2× inference speedup, and 8-12% perplexity improvement.

Conclusion: Muon is established as a principled, robust alternative to AdamW that excels particularly when combined with modern efficiency techniques and large-batch training, expanding the Pareto frontier in compute-time trade-offs.

Abstract: We present a comprehensive theoretical and empirical study of the Muon optimizer for training transformers only with a small to medium decoder (30M - 200M parameters), with an emphasis on its mathematical foundations, convergence properties and synergistic interactions with modern architectural optimizations. Building on recent work showing Muon’s scalability, we provide rigorous theoretical analysis including: (i)showing the convergence rate under standard assumptions, (ii) spectral regularization properties that prevent gradient explosion, (iii) connection to natural gradient descent on the Stiefel manifold, and (iv) equivalence to steepest gradient descent under the spectral norm. Crucially, we demonstrate that Muon expands the Pareto frontier in the compute-time trade-off by maintaining superior data efficiency at large batch sizes, a key finding of~\cite{essentialai2025muon} that we validate across our model scales. Empirically, Muon reaches the target loss with 48-52% of the training calculated by AdamW while maintaining or improving the final perplexity, consistent with larger-scale results. When combined with Multi-Head Latent Attention (MLA) and Mixture-of-Experts (MoE), we observe multiplicative efficiency gains: MLA+MoE+Muon achieves 68% memory reduction and 3.2$\times$ inference speedup, while improving perplexity by 8-12%. We provide detailed procedures on 15 architectural and optimizer components, stability analyzes across 100+ training runs, and practical implementation guidelines including Newton-Schulz coefficients $(3.4445, -4.7750, 2.0315)$ optimized by~\cite{su2024muonblog}. Our theoretical analysis and comprehensive experiments establish Muon as a principled, robust alternative to AdamW that particularly excels when combined with modern efficiency techniques and large-batch training regimes.

[1326] ScatterAD: Temporal-Topological Scattering Mechanism for Time Series Anomaly Detection

Tao Yin, Xiaohong Zhang, Shaochen Fu, Zhibin Zhang, Li Huang, Yiyuan Yang, Kaixiang Yang, Meng Yan

Main category: cs.LG

TL;DR: ScatterAD is a novel anomaly detection method for industrial IoT that models representation scattering across temporal and topological dimensions to better capture complex spatio-temporal couplings in multivariate time series data.

Details

Motivation: Traditional anomaly detection methods fail to adequately model complex spatio-temporal couplings in multivariate industrial IoT data, focusing on spatial or temporal dependencies independently, which leads to suboptimal representation learning and limited sensitivity to anomalous dispersion in high-dimensional spaces.

Method: Proposes ScatterAD with a topological encoder for graph-structured scattering and a temporal encoder to constrain over-scattering through MSE minimization between neighboring time steps. Uses contrastive fusion to ensure complementarity of temporal and topological representations, and maximizes conditional mutual information between views for better cross-view consistency.

Result: Extensive experiments on multiple public benchmarks show that ScatterAD achieves state-of-the-art performance on multivariate time series anomaly detection.

Conclusion: The scattering phenomenon can be effectively leveraged as an inductive signal to enhance spatio-temporal anomaly detection, with ScatterAD demonstrating superior performance through its dual-view approach to modeling temporal and topological scattering.

Abstract: One main challenge in time series anomaly detection for industrial IoT lies in the complex spatio-temporal couplings within multivariate data. However, traditional anomaly detection methods focus on modeling spatial or temporal dependencies independently, resulting in suboptimal representation learning and limited sensitivity to anomalous dispersion in high-dimensional spaces. In this work, we conduct an empirical analysis showing that both normal and anomalous samples tend to scatter in high-dimensional space, especially anomalous samples are markedly more dispersed. We formalize this dispersion phenomenon as scattering, quantified by the mean pairwise distance among sample representations, and leverage it as an inductive signal to enhance spatio-temporal anomaly detection. Technically, we propose ScatterAD to model representation scattering across temporal and topological dimensions. ScatterAD incorporates a topological encoder for capturing graph-structured scattering and a temporal encoder for constraining over-scattering through mean squared error minimization between neighboring time steps. We introduce a contrastive fusion mechanism to ensure the complementarity of the learned temporal and topological representations. Additionally, we theoretically show that maximizing the conditional mutual information between temporal and topological views improves cross-view consistency and enhances more discriminative representations. Extensive experiments on multiple public benchmarks show that ScatterAD achieves state-of-the-art performance on multivariate time series anomaly detection. Code is available at this repository: https://github.com/jk-sounds/ScatterAD.

[1327] BiHDTrans: binary hyperdimensional transformer for efficient multivariate time series classification

Jingtao Zhang, Yi Liu, Qi Shen, Changhong Wang

Main category: cs.LG

TL;DR: BiHDTrans is a neurosymbolic binary hyperdimensional Transformer that combines HD computing efficiency with Transformer temporal modeling, achieving superior accuracy and 39.4x lower latency than SOTA binary Transformers.

Details

Motivation: IoT devices generate massive multivariate time series data that requires efficient processing in resource-constrained edge environments. HD computing is efficient but struggles with temporal patterns, while Transformers excel at sequence modeling but have high computational overhead.

Method: Integrates self-attention into HD computing paradigm, creating a binary hyperdimensional Transformer that unifies HD computing’s representational efficiency with Transformer’s temporal modeling power. Uses hardware acceleration on FPGA with pipelined implementation.

Result: Outperforms SOTA HD computing models by at least 14.47% and achieves 6.67% higher accuracy than SOTA binary Transformers. With FPGA acceleration, delivers 39.4x lower inference latency. Remains competitive with 64% reduction in hyperspace dimensionality, surpassing binary Transformers by 1-2% accuracy with 4.4x smaller model size.

Conclusion: BiHDTrans bridges the gap between Transformer expressiveness and HD computing efficiency, enabling accurate, scalable, and low-latency multivariate time series classification for edge environments.

Abstract: The proliferation of Internet-of-Things (IoT) devices has led to an unprecedented volume of multivariate time series (MTS) data, requiring efficient and accurate processing for timely decision-making in resource-constrained edge environments. Hyperdimensional (HD) computing, with its inherent efficiency and parallelizability, has shown promise in classification tasks but struggles to capture complex temporal patterns, while Transformers excel at sequence modeling but incur high computational and memory overhead. We introduce BiHDTrans, an efficient neurosymbolic binary hyperdimensional Transformer that integrates self-attention into the HD computing paradigm, unifying the representational efficiency of HD computing with the temporal modeling power of Transformers. Empirically, BiHDTrans outperforms state-of-the-art (SOTA) HD computing models by at least 14.47% and achieves 6.67% higher accuracy on average than SOTA binary Transformers. With hardware acceleration on FPGA, our pipelined implementation leverages the independent and identically distributed properties of high-dimensional representations, delivering 39.4 times lower inference latency than SOTA binary Transformers. Theoretical analysis shows that binarizing in holographic high-dimensional space incurs significantly less information distortion than directly binarizing neural networks, explaining BiHDTrans’s superior accuracy. Furthermore, dimensionality experiments confirm that BiHDTrans remains competitive even with a 64% reduction in hyperspace dimensionality, surpassing SOTA binary Transformers by 1-2% in accuracy with 4.4 times less model size, as well as further reducing the latency by 49.8% compare to the full-dimensional baseline. Together, these contributions bridge the gap between the expressiveness of Transformers and the efficiency of HD computing, enabling accurate, scalable, and low-latency MTS classification.

[1328] When Greedy Wins: Emergent Exploitation Bias in Meta-Bandit LLM Training

Sanxing Chen, Xiaoyin Chen, Yukun Huang, Roy Xie, Bhuwan Dhingra

Main category: cs.LG

TL;DR: LLMs trained with SFT and RL outperform pre-trained models on multi-armed bandit tasks, achieving performance comparable to UCB and Thompson Sampling with robust generalization, but exhibit greedier exploitation that can lead to early catastrophic failure.

Details

Motivation: LLMs often explore suboptimally in sequential decision-making, and it's unclear how SFT and RL shape exploration strategies and generalize.

Method: Train LLMs with SFT on expert trajectories and RL with tailored rewards including regret-shaped and algorithmic rewards for oracle imitation.

Result: Trained agents outperform pre-trained models and match UCB/Thompson Sampling performance, generalizing to 6x longer horizons and across bandit families, but show greedier exploitation prone to early failure.

Conclusion: Tailored reward design and evaluation beyond average regret are needed to promote robust exploratory behavior, with each training paradigm having specific advantages.

Abstract: While Large Language Models (LLMs) hold promise to become autonomous agents, they often explore suboptimally in sequential decision-making. Recent work has sought to enhance this capability via supervised fine-tuning (SFT) or reinforcement learning (RL), improving regret on the classic multi-armed bandit task. However, it remains unclear how these learning methods shape exploration strategies and how well they generalize. We investigate both paradigms by training LLMs with SFT on expert trajectories and RL with a range of tailored reward signals including a strategic, regret-shaped reward to reduce variance, and an algorithmic reward that enables oracle imitation. The resulting agents outperform pre-trained models and achieve performance comparable to Upper Confidence Bound (UCB) and Thompson Sampling, with robust generalization to 6x longer horizons and across bandit families. Behavioral analysis reveals that gains often stem from more sophisticated but greedier exploitation: RL/SFT agents are more prone to early catastrophic failure than pre-trained models, prematurely abandoning exploration. Furthermore, agents trained to imitate UCB learn to outperform their teacher by adopting more exploitative variants. Our findings clarify when each training paradigm is preferable and advocate tailored reward design and evaluation beyond average regret to promote robust exploratory behavior.

[1329] Semantic Compression via Multimodal Representation Learning

Eleonora Grassucci, Giordano Cicchetti, Aurelio Uncini, Danilo Comminiello

Main category: cs.LG

TL;DR: This paper proposes semantic compression for multimodal embeddings by leveraging modality alignment to replace multiple embeddings with centroids, achieving significant memory savings without performance loss.

Details

Motivation: Multimodal representation learning creates high-dimensional embeddings that align different modalities but introduces scalability challenges in storage and processing. The key problem is how to compress these embeddings while preserving their semantic content across modalities.

Method: The approach proves that reducing the modality gap enables semantic compression. When embeddings from different modalities expressing the same semantics are sufficiently aligned, their centroid can faithfully represent the semantic concept. The method operates directly on pretrained encoders to replace multiple embeddings with single centroids.

Result: The proposed semantic compression approach demonstrates effectiveness across diverse large-scale multimodal downstream tasks, achieving significant compression without sacrificing performance.

Conclusion: Modality alignment is a key enabler for semantic compression, and the proposed centroid-based approach successfully reduces memory footprint while maintaining semantic representation capabilities.

Abstract: Multimodal representation learning produces high-dimensional embeddings that align diverse modalities in a shared latent space. While this enables strong generalization, it also introduces scalability challenges, both in terms of storage and downstream processing. A key open problem is how to achieve semantic compression, reducing the memory footprint of multimodal embeddings while preserving their ability to represent shared semantic content across modalities. In this paper, we prove a strong connection between reducing the modality gap, which is the residual separation of embeddings from different modalities, and the feasibility of post-training semantic compression. When the gap is sufficiently reduced, embeddings from different modalities but expressing the same semantics share a common portion of the space. Therefore, their centroid is a faithful representation of such a semantic concept. This enables replacing multiple embeddings with a single centroid, yielding significant memory savings. We propose a novel approach for semantic compression grounded on the latter intuition, operating directly on pretrained encoders. We demonstrate its effectiveness across diverse large-scale multimodal downstream tasks. Our results highlight that modality alignment is a key enabler for semantic compression, showing that the proposed approach achieves significant compression without sacrificing performance.

[1330] EOE: Evolutionary Optimization of Experts for Training Language Models

Yingshi Chen

Main category: cs.LG

TL;DR: Evolutionary framework for training LLMs using multiple experts with evolutionary operators to reduce model size and accelerate training.

Details

Motivation: To reduce memory usage and increase training throughput for large language models while maintaining accuracy.

Method: Divide model into experts, train one expert per step with AdamW, apply evolutionary operators (crossover, PSO, mutation) between current and best expert.

Result: Best expert achieves nearly same accuracy as full model, reduces model size for inference, accelerates training throughput by 10x+.

Conclusion: Evolutionary training framework enables efficient LLM training with reduced memory and high throughput while maintaining performance.

Abstract: This paper presents an evolutionary framework for the training of large language models(LLM). The models are divided into several experts(sub-networks), which have the same structure but different parameter values. Only one expert is trained at each step. After the classical AdamW optimization, some evolutionary operators(crossover, PSO, and mutation) act on the tensor weights between the current expert and the best expert. So current expert would learn the experience of best expert. The direction of best expert would help current expert’s loss decrease faster. Finally, only save the weight of the best expert. Experiments show that best expert would achieve nearly the same accuracy as the full model. This would greatly reduce the size of the model for inference. Since only one expert is trained at each step, the training needs much less memory and has much higher throughput. Experiments show that the throughput would accelerate more than ten times! Our source code is available. It’s a pure c++/cu framework, which is suitable for easy deployment on PCs and edge computing devices.

[1331] Distributionally Robust Federated Learning with Outlier Resilience

Zifan Wang, Xinlei Yi, Xenia Konti, Michael M. Zavlanos, Karl H. Johansson

Main category: cs.LG

TL;DR: The paper proposes a distributionally robust federated learning method with explicit outlier resilience using an unbalanced Wasserstein distance ambiguity set that handles both geometric distribution shifts and outlier mitigation.

Details

Motivation: Federated learning performance degrades with data distribution perturbations, and existing DRO-based FL methods fail to address the detrimental impact of outliers in local datasets that can bias learned models.

Method: Introduces a novel ambiguity set based on unbalanced Wasserstein distance with KL penalization for outlier mitigation, reformulates as tractable Lagrangian penalty optimization, and proposes distributionally outlier-robust federated learning algorithm.

Result: The approach provides robustness certificates and establishes convergence guarantees, with extensive experiments on synthetic and real-world datasets demonstrating effectiveness.

Conclusion: The proposed method successfully addresses both distribution shifts and outlier resilience in federated learning through a principled DRO framework with theoretical guarantees and empirical validation.

Abstract: Federated learning (FL) enables collaborative model training without direct data sharing, but its performance can degrade significantly in the presence of data distribution perturbations. Distributionally robust optimization (DRO) provides a principled framework for handling this by optimizing performance against the worst-case distributions within a prescribed ambiguity set. However, existing DRO-based FL methods often overlook the detrimental impact of outliers in local datasets, which can disproportionately bias the learned models. In this work, we study distributionally robust federated learning with explicit outlier resilience. We introduce a novel ambiguity set based on the unbalanced Wasserstein distance, which jointly captures geometric distributional shifts and incorporates a non-geometric Kullback–Leibler penalization to mitigate the influence of outliers. This formulation naturally leads to a challenging min–max–max optimization problem. To enable decentralized training, we reformulate the problem as a tractable Lagrangian penalty optimization, which admits robustness certificates. Building on this reformulation, we propose the distributionally outlier-robust federated learning algorithm and establish its convergence guarantees. Extensive experiments on both synthetic and real-world datasets demonstrate the effectiveness of our approach.

[1332] Scaling with Collapse: Efficient and Predictable Training of LLM Families

Shane Bergsma, Bin Claire Zhang, Nolan Dey, Shaheer Muhammad, Gurpreet Gosal, Joel Hestness

Main category: cs.LG

TL;DR: Training loss curves of LLMs collapse onto a universal trajectory when hyperparameters are optimally scaled, serving as a signature of compute-efficient training with practical applications in diagnostics and hyperparameter tuning.

Details

Motivation: To verify if loss curve collapse phenomenon holds for LLM families trained under practical scaling recipes where multiple parameters (width, depth, learning rate, batch size, weight decay) are scaled jointly.

Method: Analyzed loss curves across different model scales when optimization hyperparameters are set optimally according to empirical scaling laws for given data budgets.

Result: Loss curves collapse across scales precisely when hyperparameters are optimally set, and deviation from collapse serves as an early diagnostic for training pathologies. The predictability enables early stopping in hyperparameter tuning.

Conclusion: Collapse emerges as a signature of compute-efficient training and provides an effective tool for developing efficient LLMs, demonstrated through training the competitive Celerity LLM family.

Abstract: Effective LLM training relies on consistency, meaning that key quantities – such as final losses and optimal hyperparameters – scale predictably across model sizes. Qiu et al. (2025) recently showed that this consistency extends beyond scalars: whole training loss curves can collapse onto a universal trajectory after a simple normalization. What remains unclear is whether this phenomenon holds for LLM families trained under practical scaling recipes, where width, depth, learning rate, batch size, and weight decay are scaled jointly. We show that it does: loss curves collapse across scales precisely when optimization hyperparameters are set optimally for the given data budget, in accordance with recent empirical scaling laws. Collapse thus emerges as a signature of compute-efficient training. We demonstrate two applications at scale: (1) deviation-from-collapse provides a sensitive, early diagnostic of training pathologies, and (2) the predictability of collapsed curves enables early stopping in large-scale hyperparameter tuning. Finally, we train a competitive LLM family, Celerity, using these insights, highlighting collapse as an effective tool for developing efficient LLMs.

[1333] Interpretable Kernel Representation Learning at Scale: A Unified Framework Utilizing Nyström Approximation

Maedeh Zarvandi, Michael Timothy, Theresa Wasserer, Debarghya Ghoshdastidar

Main category: cs.LG

TL;DR: KREPES is a scalable framework for kernel-based representation learning using Nyström approximation, enabling efficient learning on large datasets while maintaining interpretability.

Details

Motivation: Kernel methods have strong theoretical foundations but suffer from scalability issues, especially for representation learning with massive unlabeled data in the foundation model era.

Method: Uses Nyström approximation to create a unified framework for kernel-based representation learning that supports various unsupervised and self-supervised losses.

Result: Demonstrated efficiency on large image and tabular datasets, with principled interpretability of learned representations as a key advantage over deep models.

Conclusion: KREPES successfully addresses the scalability limitations of kernel methods for representation learning while providing interpretability benefits over deep learning approaches.

Abstract: Kernel methods provide a theoretically grounded framework for non-linear and non-parametric learning, with strong analytic foundations and statistical guarantees. Yet, their scalability has long been limited by prohibitive time and memory costs. While progress has been made in scaling kernel regression, no framework exists for scalable kernel-based representation learning, restricting their use in the era of foundation models where representations are learned from massive unlabeled data. We introduce KREPES – a unified, scalable framework for kernel-based representation learning via Nystr"om approximation. KREPES accommodates a wide range of unsupervised and self-supervised losses, and experiments on large image and tabular datasets demonstrate its efficiency. Crucially, KREPES enables principled interpretability of the learned representations, an immediate benefit over deep models, which we substantiate through dedicated analysis.

[1334] ORPO-Distill: Mixed-Policy Preference Optimization for Cross-Architecture LLM Distillation

Aasheesh Singh, Vishal Vaddina, Dagnachew Birru

Main category: cs.LG

TL;DR: ORPO-Distill is a cross-architecture LLM distillation method that treats distillation as preference optimization, using diverse reasoning traces and an odds-ratio objective to contrast teacher and student outputs.

Details

Motivation: To improve knowledge transfer in LLM distillation by moving beyond standard Chain-of-Thought distillation and enabling more effective cross-architecture knowledge transfer through preference optimization.

Method: Formulates distillation as preference optimization with an Odds-Ratio Preference Optimization objective that contrasts teacher and student reasoning traces, using a mixed-policy strategy for student-generated outputs.

Result: Experiments on five datasets with multiple student models show consistent improvements over conventional black-box knowledge distillation baselines.

Conclusion: ORPO-Distill provides an effective general-purpose approach for cross-architecture LLM distillation that outperforms existing methods through its preference optimization formulation and mixed-policy strategy.

Abstract: We introduce ORPO-Distill, a general-purpose method for cross-architecture LLM distillation that formulates the problem as a preference optimization task. Unlike standard CoT distillation, the approach transfers knowledge through diverse reasoning traces. It employs an Odds-Ratio Preference Optimization objective that contrasts teacher and student traces for more effective learning, and adopts a mixed-policy strategy for utilizing student-generated outputs, outperforming both off- and on-policy alternatives. Experiments on five datasets and multiple student models show consistent improvements over conventional black-box KD baselines.

[1335] FS-KAN: Permutation Equivariant Kolmogorov-Arnold Networks via Function Sharing

Ran Elbaz, Guy Bar-Shalom, Yam Eitan, Fabrizio Frasca, Haggai Maron

Main category: cs.LG

TL;DR: FS-KAN is a principled framework for building permutation equivariant Kolmogorov-Arnold Networks that unifies and extends previous work, offering superior data efficiency while maintaining interpretability.

Details

Motivation: To develop a general framework for applying equivariant KANs to data with permutation symmetries, addressing the lack of principled approaches for arbitrary permutation symmetry groups.

Method: Introduces Function Sharing KAN (FS-KAN) by generalizing parameter-sharing schemes to the Kolmogorov-Arnold setup, constructing equivariant and invariant KA layers for arbitrary permutation symmetry groups.

Result: FS-KANs have the same expressive power as standard parameter-sharing networks and demonstrate superior data efficiency across multiple data types and symmetry groups, particularly in low-data regimes.

Conclusion: FS-KANs are an excellent architecture choice for low-data regimes, combining the interpretability and adaptability of KANs with improved data efficiency through principled equivariant design.

Abstract: Permutation equivariant neural networks employing parameter-sharing schemes have emerged as powerful models for leveraging a wide range of data symmetries, significantly enhancing the generalization and computational efficiency of the resulting models. Recently, Kolmogorov-Arnold Networks (KANs) have demonstrated promise through their improved interpretability and expressivity compared to traditional architectures based on MLPs. While equivariant KANs have been explored in recent literature for a few specific data types, a principled framework for applying them to data with permutation symmetries in a general context remains absent. This paper introduces Function Sharing KAN (FS-KAN), a principled approach to constructing equivariant and invariant KA layers for arbitrary permutation symmetry groups, unifying and significantly extending previous work in this domain. We derive the basic construction of these FS-KAN layers by generalizing parameter-sharing schemes to the Kolmogorov-Arnold setup and provide a theoretical analysis demonstrating that FS-KANs have the same expressive power as networks that use standard parameter-sharing layers, allowing us to transfer well-known and important expressivity results from parameter-sharing networks to FS-KANs. Empirical evaluations on multiple data types and symmetry groups show that FS-KANs exhibit superior data efficiency compared to standard parameter-sharing layers, by a wide margin in certain cases, while preserving the interpretability and adaptability of KANs, making them an excellent architecture choice in low-data regimes.

[1336] Rethinking Entropy Regularization in Large Reasoning Models

Yuxian Jiang, Yafu Li, Guanxu Chen, Dongrui Liu, Yu Cheng, Jing Shao

Main category: cs.LG

TL;DR: SIREN addresses entropy collapse and premature convergence in RLVR for large reasoning models through selective entropy regularization with two-step masking and self-anchored stabilization.

Details

Motivation: RLVR enhances reasoning in LRMs but suffers from entropy collapse and premature convergence. Traditional entropy regularization fails due to vast action spaces and long trajectories causing global entropy explosion.

Method: Proposes SIREN with two-step entropy masking (top-p mask and peak-entropy mask) to confine exploration to meaningful actions/states, plus self-anchored regularization for training stability.

Result: Achieves superior performance across 5 mathematical benchmarks, with +6.6 maj@k improvement on AIME24/25 using Qwen2.5-Math-7B. Promotes response diversity and maintains appropriate entropy levels.

Conclusion: SIREN effectively mitigates premature convergence in RLVR for LRMs by maintaining validation pass@k throughout training through controlled entropy management.

Abstract: Reinforcement learning with verifiable rewards (RLVR) has shown great promise in enhancing the reasoning abilities of large reasoning models (LRMs). However, it suffers from a critical issue: entropy collapse and premature convergence. Naive entropy regularization, a common approach for encouraging exploration in the traditional RL literature, fails to address this problem in the context of LRM. Our analysis reveals that this failure stems from the vast action space and long trajectories in LRMs, which easily trigger a global entropy explosion as the model indiscriminately explores all possible actions and states. To address this, we propose SIREN (SelectIve entRopy rEgularizatioN), a method that confines exploration to a meaningful subset of actions and states. SIREN achieves this through a two-step entropy masking mechanism, consisting of a top-p mask and a peak-entropy mask. In addition, regularization is transformed into a self-anchored form to stabilize training. Across five mathematical benchmarks, SIREN attains superior average performance over previous entropy-related RLVR approaches, exemplified by a +6.6 maj@k improvement on AIME24/25 with Qwen2.5-Math-7B. Further analysis confirms that SIREN promotes greater response diversity and maintains entropy at an appropriate level, which helps to preserve the validation pass@k throughout training. This effectively mitigates the premature convergence problem common in RLVR for LRM.

[1337] One-Prompt Strikes Back: Sparse Mixture of Experts for Prompt-based Continual Learning

Minh Le, Bao-Ngoc Dao, Huy Nguyen, Quyen Tran, Anh Nguyen, Nhat Ho

Main category: cs.LG

TL;DR: SMoPE is a novel continual learning framework that uses a sparse mixture of experts architecture with shared prompts to achieve both high performance and computational efficiency, outperforming task-specific prompt methods while reducing parameters and costs.

Details

Motivation: To address the trade-off between task-specific prompts (computationally expensive) and shared prompts (suffering from knowledge interference) in continual learning, seeking a solution that combines efficiency with strong performance.

Method: Organizes shared prompts into multiple “prompt experts” in a sparse MoE architecture, activates only relevant experts per input using prompt-attention score aggregation, employs adaptive noise for balanced utilization, and uses prototype-based loss for expert specialization.

Result: Consistently outperforms task-specific prompt methods and achieves competitive performance with state-of-the-art approaches while significantly reducing parameter counts and computational costs across multiple CL benchmarks.

Conclusion: SMoPE successfully reconciles the efficiency-performance trade-off in prompt-based continual learning by combining shared prompt benefits with task-specific advantages through sparse expert activation.

Abstract: Prompt-based methods have recently gained prominence in Continual Learning (CL) due to their strong performance and memory efficiency. A prevalent strategy in this paradigm assigns a dedicated subset of prompts to each task, which, while effective, incurs substantial computational overhead and causes memory requirements to scale linearly with the number of tasks. Conversely, approaches employing a single shared prompt across tasks offer greater efficiency but often suffer from degraded performance due to knowledge interference. To reconcile this trade-off, we propose SMoPE, a novel framework that integrates the benefits of both task-specific and shared prompt strategies. Inspired by recent findings on the relationship between Prefix Tuning and Mixture of Experts (MoE), SMoPE organizes a shared prompt into multiple “prompt experts” within a sparse MoE architecture. For each input, only a select subset of relevant experts is activated, effectively mitigating interference. To facilitate expert selection, we introduce a prompt-attention score aggregation mechanism that computes a unified proxy score for each expert, enabling dynamic and sparse activation. Additionally, we propose an adaptive noise mechanism to encourage balanced expert utilization while preserving knowledge from prior tasks. To further enhance expert specialization, we design a prototype-based loss function that leverages prefix keys as implicit memory representations. Extensive experiments across multiple CL benchmarks demonstrate that SMoPE consistently outperforms task-specific prompt methods and achieves performance competitive with state-of-the-art approaches, all while significantly reducing parameter counts and computational costs.

[1338] Guided Uncertainty Learning Using a Post-Hoc Evidential Meta-Model

Charmaine Barker, Daniel Bethell, Simos Gerasimou

Main category: cs.LG

TL;DR: GUIDE is a lightweight evidential learning meta-model that attaches to frozen deep learning models to teach them how and when to express uncertainty without retraining or architectural changes, improving out-of-distribution detection by ~77% and adversarial attack detection by ~80%.

Details

Motivation: Reliable uncertainty quantification remains a major obstacle for deploying deep learning models under distributional shift, as existing post-hoc approaches either inherit misplaced confidence or merely reshape predictions without teaching models when to be uncertain.

Method: GUIDE identifies salient internal features via calibration, then employs these features to construct a noise-driven curriculum that teaches the model how and when to express uncertainty. It requires no retraining, architectural modifications, or manual intermediate-layer selection.

Result: GUIDE improves out-of-distribution detection by ~77% and adversarial attack detection by ~80% while preserving in-distribution performance, consistently outperforming state-of-the-art approaches across diverse benchmarks.

Conclusion: GUIDE demonstrates the need for actively guiding uncertainty to close the gap between predictive confidence and reliability, providing a broadly applicable solution that avoids distilling overconfidence from base models.

Abstract: Reliable uncertainty quantification remains a major obstacle to the deployment of deep learning models under distributional shift. Existing post-hoc approaches that retrofit pretrained models either inherit misplaced confidence or merely reshape predictions, without teaching the model when to be uncertain. We introduce GUIDE, a lightweight evidential learning meta-model approach that attaches to a frozen deep learning model and explicitly learns how and when to be uncertain. GUIDE identifies salient internal features via a calibration stage, and then employs these features to construct a noise-driven curriculum that teaches the model how and when to express uncertainty. GUIDE requires no retraining, no architectural modifications, and no manual intermediate-layer selection to the base deep learning model, thus ensuring broad applicability and minimal user intervention. The resulting model avoids distilling overconfidence from the base model, improves out-of-distribution detection by ~77% and adversarial attack detection by ~80%, while preserving in-distribution performance. Across diverse benchmarks, GUIDE consistently outperforms state-of-the-art approaches, evidencing the need for actively guiding uncertainty to close the gap between predictive confidence and reliability.

[1339] LLM DNA: Tracing Model Evolution via Functional Representations

Zhaomin Wu, Haodong Zhao, Ziyang Wang, Jizhou Guo, Qian Wang, Bingsheng He

Main category: cs.LG

TL;DR: The paper proposes LLM DNA as a low-dimensional representation to track evolutionary relationships between large language models, addressing the challenge of undocumented fine-tuning and adaptation paths.

Details

Motivation: The rapid proliferation of LLMs has created an opaque landscape where evolutionary relationships through fine-tuning, distillation, or adaptation are often unclear, complicating model management and understanding.

Method: Mathematically define LLM DNA as a low-dimensional, bi-Lipschitz representation of functional behavior, prove its properties (inheritance and genetic determinism), and develop a training-free pipeline for DNA extraction.

Result: Experiments across 305 LLMs show DNA aligns with prior studies, achieves competitive performance on specific tasks, uncovers undocumented relationships, and enables construction of evolutionary trees that reflect architectural shifts and temporal progression.

Conclusion: LLM DNA provides a scalable framework for understanding and tracking the evolutionary relationships among language models, revealing patterns in their development and adaptation over time.

Abstract: The explosive growth of large language models (LLMs) has created a vast but opaque landscape: millions of models exist, yet their evolutionary relationships through fine-tuning, distillation, or adaptation are often undocumented or unclear, complicating LLM management. Existing methods are limited by task specificity, fixed model sets, or strict assumptions about tokenizers or architectures. Inspired by biological DNA, we address these limitations by mathematically defining LLM DNA as a low-dimensional, bi-Lipschitz representation of functional behavior. We prove that LLM DNA satisfies inheritance and genetic determinism properties and establish the existence of DNA. Building on this theory, we derive a general, scalable, training-free pipeline for DNA extraction. In experiments across 305 LLMs, DNA aligns with prior studies on limited subsets and achieves superior or competitive performance on specific tasks. Beyond these tasks, DNA comparisons uncover previously undocumented relationships among LLMs. We further construct the evolutionary tree of LLMs using phylogenetic algorithms, which align with shifts from encoder-decoder to decoder-only architectures, reflect temporal progression, and reveal distinct evolutionary speeds across LLM families.

[1340] SIRI: Scaling Iterative Reinforcement Learning with Interleaved Compression

Haoming Wen, Yushi Bai, Juanzi Li, Jie Tang

Main category: cs.LG

TL;DR: SIRI is a reinforcement learning method that alternates between compressing and expanding reasoning length during training, improving both performance and efficiency in Large Reasoning Models.

Details

Motivation: Existing methods struggle with repetitive thinking patterns in LRMs, creating a trade-off between reducing redundancy and maintaining performance.

Method: Iteratively alternates between compression (shortening rollout length) and expansion (relaxing length limit) phases during training to dynamically adjust reasoning budget.

Result: SIRI-low improved AIME24 performance by 43.2% while reducing token usage by 46.9%; SIRI-high achieved highest accuracy compared to other methods.

Conclusion: Periodically oscillating output truncation length during training can dynamically balance exploration and efficiency, converging toward an optimal performance-efficiency trade-off.

Abstract: We introduce SIRI, Scaling Iterative Reinforcement Learning with Interleaved Compression, a simple yet effective RL approach for Large Reasoning Models (LRMs) that enables more efficient and accurate reasoning. Existing studies have observed repetitive thinking patterns in LRMs, and attempts to reduce them often come at the cost of performance. In this paper, we show that this trade-off can be overcome through a training regime that iteratively alternates between compressing and expanding the reasoning budget, by dynamically adjusting the maximum rollout length during training. The compression phase cuts the rollout length, forcing the model to make precise and valuable decisions within a limited context, which effectively reduces redundant tokens and increases reasoning density. The expansion phase then relaxes the length limit, providing space for the model to explore and plan in long-horizon settings. Remarkably, we find that after each compression-expansion cycle, the model’s performance improves even as its output length decreases, steadily pushing it closer to the Pareto frontier in the performance-efficiency trade-off. Training on DeepSeek-R1-Distill-Qwen-1.5B, SIRI-low improves performance on AIME24 by 43.2% while reducing token usage by 46.9% after three iterations, and SIRI-high achieves the highest accuracy compared to all other methods (Figure 1). Our findings shed light on the potential of periodically oscillating the LRM’s output truncation length during training to dynamically balance exploration and efficiency in reasoning, converging towards an optimal “sweet spot” between the two. Our models are publicly available.

[1341] Specialization after Generalization: Towards Understanding Test-Time Training in Foundation Models

Jonas Hübotter, Patrik Wolf, Alexander Shevchenko, Dennis Jüni, Andreas Krause, Gil Kur

Main category: cs.LG

TL;DR: TTT improves performance by enabling specialization after generalization, focusing model capacity on task-relevant concepts rather than just out-of-distribution adaptation.

Details

Motivation: To understand why test-time training (TTT) works effectively, especially for in-distribution data with foundation models, challenging previous explanations focused on out-of-distribution adaptation.

Method: Proposed a theoretical model under linear representation hypothesis, trained sparse autoencoder on ImageNet to validate assumptions, and conducted scaling studies across image and language tasks.

Result: TTT achieves substantially smaller in-distribution test error than global training; sparse autoencoder analysis shows semantically related data points share few concepts; scaling studies identify regimes where specialization is most effective.

Conclusion: Foundation models are globally underparameterized, and TTT provides effective specialization after generalization by focusing capacity on task-relevant concepts, with practical implications across domains.

Abstract: Recent empirical studies have explored the idea of continuing to train a model at test-time for a given task, known as test-time training (TTT), and have found it to yield significant performance improvements. However, there is limited understanding of why and when TTT is effective. Earlier explanations mostly focused on the observation that TTT may help when applied to out-of-distribution adaptation or used with privileged data. However, the growing scale of foundation models with most test data being in-distribution questions these explanations. We instead posit that foundation models remain globally underparameterized, with TTT providing a mechanism for specialization after generalization, focusing capacity on concepts relevant to the test task. Specifically, under the linear representation hypothesis, we propose a model in which TTT achieves a substantially smaller in-distribution test error than global training. We empirically validate our model’s key assumptions by training a sparse autoencoder on ImageNet, showing that semantically related data points are explained by only a few shared concepts. Finally, we perform scaling studies across image and language tasks that confirm the practical implications of our model, identifying the regimes where specialization is most effective.

[1342] Trading Carbon for Physics: On the Resource Efficiency of Machine Learning for Spatio-Temporal Forecasting

Sophia N. Wilson, Jens Hesselbjerg Christensen, Raghavendra Selvan

Main category: cs.LG

TL;DR: Physics inductive biases can significantly improve model efficiency (compute, energy, carbon) while maintaining or improving efficacy in spatio-temporal forecasting tasks, offering a principled approach to reduce ML’s environmental impact.

Details

Motivation: Current deep learning focuses excessively on model efficacy, leading to large-scale models with massive resource requirements and significant carbon footprint. The paper explores how physics inductive biases can provide better trade-offs between efficacy and efficiency.

Method: Study various models for spatio-temporal forecasting with different levels of physics inductive bias, including standard physics-informed models and more recent approaches like flow matching as general purpose methods for spatio-temporal forecasting.

Result: Embedding physics inductive biases yields substantial efficiency gains while retaining or even improving task efficacy. Experiments demonstrate these biases provide a principled way to improve efficiency and reduce carbon footprint.

Conclusion: Model efficiency, along with model efficacy, should become a core consideration driving machine learning model development and deployment to address environmental concerns.

Abstract: Development of modern deep learning methods has been driven primarily by the push for improving model efficacy (accuracy metrics). This sole focus on efficacy has steered development of large-scale models that require massive resources, and results in considerable carbon footprint across the model life-cycle. In this work, we explore how physics inductive biases can offer useful trade-offs between model efficacy and model efficiency (compute, energy, and carbon). We study a variety of models for spatio-temporal forecasting, a task governed by physical laws and well-suited for exploring different levels of physics inductive bias. We show that embedding physics inductive biases into the model design can yield substantial efficiency gains while retaining or even improving efficacy for the tasks under consideration. In addition to using standard physics-informed spatio-temporal models, we demonstrate the usefulness of more recent models like flow matching as a general purpose method for spatio-temporal forecasting. Our experiments show that incorporating physics inductive biases offer a principled way to improve the efficiency and reduce the carbon footprint of machine learning models. We argue that model efficiency, along with model efficacy, should become a core consideration driving machine learning model development and deployment.

[1343] Short window attention enables long-term memorization

Loïc Cabannes, Maximilian Beck, Gergely Szilvasy, Matthijs Douze, Maria Lomeli, Jade Copet, Pierre-Emmanuel Mazaré, Gabriel Synnaeve, Hervé Jégou

Main category: cs.LG

TL;DR: SWAX is a hybrid architecture combining sliding-window attention with xLSTM layers, where stochastic window size training improves performance on both short and long-context tasks by encouraging better use of xLSTM memory.

Details

Motivation: To study the interplay between sliding window attention and linear RNN layers, and address the counter-intuitive finding that larger windows don't improve long-context performance while small windows hurt short-context tasks.

Method: Introduce SWAX architecture with sliding-window attention and xLSTM layers, trained using stochastic window sizes to force the model to leverage both longer context windows and xLSTM memory.

Result: SWAX with stochastic window training significantly outperforms regular window attention on both short and long-context problems.

Conclusion: Stochastic window size training in hybrid architectures enables better utilization of both attention mechanisms and RNN memory, solving the limitations of fixed window sizes.

Abstract: Recent works show that hybrid architectures combining sliding window softmax attention layers with linear recurrent neural network (RNN) layers outperform both of these architectures taken separately. However, the impact of the window length and the interplay between softmax attention and linear RNN layers remain under-studied. In this work, we introduce SWAX, a hybrid architecture consisting of sliding-window attention and xLSTM linear RNN layers. A counter-intuitive finding with SWAX is that larger sliding windows do not improve the long-context performance. In fact, short window attention encourages the model to better train the long-term memory of the xLSTM, by relying less on the softmax attention mechanism for long context-retrieval. The issue with small sliding windows is that they are detrimental for short-context tasks, which could be solved with information from moderately larger sliding windows otherwise. Therefore, we train SWAX by stochastically changing the sliding window size, forcing the model to leverage both a longer context window and the xLSTM memory. SWAX trained with stochastic window sizes significantly outperforms regular window attention both on short and long-context problems.

[1344] Deep Reinforcement Learning in Action: Real-Time Control of Vortex-Induced Vibrations

Hussam Sababha, Bernat Font, Mohammed Daqaq

Main category: cs.LG

TL;DR: Experimental deployment of deep reinforcement learning for active flow control of vortex-induced vibrations in a circular cylinder at Re=3000 using rotary actuation, achieving up to 80-95% vibration suppression.

Details

Motivation: To demonstrate real-time control in challenging experimental settings at high Reynolds numbers, addressing practical constraints like actuator delay that were not covered in prior low-Reynolds-number simulations.

Method: Deep reinforcement learning with rotary actuation control, using state feedback (displacement and velocity) and augmented with past control actions to handle actuation delays.

Result: With state feedback alone: 80% vibration suppression using low-frequency rotary control. With augmented learning: over 95% vibration attenuation using high-frequency rotary control that modifies vortex shedding.

Conclusion: DRL demonstrates adaptability for active flow control in real-world experiments and can overcome instrumental limitations like actuation lag.

Abstract: This study showcases an experimental deployment of deep reinforcement learning (DRL) for active flow control (AFC) of vortex-induced vibrations (VIV) in a circular cylinder at a high Reynolds number (Re = 3000) using rotary actuation. Departing from prior work that relied on low-Reynolds-number numerical simulations, this research demonstrates real-time control in a challenging experimental setting, successfully addressing practical constraints such as actuator delay. When the learning algorithm is provided with state feedback alone (displacement and velocity of the oscillating cylinder), the DRL agent learns a low-frequency rotary control strategy that achieves up to 80% vibration suppression which leverages the traditional lock-on phenomenon. While this level of suppression is significant, it remains below the performance achieved using high-frequency rotary actuation. The reduction in performance is attributed to actuation delays and can be mitigated by augmenting the learning algorithm with past control actions. This enables the agent to learn a high-frequency rotary control strategy that effectively modifies vortex shedding and achieves over 95% vibration attenuation. These results demonstrate the adaptability of DRL for AFC in real-world experiments and its ability to overcome instrumental limitations such as actuation lag.

[1345] Emergent World Representations in OpenVLA

Marco Molinari, Leonardo Nevali, Saharsha Navani, Omar G. Younis

Main category: cs.LG

TL;DR: The paper investigates whether Vision Language Action models (VLAs) implicitly learn world models by probing OpenVLA’s state representations using embedding arithmetic and linear/non-linear probes.

Details

Motivation: To determine if VLAs trained with policy-based RL implicitly learn world models, which is a key characteristic of model-based reinforcement learning, despite not explicitly modeling environmental dynamics.

Method: Used embedding arithmetic on state representations and trained linear/non-linear probes on model activations across layers to test if transition vectors between sequential states are recoverable from intermediate activations.

Result: Found statistically significant predictive ability on state transitions exceeding baseline embeddings, indicating that OpenVLA encodes an internal world model. Also discovered that this world model emerges as training progresses.

Conclusion: OpenVLA does contain latent knowledge of state transitions and implicitly learns a world model, with evidence suggesting this capability develops during training. Proposed a pipeline using Sparse Autoencoders for further analysis.

Abstract: Vision Language Action models (VLAs) trained with policy-based reinforcement learning (RL) encode complex behaviors without explicitly modeling environmental dynamics. However, it remains unclear whether VLAs implicitly learn world models, a hallmark of model-based RL. We propose an experimental methodology using embedding arithmetic on state representations to probe whether OpenVLA, the current state of the art in VLAs, contains latent knowledge of state transitions. Specifically, we measure the difference between embeddings of sequential environment states and test whether this transition vector is recoverable from intermediate model activations. Using linear and non linear probes trained on activations across layers, we find statistically significant predictive ability on state transitions exceeding baselines (embeddings), indicating that OpenVLA encodes an internal world model (as opposed to the probes learning the state transitions). We investigate the predictive ability of an earlier checkpoint of OpenVLA, and uncover hints that the world model emerges as training progresses. Finally, we outline a pipeline leveraging Sparse Autoencoders (SAEs) to analyze OpenVLA’s world model.

[1346] Learning to Solve Optimization Problems Constrained with Partial Differential Equations

Yusuf Guven, Vincenzo Di Vito, Ferdinando Fioretto

Main category: cs.LG

TL;DR: A learning-based framework combining dynamic predictor and optimization surrogate for PDE-constrained optimization, achieving comparable solution quality to classical methods with 4 orders of magnitude speedup.

Details

Motivation: PDE-constrained optimization problems in scientific domains are computationally demanding due to tight coupling between decision variables and PDE state variables, requiring handling of high-dimensional discretization and dynamic constraints.

Method: Dual-network design: dynamic predictor (time-discrete Neural Operator) approximates PDE system trajectories, and optimization surrogate (proxy optimizer techniques) approximates optimal decisions, explicitly capturing decision-PDE coupling.

Result: Validated on benchmark tasks (Burgers’ equation, heat equation, voltage regulation), achieves solution quality comparable to Direct Method and MPC while providing up to 10,000x computational speed improvement.

Conclusion: The proposed learning-based framework enables real-time approximation of optimal strategies for PDE-constrained optimization with significant computational efficiency gains while maintaining solution quality.

Abstract: Partial differential equation (PDE)-constrained optimization arises in many scientific and engineering domains, such as energy systems, fluid dynamics and material design. In these problems, the decision variables (e.g., control inputs or design parameters) are tightly coupled with the PDE state variables, and the feasible set is implicitly defined by the governing PDE constraints. This coupling makes the problems computationally demanding, as it requires handling high dimensional discretization and dynamic constraints. To address these challenges, this paper introduces a learning-based framework that integrates a dynamic predictor with an optimization surrogate. The dynamic predictor, a novel time-discrete Neural Operator (Lu et al.), efficiently approximate system trajectories governed by PDE dynamics, while the optimization surrogate leverages proxy optimizer techniques (Kotary et al.) to approximate the associated optimal decisions. This dual-network design enables real-time approximation of optimal strategies while explicitly capturing the coupling between decisions and PDE dynamics. We validate the proposed approach on benchmark PDE-constrained optimization tasks inlacing Burgers’ equation, heat equation and voltage regulation, and demonstrate that it achieves solution quality comparable to classical control-based algorithms, such as the Direct Method and Model Predictive Control (MPC), while providing up to four orders of magnitude improvement in computational speed.

[1347] SAIP: A Plug-and-Play Scale-adaptive Module in Diffusion-based Inverse Problems

Lingyu Wang, Xiangming Meng

Main category: cs.LG

TL;DR: SAIP is a plug-and-play module that adaptively refines the scale parameter in diffusion-based inverse problem solving, improving performance over fixed-scale methods like DPS, DMPS, and πGDM.

Details

Motivation: Existing diffusion-based inverse problem solvers use fixed, manually tuned scales to balance prior and likelihood scores, which is suboptimal since the ideal balance varies across timesteps and tasks, limiting performance and generalization.

Method: Proposes SAIP, a plug-and-play module that adaptively refines the scale parameter at each timestep without requiring retraining or modifying the diffusion backbone. It integrates seamlessly into existing samplers.

Result: SAIP consistently improves reconstruction quality across diverse image restoration tasks, including challenging scenarios, by providing adaptive scale refinement.

Conclusion: The adaptive scale refinement approach of SAIP addresses limitations of fixed-scale methods and enhances performance in diffusion-based inverse problem solving for image restoration.

Abstract: Solving inverse problems with diffusion models has shown promise in tasks such as image restoration. A common approach is to formulate the problem in a Bayesian framework and sample from the posterior by combining the prior score with the likelihood score. Since the likelihood term is often intractable, estimators like DPS, DMPS, and $\pi$GDM are widely adopted. However, these methods rely on a fixed, manually tuned scale to balance prior and likelihood contributions. Such a static design is suboptimal, as the ideal balance varies across timesteps and tasks, limiting performance and generalization. To address this issue, we propose SAIP, a plug-and-play module that adaptively refines the scale at each timestep without retraining or altering the diffusion backbone. SAIP integrates seamlessly into existing samplers and consistently improves reconstruction quality across diverse image restoration tasks, including challenging scenarios.

[1348] CURA: Size Isnt All You Need – A Compact Universal Architecture for On-Device Intelligence

Jae-Bum Seo, Muhammad Salman, Lismer Andres Caceres-Najarro

Main category: cs.LG

TL;DR: CURA is a compact and lightweight AI architecture inspired by analog audio circuits that achieves high performance across diverse ML tasks with dramatically fewer parameters than existing approaches.

Details

Motivation: Existing on-device AI architectures lack compactness (parameters scale with task complexity) and generalizability (models are domain-specific and cannot adapt across different application domains).

Method: Proposed CURA architecture inspired by analog audio signal processing circuits, designed to be compact and adaptable across multiple domains including regression, classification, NLP, and computer vision.

Result: CURA achieved equivalent accuracy using up to 2,500 times fewer parameters, demonstrated consistent performance across 4 NLP benchmarks and 1 computer vision dataset (F1-scores up to 90%), and delivered superior forecasting accuracy with 1.6x lower MAE and 2.1x lower MSE than competing models.

Conclusion: CURA provides a compact, generalizable solution for resource-constrained environments that can capture complex patterns while maintaining extremely low model complexity across diverse ML domains.

Abstract: Existing on-device AI architectures for resource-constrained environments face two critical limitations: they lack compactness, with parameter requirements scaling proportionally to task complexity, and they exhibit poor generalizability, performing effectively only on specific application domains (e.g., models designed for regression tasks cannot adapt to natural language processing (NLP) applications). In this paper, we propose CURA, an architecture inspired by analog audio signal processing circuits that provides a compact and lightweight solution for diverse machine learning tasks across multiple domains. Our architecture offers three key advantages over existing approaches: (1) Compactness: it requires significantly fewer parameters regardless of task complexity; (2) Generalizability: it adapts seamlessly across regression, classification, complex NLP, and computer vision tasks; and (3) Complex pattern recognition: it can capture intricate data patterns while maintaining extremely low model complexity. We evaluated CURA across diverse datasets and domains. For compactness, it achieved equivalent accuracy using up to 2,500 times fewer parameters compared to baseline models. For generalizability, it demonstrated consistent performance across four NLP benchmarks and one computer vision dataset, nearly matching specialized existing models (achieving F1-scores up to 90%). Lastly, it delivers superior forecasting accuracy for complex patterns, achieving 1.6 times lower mean absolute error and 2.1 times lower mean squared error than competing models.

[1349] Evaluating classification performance across operating contexts: A comparison of decision curve analysis and cost curves

Louise AC Millard, Peter A Flach

Main category: cs.LG

TL;DR: Decision curve analysis (DCA) and cost curves are compared, showing they are closely related. Decision curves are equivalent to Brier curves when scores are calibrated and thresholds are set using relative classification values. Both methods select the same optimal model at any threshold, but Brier curves are more generally applicable across thresholds.

Details

Motivation: To understand the relationship between decision curve analysis and cost curves, and determine their respective strengths and limitations for model evaluation in classification contexts where decision thresholds matter.

Method: Theoretical comparison of decision curve analysis (DCA) and cost curves, specifically Brier curves, by analyzing their mathematical relationships and assumptions about score calibration and threshold setting based on relative classification values.

Result: Decision curves are closely related to Brier curves, with equivalent x-axes. Net benefit (DCA) and Brier loss always choose the same optimal model at any threshold. Brier curves are more generally applicable across thresholds, and the area under Brier curve equals the Brier score.

Conclusion: Brier curves are more generally applicable than DCA for evaluating models across thresholds. The upper envelope decision curve is suggested as a useful comparison for DCA to show potential net benefit gains through recalibration alone.

Abstract: Classification models typically predict a score and use a decision threshold to produce a classification. Appropriate model evaluation should carefully consider the context in which a model will be used, including the relative value of correct classifications of positive versus negative examples, which affects the threshold that should be used. Decision curve analysis (DCA) and cost curves are model evaluation approaches that assess the expected utility and expected loss of prediction models, respectively, across decision thresholds. We compared DCA and cost curves to determine how they are related, and their strengths and limitations. We demonstrate that decision curves are closely related to a specific type of cost curve called a Brier curve. Both curves are derived assuming model scores are calibrated and setting the classification threshold using the relative value of correct positive and negative classifications, and the x-axis of both curves are equivalent. Net benefit (used for DCA) and Brier loss (used for Brier curves) will always choose the same model as optimal at any given threshold. Across thresholds, differences in Brier loss are comparable whereas differences in net benefit cannot be compared. Brier curves are more generally applicable (when a wider range of thresholds are plausible), and the area under the Brier curve is the Brier score. We demonstrate that reference lines common in each space can be included in either and suggest the upper envelope decision curve as a useful comparison for DCA showing the possible gain in net benefit that could be achieved through recalibration alone.

[1350] Learning Hamiltonian Dynamics at Scale: A Differential-Geometric Approach

Katharina Friedl, Noémie Jaquier, Mika Liao, Danica Kragic

Main category: cs.LG

TL;DR: RO-HNN combines Hamiltonian mechanics with model reduction to scale physics-inspired neural networks to high-dimensional systems using a symplectic autoencoder and geometric Hamiltonian network.

Details

Motivation: Existing physics-inspired neural networks enforce conservation laws but struggle with scaling to high-dimensional systems, creating a need for methods that maintain physical consistency while being computationally feasible.

Method: Two core components: 1) geometrically-constrained symplectic autoencoder for learning low-dimensional structure-preserving submanifold, 2) geometric Hamiltonian neural network for modeling dynamics on the submanifold.

Result: RO-HNN provides physically-consistent, stable, and generalizable predictions of complex high-dimensional dynamics, effectively extending Hamiltonian neural networks to high-dimensional systems.

Conclusion: The proposed approach successfully combines conservation laws with scalability, enabling accurate modeling of high-dimensional physical systems while maintaining physical consistency.

Abstract: By embedding physical intuition, network architectures enforce fundamental properties, such as energy conservation laws, leading to plausible predictions. Yet, scaling these models to intrinsically high-dimensional systems remains a significant challenge. This paper introduces Geometric Reduced-order Hamiltonian Neural Network (RO-HNN), a novel physics-inspired neural network that combines the conservation laws of Hamiltonian mechanics with the scalability of model order reduction. RO-HNN is built on two core components: a novel geometrically-constrained symplectic autoencoder that learns a low-dimensional, structure-preserving symplectic submanifold, and a geometric Hamiltonian neural network that models the dynamics on the submanifold. Our experiments demonstrate that RO-HNN provides physically-consistent, stable, and generalizable predictions of complex high-dimensional dynamics, thereby effectively extending the scope of Hamiltonian neural networks to high-dimensional physical systems.

[1351] Identity Bridge: Enabling Implicit Reasoning via Shared Latent Memory

Pengxiao Lin, Zheng-An Chen, Zhi-Qin John Xu

Main category: cs.LG

TL;DR: The Identity Bridge mechanism resolves compositional reasoning failures in LLMs by adding zero-hop identity supervision, enabling successful out-of-distribution two-hop reasoning through latent geometry reshaping.

Details

Motivation: Large language models often fail at compositional reasoning tasks, particularly exhibiting the 'curse of two-hop reasoning' where they cannot perform multi-step reasoning beyond their training distribution.

Method: Introduces Identity Bridge - supervising models on zero-hop identity tasks. Uses theoretical analysis with Emb-MLP model, showing identity supervision reshapes latent geometry through implicit nuclear-norm regularization. Enhances effect with small initialization or weight decay.

Result: Enables models to successfully perform out-of-distribution two-hop reasoning that they otherwise completely fail. Shows alignment is induced by implicit nuclear-norm regularization favoring low-rank solutions with shared structure across tasks.

Conclusion: Identity supervision reshapes latent geometry to enable compositional reasoning. Large-scale models achieve two-hop reasoning through latent memory, providing inspiration for enhancing implicit reasoning abilities.

Abstract: Despite remarkable advances, large language models often fail at compositional reasoning tasks, a phenomenon exemplified by the ``curse of two-hop reasoning’’. This paper introduces the Identity Bridge, a simple yet powerful mechanism that resolves this compositionality gap by supervising the model on a zero-hop identity task. We demonstrate empirically that this addition enables models to successfully perform out-of-distribution two-hop reasoning, a task they otherwise completely fail. To explain this phenomenon, we provide a theoretical analysis using a simplified Emb-MLP model, proving that identity supervision reshapes the model’s latent geometry. We show this alignment is induced by an implicit nuclear-norm regularization during optimization, which favors low-rank solutions that share structure across tasks. For complex tasks, we use small initialization or weight decay to enhance the regularization effect, which enhances the latent space alignment effect and slows down the generalization decay. Finally, we extend our investigation to large-scale models, observing that they still achieve two-hop reasoning through the latent memory, which provides crucial inspiration for enhancing their implicit reasoning abilities.

[1352] HyperHELM: Hyperbolic Hierarchy Encoding for mRNA Language Modeling

Max van Spengler, Artem Moskalev, Tommaso Mansi, Mangal Prakash, Rui Liao

Main category: cs.LG

TL;DR: HyperHELM introduces hyperbolic geometry for mRNA language modeling, outperforming Euclidean models on biological tasks by better capturing hierarchical relationships.

Details

Motivation: Euclidean geometry in language models may not align well with the hierarchical structures inherent to biological sequences like mRNA, while hyperbolic geometry offers a better alternative for hierarchical data.

Method: HyperHELM implements masked language model pre-training in hyperbolic space using a hybrid design with hyperbolic layers atop a Euclidean backbone, aligning representations with biological hierarchy between mRNA and amino acids.

Result: Outperforms Euclidean baselines on 9/10 property prediction tasks (10% average improvement), excels in out-of-distribution generalization to long and low-GC content sequences, and surpasses hierarchy-aware Euclidean models by 3% in antibody region annotation accuracy.

Conclusion: Hyperbolic geometry serves as an effective inductive bias for hierarchical language modeling of mRNA sequences, demonstrating superior performance over Euclidean approaches.

Abstract: Language models are increasingly applied to biological sequences like proteins and mRNA, yet their default Euclidean geometry may mismatch the hierarchical structures inherent to biological data. While hyperbolic geometry provides a better alternative for accommodating hierarchical data, it has yet to find a way into language modeling for mRNA sequences. In this work, we introduce HyperHELM, a framework that implements masked language model pre-training in hyperbolic space for mRNA sequences. Using a hybrid design with hyperbolic layers atop Euclidean backbone, HyperHELM aligns learned representations with the biological hierarchy defined by the relationship between mRNA and amino acids. Across multiple multi-species datasets, it outperforms Euclidean baselines on 9 out of 10 tasks involving property prediction, with 10% improvement on average, and excels in out-of-distribution generalization to long and low-GC content sequences; for antibody region annotation, it surpasses hierarchy-aware Euclidean models by 3% in annotation accuracy. Our results highlight hyperbolic geometry as an effective inductive bias for hierarchical language modeling of mRNA sequences.

[1353] T-POP: Test-Time Personalization with Online Preference Feedback

Zikun Qu, Min Zhang, Mingze Kong, Xiang Li, Zhiwei Shang, Zhiyong Wang, Yikun Ban, Shuang Qiu, Yao Shu, Zhongxiang Dai

Main category: cs.LG

TL;DR: T-POP is a real-time personalization method for LLMs that uses online pairwise preference feedback during text generation, combining test-time alignment with dueling bandits to personalize responses without model fine-tuning.

Details

Motivation: Current LLM personalization methods require extensive user data or slow fine-tuning, creating a cold-start problem for new users.

Method: T-POP learns user preferences online through pairwise feedback queries, using dueling bandits to balance exploration and exploitation while steering decoding of a frozen LLM.

Result: T-POP achieves rapid, data-efficient personalization, significantly outperforming baselines and improving with more user interactions.

Conclusion: The proposed method enables effective real-time personalization for new users without model updates, addressing the cold-start problem in LLM personalization.

Abstract: Personalizing large language models (LLMs) to individual user preferences is a critical step beyond generating generically helpful responses. However, current personalization methods are ill-suited for new users, as they typically require either slow, resource-intensive fine-tuning or a substantial amount of pre-existing user data, creating a significant cold-start problem. To address this challenge, we introduce a new paradigm for real-time personalization by learning from online pairwise preference feedback collected during text generation. We propose T-POP (Test-Time Personalization with Online Preference Feedback}), a novel algorithm that synergistically combines test-time alignment with dueling bandits. Without updating the LLM parameters, T-POP steers the decoding process of a frozen LLM by learning a reward function online that captures user preferences. By leveraging dueling bandits, T-POP intelligently queries the user to efficiently balance between exploring their preferences and exploiting the learned knowledge to generate personalized text. Extensive experiments demonstrate that T-POP achieves rapid and data-efficient personalization, significantly outperforming existing baselines and showing consistent improvement with more user interactions.

[1354] FedPOB: Sample-Efficient Federated Prompt Optimization via Bandits

Pingchen Lu, Zhi Hong, Zhiwei Shang, Zhiyong Wang, Yikun Ban, Yao Shu, Min Zhang, Shuang Qiu, Zhongxiang Dai

Main category: cs.LG

TL;DR: A federated prompt optimization framework using multi-armed bandits that addresses black-box LLM optimization, sample efficiency, and privacy-preserving collaboration among multiple users.

Details

Motivation: Real-world LLM applications face three major challenges: black-box nature of proprietary LLMs, high query costs requiring sample efficiency, and need for privacy-preserving collaboration among multiple users.

Method: Proposed FedPOB (federated variant of Linear UCB) and FedPOB-Pref (for comparative user feedback based on federated dueling bandits), where agents collaborate by sharing model parameters instead of raw data.

Result: Extensive experiments show both FedPOB and FedPOB-Pref significantly outperform existing baselines, with performance consistently improving as more agents participate in the collaboration.

Conclusion: The federated multi-armed bandit approach effectively addresses the three key challenges in prompt optimization while enabling collaborative learning with proven benefits from more participating agents.

Abstract: The performance of large language models (LLMs) is highly sensitive to the input prompt, making prompt optimization a critical task. However, real-world application is hindered by three major challenges: (1) the black-box nature of powerful proprietary LLMs, (2) the need for high sample efficiency due to query costs, and (3) the desire for privacy-preserving collaboration among multiple users. To address these challenges simultaneously, we introduce a novel framework for sample-efficient federated prompt optimization based on multi-armed bandits (MABs). The MAB framework is uniquely suited for this problem as it is (1) inherently a black-box optimization method, (2) practically sample-efficient, and (3) enables collaborative learning with theoretically guaranteed benefit from more participating agents. We first propose the Federated Prompt Optimization via Bandits (FedPOB) algorithm, a federated variant of the Linear UCB algorithm, where agents collaborate by sharing model parameters instead of raw data. We then extend our approach to the practical setting of comparative user feedback by introducing FedPOB with Preference Feedback (FedPOB-Pref), an efficient algorithm based on federated dueling bandits. Extensive experiments demonstrate that both FedPOB and FedPOB-Pref significantly outperform existing baselines and that their performance consistently improves as more agents participate in the collaboration, validating the effectiveness of our federated approach.

[1355] Circuit-Aware Reward Training: A Mechanistic Framework for Longtail Robustness in RLHF

Jing Liu

Main category: cs.LG

TL;DR: Proposes a mechanistic interpretability framework to identify specialized neural circuits for rare-event processing in RLHF reward models, addressing systematic failures on longtail distributions through Circuit-Aware Reward Training (CART).

Details

Motivation: RLHF reward models exhibit systematic failures on longtail distributions, leading to reward hacking and misalignment, which motivates understanding and improving their rare-event processing capabilities.

Method: Uses mechanistic interpretability to identify specialized neural circuits for rare events, then applies Circuit-Aware Reward Training (CART) that leverages circuit analysis for data augmentation, regularization, and ensemble strategies.

Result: The framework provides both theoretical insights into reward model failures and practical interventions for improving longtail robustness through circuit-aware training methods.

Conclusion: The proposed approach connects circuit specialization with reward generalization bounds and offers a practical pathway to address longtail performance issues in RLHF reward models.

Abstract: Reinforcement Learning from Human Feedback (RLHF) reward models exhibit systematic failures on longtail distributions, leading to reward hacking and misalignment. We propose a mechanistic interpretability framework that identifies specialized neural circuits responsible for rare-event processing in reward models. Drawing from recent advances showing distributed specialization for rare tokens in language models\citep{liu2025no, liu2025emergent}, we hypothesize that reward models also develop functionally distinct circuits for longtail scenarios. Our theoretical framework establishes formal connections between circuit specialization, reward generalization bounds, and longtail performance. We introduce \textbf{Circuit-Aware Reward Training (CART)}, which uses circuit analysis to guide data augmentation, regularization, and ensemble strategies. This approach provides both theoretical insights into reward model failures and practical interventions for improving longtail robustness.

[1356] Discrete Variational Autoencoding via Policy Search

Michael Drolet, Firas Al-Hafez, Aditya Bhatt, Jan Peters, Oleg Arenz

Main category: cs.LG

TL;DR: Proposes a training framework for discrete VAEs using natural gradient optimization and automatic step size adaptation, achieving better image reconstruction than existing methods.

Details

Motivation: Discrete VAEs face challenges with differentiable parameterization, requiring approximations like Gumbel-Softmax or high-variance methods like REINFORCE that perform poorly on high-dimensional tasks like image reconstruction.

Method: Leverages natural gradient of a non-parametric encoder to update the parametric encoder without reparameterization, combined with automatic step size adaptation and transformer-based encoder.

Result: Outperforms approximate reparameterization methods and quantization-based discrete autoencoders, achieving 20% improvement on FID Score for ImageNet 256.

Conclusion: The proposed training framework effectively scales to challenging datasets and improves reconstruction quality from compact latent spaces.

Abstract: Discrete latent bottlenecks in variational autoencoders (VAEs) offer high bit efficiency and can be modeled with autoregressive discrete distributions, enabling parameter-efficient multimodal search with transformers. However, discrete random variables do not allow for exact differentiable parameterization; therefore, discrete VAEs typically rely on approximations, such as Gumbel-Softmax reparameterization or straight-through gradient estimates, or employ high-variance gradient-free methods such as REINFORCE that have had limited success on high-dimensional tasks such as image reconstruction. Inspired by popular techniques in policy search, we propose a training framework for discrete VAEs that leverages the natural gradient of a non-parametric encoder to update the parametric encoder without requiring reparameterization. Our method, combined with automatic step size adaptation and a transformer-based encoder, scales to challenging datasets such as ImageNet and outperforms both approximate reparameterization methods and quantization-based discrete autoencoders in reconstructing high-dimensional data from compact latent spaces, achieving a 20% improvement on FID Score for ImageNet 256.

[1357] Q-Net: Transferable Queue Length Estimation via Kalman-based Neural Networks

Ting Gao, Elvin Isufi, Winnie Daamen, Erik-Sander Smits, Serge Hoogendoorn

Main category: cs.LG

TL;DR: Q-Net is a data-efficient framework for queue length estimation at signalized intersections using loop detector counts and aggregated floating car data, achieving over 60% RMSE improvement over baselines.

Details

Motivation: Estimating queue lengths under partially observed conditions is challenging in traffic management, especially when traffic conservation assumptions are violated and available data sources have different spatial/temporal resolutions.

Method: Q-Net integrates loop detector counts and aggregated FCD using a state-space model with AI-augmented Kalman filter (KalmanNet) that learns Kalman gain from data without requiring noise covariances or full system dynamics.

Result: Evaluation on Rotterdam roads shows Q-Net outperforms baseline methods by over 60% in RMSE, accurately tracks queue formation/dissipation, corrects FCD-induced delays, and demonstrates strong spatial/temporal transferability.

Conclusion: Q-Net provides an interpretable, physically-grounded solution for queue estimation that enables deployment without costly infrastructure and has potential for real-time integration into dynamic traffic control systems.

Abstract: Estimating queue lengths at signalized intersections remains a challenge in traffic management, especially under partially observed conditions where vehicle flows are not fully captured. This paper introduces Q-Net, a data-efficient and interpretable framework for queue length estimation that performs robustly even when traffic conservation assumptions are violated. Q-Net integrates two widely available and privacy-friendly data sources: (i) vehicle counts from loop detectors near stop lines, and (ii) aggregated floating car data (aFCD), which divides each road section into segments and provides segment-wise average speed measurements. These data sources often differ in spatial and temporal resolution, creating fusion challenges. Q-Net addresses this by employing a tailored state-space model and an AI-augmented Kalman filter, KalmanNet, which learns the Kalman gain from data without requiring prior knowledge of noise covariances or full system dynamics. We build on the vanilla KalmanNet pipeline to decouple measurement dimensionality from section length, enabling spatial transferability across road segments. Unlike black-box models, Q-Net maintains physical interpretability, with internal variables linked to real-world traffic dynamics. Evaluations on main roads in Rotterdam, the Netherlands, demonstrate that Q-Net outperforms baseline methods by over 60% in Root Mean Square Error (RMSE), accurately tracking queue formation and dissipation while correcting aFCD-induced delays. Q-Net also demonstrates strong spatial and temporal transferability, enabling deployment without costly sensing infrastructure like cameras or radar. Additionally, we propose a real-time variant of Q-Net, highlighting its potential for integration into dynamic, queue-based traffic control systems.

[1358] Beyond Softmax: A Natural Parameterization for Categorical Random Variables

Alessandro Manenti, Cesare Alippi

Main category: cs.LG

TL;DR: The paper proposes replacing the softmax function with catnat, a hierarchical binary split function, to address gradient estimation challenges in discrete latent variables, showing improved learning efficiency and test performance across various applications.

Details

Motivation: Discrete latent variables in deep learning face gradient estimation challenges during training, and the ubiquitous softmax function has limitations from an information-geometric perspective.

Method: Replace softmax with catnat function - a hierarchical binary split sequence that creates a diagonal Fisher Information Matrix, improving gradient descent efficiency.

Result: Experiments in graph structure learning, variational autoencoders, and reinforcement learning show catnat improves learning efficiency and consistently yields higher test performance.

Conclusion: Catnat is a simple, compatible alternative to softmax that offers better gradient descent properties and can be easily integrated into existing codebases while maintaining compatibility with standard training stabilization techniques.

Abstract: Latent categorical variables are frequently found in deep learning architectures. They can model actions in discrete reinforcement-learning environments, represent categories in latent-variable models, or express relations in graph neural networks. Despite their widespread use, their discrete nature poses significant challenges to gradient-descent learning algorithms. While a substantial body of work has offered improved gradient estimation techniques, we take a complementary approach. Specifically, we: 1) revisit the ubiquitous $\textit{softmax}$ function and demonstrate its limitations from an information-geometric perspective; 2) replace the $\textit{softmax}$ with the $\textit{catnat}$ function, a function composed of a sequence of hierarchical binary splits; we prove that this choice offers significant advantages to gradient descent due to the resulting diagonal Fisher Information Matrix. A rich set of experiments - including graph structure learning, variational autoencoders, and reinforcement learning - empirically show that the proposed function improves the learning efficiency and yields models characterized by consistently higher test performance. $\textit{Catnat}$ is simple to implement and seamlessly integrates into existing codebases. Moreover, it remains compatible with standard training stabilization techniques and, as such, offers a better alternative to the $\textit{softmax}$ function.

[1359] Who invented deep residual learning?

Juergen Schmidhuber

Main category: cs.LG

TL;DR: The paper presents a timeline of the evolution of deep residual learning, noting that as of 2025 the most cited scientific article of the 21st century is a neural network paper on deep residual learning with residual connections.

Details

Motivation: To trace the origins and development of deep residual learning, particularly since the most cited scientific article of the 21st century is about neural networks with residual connections.

Method: The authors present a timeline approach to document the evolution of deep residual learning.

Result: The paper establishes a historical timeline showing the progression and key developments in deep residual learning technology.

Conclusion: Deep residual learning has become a foundational concept in modern AI, with residual connections being central to the most influential neural network research of the 21st century.

Abstract: Modern AI is based on deep artificial neural networks (NNs). As of 2025, the most cited scientific article of the 21st century is an NN paper on deep residual learning with residual connections. Who invented this? We present a timeline of the evolution of deep residual learning.

[1360] A TRIANGLE Enables Multimodal Alignment Beyond Cosine Similarity

Giordano Cicchetti, Eleonora Grassucci, Danilo Comminiello

Main category: cs.LG

TL;DR: TRIANGLE is a novel similarity measure that improves multimodal alignment by computing triangle-area similarity in high-dimensional embedding space, achieving state-of-the-art results in three-modal tasks.

Details

Motivation: Current multimodal models suffer from ineffective modality alignment, where some modalities may not be properly aligned, limiting the model's ability to exploit complementary information from multiple modalities in downstream tasks.

Method: TRIANGLE computes similarity directly in the high-dimensional space spanned by modality embeddings using triangle-area similarity, avoiding additional fusion layers or pairwise similarities. It replaces cosine similarity in contrastive losses.

Result: TRIANGLE significantly boosts multimodal modeling performance, achieving state-of-the-art results in three-modal tasks (video-text and audio-text retrieval, audio-video classification), improving cosine-based methods by up to 9 points in Recall@1.

Conclusion: TRIANGLE provides an effective approach for joint alignment of three modalities through geometric similarity measures, yielding both performance improvements and interpretable alignment rationales in multimodal learning.

Abstract: Multimodal learning plays a pivotal role in advancing artificial intelligence systems by incorporating information from multiple modalities to build a more comprehensive representation. Despite its importance, current state-of-the-art models still suffer from severe limitations that prevent the successful development of a fully multimodal model. Such methods may not provide indicators that all the involved modalities are effectively aligned. As a result, some modalities may not be aligned, undermining the effectiveness of the model in downstream tasks where multiple modalities should provide additional information that the model fails to exploit. In this paper, we present TRIANGLE: TRI-modAl Neural Geometric LEarning, the novel proposed similarity measure that is directly computed in the higher-dimensional space spanned by the modality embeddings. TRIANGLE improves the joint alignment of three modalities via a triangle-area similarity, avoiding additional fusion layers or pairwise similarities. When incorporated in contrastive losses replacing cosine similarity, TRIANGLE significantly boosts the performance of multimodal modeling, while yielding interpretable alignment rationales. Extensive evaluation in three-modal tasks such as video-text and audio-text retrieval or audio-video classification, demonstrates that TRIANGLE achieves state-of-the-art results across different datasets improving the performance of cosine-based methods up to 9 points of Recall@1.

[1361] Robust Policy Expansion for Offline-to-Online RL under Diverse Data Corruption

Longxiang He, Deheng Ye, Junbo Tan, Xueqian Wang, Li Shen

Main category: cs.LG

TL;DR: RPEX is a robust offline-to-online RL method that addresses data corruption by incorporating Inverse Probability Weighted (IPW) into online exploration to alleviate heavy-tailed policy behavior caused by corrupted data.

Details

Motivation: Existing O2O RL methods focus on mitigating offline policy conservatism but ignore robustness under data corruption (states, actions, rewards, dynamics), which severely degrades performance by inducing heavy-tailed policy behavior.

Method: Proposes RPEX (Robust Policy Expansion) that incorporates Inverse Probability Weighted (IPW) into the online exploration policy to alleviate heavy-tailedness induced by data corruption.

Result: Extensive experiments on D4RL datasets show RPEX achieves state-of-the-art O2O performance across various data corruption scenarios.

Conclusion: RPEX is a simple yet effective method that successfully addresses data corruption challenges in offline-to-online RL, demonstrating superior robustness and performance.

Abstract: Pretraining a policy on offline data followed by fine-tuning through online interactions, known as Offline-to-Online Reinforcement Learning (O2O RL), has emerged as a promising paradigm for real-world RL deployment. However, both offline datasets and online interactions in practical environments are often noisy or even maliciously corrupted, severely degrading the performance of O2O RL. Existing works primarily focus on mitigating the conservatism of offline policies via online exploration, while the robustness of O2O RL under data corruption, including states, actions, rewards, and dynamics, is still unexplored. In this work, we observe that data corruption induces heavy-tailed behavior in the policy, thereby substantially degrading the efficiency of online exploration. To address this issue, we incorporate Inverse Probability Weighted (IPW) into the online exploration policy to alleviate heavy-tailedness, and propose a novel, simple yet effective method termed $\textbf{RPEX}$: $\textbf{R}$obust $\textbf{P}$olicy $\textbf{EX}$pansion. Extensive experimental results on D4RL datasets demonstrate that RPEX achieves SOTA O2O performance across a wide range of data corruption scenarios. Code is available at $\href{https://github.com/felix-thu/RPEX}{https://github.com/felix-thu/RPEX}$.

[1362] In-Context Learning of Temporal Point Processes with Foundation Inference Models

David Berghaus, Patrick Seifner, Kostadin Cvejoski, César Ojeda, Ramsés J. Sánchez

Main category: cs.LG

TL;DR: The paper introduces FIM-PP, a foundation model for marked temporal point processes that uses amortized inference and in-context learning to estimate conditional intensity functions from event sequences without specialized training for each system.

Details

Motivation: Current neural MTPP approaches require training separate models for each target system, which is inefficient. The authors aim to develop a general-purpose model that can infer MTPPs across different systems without retraining.

Method: Pretrain a deep neural network on a large synthetic dataset of Hawkes processes using amortized inference and in-context learning. The model learns to infer conditional intensity functions from context sets of event sequences.

Result: FIM-PP matches the performance of specialized models on next-event prediction across common benchmark datasets without additional training, and can be rapidly finetuned when needed.

Conclusion: The amortized inference approach enables a single foundation model to effectively handle MTPP inference across diverse systems, reducing the need for specialized training while maintaining competitive performance.

Abstract: Modeling event sequences of multiple event types with marked temporal point processes (MTPPs) provides a principled way to uncover governing dynamical rules and predict future events. Current neural network approaches to MTPP inference rely on training separate, specialized models for each target system. We pursue a radically different approach: drawing on amortized inference and in-context learning, we pretrain a deep neural network to infer, in-context, the conditional intensity functions of event histories from a context defined by sets of event sequences. Pretraining is performed on a large synthetic dataset of MTPPs sampled from a broad distribution of Hawkes processes. Once pretrained, our Foundation Inference Model for Point Processes (FIM-PP) can estimate MTPPs from real-world data without any additional training, or be rapidly finetuned to target systems. Experiments show that this amortized approach matches the performance of specialized models on next-event prediction across common benchmark datasets. Our pretrained model, repository and tutorials will soon be available online

[1363] Neural Message-Passing on Attention Graphs for Hallucination Detection

Fabrizio Frasca, Guy Bar-Shalom, Yftah Ziser, Haggai Maron

Main category: cs.LG

TL;DR: CHARM is a graph learning approach that detects hallucinations in LLMs by representing attention maps and activations as attributed graphs and applying GNNs, outperforming existing methods.

Details

Motivation: LLMs often generate incorrect or unsupported content (hallucinations), and existing detection methods rely on heuristics or simple models over isolated computational traces.

Method: Represent computational traces as attributed graphs where tokens are nodes, edges follow attentional flows, and both carry features from attention scores and activations. Apply GNNs over these graphs for hallucination detection.

Result: CHARM consistently outperforms other leading approaches across diverse benchmarks, shows promising zero-shot performance on cross-dataset transfer, and provably subsumes prior attention-based heuristics.

Conclusion: The graph structure plays a relevant role in hallucination detection, and combining computational traces provides benefits for detecting LLM hallucinations.

Abstract: Large Language Models (LLMs) often generate incorrect or unsupported content, known as hallucinations. Existing detection methods rely on heuristics or simple models over isolated computational traces such as activations, or attention maps. We unify these signals by representing them as attributed graphs, where tokens are nodes, edges follow attentional flows, and both carry features from attention scores and activations. Our approach, CHARM, casts hallucination detection as a graph learning task and tackles it by applying GNNs over the above attributed graphs. We show that CHARM provably subsumes prior attention-based heuristics and, experimentally, it consistently outperforms other leading approaches across diverse benchmarks. Our results shed light on the relevant role played by the graph structure and on the benefits of combining computational traces, whilst showing CHARM exhibits promising zero-shot performance on cross-dataset transfer.

[1364] MarS-FM: Generative Modeling of Molecular Dynamics via Markov State Models

Kacper Kapuśniak, Cristian Gabellini, Michael Bronstein, Prudencio Tossou, Francesco Di Giovanni

Main category: cs.LG

TL;DR: MarS-FM is a new generative model that learns to sample transitions across discrete states in Markov State Models, achieving 100x+ speedup over MD simulations while better reproducing protein dynamics statistics.

Details

Motivation: Molecular Dynamics is computationally expensive for studying protein functions due to fine-grained integration and long timescales. Existing generative models learn fixed-lag transition densities dominated by uninformative transitions.

Method: MSM Emulators learn transitions across discrete states defined by Markov State Models. MarS-FM (Markov Space Flow Matching) is an instantiation that samples transitions with significant speedup.

Result: MarS-FM achieves >2 orders of magnitude speedup compared to implicit- or explicit-solvent MD. It outperforms existing methods across all metrics (RMSD, radius of gyration, secondary structure) on diverse protein domains up to 500 residues.

Conclusion: MarS-FM represents a new class of generative models that better captures protein dynamics while providing substantial computational speedup, demonstrating strong generalization across chemically and structurally diverse proteins.

Abstract: Molecular Dynamics (MD) is a powerful computational microscope for probing protein functions. However, the need for fine-grained integration and the long timescales of biomolecular events make MD computationally expensive. To address this, several generative models have been proposed to generate surrogate trajectories at lower cost. Yet, these models typically learn a fixed-lag transition density, causing the training signal to be dominated by frequent but uninformative transitions. We introduce a new class of generative models, MSM Emulators, which instead learn to sample transitions across discrete states defined by an underlying Markov State Model (MSM). We instantiate this class with Markov Space Flow Matching (MarS-FM), whose sampling offers more than two orders of magnitude speedup compared to implicit- or explicit-solvent MD simulations. We benchmark Mars-FM ability to reproduce MD statistics through structural observables such as RMSD, radius of gyration, and secondary structure content. Our evaluation spans protein domains (up to 500 residues) with significant chemical and structural diversity, including unfolding events, and enforces strict sequence dissimilarity between training and test sets to assess generalization. Across all metrics, MarS-FM outperforms existing methods, often by a substantial margin.

[1365] Quantifying Generalisation in Imitation Learning

Nathan Gavenski, Odinaldo Rodrigues

Main category: cs.LG

TL;DR: Labyrinth is a new benchmarking environment for imitation learning that enables precise control over training and evaluation settings to test generalization capabilities with verifiable distinctness.

Details

Motivation: Current imitation learning benchmarks lack sufficient variation between training and evaluation, limiting meaningful assessment of generalization capabilities.

Method: Labyrinth provides a discrete, fully observable state space with known optimal actions, offering flexible setup with controlled structure, start/goal positions, and task complexity. It includes variants like partial observability, key-and-door tasks, and ice-floor hazards.

Result: The environment enables verifiably distinct training, evaluation, and test settings, supporting interpretability and fine-grained evaluation of generalization factors.

Conclusion: Labyrinth advances the evaluation of generalization in imitation learning by enabling controlled, reproducible experiments and provides a valuable tool for developing more robust agents.

Abstract: Imitation learning benchmarks often lack sufficient variation between training and evaluation, limiting meaningful generalisation assessment. We introduce Labyrinth, a benchmarking environment designed to test generalisation with precise control over structure, start and goal positions, and task complexity. It enables verifiably distinct training, evaluation, and test settings. Labyrinth provides a discrete, fully observable state space and known optimal actions, supporting interpretability and fine-grained evaluation. Its flexible setup allows targeted testing of generalisation factors and includes variants like partial observability, key-and-door tasks, and ice-floor hazards. By enabling controlled, reproducible experiments, Labyrinth advances the evaluation of generalisation in imitation learning and provides a valuable tool for developing more robust agents.

[1366] Assessing the risk of future Dunkelflaute events for Germany using generative deep learning

Felix Strnad, Jonathan Schmidt, Fabian Mockert, Philipp Hennig, Nicole Ludwig

Main category: cs.LG

TL;DR: Study analyzes Dunkelflaute events (low wind/solar periods) in Germany’s future electricity grid using deep learning downscaling of CMIP6 climate simulations, finding frequency and duration remain largely unchanged under SSP2-4.5 and SSP5-8.5 scenarios.

Details

Motivation: The transition to renewable energy sources creates grid stability challenges due to weather dependency, particularly Dunkelflaute events that can cause electricity supply shortages.

Method: Adapted generative deep learning framework to downscale CMIP6 climate simulations, compared to historical ERA5 data, and assessed Dunkelflaute events under SSP2-4.5 and SSP5-8.5 emission scenarios.

Result: Both frequency and duration of Dunkelflaute events in Germany are projected to remain largely unchanged compared to historical period in the ensemble mean.

Conclusion: Under considered climate scenarios, the risk associated with Dunkelflaute events is expected to remain stable throughout the century.

Abstract: The European electricity power grid is transitioning towards renewable energy sources, characterized by an increasing share of off- and onshore wind and solar power. However, the weather dependency of these energy sources poses a challenge to grid stability, with so-called Dunkelflaute events – periods of low wind and solar power generation – being of particular concern due to their potential to cause electricity supply shortages. In this study, we investigate the impact of these events on the German electricity production in the years and decades to come. For this purpose, we adapt a recently developed generative deep learning framework to downscale climate simulations from the CMIP6 ensemble. We first compare their statistics to the historical record taken from ERA5 data. Next, we use these downscaled simulations to assess plausible future occurrences of Dunkelflaute events in Germany under the optimistic low (SSP2-4.5) and high (SSP5-8.5) emission scenarios. Our analysis indicates that both the frequency and duration of Dunkelflaute events in Germany in the ensemble mean are projected to remain largely unchanged compared to the historical period. This suggests that, under the considered climate scenarios, the associated risk is expected to remain stable throughout the century.

[1367] Fidel-TS: A High-Fidelity Benchmark for Multimodal Time Series Forecasting

Zhijian Xu, Wanxu Cai, Xilin Dai, Zhaorong Deng, Qiang Xu

Main category: cs.LG

TL;DR: Fidel-TS is a new time series forecasting benchmark built from live APIs to address data contamination and leakage issues in existing benchmarks, showing that causal relevance of textual information is key for multimodal forecasting performance.

Details

Motivation: Existing time series forecasting benchmarks suffer from data contamination, causal leakage, and description leakage, creating an illusion of progress in the field.

Method: Formalized core principles of high-fidelity benchmarking (data sourcing integrity, strict causal soundness, structural clarity) and built Fidel-TS benchmark from live APIs following these principles.

Result: Experiments exposed critical biases and design limitations in prior benchmarks, and demonstrated that causal relevance of textual information is the key factor for genuine performance gains in multimodal forecasting.

Conclusion: High-fidelity benchmarking principles are essential for reliable evaluation, and causal relevance of textual information is crucial for effective multimodal time series forecasting.

Abstract: The evaluation of time series forecasting models is hindered by a critical lack of high-quality benchmarks, leading to a potential illusion of progress. Existing datasets suffer from issues ranging from pre-training data contamination in the age of LLMs to the causal and description leakage prevalent in early multimodal designs. To address this, we formalize the core principles of high-fidelity benchmarking, focusing on data sourcing integrity, strict causal soundness, and structural clarity. We introduce Fidel-TS, a new large-scale benchmark built from the ground up on these principles by sourcing data from live APIs. Our extensive experiments validate this approach by exposing the critical biases and design limitations of prior benchmarks. Furthermore, we conclusively demonstrate that the causal relevance of textual information is the key factor in unlocking genuine performance gains in multimodal forecasting.

[1368] DSAT-HD: Dual-Stream Adaptive Transformer with Hybrid Decomposition for Multivariate Time Series Forecasting

Zixu Wang, Hongbin Dong, Xiaoping Zhang

Main category: cs.LG

TL;DR: DSAT-HD is a novel time series forecasting model that uses hybrid decomposition and dual-stream adaptive transformers to better capture multi-scale temporal patterns and handle complex seasonality-trend relationships.

Details

Motivation: Existing transformer-based time series forecasting methods struggle with limited time series modeling, fixed scales, and inability to handle complex seasonality-trend decomposition without pre-specified seasonal periods.

Method: Proposes DSAT-HD with three innovations: 1) Hybrid decomposition combining EMA and Fourier decomposition with RevIN normalization and noise Top-k gating; 2) Multi-scale adaptive pathway with sparse allocator routing features to four parallel Transformer layers; 3) Dual-stream residual learning with CNN and MLP branches processing seasonal/trend components separately.

Result: Extensive experiments on nine datasets show DSAT-HD outperforms existing methods overall and achieves state-of-the-art performance on some datasets, with stronger generalization capabilities across various transfer scenarios.

Conclusion: DSAT-HD effectively addresses limitations of current time series forecasting methods by integrating hybrid decomposition, multi-scale adaptive pathways, and dual-stream learning, demonstrating superior performance and generalization.

Abstract: Time series forecasting is crucial for various applications, such as weather, traffic, electricity, and energy predictions. Currently, common time series forecasting methods are based on Transformers. However, existing approaches primarily model limited time series or fixed scales, making it more challenging to capture diverse features cross different ranges. Additionally, traditional methods like STL for complex seasonality-trend decomposition require pre-specified seasonal periods and typically handle only single, fixed seasonality. We propose the Hybrid Decomposition Dual-Stream Adaptive Transformer (DSAT-HD), which integrates three key innovations to address the limitations of existing methods: 1) A hybrid decomposition mechanism combining EMA and Fourier decomposition with RevIN normalization, dynamically balancing seasonal and trend components through noise Top-k gating; 2) A multi-scale adaptive pathway leveraging a sparse allocator to route features to four parallel Transformer layers, followed by feature merging via a sparse combiner, enhanced by hybrid attention combining local CNNs and global interactions; 3) A dual-stream residual learning framework where CNN and MLP branches separately process seasonal and trend components, coordinated by a balanced loss function minimizing expert collaboration variance. Extensive experiments on nine datasets demonstrate that DSAT-HD outperforms existing methods overall and achieves state-of-the-art performance on some datasets. Notably, it also exhibits stronger generalization capabilities across various transfer scenarios.

[1369] Physics-informed learning under mixing: How physical knowledge speeds up learning

Anna Scampicchio, Leonardo F. Toso, Rahel Rickenbach, James Anderson, Melanie N. Zeilinger

Main category: cs.LG

TL;DR: Physics-informed regularization improves learning rates from slow Sobolev minimax to fast optimal i.i.d. rates when physical priors are aligned, eliminating sample-size deflation from data dependence.

Details

Motivation: To understand how domain knowledge incorporation affects learning rates with dependent data in physics-informed machine learning.

Method: Empirical risk minimization with physics-informed regularization, deriving complexity-dependent bounds on excess risk.

Result: When physical prior information is aligned, learning rate improves from Sobolev minimax rate to optimal i.i.d. rate without sample-size deflation.

Conclusion: Properly aligned physics-informed regularization can overcome data dependence limitations and achieve optimal learning rates.

Abstract: A major challenge in physics-informed machine learning is to understand how the incorporation of prior domain knowledge affects learning rates when data are dependent. Focusing on empirical risk minimization with physics-informed regularization, we derive complexity-dependent bounds on the excess risk in probability and in expectation. We prove that, when the physical prior information is aligned, the learning rate improves from the (slow) Sobolev minimax rate to the (fast) optimal i.i.d. one without any sample-size deflation due to data dependence.

[1370] DyMoDreamer: World Modeling with Dynamic Modulation

Boxuan Zhang, Runqing Wang, Wei Xiao, Weipu Zhang, Jian Sun, Gao Huang, Jie Chen, Gang Wang

Main category: cs.LG

TL;DR: DyMoDreamer is a novel model-based reinforcement learning algorithm that uses dynamic modulation to improve sample efficiency by focusing on dynamic objects and temporal features rather than processing observations holistically.

Details

Motivation: Address sample inefficiency in deep reinforcement learning by improving world models to better handle dynamic objects and temporal features, which are crucial for visual tasks where dynamic elements significantly impact rewards and decision-making.

Method: Uses dynamic modulation mechanism with differential observations from inter-frame differencing mask to encode object-level motion cues. Models dynamic modulation as stochastic categorical distributions integrated into recurrent state-space model (RSSM) to focus on reward-relevant dynamics.

Result: Achieves state-of-the-art performance: 156.6% mean human-normalized score on Atari 100k benchmark, new record of 832 on DeepMind Visual Control Suite, and 9.5% performance improvement after 1M steps on Crafter benchmark.

Conclusion: DyMoDreamer effectively addresses sample inefficiency in visual reinforcement learning tasks by explicitly modeling dynamic features and temporal information, demonstrating superior performance across multiple benchmarks.

Abstract: A critical bottleneck in deep reinforcement learning (DRL) is sample inefficiency, as training high-performance agents often demands extensive environmental interactions. Model-based reinforcement learning (MBRL) mitigates this by building world models that simulate environmental dynamics and generate synthetic experience, improving sample efficiency. However, conventional world models process observations holistically, failing to decouple dynamic objects and temporal features from static backgrounds. This approach is computationally inefficient, especially for visual tasks where dynamic objects significantly influence rewards and decision-making performance. To address this, we introduce DyMoDreamer, a novel MBRL algorithm that incorporates a dynamic modulation mechanism to improve the extraction of dynamic features and enrich the temporal information. DyMoDreamer employs differential observations derived from a novel inter-frame differencing mask, explicitly encoding object-level motion cues and temporal dynamics. Dynamic modulation is modeled as stochastic categorical distributions and integrated into a recurrent state-space model (RSSM), enhancing the model’s focus on reward-relevant dynamics. Experiments demonstrate that DyMoDreamer sets a new state-of-the-art on the Atari $100$k benchmark with a $156.6$% mean human-normalized score, establishes a new record of $832$ on the DeepMind Visual Control Suite, and gains a $9.5$% performance improvement after $1$M steps on the Crafter benchmark. Our code is released at https://github.com/Ultraman-Tiga1/DyMoDreamer.

[1371] Putnam-like dataset summary: LLMs as mathematical competition contestants

Bartosz Bieganowski, Daniel Strzelecki, Robert Skiba, Mateusz Topolewski

Main category: cs.LG

TL;DR: Analysis of LLM performance on Putnam-like mathematical competition problems

Details

Motivation: To evaluate the capability of large language models in solving complex mathematical problems similar to those in the Putnam Competition

Method: Used a benchmark dataset of 96 original Putnam-style problems and 576 LLM solutions, analyzing model performance on this specialized mathematical problem set

Result: The paper presents performance analysis results of various LLMs on mathematical contest problems

Conclusion: The study provides insights into LLMs’ mathematical reasoning abilities through systematic evaluation on Putnam-like problems

Abstract: In this paper we summarize the results of the Putnam-like benchmark published by Google DeepMind. This dataset consists of 96 original problems in the spirit of the Putnam Competition and 576 solutions of LLMs. We analyse the performance of models on this set of problems to verify their ability to solve problems from mathematical contests.

[1372] Cell2Text: Multimodal LLM for Generating Single-Cell Descriptions from RNA-Seq Data

Oussama Kharouiche, Aris Markogiannakis, Xiao Fei, Michail Chatzianastasis, Michalis Vazirgiannis

Main category: cs.LG

TL;DR: Cell2Text is a multimodal framework that translates single-cell RNA sequencing profiles into natural language descriptions, improving interpretability and performance over traditional discrete classification methods.

Details

Motivation: Current single-cell foundation models use discrete prediction heads that collapse cellular complexity into predefined labels, failing to capture richer contextual explanations needed by biologists.

Method: Integrates gene-level embeddings from single-cell foundation models with pretrained large language models to generate structured natural language descriptions from scRNA-seq profiles.

Result: Outperforms baselines on classification accuracy, shows strong ontological consistency using PageRank-based metrics, and achieves high semantic fidelity in text generation.

Conclusion: Coupling expression data with natural language offers both stronger predictive performance and inherently interpretable outputs, enabling scalable label-efficient characterization of unseen cells.

Abstract: Single-cell RNA sequencing has transformed biology by enabling the measurement of gene expression at cellular resolution, providing information for cell types, states, and disease contexts. Recently, single-cell foundation models have emerged as powerful tools for learning transferable representations directly from expression profiles, improving performance on classification and clustering tasks. However, these models are limited to discrete prediction heads, which collapse cellular complexity into predefined labels that fail to capture the richer, contextual explanations biologists need. We introduce Cell2Text, a multimodal generative framework that translates scRNA-seq profiles into structured natural language descriptions. By integrating gene-level embeddings from single-cell foundation models with pretrained large language models, Cell2Text generates coherent summaries that capture cellular identity, tissue origin, disease associations, and pathway activity, generalizing to unseen cells. Empirically, Cell2Text outperforms baselines on classification accuracy, demonstrates strong ontological consistency using PageRank-based similarity metrics, and achieves high semantic fidelity in text generation. These results demonstrate that coupling expression data with natural language offers both stronger predictive performance and inherently interpretable outputs, pointing to a scalable path for label-efficient characterization of unseen cells.

[1373] Uncertainty-Guided Expert-AI Collaboration for Efficient Soil Horizon Annotation

Teodor Chiaburu, Vipin Singh, Frank Haußer, Felix Bießmann

Main category: cs.LG

TL;DR: Conformal prediction applied to SoilNet improves annotation efficiency in regression tasks and maintains performance in classification tasks under limited expert annotation budgets.

Details

Motivation: Uncertainty quantification is crucial for human-machine collaboration, as humans adjust decisions based on machine confidence. Reliable uncertainty calibration enables better collaboration, targeted expert intervention, and responsible ML usage.

Method: Applied conformal prediction to SoilNet (multimodal multitask soil profile model) and designed a simulated human-in-the-loop annotation pipeline where domain experts provide ground truth annotations only when model uncertainty is high.

Result: Conformalized SoilNet leads to more efficient annotation in regression tasks and comparable performance scores in classification tasks under the same annotation budget compared to non-conformal counterpart.

Conclusion: Conformal prediction effectively improves annotation efficiency in human-in-the-loop systems for soil profile analysis, particularly benefiting regression tasks while maintaining classification performance.

Abstract: Uncertainty quantification is essential in human-machine collaboration, as human agents tend to adjust their decisions based on the confidence of the machine counterpart. Reliably calibrated model uncertainties, hence, enable more effective collaboration, targeted expert intervention and more responsible usage of Machine Learning (ML) systems. Conformal prediction has become a well established model-agnostic framework for uncertainty calibration of ML models, offering statistically valid confidence estimates for both regression and classification tasks. In this work, we apply conformal prediction to $\textit{SoilNet}$, a multimodal multitask model for describing soil profiles. We design a simulated human-in-the-loop (HIL) annotation pipeline, where a limited budget for obtaining ground truth annotations from domain experts is available when model uncertainty is high. Our experiments show that conformalizing SoilNet leads to more efficient annotation in regression tasks and comparable performance scores in classification tasks under the same annotation budget when tested against its non-conformal counterpart. All code and experiments can be found in our repository: https://github.com/calgo-lab/BGR

[1374] Beyond the Hook: Predicting Billboard Hot 100 Chart Inclusion with Machine Learning from Streaming, Audio Signals, and Perceptual Features

Christos Mountzouris

Main category: cs.LG

TL;DR: This paper analyzes factors predicting Billboard Hot 100 chart inclusion using streaming popularity, audio features, and listening indicators, achieving ~90% accuracy with various machine learning models.

Details

Motivation: Digital streaming platforms provide structured data that enables new research on music popularity dynamics and mainstream success determinants.

Method: Used Logistic Regression, Random Forest, and Gradient Boosting (XGBoost) to predict Billboard Hot 100 inclusion based on streaming popularity, audio signal attributes, and listening indicators.

Result: All models achieved ~90% accuracy: Logistic Regression (90.0%), Random Forest (90.4%), XGBoost (90.3%). Popularity was the strongest predictor, followed by instrumentalness, valence, duration, and speechiness.

Conclusion: Machine learning models can effectively predict chart success with high accuracy, with streaming popularity being the most influential factor for Billboard Hot 100 inclusion.

Abstract: The advent of digital streaming platforms have recently revolutionized the landscape of music industry, with the ensuing digitalization providing structured data collections that open new research avenues for investigating popularity dynamics and mainstream success. The present work explored which determinants hold the strongest predictive influence for a track’s inclusion in the Billboard Hot 100 charts, including streaming popularity, measurable audio signal attributes, and probabilistic indicators of human listening. The analysis revealed that popularity was by far the most decisive predictor of Billboard Hot 100 inclusion, with considerable contribution from instrumentalness, valence, duration and speechiness. Logistic Regression achieved 90.0% accuracy, with very high recall for charting singles (0.986) but lower recall for non-charting ones (0.813), yielding balanced F1-scores around 0.90. Random Forest slightly improved performance to 90.4% accuracy, maintaining near-perfect precision for non-charting singles (0.990) and high recall for charting ones (0.992), with F1-scores up to 0.91. Gradient Boosting (XGBoost) reached 90.3% accuracy, delivering a more balanced trade-off by improving recall for non-charting singles (0.837) while sustaining high recall for charting ones (0.969), resulting in F1-scores comparable to the other models.

[1375] DRIFT-Net: A Spectral–Coupled Neural Operator for PDEs Learning

Jiayi Li, Flora D. Salim

Main category: cs.LG

TL;DR: DRIFT-Net is a dual-branch neural PDE solver that combines spectral and image branches to capture global low-frequency information and local details, achieving better accuracy and efficiency than attention-based methods.

Details

Motivation: Existing foundation models for PDEs using multi-scale windowed self-attention suffer from weak global coupling, leading to error accumulation and drift during closed-loop rollouts due to their locality.

Method: Dual-branch design with spectral branch for global low-frequency information and image branch for local details. Uses controlled lightweight mixing in low-frequency range, bandwise weighting fusion between branches, and spatial domain transformation.

Result: Achieves 7%-54% lower relative L1 error, 15% fewer parameters, and higher throughput than scOT baseline on Navier-Stokes benchmarks. Better stability and effectiveness demonstrated through ablation studies.

Conclusion: DRIFT-Net effectively addresses global coupling issues in PDE neural solvers by combining spectral and spatial approaches, achieving superior performance with fewer parameters and maintaining computational efficiency.

Abstract: Learning PDE dynamics with neural solvers can significantly improve wall-clock efficiency and accuracy compared with classical numerical solvers. In recent years, foundation models for PDEs have largely adopted multi-scale windowed self-attention, with the scOT backbone in \textsc{Poseidon} serving as a representative example. However, because of their locality, truly globally consistent spectral coupling can only be propagated gradually through deep stacking and window shifting. This weakens global coupling and leads to error accumulation and drift during closed-loop rollouts. To address this, we propose \textbf{DRIFT-Net}. It employs a dual-branch design comprising a spectral branch and an image branch. The spectral branch is responsible for capturing global, large-scale low-frequency information, whereas the image branch focuses on local details and nonstationary structures. Specifically, we first perform controlled, lightweight mixing within the low-frequency range. Then we fuse the spectral and image paths at each layer via bandwise weighting, which avoids the width inflation and training instability caused by naive concatenation. The fused result is transformed back into the spatial domain and added to the image branch, thereby preserving both global structure and high-frequency details across scales. Compared with strong attention-based baselines, DRIFT-Net achieves lower error and higher throughput with fewer parameters under identical training settings and budget. On Navier–Stokes benchmarks, the relative $L_{1}$ error is reduced by 7%–54%, the parameter count decreases by about 15%, and the throughput remains higher than scOT. Ablation studies and theoretical analyses further demonstrate the stability and effectiveness of this design. The code is available at https://github.com/cruiseresearchgroup/DRIFT-Net.

[1376] Scaling Laws and Spectra of Shallow Neural Networks in the Feature Learning Regime

Leonardo Defilippis, Yizhou Xu, Julius Girardin, Emanuele Troiani, Vittorio Erba, Lenka Zdeborová, Bruno Loureiro, Florent Krzakala

Main category: cs.LG

TL;DR: Theoretical analysis of neural scaling laws for quadratic and diagonal networks in feature learning regime, revealing phase transitions and connecting weight spectrum properties to generalization performance.

Details

Motivation: Neural scaling laws drive deep learning advances but lack theoretical understanding beyond linear models. This work aims to systematically analyze scaling laws for nonlinear networks in feature learning regimes.

Method: Leveraging connections with matrix compressed sensing and LASSO, the authors derive phase diagrams for scaling exponents as functions of sample complexity and weight decay, analyzing quadratic and diagonal neural networks.

Result: The analysis uncovers crossovers between distinct scaling regimes and plateau behaviors that mirror empirical observations. It establishes a precise link between scaling regimes and spectral properties of trained network weights.

Conclusion: The work provides theoretical validation for empirical observations connecting power-law tails in weight spectrum with generalization performance, offering interpretation from first principles.

Abstract: Neural scaling laws underlie many of the recent advances in deep learning, yet their theoretical understanding remains largely confined to linear models. In this work, we present a systematic analysis of scaling laws for quadratic and diagonal neural networks in the feature learning regime. Leveraging connections with matrix compressed sensing and LASSO, we derive a detailed phase diagram for the scaling exponents of the excess risk as a function of sample complexity and weight decay. This analysis uncovers crossovers between distinct scaling regimes and plateau behaviors, mirroring phenomena widely reported in the empirical neural scaling literature. Furthermore, we establish a precise link between these regimes and the spectral properties of the trained network weights, which we characterize in detail. As a consequence, we provide a theoretical validation of recent empirical observations connecting the emergence of power-law tails in the weight spectrum with network generalization performance, yielding an interpretation from first principles.

[1377] Adaptive Canonicalization with Application to Invariant Anisotropic Geometric Networks

Ya-Wei Eileen Lin, Ron Levie

Main category: cs.LG

TL;DR: Adaptive canonicalization is introduced as a framework where canonicalization depends on both input and network, addressing discontinuities in traditional canonicalization while maintaining symmetry and enabling universal approximation.

Details

Motivation: Traditional canonicalization in equivariant machine learning introduces discontinuities that affect training stability, limit generalization, and complicate universal approximation theorems.

Method: Proposes adaptive canonicalization based on prior maximization, where the standard form of input is chosen to maximize the network’s predictive confidence. Applied to resolve eigenbasis ambiguities in spectral graph neural networks and rotational symmetries in point clouds.

Result: Empirical validation on molecular and protein classification, and point cloud classification tasks shows that adaptive canonicalization outperforms data augmentation, standard canonicalization, and equivariant architectures.

Conclusion: The adaptive canonicalization framework yields continuous and symmetry-respecting models with universal approximation properties, providing superior performance over existing equivariant machine learning approaches.

Abstract: Canonicalization is a widely used strategy in equivariant machine learning, enforcing symmetry in neural networks by mapping each input to a standard form. Yet, it often introduces discontinuities that can affect stability during training, limit generalization, and complicate universal approximation theorems. In this paper, we address this by introducing \emph{adaptive canonicalization}, a general framework in which the canonicalization depends both on the input and the network. Specifically, we present the adaptive canonicalization based on prior maximization, where the standard form of the input is chosen to maximize the predictive confidence of the network. We prove that this construction yields continuous and symmetry-respecting models that admit universal approximation properties. We propose two applications of our setting: (i) resolving eigenbasis ambiguities in spectral graph neural networks, and (ii) handling rotational symmetries in point clouds. We empirically validate our methods on molecular and protein classification, as well as point cloud classification tasks. Our adaptive canonicalization outperforms the three other common solutions to equivariant machine learning: data augmentation, standard canonicalization, and equivariant architectures.

[1378] Towards Understanding the Shape of Representations in Protein Language Models

Kosio Beshkov, Anders Malthe-Sørenssen

Main category: cs.LG

TL;DR: This paper analyzes how protein language models (PLMs) transform sequence spaces and encode structural information using SRV representations and graph filtrations to understand the transformed space of protein sequences.

Details

Motivation: Current interpretability tools for PLMs focus on individual sequences, but the transformation of the entire sequence space and relationships between sequences remains unknown. The authors aim to understand this transformed space.

Method: Used square-root velocity (SRV) representations and graph filtrations to create metric spaces for comparing proteins. Analyzed SCOP dataset proteins, computed Karcher means and effective dimensions, and studied context lengths at which structural features are encoded.

Result: Found that PLMs preferentially encode immediate and local residue relations, with performance degrading for larger context lengths. The most structurally faithful encoding occurs close to but before the last layer of models.

Conclusion: Training folding models on layers just before the final layer might improve protein folding performance, as these layers encode structural information most faithfully.

Abstract: While protein language models (PLMs) are one of the most promising avenues of research for future de novo protein design, the way in which they transform sequences to hidden representations, as well as the information encoded in such representations is yet to be fully understood. Several works have attempted to propose interpretability tools for PLMs, but they have focused on understanding how individual sequences are transformed by such models. Therefore, the way in which PLMs transform the whole space of sequences along with their relations is still unknown. In this work we attempt to understand this transformed space of sequences by identifying protein structure and representation with square-root velocity (SRV) representations and graph filtrations. Both approaches naturally lead to a metric space in which pairs of proteins or protein representations can be compared with each other. We analyze different types of proteins from the SCOP dataset and show that the Karcher mean and effective dimension of the SRV shape space follow a non-linear pattern as a function of the layers in ESM2 models of different sizes. Furthermore, we use graph filtrations as a tool to study the context lengths at which models encode the structural features of proteins. We find that PLMs preferentially encode immediate as well as local relations between residues, but start to degrade for larger context lengths. The most structurally faithful encoding tends to occur close to, but before the last layer of the models, indicating that training a folding model ontop of these layers might lead to improved folding performance.

[1379] Is Sequence Information All You Need for Bayesian Optimization of Antibodies?

Sebastian W. Ober, Calvin McCarter, Aniruddh Raghu, Yucen Lily Li, Alan N. Amin, Andrew Gordon Wilson, Hunter Elliott

Main category: cs.LG

TL;DR: Bayesian optimization for antibody engineering with structural information and protein language model constraints improves early data efficiency for stability but shows equivalent peak performance, questioning the necessity of structure.

Details

Motivation: Antibody therapeutic engineering is iterative and expensive, but optimal surrogate models for Bayesian optimization over structured antibody space are unclear, and no prior works incorporated structural information.

Method: Explored structural information incorporation in Bayesian optimization, compared sequence-only approaches on binding affinity and stability, and proposed protein language model-based soft constraints to guide optimization.

Result: Structural information improved early data efficiency for stability but had equivalent peak performance. With protein language model constraints, sequence-only methods matched structure-based performance, eliminating the efficiency gap.

Conclusion: Structure may not be necessary for antibody Bayesian optimization when using protein language model constraints, as sequence-only methods can achieve equivalent performance.

Abstract: Bayesian optimization is a natural candidate for the engineering of antibody therapeutic properties, which is often iterative and expensive. However, finding the optimal choice of surrogate model for optimization over the highly structured antibody space is difficult, and may differ depending on the property being optimized. Moreover, to the best of our knowledge, no prior works have attempted to incorporate structural information into antibody Bayesian optimization. In this work, we explore different approaches to incorporating structural information into Bayesian optimization, and compare them to a variety of sequence-only approaches on two different antibody properties, binding affinity and stability. In addition, we propose the use of a protein language model-based ``soft constraint,’’ which helps guide the optimization to promising regions of the space. We find that certain types of structural information improve data efficiency in early optimization rounds for stability, but have equivalent peak performance. Moreover, when incorporating the protein language model soft constraint we find that the data efficiency gap is diminished for affinity and eliminated for stability, resulting in sequence-only methods that match the performance of structure-based methods, raising questions about the necessity of structure in Bayesian optimization for antibodies.

[1380] OAT-FM: Optimal Acceleration Transport for Improved Flow Matching

Angxiao Yue, Anqi Dong, Hongteng Xu

Main category: cs.LG

TL;DR: This paper bridges Flow Matching (FM) with Optimal Acceleration Transport (OAT) theory, developing OAT-FM method that optimizes acceleration transport instead of constant velocity, leading to improved generative modeling performance.

Details

Motivation: To improve Flow Matching methods by connecting them with Optimal Acceleration Transport theory, addressing limitations of existing OT-based FM approaches and developing a more efficient acceleration-based optimization.

Method: Developed OAT-FM method that optimizes acceleration transport in product space of sample and velocity, with efficient algorithm design. Also proposed two-phase FM paradigm: first train with any FM method, then fine-tune with OAT-FM.

Result: OAT-FM consistently improves model performance across various generative tasks, eliminates data distribution drift risk, and avoids need for large noise data pairs generation.

Conclusion: OAT-FM provides theoretical foundation for flow straightness and practical benefits for generative modeling, establishing a new paradigm for improving existing FM methods through acceleration-based fine-tuning.

Abstract: As a powerful technique in generative modeling, Flow Matching (FM) aims to learn velocity fields from noise to data, which is often explained and implemented as solving Optimal Transport (OT) problems. In this study, we bridge FM and the recent theory of Optimal Acceleration Transport (OAT), developing an improved FM method called OAT-FM and exploring its benefits in both theory and practice. In particular, we demonstrate that the straightening objective hidden in existing OT-based FM methods is mathematically equivalent to minimizing the physical action associated with acceleration defined by OAT. Accordingly, instead of enforcing constant velocity, OAT-FM optimizes the acceleration transport in the product space of sample and velocity, whose objective corresponds to a necessary and sufficient condition of flow straightness. An efficient algorithm is designed to achieve OAT-FM with low complexity. OAT-FM motivates a new two-phase FM paradigm: Given a generative model trained by an arbitrary FM method, whose velocity information has been relatively reliable, we can fine-tune and improve it via OAT-FM. This paradigm eliminates the risk of data distribution drift and the need to generate a large number of noise data pairs, which consistently improves model performance in various generative tasks. Code is available at: https://github.com/AngxiaoYue/OAT-FM

[1381] Learning Distinguishable Representations in Deep Q-Networks for Linear Transfer

Sooraj Sathish, Keshav Goyal, Raghuram Bharadwaj Diddigi

Main category: cs.LG

TL;DR: This paper proposes a novel deep Q-learning method with regularization to reduce feature correlations, enabling more effective transfer learning using linear function approximation.

Details

Motivation: Deep RL models require extensive hyperparameter tuning and high computational costs. Transfer learning can reuse knowledge from previous tasks to avoid retraining from scratch, but standard deep RL representations are highly correlated, limiting their effectiveness with linear function approximation.

Method: Proposes a deep Q-learning approach with a regularization term to reduce positive correlations between state feature representations, enabling more effective use of linear function approximation in transfer learning.

Result: Experiments on standard RL benchmarks and MinAtar games demonstrate improved transfer learning performance and reduced computational overhead through the use of reduced-correlation features.

Conclusion: The proposed regularization approach successfully mitigates feature correlation issues in deep RL representations, enabling more effective transfer learning with linear function approximation while reducing computational costs.

Abstract: Deep Reinforcement Learning (RL) has demonstrated success in solving complex sequential decision-making problems by integrating neural networks with the RL framework. However, training deep RL models poses several challenges, such as the need for extensive hyperparameter tuning and high computational costs. Transfer learning has emerged as a promising strategy to address these challenges by enabling the reuse of knowledge from previously learned tasks for new, related tasks. This avoids the need for retraining models entirely from scratch. A commonly used approach for transfer learning in RL is to leverage the internal representations learned by the neural network during training. Specifically, the activations from the last hidden layer can be viewed as refined state representations that encapsulate the essential features of the input. In this work, we investigate whether these representations can be used as input for training simpler models, such as linear function approximators, on new tasks. We observe that the representations learned by standard deep RL models can be highly correlated, which limits their effectiveness when used with linear function approximation. To mitigate this problem, we propose a novel deep Q-learning approach that introduces a regularization term to reduce positive correlations between feature representation of states. By leveraging these reduced correlated features, we enable more effective use of linear function approximation in transfer learning. Through experiments and ablation studies on standard RL benchmarks and MinAtar games, we demonstrate the efficacy of our approach in improving transfer learning performance and thereby reducing computational overhead.

[1382] Intra-request branch orchestration for efficient LLM reasoning

Weifan Jiang, Rana Shahout, Yilun Du, Michael Mitzenmacher, Minlan Yu

Main category: cs.LG

TL;DR: DUCHESS is an LLM serving system that reduces token usage and latency for reasoning tasks without sacrificing accuracy, using branch orchestration guided by correctness predictions.

Details

Motivation: Current inference-time reasoning methods like chain-of-thought and multi-branch reasoning significantly increase token usage and latency, with prior work focusing mainly on token reduction at the expense of accuracy.

Method: DUCHESS uses lightweight linear probing on LLM layer activations to predict branch correctness, and orchestrates branches by deciding whether to terminate, duplicate, or continue them. It also prioritizes easier reasoning tasks when handling multiple requests.

Result: Experiments show 42-63% token reduction at matched accuracy compared to self-consistency, and latency reductions of 57-81% (mean), 58-85% (median), and 52-84% (tail) with FCFS scheduling.

Conclusion: DUCHESS effectively improves the token-accuracy Pareto frontier and significantly reduces latency while maintaining accuracy in LLM reasoning tasks.

Abstract: Large Language Models (LLMs) increasingly rely on inference-time reasoning algorithms such as chain-of-thought and multi-branch reasoning to improve accuracy on complex tasks. These methods, however, substantially increase token usage and per-request latency. Prior work has largely focused on reducing token usage, often at the expense of accuracy, while overlooking other latency factors. We present DUCHESS, an LLM serving system that reduces cost and latency without sacrificing accuracy through intra-request branch orchestration guided by predictions. DUCHESS employs a lightweight linear probing model over LLM layer activations to estimate branch correctness, and its orchestration policy decides whether to terminate, duplicate, or continue a branch. When handling multiple requests, DUCHESS further reduces latency by prioritizing easier reasoning tasks when complexity can be estimated from the prompt. Experiments on three reasoning benchmarks show that DUCHESS consistently improves the token-accuracy Pareto frontier, reducing token usage by 42-63% at matched accuracy compared to self-consistency. In serving with vLLM, DUCHESS reduces mean, median, and tail latencies by 57-81%, 58-85%, and 52-84% with First-Come-First-Served scheduling, and achieves additional gains under difficulty-aware scheduling at higher request rates.

[1383] Overlap-Adaptive Regularization for Conditional Average Treatment Effect Estimation

Valentyn Melnychuk, Dennis Frauen, Jonas Schweisthal, Stefan Feuerriegel

Main category: cs.LG

TL;DR: OAR improves CATE estimation in low-overlap regions by regularizing models proportionally to overlap weights, with higher regularization in low-overlap areas.

Details

Motivation: Existing meta-learners for CATE estimation perform poorly in low-overlap regions, limiting their effectiveness in personalized medicine.

Method: Introduces Overlap-Adaptive Regularization (OAR) that applies regularization proportional to overlap weights, with debiased versions preserving Neyman-orthogonality.

Result: OAR significantly improves CATE estimation in low-overlap settings compared to constant regularization in (semi-)synthetic experiments.

Conclusion: OAR is a flexible approach that enhances existing meta-learners’ performance in low-overlap regions while maintaining robust inference properties.

Abstract: The conditional average treatment effect (CATE) is widely used in personalized medicine to inform therapeutic decisions. However, state-of-the-art methods for CATE estimation (so-called meta-learners) often perform poorly in the presence of low overlap. In this work, we introduce a new approach to tackle this issue and improve the performance of existing meta-learners in the low-overlap regions. Specifically, we introduce Overlap-Adaptive Regularization (OAR) that regularizes target models proportionally to overlap weights so that, informally, the regularization is higher in regions with low overlap. To the best of our knowledge, our OAR is the first approach to leverage overlap weights in the regularization terms of the meta-learners. Our OAR approach is flexible and works with any existing CATE meta-learner: we demonstrate how OAR can be applied to both parametric and non-parametric second-stage models. Furthermore, we propose debiased versions of our OAR that preserve the Neyman-orthogonality of existing meta-learners and thus ensure more robust inference. Through a series of (semi-)synthetic experiments, we demonstrate that our OAR significantly improves CATE estimation in low-overlap settings in comparison to constant regularization.

[1384] Double Descent as a Lens for Sample Efficiency in Autoregressive vs. Discrete Diffusion Models

Ahmad Fraij, Sam Dauncey

Main category: cs.LG

TL;DR: Discrete diffusion models require more capacity and training to match autoregressive models’ sample efficiency on small datasets, only becoming competitive with sufficient compute.

Details

Motivation: Address data scarcity by comparing sample efficiency of discrete diffusion vs autoregressive models using double descent phenomenon.

Method: Use double descent phenomenon to holistically compare sample efficiency, analyzing underparameterized and overparameterized regimes across model sizes.

Result: Discrete diffusion needs larger capacity and more epochs to reach interpolation threshold; both show similar behavior in overparameterized regime without pronounced second descent.

Conclusion: Autoregressive models are more sample-efficient on small datasets; discrete diffusion only competitive with sufficient capacity and compute.

Abstract: Data scarcity drives the need for more sample-efficient large language models. In this work, we use the double descent phenomenon to holistically compare the sample efficiency of discrete diffusion and autoregressive models. We show that discrete diffusion models require larger capacity and more training epochs to escape their underparameterized regime and reach the interpolation threshold. In the strongly overparameterized regime, both models exhibit similar behavior, with neither exhibiting a pronounced second descent in test loss across a large range of model sizes. Overall, our results indicate that autoregressive models are more sample-efficient on small-scale datasets, while discrete diffusion models only become competitive when given sufficient capacity and compute.

[1385] Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards

Haoran He, Yuxiao Ye, Qingpeng Cai, Chen Hu, Binxing Jiao, Daxin Jiang, Ling Pan

Main category: cs.LG

TL;DR: RLVR improves LLM reasoning but suffers from instability and diversity collapse. ROVER simplifies RL by using Q-values from a fixed random policy instead of policy iteration, achieving better performance and diversity in math reasoning.

Details

Motivation: Current RLVR methods using PPO/GRPO suffer from training instability and diversity collapse, requiring complex heuristics. The paper aims to simplify RL for math reasoning by leveraging the problem's special structure.

Method: ROVER uses Q-values from a fixed uniformly random policy to determine optimal actions, bypassing the policy iteration loop. It samples actions from softmax over these uniform-policy Q-values.

Result: ROVER achieves superior performance: +8.2 on pass@1, +16.8 on pass@256, and +17.6% diversity across multiple base models and math reasoning benchmarks.

Conclusion: ROVER demonstrates that simplified RL methods can outperform complex existing approaches in math reasoning, preserving diversity while achieving better quality results.

Abstract: RL with Verifiable Rewards (RLVR) has emerged as a promising paradigm for improving the reasoning abilities of large language models (LLMs). Current methods rely primarily on policy optimization frameworks like PPO and GRPO, which follow generalized policy iteration that alternates between evaluating the current policy’s value and improving the policy based on evaluation. While effective, they often suffer from training instability and diversity collapse, requiring complex heuristic tricks and careful tuning. We observe that standard RLVR in math reasoning can be formalized as a specialized finite-horizon Markov Decision Process with deterministic state transitions, tree-structured dynamics, and binary terminal rewards. Though large in scale, the underlying structure is simpler than general-purpose control settings for which popular RL algorithms (e.g., PPO) were developed, suggesting that several sophisticated techniques in existing methods may be reduced or even omitted. Based on this insight, we prove a surprising result: the optimal action can be recovered from the Q-function of a fixed uniformly random policy, thereby bypassing the generalized policy iteration loop and its associated heuristics. We introduce Random Policy Valuation for Diverse Reasoning (ROVER) to translate this principle into a practical and scalable algorithm for LLM math reasoning, a minimalist yet highly effective RL method that samples actions from a softmax over these uniform-policy Q-values. ROVER preserves diversity throughout training, allowing sustained exploration of multiple valid pathways. Across multiple base models and standard math reasoning benchmarks, ROVER demonstrates superior performance in both \textbf{quality} (\textbf{+8.2} on pass@1, \textbf{+16.8} on pass@256) and \textbf{diversity} (\textbf{+17.6%}), despite its radical simplification compared to strong, complicated existing methods.

[1386] Sampling Complexity of TD and PPO in RKHS

Lu Zou, Wendi Ren, Weizhong Zhang, Liang Ding, Shuang Li

Main category: cs.LG

TL;DR: This paper provides a function-space analysis of PPO using reproducing kernel Hilbert spaces, decoupling policy evaluation with kernelized TD critics and policy improvement with KL-regularized natural gradient updates.

Details

Motivation: To place PPO on a firmer theoretical footing beyond finite-dimensional assumptions and clarify when RKHS-proximal updates with kernel-TD critics yield global policy improvement with practical efficiency.

Method: Decouples policy evaluation and improvement in RKHS: (i) kernelized TD critic performs efficient RKHS-gradient updates using one-step samples, (ii) KL-regularized natural-gradient policy step exponentiates action-value to recover PPO/TRPO-style proximal updates in continuous spaces.

Result: Provides non-asymptotic, instance-adaptive guarantees with rates depending on RKHS entropy, unifying various kernel regimes. Derives optimal k^{-1/2} convergence rate sampling rule. Empirically improves stability and sample efficiency on control tasks, with favorable throughput versus GAE baseline.

Conclusion: The analysis places PPO on stronger theoretical foundations and demonstrates practical benefits of RKHS-proximal updates with kernel-TD critics for global policy improvement.

Abstract: We revisit Proximal Policy Optimization (PPO) from a function-space perspective. Our analysis decouples policy evaluation and improvement in a reproducing kernel Hilbert space (RKHS): (i) A kernelized temporal-difference (TD) critic performs efficient RKHS-gradient updates using only one-step state-action transition samples; (ii) a KL-regularized, natural-gradient policy step exponentiates the evaluated action-value, recovering a PPO/TRPO-style proximal update in continuous state-action spaces. We provide non-asymptotic, instance-adaptive guarantees whose rates depend on RKHS entropy, unifying tabular, linear, Sobolev, Gaussian, and Neural Tangent Kernel (NTK) regimes, and we derive a sampling rule for the proximal update that ensures the optimal $k^{-1/2}$ convergence rate for stochastic optimization. Empirically, the theory-aligned schedule improves stability and sample efficiency on common control tasks (e.g., CartPole, Acrobot), while our TD-based critic attains favorable throughput versus a GAE baseline. Altogether, our results place PPO on a firmer theoretical footing beyond finite-dimensional assumptions and clarify when RKHS-proximal updates with kernel-TD critics yield global policy improvement with practical efficiency.

[1387] Score-based Membership Inference on Diffusion Models

Mingxing Rao, Bowen Qu, Daniel Moyer

Main category: cs.LG

TL;DR: The paper presents SimA, a single-query membership inference attack for diffusion models that uses predicted noise vector norms to detect training data membership, showing latent diffusion models are less vulnerable than pixel-space models due to VAE information bottlenecks.

Details

Motivation: Membership inference attacks against diffusion models pose serious privacy risks by revealing whether specific samples were in the training data, requiring efficient detection methods.

Method: Proposed SimA attack that analyzes predicted noise vector norms, showing they encode proximity to training data. Compared pixel-space DDPM and latent diffusion models (LDM) with different VAE regularization parameters.

Result: SimA achieves strong performance across diffusion model variants. Latent diffusion models are significantly less vulnerable than pixel-space models due to VAE information bottlenecks. Varying VAE regularization affects vulnerability.

Conclusion: Score-based MIAs are theoretically grounded, and latent diffusion models’ robustness comes from VAE bottlenecks. Future work should focus on understanding VAE inversion rather than just diffusion process inversion for better privacy protection.

Abstract: Membership inference attacks (MIAs) against diffusion models have emerged as a pressing privacy concern, as these models may inadvertently reveal whether a given sample was part of their training set. We present a theoretical and empirical study of score-based MIAs, focusing on the predicted noise vectors that diffusion models learn to approximate. We show that the expected denoiser output points toward a kernel-weighted local mean of nearby training samples, such that its norm encodes proximity to the training set and thereby reveals membership. Building on this observation, we propose SimA, a single-query attack that provides a principled, efficient alternative to existing multi-query methods. SimA achieves consistently strong performance across variants of DDPM, Latent Diffusion Model (LDM). Notably, we find that Latent Diffusion Models are surprisingly less vulnerable than pixel-space models, due to the strong information bottleneck imposed by their latent auto-encoder. We further investigate this by differing the regularization hyperparameters ($\beta$ in $\beta$-VAE) in latent channel and suggest a strategy to make LDM training more robust to MIA. Our results solidify the theory of score-based MIAs, while highlighting that Latent Diffusion class of methods requires better understanding of inversion for VAE, and not simply inversion of the Diffusion process

[1388] Uncertainty-Aware Deep Learning for Wildfire Danger Forecasting

Spyros Kondylatos, Gustau Camps-Valls, Ioannis Papoutsis

Main category: cs.LG

TL;DR: An uncertainty-aware deep learning framework that jointly models epistemic and aleatoric uncertainty for short-term wildfire danger forecasting, improving both accuracy and reliability.

Details

Motivation: Wildfires pose severe threats but current deep learning models lack reliability due to insufficient uncertainty quantification, hindering adoption for decision-making.

Method: Developed an uncertainty-aware DL framework that captures both epistemic (model) and aleatoric (data) uncertainty for wildfire danger forecasting, with applications to next-day and extended (up to 10-day) predictions.

Result: In next-day forecasting: improved F1 Score by 2.3% and reduced Expected Calibration Error by 2.1% compared to deterministic baseline. Uncertainty estimates proved reliable for decision support, including threshold-based prediction rejection and uncertainty-aware danger maps.

Conclusion: Joint modeling of epistemic and aleatoric uncertainty significantly enhances wildfire danger forecasting accuracy and reliability, providing complementary insights under challenging conditions and advancing trustworthy DL systems for wildfire prediction.

Abstract: Wildfires are among the most severe natural hazards, posing a significant threat to both humans and natural ecosystems. The growing risk of wildfires increases the demand for forecasting models that are not only accurate but also reliable. Deep Learning (DL) has shown promise in predicting wildfire danger; however, its adoption is hindered by concerns over the reliability of its predictions, some of which stem from the lack of uncertainty quantification. To address this challenge, we present an uncertainty-aware DL framework that jointly captures epistemic (model) and aleatoric (data) uncertainty to enhance short-term wildfire danger forecasting. In the next-day forecasting, our best-performing model improves the F1 Score by 2.3% and reduces the Expected Calibration Error by 2.1% compared to a deterministic baseline, enhancing both predictive skill and calibration. Our experiments confirm the reliability of the uncertainty estimates and illustrate their practical utility for decision support, including the identification of uncertainty thresholds for rejecting low-confidence predictions and the generation of well-calibrated wildfire danger maps with accompanying uncertainty layers. Extending the forecast horizon up to ten days, we observe that aleatoric uncertainty increases with time, showing greater variability in environmental conditions, while epistemic uncertainty remains stable. Finally, we show that although the two uncertainty types may be redundant in low-uncertainty cases, they provide complementary insights under more challenging conditions, underscoring the value of their joint modeling for robust wildfire danger prediction. In summary, our approach significantly improves the accuracy and reliability of wildfire danger forecasting, advancing the development of trustworthy wildfire DL systems.

[1389] MARCOS: Deep Thinking by Markov Chain of Continuous Thoughts

Jiayu Liu, Zhenya Huang, Anya Sims, Enhong Chen, Yee Whye Teh, Ning Miao

Main category: cs.LG

TL;DR: MARCOS introduces a new reasoning paradigm that models reasoning as a hidden Markov chain of continuous thoughts instead of discrete token generation, achieving comparable performance to chain-of-thought with significant speedup.

Details

Motivation: Current chain-of-thought reasoning has three main drawbacks: slow and expensive autoregressive token generation, constrained reasoning in discrete token space, and entangled reasoning with token generation that causes short-sighted reasoning.

Method: Models reasoning as a hidden Markov chain of continuous, high-dimensional thoughts where explicit reasoning steps serve as observable variables. Uses a two-phase variational training scheme since the latent process is incompatible with standard supervised learning.

Result: Outperforms existing continuous reasoning methods and achieves performance comparable to token-based CoT, surpassing it by 4.7% on GSM8K with up to 15.7x speedup in inference.

Conclusion: MARCOS provides a superior reasoning paradigm that decouples reasoning from token generation, enables step-level control over randomness, and opens opportunities for reinforcement learning and reasoning in LLMs.

Abstract: The current paradigm for reasoning in large language models (LLMs) involves models “thinking out loud” via a sequence of tokens, known as chain-of-thought (CoT). This approach, while effective, has several significant drawbacks. Firstly, inference requires autoregressive generation of often thousands of CoT tokens, which is slow and computationally expensive. Secondly, it constrains reasoning to the discrete space of tokens, creating an information bottleneck across reasoning steps. Thirdly, it fundamentally entangles reasoning with token generation, forcing LLMs to “think while speaking,” which causes potentially short-sighted reasoning. In light of these limitations, we re-imagine reasoning in LLMs and present a new paradigm: MARCOS. In our approach, rather than autoregressively generating tokens, we model reasoning as a hidden Markov chain of continuous, high-dimensional “thoughts”. Each reasoning step involves a transition of the internal thoughts, where explicit reasoning steps (which may consist of hundreds of tokens) serve as observable variables, which are windows to peek into the implicit thoughts. Since this latent process is incompatible with the standard supervised learning, we further propose a two-phase variational training scheme. Our experiments on three benchmarks demonstrate that MARCOS outperforms existing continuous reasoning methods and, for the first time, achieves performance comparable to token-based CoT, even surpassing it by 4.7% on GSM8K with up to 15.7x speedup in inference. Beyond this, MARCOS offers additional advantages, such as step-level instead of token-level control over randomness, opening significant opportunities for reinforcement learning and reasoning in LLMs.

[1390] Bayesian Surrogates for Risk-Aware Pre-Assessment of Aging Bridge Portfolios

Sophia V. Kuhn, Rafael Bischof, Marius Weber, Antoine Binggeli, Michael A. Kraus, Walter Kaufmann, Fernando Pérez-Cruz

Main category: cs.LG

TL;DR: BNN surrogates enable fast, uncertainty-aware structural assessment of bridges, reducing costs and emissions by prioritizing critical structures for detailed analysis.

Details

Motivation: Aging infrastructure requires efficient resource allocation, balancing cheap conservative methods vs accurate but costly simulations that don't scale portfolio-wide.

Method: Bayesian neural network surrogates trained on large-scale database of non-linear finite element analyses from parametric pipeline based on Swiss Federal Railway’s bridge portfolio.

Result: Models accurately estimate high-fidelity structural analysis results by predicting code compliance factors with calibrated epistemic uncertainty, enabling fast triage of critical structures.

Conclusion: The framework significantly reduces costs and emissions by avoiding unnecessary analyses and physical interventions across infrastructure portfolios, demonstrated in real-world railway case study.

Abstract: Aging infrastructure portfolios pose a critical resource allocation challenge: deciding which structures require intervention and which can safely remain in service. Structural assessments must balance the trade-off between cheaper, conservative analysis methods and accurate but costly simulations that do not scale portfolio-wide. We propose Bayesian neural network (BNN) surrogates for rapid structural pre-assessment of worldwide common bridge types, such as reinforced concrete frame bridges. Trained on a large-scale database of non-linear finite element analyses generated via a parametric pipeline and developed based on the Swiss Federal Railway’s bridge portfolio, the models accurately and efficiently estimate high-fidelity structural analysis results by predicting code compliance factors with calibrated epistemic uncertainty. Our BNN surrogate enables fast, uncertainty-aware triage: flagging likely critical structures and providing guidance where refined analysis is pertinent. We demonstrate the framework’s effectiveness in a real-world case study of a railway underpass, showing its potential to significantly reduce costs and emissions by avoiding unnecessary analyses and physical interventions across entire infrastructure portfolios.

[1391] A multiscale analysis of mean-field transformers in the moderate interaction regime

Giuseppe Bruno, Federico Pasqualotto, Andrea Agazzi

Main category: cs.LG

TL;DR: The paper models token evolution in encoder-only transformers as interacting particles in a mean-field system, analyzing dynamics in the moderate interaction regime where token count and inverse temperature scale together.

Details

Motivation: To understand how tokens evolve through transformer model depth by modeling them as interacting particles, particularly in the moderate interaction regime where system dynamics exhibit multiscale behavior.

Method: Model tokens as particles in a mean-field system, study dynamics in moderate interaction regime (large N tokens, β scaling with N), analyze three-phase multiscale behavior: fast collapse to low-dimensional space, intermediate clustering, and slow cluster merging.

Result: Rigorous characterization of limiting dynamics in each phase (fast collapse, intermediate clustering, slow merging), proof of convergence in the limit, with supporting simulations.

Conclusion: The mean-field particle model successfully captures multiscale token evolution dynamics in transformers, providing mathematical foundation for understanding token behavior through model depth.

Abstract: In this paper, we study the evolution of tokens through the depth of encoder-only transformer models at inference time by modeling them as a system of particles interacting in a mean-field way and studying the corresponding dynamics. More specifically, we consider this problem in the moderate interaction regime, where the number $N$ of tokens is large and the inverse temperature parameter $\beta$ of the model scales together with $N$. In this regime, the dynamics of the system displays a multiscale behavior: a fast phase, where the token empirical measure collapses on a low-dimensional space, an intermediate phase, where the measure further collapses into clusters, and a slow one, where such clusters sequentially merge into a single one. We provide a rigorous characterization of the limiting dynamics in each of these phases and prove convergence in the above mentioned limit, exemplifying our results with some simulations.

[1392] Efficient Hyperparameter Tuning via Trajectory Invariance Principle

Bingrui Li, Jiaxin Wen, Zhanpeng Zhou, Jun Zhu, Jianfei Chen

Main category: cs.LG

TL;DR: The paper identifies trajectory invariance in hyperparameter tuning, where loss curves, gradient noise, and gradient norm remain invariant when learning rate and weight decay are combined in a specific way, reducing the 2D tuning space to 1D for more efficient optimization.

Details

Motivation: As hyperparameter tuning becomes increasingly costly at scale, there is a need for efficient tuning methods and guiding principles, since current principles for hyperparameter tuning remain limited.

Method: The authors identify a phenomenon called trajectory invariance, where pre-training loss curves, gradient noise, and gradient norm exhibit invariance with respect to a combined quantity of learning rate and weight decay, effectively reducing the hyperparameter space from two dimensions to one.

Result: The trajectory invariance phenomenon enables an efficient tuning rule: follow the salient direction revealed by the invariance. The work also refines previous scaling laws and challenges several existing viewpoints on hyperparameter tuning.

Conclusion: The paper proposes new principles for efficient hyperparameter tuning and inspires future research on scaling laws by establishing trajectory invariance as a guiding phenomenon for optimization.

Abstract: As hyperparameter tuning becomes increasingly costly at scale, efficient tuning methods are essential. Yet principles for guiding hyperparameter tuning remain limited. In this work, we seek to establish such principles by considering a broad range of hyperparameters, including batch size, learning rate, and weight decay. We identify a phenomenon we call trajectory invariance, where pre-training loss curves, gradient noise, and gradient norm exhibit invariance–closely overlapping–with respect to a quantity that combines learning rate and weight decay. This phenomenon effectively reduces the original two-dimensional hyperparameter space to one dimension, yielding an efficient tuning rule: follow the salient direction revealed by trajectory invariance. Furthermore, we refine previous scaling laws and challenge several existing viewpoints. Overall, our work proposes new principles for efficient tuning and inspires future research on scaling laws.

[1393] Advantage Weighted Matching: Aligning RL with Pretraining in Diffusion Models

Shuchen Xue, Chongjian Ge, Shilong Zhang, Yichen Li, Zhi-Ming Ma

Main category: cs.LG

TL;DR: AWM is a new RL method for diffusion models that uses the same score/flow-matching loss as pretraining, reducing variance and accelerating convergence compared to DDPO.

Details

Motivation: Current RL approaches for diffusion models like DDPO optimize objectives different from pretraining, increasing variance and slowing convergence. The authors aim to unify pretraining and RL objectives.

Method: AWM uses score/flow-matching loss identical to pretraining and reweights samples by their advantage, amplifying high-reward samples while suppressing low-reward ones.

Result: AWM achieves up to 24x speedup over Flow-GRPO (based on DDPO) on GenEval, OCR, and PickScore benchmarks with Stable Diffusion 3.5 Medium and FLUX, without quality loss.

Conclusion: AWM unifies pretraining and RL conceptually and practically, reduces variance, enables faster convergence, and maintains generation quality while being theoretically consistent with policy-gradient theory.

Abstract: Reinforcement Learning (RL) has emerged as a central paradigm for advancing Large Language Models (LLMs), where pre-training and RL post-training share the same log-likelihood formulation. In contrast, recent RL approaches for diffusion models, most notably Denoising Diffusion Policy Optimization (DDPO), optimize an objective different from the pretraining objectives–score/flow matching loss. In this work, we establish a novel theoretical analysis: DDPO is an implicit form of score/flow matching with noisy targets, which increases variance and slows convergence. Building on this analysis, we introduce \textbf{Advantage Weighted Matching (AWM)}, a policy-gradient method for diffusion. It uses the same score/flow-matching loss as pretraining to obtain a lower-variance objective and reweights each sample by its advantage. In effect, AWM raises the influence of high-reward samples and suppresses low-reward ones while keeping the modeling objective identical to pretraining. This unifies pretraining and RL conceptually and practically, is consistent with policy-gradient theory, reduces variance, and yields faster convergence. This simple yet effective design yields substantial benefits: on GenEval, OCR, and PickScore benchmarks, AWM delivers up to a $24\times$ speedup over Flow-GRPO (which builds on DDPO), when applied to Stable Diffusion 3.5 Medium and FLUX, without compromising generation quality. Code is available at https://github.com/scxue/advantage_weighted_matching.

[1394] Towards a Certificate of Trust: Task-Aware OOD Detection for Scientific AI

Bogdan Raonić, Siddhartha Mishra, Samuel Lanthaler

Main category: cs.LG

TL;DR: A new OOD detection method for regression tasks using score-based diffusion models to estimate joint likelihoods of inputs and predictions, providing task-aware reliability scores that correlate with prediction errors across scientific datasets.

Details

Motivation: Data-driven models in critical scientific fields can fail on out-of-distribution data, but detecting such failures in regression tasks remains challenging.

Method: Propose OOD detection based on estimating joint likelihoods using a score-based diffusion model, considering both input and regression model’s prediction to provide task-aware reliability scores.

Result: The likelihood strongly correlates with prediction error across various scientific datasets including PDE datasets, satellite imagery, and brain tumor segmentation.

Conclusion: Provides a foundational step towards building verifiable ‘certificates of trust’ for assessing trustworthiness of AI-based scientific predictions.

Abstract: Data-driven models are increasingly adopted in critical scientific fields like weather forecasting and fluid dynamics. These methods can fail on out-of-distribution (OOD) data, but detecting such failures in regression tasks is an open challenge. We propose a new OOD detection method based on estimating joint likelihoods using a score-based diffusion model. This approach considers not just the input but also the regression model’s prediction, providing a task-aware reliability score. Across numerous scientific datasets, including PDE datasets, satellite imagery and brain tumor segmentation, we show that this likelihood strongly correlates with prediction error. Our work provides a foundational step towards building a verifiable ‘certificate of trust’, thereby offering a practical tool for assessing the trustworthiness of AI-based scientific predictions. Our code is publicly available at https://github.com/bogdanraonic3/OOD_Detection_ScientificML

[1395] Towards generalizable deep ptychography neural networks

Albert Vong, Steven Henke, Oliver Hoidn, Hanna Ruth, Junjing Deng, Alexander Hexemer, Apurva Mehta, Arianna Gleason, Levi Hancock, Nicholas Schwarz

Main category: cs.LG

TL;DR: Unsupervised training workflow for X-ray ptychography using probe learning with synthetic objects enables robust neural network reconstruction across multiple beamlines.

Details

Motivation: Need for real-time feedback in X-ray ptychography under accelerated acquisition rates, with existing deep learning approaches lacking robustness across diverse experimental conditions.

Method: Probe-centric unsupervised training combining experimentally-measured probes with procedurally generated synthetic objects, emphasizing probe learning over in-distribution learning.

Result: Single physics-informed neural network achieves multi-probe generalization across multiple beamlines, with reconstruction fidelity comparable to models trained exclusively on experimental data.

Conclusion: Probe learning is equally important as in-distribution learning, enabling training of experiment-steering models that provide real-time feedback under dynamic experimental conditions.

Abstract: X-ray ptychography is a data-intensive imaging technique expected to become ubiquitous at next-generation light sources delivering many-fold increases in coherent flux. The need for real-time feedback under accelerated acquisition rates motivates surrogate reconstruction models like deep neural networks, which offer orders-of-magnitude speedup over conventional methods. However, existing deep learning approaches lack robustness across diverse experimental conditions. We propose an unsupervised training workflow emphasizing probe learning by combining experimentally-measured probes with synthetic, procedurally generated objects. This probe-centric approach enables a single physics-informed neural network to reconstruct unseen experiments across multiple beamlines; among the first demonstrations of multi-probe generalization. We find probe learning is equally important as in-distribution learning; models trained using this synthetic workflow achieve reconstruction fidelity comparable to those trained exclusively on experimental data, even when changing the type of synthetic training object. The proposed approach enables training of experiment-steering models that provide real-time feedback under dynamic experimental conditions.

[1396] Learning in an Echo Chamber: Online Learning with Replay Adversary

Daniil Dmitriev, Harald Eskelund Franck, Carolin Heinzler, Amartya Sanyal

Main category: cs.LG

TL;DR: The paper introduces a learning-theoretic framework called Online Learning in the Replay Setting to model how machine learning systems can reinforce errors when training on self-annotated data. It defines the Extended Threshold dimension as the exact measure of learnability in this model and shows it’s provably harder than classical mistake-bound learning.

Details

Motivation: Machine learning systems increasingly train on self-annotated data, which risks creating echo chambers where models reinforce their own errors. This paper aims to formally model and analyze this phenomenon.

Method: The authors introduce the Online Learning in the Replay Setting framework where in each round, the learner outputs a hypothesis and the adversary reveals either true labels or replayed labels from earlier rounds. They define the Extended Threshold dimension (ExThD) and prove matching upper and lower bounds using closure-based learners.

Result: The Extended Threshold dimension is proven to be the exact measure of learnability in the replay setting. A closure-based learner makes at most ExThD(H) mistakes against any adaptive adversary, and no algorithm can perform better. The replay setting is provably harder than classical mistake-bound learning.

Conclusion: The replay setting presents fundamental challenges distinct from classical online learning. Proper learning is only possible for (almost) intersection-closed classes, while improper learning can achieve the ExThD bound. These results provide the first tight analysis of learning against replay adversaries.

Abstract: As machine learning systems increasingly train on self-annotated data, they risk reinforcing errors and becoming echo chambers of their own beliefs. We model this phenomenon by introducing a learning-theoretic framework: Online Learning in the Replay Setting. In round $t$, the learner outputs a hypothesis $\hat{h}_t$; the adversary then reveals either the true label $f^\ast(x_t)$ or a replayed label $\hat{h}_i(x_t)$ from an earlier round $i < t$. A mistake is counted only when the true label is shown, yet classical algorithms such as the SOA or the halving algorithm are easily misled by the replayed errors. We introduce the Extended Threshold dimension, $\mathrm{ExThD}(\mathcal{H})$, and prove matching upper and lower bounds that make $\mathrm{ExThD}(\mathcal{H})$ the exact measure of learnability in this model. A closure-based learner makes at most $\mathrm{ExThD}(\mathcal{H})$ mistakes against any adaptive adversary, and no algorithm can perform better. For stochastic adversaries, we prove a similar bound for every intersection-closed class. The replay setting is provably harder than the classical mistake bound setting: some classes have constant Littlestone dimension but arbitrarily large $\mathrm{ExThD}(\mathcal{H})$. Proper learning exhibits an even sharper separation: a class is properly learnable under replay if and only if it is (almost) intersection-closed. Otherwise, every proper learner suffers $\Omega(T)$ errors, whereas our improper algorithm still achieves the $\mathrm{ExThD}(\mathcal{H})$ bound. These results give the first tight analysis of learning against replay adversaries, based on new results for closure-type algorithms.

[1397] Chance-constrained Flow Matching for High-Fidelity Constraint-aware Generation

Jinhao Liang, Yixuan Sun, Anirban Samaddar, Sandeep Madireddy, Ferdinando Fioretto

Main category: cs.LG

TL;DR: CCFM is a training-free method that integrates stochastic optimization into sampling to enforce hard constraints while maintaining high-fidelity generation, outperforming current constrained generative models.

Details

Motivation: Generative models often violate hard constraints from physical laws or specifications, and existing projection methods can distort distributions or increase complexity.

Method: Chance-constrained Flow Matching (CCFM) integrates stochastic optimization into sampling to enforce constraints while operating directly on noisy intermediate samples.

Result: CCFM outperforms state-of-the-art constrained generative models in physical systems and molecular docking, achieving higher feasibility and fidelity.

Conclusion: CCFM effectively enforces hard constraints while maintaining sample quality, providing theoretical guarantees and practical improvements over existing methods.

Abstract: Generative models excel at synthesizing high-fidelity samples from complex data distributions, but they often violate hard constraints arising from physical laws or task specifications. A common remedy is to project intermediate samples onto the feasible set; however, repeated projection can distort the learned distribution and induce a mismatch with the data manifold. Thus, recent multi-stage procedures attempt to defer projection to clean samples during sampling, but they increase algorithmic complexity and accumulate errors across steps. This paper addresses these challenges by proposing a novel training-free method, Chance-constrained Flow Matching (CCFM), that integrates stochastic optimization into the sampling process, enabling effective enforcement of hard constraints while maintaining high-fidelity sample generation. Importantly, CCFM guarantees feasibility in the same manner as conventional repeated projection, yet, despite operating directly on noisy intermediate samples, it is theoretically equivalent to projecting onto the feasible set defined by clean samples. This yields a sampler that mitigates distributional distortion. Empirical experiments show that CCFM outperforms current state-of-the-art constrained generative models in modeling complex physical systems governed by partial differential equations and molecular docking problems, delivering higher feasibility and fidelity.

[1398] BALF: Budgeted Activation-Aware Low-Rank Factorization for Fine-Tuning-Free Model Compression

David González Martínez

Main category: cs.LG

TL;DR: BALF is a fine-tuning-free neural network compression framework that uses activation-aware factorization and a scalable budgeted rank allocator to achieve efficient compression across various models and scales.

Details

Motivation: Traditional neural network compression methods require expensive fine-tuning or search procedures, making them impractical on commodity hardware. The goal is to develop an efficient compression pipeline that works without fine-tuning.

Method: The method combines an activation-aware factorization framework applicable to various layers with a scalable budgeted rank allocator that enables flexible control over compression targets without overhead.

Result: BALF achieves excellent results without fine-tuning, reducing FLOPs on ResNeXt-101 by 45% with only a 1-percentage-point top-1 accuracy drop. It demonstrates effectiveness across multiple scales and architectures including ResNet-20, ResNeXt-101, and vision transformers.

Conclusion: BALF provides an efficient fine-tuning-free compression pipeline that achieves significant computational savings with minimal accuracy loss across diverse neural network architectures.

Abstract: Neural network compression techniques typically require expensive fine-tuning or search procedures, rendering them impractical on commodity hardware. Inspired by recent LLM compression research, we present a general activation-aware factorization framework that can be applied to a broad range of layers. Moreover, we introduce a scalable budgeted rank allocator that allows flexible control over compression targets (e.g., retaining 50% of parameters) with no overhead. Together, these components form BALF, an efficient pipeline for compressing models without fine-tuning. We demonstrate its effectiveness across multiple scales and architectures, from ResNet-20 on CIFAR-10 to ResNeXt-101 and vision transformers on ImageNet, and show that it achieves excellent results in the fine-tuning-free regime. For instance, BALF reduces FLOPs on ResNeXt-101 by 45% with only a 1-percentage-point top-1 accuracy drop.

[1399] GLASS Flows: Transition Sampling for Alignment of Flow and Diffusion Models

Peter Holderrieth, Uriel Singer, Tommi Jaakkola, Ricky T. Q. Chen, Yaron Lipman, Brian Karrer

Main category: cs.LG

TL;DR: GLASS Flows introduces a new sampling paradigm that enables efficient Markov transitions in flow matching models by simulating a “flow matching model within a flow matching model,” eliminating the trade-off between stochastic evolution and efficiency.

Details

Motivation: Current reward alignment algorithms for flow matching and diffusion models suffer from efficiency limitations due to their reliance on SDE sampling, which is slower and often less performant than ODE sampling.

Method: GLASS Flows retrieves an “inner” flow matching model from pre-trained models without retraining, combining ODE efficiency with SDE stochastic evolution through a nested flow matching approach.

Result: On large-scale text-to-image models, GLASS Flows eliminate the efficiency-stochasticity trade-off and, when combined with Feynman-Kac Steering, achieve state-of-the-art performance in text-to-image generation.

Conclusion: GLASS Flows provides a simple, drop-in solution for efficient inference-time scaling of flow and diffusion models, enabling both stochastic evolution and high efficiency simultaneously.

Abstract: The performance of flow matching and diffusion models can be greatly improved at inference time using reward alignment algorithms, yet efficiency remains a major limitation. While several algorithms were proposed, we demonstrate that a common bottleneck is the sampling method these algorithms rely on: many algorithms require to sample Markov transitions via SDE sampling, which is significantly less efficient and often less performant than ODE sampling. To remove this bottleneck, we introduce GLASS Flows, a new sampling paradigm that simulates a “flow matching model within a flow matching model” to sample Markov transitions. As we show in this work, this “inner” flow matching model can be retrieved from a pre-trained model without any re-training, combining the efficiency of ODEs with the stochastic evolution of SDEs. On large-scale text-to-image models, we show that GLASS Flows eliminate the trade-off between stochastic evolution and efficiency. Combined with Feynman-Kac Steering, GLASS Flows improve state-of-the-art performance in text-to-image generation, making it a simple, drop-in solution for inference-time scaling of flow and diffusion models.

[1400] High-Dimensional Analysis of Single-Layer Attention for Sparse-Token Classification

Nicholas Barnfield, Hugo Cui, Yue M. Lu

Main category: cs.LG

TL;DR: Attention mechanisms can learn to selectively attend to informative tokens with weak signals, achieving better performance than linear classifiers with logarithmic vs linear signal strength requirements.

Details

Motivation: To understand when and how attention mechanisms can learn to selectively focus on informative tokens, especially for detecting weak, rare, and sparsely located features in sequences.

Method: Theoretical analysis of a sparse-token classification model using a single-layer attention classifier, studying both representational power and learnability in high-dimensional regimes with finite sequence lengths.

Result: Attention classifiers achieve vanishing test error with logarithmic signal strength scaling, while linear classifiers require square root scaling. Two gradient updates suffice for query weights to align with hidden signals, enabling selective amplification of informative tokens.

Conclusion: Attention mechanisms provide significant advantages over non-adaptive linear baselines through adaptive token selection, explaining their superior capacity and performance in detecting sparse, weak features.

Abstract: When and how can an attention mechanism learn to selectively attend to informative tokens, thereby enabling detection of weak, rare, and sparsely located features? We address these questions theoretically in a sparse-token classification model in which positive samples embed a weak signal vector in a randomly chosen subset of tokens, whereas negative samples are pure noise. In the long-sequence limit, we show that a simple single-layer attention classifier can in principle achieve vanishing test error when the signal strength grows only logarithmically in the sequence length $L$, whereas linear classifiers require $\sqrt{L}$ scaling. Moving from representational power to learnability, we study training at finite $L$ in a high-dimensional regime, where sample size and embedding dimension grow proportionally. We prove that just two gradient updates suffice for the query weight vector of the attention classifier to acquire a nontrivial alignment with the hidden signal, inducing an attention map that selectively amplifies informative tokens. We further derive an exact asymptotic expression for the test error and training loss of the trained attention-based classifier, and quantify its capacity – the largest dataset size that is typically perfectly separable – thereby explaining the advantage of adaptive token selection over nonadaptive linear baselines.

[1401] XQC: Well-conditioned Optimization Accelerates Deep Reinforcement Learning

Daniel Palenicek, Florian Vogt, Joe Watson, Ingmar Posner, Jan Peters

Main category: cs.LG

TL;DR: XQC is a sample-efficient deep actor-critic algorithm that improves optimization landscape through batch normalization, weight normalization, and distributional cross-entropy loss, achieving state-of-the-art performance with fewer parameters.

Details

Motivation: To improve sample efficiency in deep reinforcement learning through principled optimization landscape analysis rather than empirical complexity additions.

Method: Systematic analysis of critic’s Hessian eigenspectrum and condition number, combining batch normalization, weight normalization, and distributional cross-entropy loss to create XQC algorithm based on soft actor-critic.

Result: Achieved state-of-the-art sample efficiency across 55 proprioception and 15 vision-based continuous control tasks with significantly fewer parameters than competing methods.

Conclusion: Optimization-aware architectural choices can dramatically improve sample efficiency and training stability in deep reinforcement learning.

Abstract: Sample efficiency is a central property of effective deep reinforcement learning algorithms. Recent work has improved this through added complexity, such as larger models, exotic network architectures, and more complex algorithms, which are typically motivated purely by empirical performance. We take a more principled approach by focusing on the optimization landscape of the critic network. Using the eigenspectrum and condition number of the critic’s Hessian, we systematically investigate the impact of common architectural design decisions on training dynamics. Our analysis reveals that a novel combination of batch normalization (BN), weight normalization (WN), and a distributional cross-entropy (CE) loss produces condition numbers orders of magnitude smaller than baselines. This combination also naturally bounds gradient norms, a property critical for maintaining a stable effective learning rate under non-stationary targets and bootstrapping. Based on these insights, we introduce XQC: a well-motivated, sample-efficient deep actor-critic algorithm built upon soft actor-critic that embodies these optimization-aware principles. We achieve state-of-the-art sample efficiency across 55 proprioception and 15 vision-based continuous control tasks, all while using significantly fewer parameters than competing methods.

[1402] Physics-Informed Inductive Biases for Voltage Prediction in Distribution Grids

Ehimare Okoyomon, Arbel Yaniv, Christoph Goebel

Main category: cs.LG

TL;DR: The paper investigates how physics-informed inductive biases improve voltage prediction in distribution grids using Graph Neural Networks, evaluating three strategies: power-flow-constrained loss functions, complex-valued networks, and residual-based task reformulation.

Details

Motivation: Machine learning approaches like GNNs offer speedups for voltage prediction but suffer from poor generalization with limited data. Physics-informed inductive biases can improve model reliability and learning of power flow dynamics.

Method: Systematically evaluated three physics-informed strategies: (1) power-flow-constrained loss functions, (2) complex-valued neural networks, and (3) residual-based task reformulation. Used ENGAGE dataset spanning multiple grid configurations and conducted controlled experiments to isolate each bias effect.

Result: The study provides practical insights into which model assumptions most effectively guide learning for reliable voltage prediction, assessing both standard performance and out-of-distribution generalization.

Conclusion: Physics-informed inductive biases significantly improve the reliability and efficiency of voltage prediction in distribution networks, with the study identifying the most effective strategies for guiding learning in power flow applications.

Abstract: Voltage prediction in distribution grids is a critical yet difficult task for maintaining power system stability. Machine learning approaches, particularly Graph Neural Networks (GNNs), offer significant speedups but suffer from poor generalization when trained on limited or incomplete data. In this work, we systematically investigate the role of inductive biases in improving a model’s ability to reliably learn power flow. Specifically, we evaluate three physics-informed strategies: (i) power-flow-constrained loss functions, (ii) complex-valued neural networks, and (iii) residual-based task reformulation. Using the ENGAGE dataset, which spans multiple low- and medium-voltage grid configurations, we conduct controlled experiments to isolate the effect of each inductive bias and assess both standard predictive performance and out-of-distribution generalization. Our study provides practical insights into which model assumptions most effectively guide learning for reliable and efficient voltage prediction in modern distribution networks.

[1403] TR2-D2: Tree Search Guided Trajectory-Aware Fine-Tuning for Discrete Diffusion

Sophia Tang, Yuchen Zhu, Molei Tao, Pranam Chatterjee

Main category: cs.LG

TL;DR: TR2-D2 is a framework that uses tree search to guide fine-tuning of discrete diffusion models, addressing the limitation of reinforcement learning approaches that can reinforce sub-optimal trajectories.

Details

Motivation: Existing reinforcement learning methods for diffusion fine-tuning are susceptible to reinforcing sub-optimal trajectories that yield poor rewards, which limits their effectiveness.

Method: The framework uses Monte Carlo Tree Search (MCTS) to construct replay buffers for trajectory-aware fine-tuning, then fine-tunes a pre-trained discrete diffusion model under a stochastic optimal control objective.

Result: TR2-D2 demonstrates effectiveness for reliable reward-guided fine-tuning in discrete sequence generation, validated on single- and multi-objective fine-tuning of biological sequence diffusion models.

Conclusion: The proposed TR2-D2 framework provides an effective solution for optimizing reward-guided discrete diffusion trajectories while avoiding reinforcement of sub-optimal paths.

Abstract: Reinforcement learning with stochastic optimal control offers a promising framework for diffusion fine-tuning, where a pre-trained diffusion model is optimized to generate paths that lead to a reward-tilted distribution. While these approaches enable optimization without access to explicit samples from the optimal distribution, they require training on rollouts under the current fine-tuned model, making them susceptible to reinforcing sub-optimal trajectories that yield poor rewards. To overcome this challenge, we introduce TRee Search Guided TRajectory-Aware Fine-Tuning for Discrete Diffusion (TR2-D2), a novel framework that optimizes reward-guided discrete diffusion trajectories with tree search to construct replay buffers for trajectory-aware fine-tuning. These buffers are generated using Monte Carlo Tree Search (MCTS) and subsequently used to fine-tune a pre-trained discrete diffusion model under a stochastic optimal control objective. We validate our framework on single- and multi-objective fine-tuning of biological sequence diffusion models, highlighting the overall effectiveness of TR2-D2 for reliable reward-guided fine-tuning in discrete sequence generation.

[1404] Beyond Losses Reweighting: Empowering Multi-Task Learning via the Generalization Perspective

Hoang Phan, Lam Tran, Quyen Tran, Ngoc N. Tran, Tuan Truong, Qi Lei, Nhat Ho, Dinh Phung, Trung Le

Main category: cs.LG

TL;DR: A novel multi-task learning framework that uses adaptive weight perturbation to regulate gradient norms, reducing conflicts and improving generalization across tasks.

Details

Motivation: Conventional multi-task learning with empirical loss minimization is vulnerable to overfitting and gradient conflicts, limiting model robustness and performance.

Method: Leverages weight perturbation to regulate gradient norms, adaptively modulating perturbations to harmonize task-specific gradients and reduce conflicts.

Result: Extensive experiments show the method significantly outperforms existing gradient-based MTL techniques in task performance and model robustness.

Conclusion: Controlling gradient norms through weight perturbation directly contributes to better generalization in multi-task learning, providing a more robust framework.

Abstract: Multi-task learning (MTL) trains deep neural networks to optimize several objectives simultaneously using a shared backbone, which leads to reduced computational costs, improved data efficiency, and enhanced performance through cross-task knowledge sharing. Although recent gradient manipulation techniques aim to find a common descent direction that benefits all tasks, conventional empirical loss minimization still leaves models vulnerable to overfitting and gradient conflicts. To address this, we introduce a novel MTL framework that leverages weight perturbation to regulate gradient norms, thus improving generalization. By adaptively modulating weight perturbations, our approach harmonizes task-specific gradients, reducing conflicts and encouraging more robust learning across tasks. Theoretical insights reveal that controlling the gradient norm through weight perturbation directly contributes to better generalization. Extensive experiments across diverse applications demonstrate that our method significantly outperforms existing gradient-based MTL techniques in terms of task performance and overall model robustness.

[1405] Symbolic Imitation Learning: From Black-Box to Explainable Driving Policies

Iman Sharifi, Mustafa Yildirim, Saber Fallah

Main category: cs.LG

TL;DR: SIL uses Inductive Logic Programming to create explainable driving policies from synthetic data, outperforming neural methods in transparency while maintaining performance.

Details

Motivation: Current DNN-based imitation learning lacks interpretability and generalizability, which are critical for safety in autonomous driving.

Method: Symbolic Imitation Learning (SIL) framework using Inductive Logic Programming (ILP) to derive policies from synthetic datasets.

Result: SIL significantly enhances policy transparency while maintaining strong performance on collision rate, lane change efficiency, and average speed metrics.

Conclusion: ILP integration in imitation learning can promote safer and more reliable autonomous systems through explainable policies.

Abstract: Current imitation learning approaches, predominantly based on deep neural networks (DNNs), offer efficient mechanisms for learning driving policies from real-world datasets. However, they suffer from inherent limitations in interpretability and generalizability–issues of critical importance in safety-critical domains such as autonomous driving. In this paper, we introduce Symbolic Imitation Learning (SIL), a novel framework that leverages Inductive Logic Programming (ILP) to derive explainable and generalizable driving policies from synthetic datasets. We evaluate SIL on real-world HighD and NGSim datasets, comparing its performance with state-of-the-art neural imitation learning methods using metrics such as collision rate, lane change efficiency, and average speed. The results indicate that SIL significantly enhances policy transparency while maintaining strong performance across varied driving conditions. These findings highlight the potential of integrating ILP into imitation learning to promote safer and more reliable autonomous systems.

[1406] Federated Learning Resilient to Byzantine Attacks and Data Heterogeneity

Shiyuan Zuo, Xingrun Yan, Rongfei Fan, Han Hu, Hangguan Shan, Tony Q. S. Quek, Puning Zhao

Main category: cs.LG

TL;DR: RAGA is a robust federated learning algorithm that uses geometric median aggregation and flexible local updates to defend against Byzantine attacks on heterogeneous datasets, achieving convergence for both strongly-convex and non-convex loss functions.

Details

Motivation: Address the challenge of federated learning under malicious Byzantine attacks and data heterogeneity, where existing resilient approaches often assume strongly-convex loss functions or homogeneous datasets.

Method: Propose Robust Average Gradient Algorithm (RAGA) using geometric median for aggregation and allowing flexible round numbers for local updates. Conduct convergence analysis for both strongly-convex and non-convex loss functions over heterogeneous datasets.

Result: RAGA achieves convergence at rate O(1/T^{2/3-δ}) for non-convex functions and linear convergence for strongly-convex functions when malicious users are less than half. Experimental results validate robustness against Byzantine attacks and superior convergence performance.

Conclusion: RAGA effectively handles Byzantine attacks on heterogeneous datasets, achieving convergence with theoretical guarantees and practical performance improvements over baselines.

Abstract: This paper addresses federated learning (FL) in the context of malicious Byzantine attacks and data heterogeneity. We introduce a novel Robust Average Gradient Algorithm (RAGA), which uses the geometric median for aggregation and {allows flexible round number for local updates.} Unlike most existing resilient approaches, which base their convergence analysis on strongly-convex loss functions or homogeneously distributed datasets, this work conducts convergence analysis for both strongly-convex and non-convex loss functions over heterogeneous datasets. The theoretical analysis indicates that as long as the fraction of the {data} from malicious users is less than half, RAGA can achieve convergence at a rate of $\mathcal{O}({1}/{T^{2/3- \delta}})$ for non-convex loss functions, where $T$ is the iteration number and $\delta \in (0, 2/3)$. For strongly-convex loss functions, the convergence rate is linear. Furthermore, the stationary point or global optimal solution is shown to be attainable as data heterogeneity diminishes. Experimental results validate the robustness of RAGA against Byzantine attacks and demonstrate its superior convergence performance compared to baselines under varying intensities of Byzantine attacks on heterogeneous datasets.

[1407] PLEIADES: Building Temporal Kernels with Orthogonal Polynomials

Yan Ru Pei, Olivier Coenen

Main category: cs.LG

TL;DR: PLEIADES networks use orthogonal polynomial basis functions for temporal convolution kernels, enabling efficient online spatiotemporal classification and detection with event-based data while maintaining performance across varying sample rates without retraining.

Details

Motivation: To develop neural networks that can efficiently process event-based data with low latency for real-time spatiotemporal classification and detection tasks, while being adaptable to different data sampling rates.

Method: Use temporal convolution kernels generated from orthogonal polynomial basis functions, allowing flexible sample rate variation and discretization step-size changes without additional finetuning.

Result: Achieved state-of-the-art results on three event-based benchmarks: 99.59% accuracy on DVS128 hand gesture recognition (192K params), 99.58% accuracy on AIS 2024 eye tracking challenge (277K params), and 0.556 mAP on PROPHESEE 1 Megapixel Automotive Detection Dataset (576K params).

Conclusion: PLEIADES networks demonstrate superior performance with significantly smaller memory and compute costs compared to existing methods, making them highly efficient for event-based spatiotemporal processing tasks.

Abstract: We introduce a class of neural networks named PLEIADES (PoLynomial Expansion In Adaptive Distributed Event-based Systems), which contains temporal convolution kernels generated from orthogonal polynomial basis functions. We focus on interfacing these networks with event-based data to perform online spatiotemporal classification and detection with low latency. By virtue of using structured temporal kernels and event-based data, we have the freedom to vary the sample rate of the data along with the discretization step-size of the network without additional finetuning. We experimented with three event-based benchmarks and obtained state-of-the-art results on all three by large margins with significantly smaller memory and compute costs. We achieved: 1) 99.59% accuracy with 192K parameters on the DVS128 hand gesture recognition dataset and 100% with a small additional output filter; 2) 99.58% test accuracy with 277K parameters on the AIS 2024 eye tracking challenge; and 3) 0.556 mAP with 576k parameters on the PROPHESEE 1 Megapixel Automotive Detection Dataset.

[1408] A Comprehensive Graph Pooling Benchmark: Effectiveness, Robustness and Generalizability

Pengyun Wang, Junyu Luo, Yanxin Shen, Ming Zhang, Shaoen Qin, Siyu Heng, Xiao Luo

Main category: cs.LG

TL;DR: A comprehensive benchmark for evaluating 17 graph pooling methods across 28 datasets, assessing effectiveness, robustness, and generalizability in various graph learning tasks.

Details

Motivation: There is a lack of standardized experimental settings and fair benchmarks to evaluate the performance of graph pooling methods despite their growing importance in graph representation learning.

Method: Constructed a benchmark with 17 graph pooling methods and 28 graph datasets, systematically evaluating performance in three dimensions: effectiveness, robustness, and generalizability across graph classification, regression, and node classification tasks.

Result: Extensive experiments validate the strong capability and applicability of graph pooling approaches in various scenarios, providing detailed efficiency analysis, backbone analysis, parameter analysis and visualization.

Conclusion: The benchmark provides valuable insights and guidance for deep geometric learning research, demonstrating the effectiveness of graph pooling methods across different tasks and scenarios.

Abstract: Graph pooling has gained attention for its ability to obtain effective node and graph representations for various downstream tasks. Despite the recent surge in graph pooling approaches, there is a lack of standardized experimental settings and fair benchmarks to evaluate their performance. To address this issue, we have constructed a comprehensive benchmark that includes 17 graph pooling methods and 28 different graph datasets. This benchmark systematically assesses the performance of graph pooling methods in three dimensions, i.e., effectiveness, robustness, and generalizability. We first evaluate the performance of these graph pooling approaches across different tasks including graph classification, graph regression and node classification. Then, we investigate their performance under potential noise attacks and out-of-distribution shifts in real-world scenarios. We also involve detailed efficiency analysis, backbone analysis, parameter analysis and visualization to provide more evidence. Extensive experiments validate the strong capability and applicability of graph pooling approaches in various scenarios, which can provide valuable insights and guidance for deep geometric learning research. The source code of our benchmark is available at https://github.com/goose315/Graph_Pooling_Benchmark.

[1409] Understanding Transformer Architecture through Continuous Dynamics: A Partial Differential Equation Perspective

Yukun Zhang, Xueqing Zhou

Main category: cs.LG

TL;DR: This paper analyzes Transformers as continuous spatiotemporal dynamical systems governed by PDEs, revealing that residual connections and layer normalization are essential mathematical stabilizers rather than heuristic tricks.

Details

Motivation: To develop a principled theoretical understanding of Transformers' internal mechanisms, which currently lack rigorous mathematical foundations despite their revolutionary impact on AI.

Method: Introduces an analytical framework that maps Transformers to continuous PDE systems: self-attention as non-local interaction, feed-forward networks as local reaction, and residual connections/layer normalization as stabilization mechanisms. Compares standard Transformers with PDE simulators lacking explicit stabilizers.

Result: Empirical evidence shows that without residual connections, the system suffers catastrophic representational drift, and without layer normalization, training becomes unstable and explosive. These components are mathematically necessary for stability.

Conclusion: Residual connections and layer normalization are fundamental mathematical stabilizers required to tame an otherwise unstable continuous system, providing a first-principles explanation for Transformer design and establishing a new paradigm for analyzing neural networks through continuous dynamics.

Abstract: The Transformer architecture has revolutionized artificial intelligence, yet a principled theoretical understanding of its internal mechanisms remains elusive. This paper introduces a novel analytical framework that reconceptualizes the Transformer’s discrete, layered structure as a continuous spatiotemporal dynamical system governed by a master Partial Differential Equation (PDE). Within this paradigm, we map core architectural components to distinct mathematical operators: self-attention as a non-local interaction, the feed-forward network as a local reaction, and, critically, residual connections and layer normalization as indispensable stabilization mechanisms. We do not propose a new model, but rather employ the PDE system as a theoretical probe to analyze the mathematical necessity of these components. By comparing a standard Transformer with a PDE simulator that lacks explicit stabilizers, our experiments provide compelling empirical evidence for our central thesis. We demonstrate that without residual connections, the system suffers from catastrophic representational drift, while the absence of layer normalization leads to unstable, explosive training dynamics. Our findings reveal that these seemingly heuristic “tricks” are, in fact, fundamental mathematical stabilizers required to tame an otherwise powerful but inherently unstable continuous system. This work offers a first-principles explanation for the Transformer’s design and establishes a new paradigm for analyzing deep neural networks through the lens of continuous dynamics.

[1410] A GREAT Architecture for Edge-Based Graph Problems Like TSP

Attila Lischka, Filip Rydin, Jiaming Wu, Morteza Haghir Chehreghani, Balázs Kulcsár

Main category: cs.LG

TL;DR: Proposes GREAT, a graph edge attention network for routing problems with non-Euclidean and asymmetric costs, achieving competitive results in learning-based benchmarks.

Details

Motivation: Existing GNN-based approaches are ill-suited for real-world routing problems with non-Euclidean and asymmetric edge costs.

Method: Developed Graph Edge Attention Network (GREAT) as encoder in reinforcement learning framework for vehicle routing problems.

Result: Achieves competitive results among learning-based benchmarks for both Euclidean and non-Euclidean variants of routing problems.

Conclusion: GREAT framework successfully handles non-Euclidean routing problems and is among the first learning-based approaches for such variants.

Abstract: In the last years, an increasing number of learning-based approaches have been proposed to tackle combinatorial optimization problems such as routing problems. Many of these approaches are based on graph neural networks (GNNs) or related transformers, operating on the Euclidean coordinates representing the routing problems. However, such models are ill-suited for a wide range of real-world problems that feature non-Euclidean and asymmetric edge costs. To overcome this limitation, we propose a novel GNN-based and edge-focused neural model called Graph Edge Attention Network (GREAT). Using GREAT as an encoder to capture the properties of a routing problem instance, we build a reinforcement learning framework which we apply to both Euclidean and non-Euclidean variants of vehicle routing problems such as Traveling Salesman Problem, Capacitated Vehicle Routing Problem and Orienteering Problem. Our framework is among the first to tackle non-Euclidean variants of these problems and achieves competitive results among learning-based benchmarks.

[1411] Extracting Moore Machines from Transformers using Queries and Counterexamples

Rik Adriaensen, Jaron Maene

Main category: cs.LG

TL;DR: The paper constructs finite state automata as abstractions of transformers trained on regular languages using queries and counterexamples, focusing on Moore machines to study positive-only learning and sequence accuracy.

Details

Motivation: To address the difficulty in comparing existing results on what formal languages transformers can learn due to methodological differences.

Method: Extract Moore machines from transformers trained on regular languages using queries and counterexamples as high-level abstractions.

Result: Demonstrated the approach’s usefulness by studying positive-only learning and sequence accuracy in detail.

Conclusion: The proposed method provides a standardized way to analyze transformer capabilities in learning formal languages, facilitating better comparison across studies.

Abstract: Fuelled by the popularity of the transformer architecture in deep learning, several works have investigated what formal languages a transformer can learn from data. Nonetheless, existing results remain hard to compare due to methodological differences. To address this, we construct finite state automata as high-level abstractions of transformers trained on regular languages using queries and counterexamples. Concretely, we extract Moore machines, as many training tasks used in literature can be mapped onto them. We demonstrate the usefulness of this approach by studying positive-only learning and the sequence accuracy measure in detail.

[1412] Deeper Insights into Deep Graph Convolutional Networks: Stability and Generalization

Guangrui Yang, Ming Li, Han Feng, Xiaosheng Zhuang

Main category: cs.LG

TL;DR: This paper provides theoretical analysis of stability and generalization properties of deep graph convolutional networks (GCNs), establishing upper bounds influenced by graph filter eigenvalues and network depth.

Details

Motivation: While GCNs show empirical success, there's limited theoretical understanding of deep GCNs' stability and generalization properties, with existing research focusing mainly on single-layer networks.

Method: The authors conduct theoretical analysis to characterize upper bounds for stability and generalization of deep GCNs, examining key factors like graph filter operators’ eigenvalues and network depth.

Result: Theoretical results show that deep GCNs’ stability and generalization are affected by maximum absolute eigenvalues of graph filter operators and network depth, with established upper bounds.

Conclusion: This theoretical study enhances understanding of deep GCNs’ properties and could lead to development of more reliable and better-performing graph learning models.

Abstract: Graph convolutional networks (GCNs) have emerged as powerful models for graph learning tasks, exhibiting promising performance in various domains. While their empirical success is evident, there is a growing need to understand their essential ability from a theoretical perspective. Existing theoretical research has primarily focused on the analysis of single-layer GCNs, while a comprehensive theoretical exploration of the stability and generalization of deep GCNs remains limited. In this paper, we bridge this gap by delving into the stability and generalization properties of deep GCNs, aiming to provide valuable insights by characterizing rigorously the associated upper bounds. Our theoretical results reveal that the stability and generalization of deep GCNs are influenced by certain key factors, such as the maximum absolute eigenvalue of the graph filter operators and the depth of the network. Our theoretical studies contribute to a deeper understanding of the stability and generalization properties of deep GCNs, potentially paving the way for developing more reliable and well-performing models.

[1413] NextLocLLM: Location Semantics Modeling and Coordinate-Based Next Location Prediction with LLMs

Shuai Liu, Ning Cao, Yile Chen, Yue Jiang, George Rosario Jagadeesh, Gao Cong

Main category: cs.LG

TL;DR: NextLocLLM reformulates next-location prediction as coordinate regression using LLMs for semantic encoding and prediction, outperforming existing methods in supervised and zero-shot settings.

Details

Motivation: Existing methods treat location prediction as classification with discrete IDs, which limits spatial continuity modeling and generalization to new cities.

Method: Uses LLMs to create enhanced POI embeddings from textual descriptions, combines with spatiotemporal trajectory data, and adds a regression head for coordinate prediction with post-prediction retrieval for top-k locations.

Result: Outperforms existing baselines across diverse cities in both supervised and zero-shot settings.

Conclusion: NextLocLLM successfully integrates LLMs for unified semantic and predictive modeling, enabling better spatial continuity and cross-city generalization.

Abstract: Next location prediction is a critical task in human mobility analysis.Existing methods typically formulate it as a classification task based on discrete location IDs, which hinders spatial continuity modeling and limits generalization to new cities. In this paper, we propose NextLocLLM, a novel framework that reformulates next-location prediction as coordinate regression and integrates LLMs for both location semantics encoding and coordinate-level prediction. To model location functional semantics, it constructs LLM-enhanced POI embeddings by leveraging language understanding capabilities of LLMs to extract functional semantics from textual descriptions of POI categories. These POI embeddings are combined with spatiotemporal trajectory representation and fed into the same LLM, enabling unified semantic and predictive modeling. A lightweight regression head generates coordinate outputs, which are mapped to top-k candidate locations via post-prediction retrieval module, ensuring structured outputs. Experiments across diverse cities show that NextLocLLM outperforms existing baselines in both supervised and zero-shot settings. Code is available at: https://github.com/liuwj2000/NexelocLLM.

[1414] Gradient-Free Training of Quantized Neural Networks

Noa Cohen, Omkar Joglekar, Dotan Di Castro, Vladimir Tchuiev, Shir Kozlovsky, Michal Moshkovitz

Main category: cs.LG

TL;DR: Proposes eliminating gradients in neural network training by using a novel heuristic optimization framework that avoids full weight updates, achieving comparable performance to gradient-based methods with significantly less energy and fewer parameter updates.

Details

Motivation: Current neural network training methods are computationally expensive and energy-intensive, even with optimizations like mixed-precision and quantization-aware training. The authors aim to eliminate the dependency on gradient-based optimization entirely.

Method: Introduces a novel heuristic optimization framework that avoids full weight updates and gradient computation, operating in quantized weight spaces to improve efficiency.

Result: Empirically achieves performance comparable to full-precision gradient-based training on standard datasets and architectures, while using up to 3x less energy and requiring up to 5x fewer parameter updates.

Conclusion: Demonstrates that gradient-free training in quantized spaces is feasible and can significantly reduce computational costs and energy consumption while maintaining competitive performance compared to traditional gradient-based methods.

Abstract: Training neural networks requires significant computational resources and energy. Methods like mixed-precision and quantization-aware training reduce bit usage, yet they still depend heavily on computationally expensive gradient-based optimization. In this work, we propose a paradigm shift: eliminate gradients altogether. One might hope that, in a finite quantized space, finding optimal weights with out gradients would be easier but we theoretically prove that this problem is NP-hard even in simple settings where the continuous case is efficiently solvable. To address this, we introduce a novel heuristic optimization framework that avoids full weight updates and significantly improves efficiency. Empirically, our method achieves performance comparable to that of full-precision gradient-based training on standard datasets and architectures, while using up to 3x less energy and requiring up to 5x fewer parameter updates.

[1415] Self-Normalized Resets for Plasticity in Continual Learning

Vivek F. Farias, Adam D. Jozefiak

Main category: cs.LG

TL;DR: SNR is an adaptive algorithm that mitigates plasticity loss in neural networks by resetting neurons when their firing rates drop to zero, achieving superior performance in continual learning tasks.

Details

Motivation: Plasticity loss - the diminishing ability of neural networks to adapt to new tasks during continual training - is an important problem that needs addressing.

Method: Self-Normalized Resets (SNR) algorithm that resets neuron weights when evidence suggests their firing rate has effectively dropped to zero, based on a hypothesis test.

Result: SNR consistently outperforms competitor algorithms across various continual learning problems and architectures, and is robust to its hyperparameter while competitors show sensitivity.

Conclusion: SNR effectively mitigates plasticity loss through its threshold-based reset mechanism, and theoretical analysis shows it can learn target ReLUs even with adversarial initialization, unlike regularization approaches.

Abstract: Plasticity Loss is an increasingly important phenomenon that refers to the empirical observation that as a neural network is continually trained on a sequence of changing tasks, its ability to adapt to a new task diminishes over time. We introduce Self-Normalized Resets (SNR), a simple adaptive algorithm that mitigates plasticity loss by resetting a neuron’s weights when evidence suggests its firing rate has effectively dropped to zero. Across a battery of continual learning problems and network architectures, we demonstrate that SNR consistently attains superior performance compared to its competitor algorithms. We also demonstrate that SNR is robust to its sole hyperparameter, its rejection percentile threshold, while competitor algorithms show significant sensitivity. SNR’s threshold-based reset mechanism is motivated by a simple hypothesis test that we derive. Seen through the lens of this hypothesis test, competing reset proposals yield suboptimal error rates in correctly detecting inactive neurons, potentially explaining our experimental observations. We also conduct a theoretical investigation of the optimization landscape for the problem of learning a single ReLU. We show that even when initialized adversarially, an idealized version of SNR learns the target ReLU, while regularization based approaches can fail to learn.

[1416] Last iterate convergence in no-regret learning: constrained min-max optimization for convex-concave landscapes

Qi Lei, Sai Ganesh Nagarajan, Ioannis Panageas, Xiao Wang

Main category: cs.LG

TL;DR: Optimistic Multiplicative-Weights Update (OMWU) exhibits last iterate convergence locally for convex-concave games, extending previous results from bilinear cases to more general settings.

Details

Motivation: Previous work established last iterate convergence for variants of Gradient Descent/Ascent and Mirror Descent in convex-concave games, but OMWU's convergence was only proven for bilinear cases. This work aims to generalize these results.

Method: Uses Optimistic Multiplicative-Weights Update (OMWU) which follows the no-regret online learning framework, analyzing its local convergence properties for convex-concave games.

Result: Shows that OMWU exhibits last iterate convergence locally for convex-concave games, and experiments indicate fast convergence of the method.

Conclusion: OMWU provides last iterate convergence for general convex-concave games, extending beyond the previously known bilinear case, with empirical evidence supporting its fast convergence.

Abstract: In a recent series of papers it has been established that variants of Gradient Descent/Ascent and Mirror Descent exhibit last iterate convergence in convex-concave zero-sum games. Specifically, \cite{DISZ17, LiangS18} show last iterate convergence of the so called “Optimistic Gradient Descent/Ascent” for the case of \textit{unconstrained} min-max optimization. Moreover, in \cite{Metal} the authors show that Mirror Descent with an extra gradient step displays last iterate convergence for convex-concave problems (both constrained and unconstrained), though their algorithm does not follow the online learning framework; it uses extra information rather than \textit{only} the history to compute the next iteration. In this work, we show that “Optimistic Multiplicative-Weights Update (OMWU)” which follows the no-regret online learning framework, exhibits last iterate convergence locally for convex-concave games, generalizing the results of \cite{DP19} where last iterate convergence of OMWU was shown only for the \textit{bilinear case}. We complement our results with experiments that indicate fast convergence of the method.

[1417] CRAUM-Net: Contextual Recursive Attention with Uncertainty Modeling for Salient Object Detection

Abhinav Sagar

Main category: cs.LG

TL;DR: A novel Salient Object Detection framework that integrates multi-scale context aggregation, attention mechanisms, and uncertainty-aware modules for improved accuracy and boundary delineation.

Details

Motivation: To achieve accurate localization and precise boundary delineation of salient regions in computer vision applications, addressing the need for reliable saliency maps with quantified prediction confidence.

Method: Uses Adaptive Cross-Scale Context Module with Recursive Channel Spatial Attention and Convolutional Block Attention, edge-aware decoder with Edge Extractor, Monte Carlo Dropout for uncertainty estimation, and boundary-sensitive loss functions including Boundary IoU, Focal Tversky, and Topological Saliency losses.

Result: Superior performance demonstrated through evaluation metrics including uncertainty-calibrated error, Boundary F1 score, and standard SOD metrics, producing accurate and reliable saliency maps with fine-grained details.

Conclusion: The approach advances state-of-the-art in salient object detection by effectively capturing fine-grained details while quantifying prediction confidence through extensive experimental validation.

Abstract: Salient Object Detection (SOD) plays a crucial role in many computer vision applications, requiring accurate localization and precise boundary delineation of salient regions. In this work, we present a novel framework that integrates multi-scale context aggregation, advanced attention mechanisms, and an uncertainty-aware module for improved SOD performance. Our Adaptive Cross-Scale Context Module effectively fuses features from multiple levels, leveraging Recursive Channel Spatial Attention and Convolutional Block Attention to enhance salient feature representation. We further introduce an edge-aware decoder that incorporates a dedicated Edge Extractor for boundary refinement, complemented by Monte Carlo Dropout to estimate uncertainty in predictions. To train our network robustly, we employ a combination of boundary-sensitive and topology-preserving loss functions, including Boundary IoU, Focal Tversky, and Topological Saliency losses. Evaluation metrics such as uncertainty-calibrated error and Boundary F1 score, along with the standard SOD metrics, demonstrate our method’s superior ability to produce accurate and reliable saliency maps. Extensive experiments validate the effectiveness of our approach in capturing fine-grained details while quantifying prediction confidence, advancing the state-of-the-art in salient object detection.

[1418] Rare-to-Frequent: Unlocking Compositional Generation Power of Diffusion Models on Rare Concepts with LLM Guidance

Dongmin Park, Sebin Kim, Taehong Moon, Minkyu Kim, Kangwook Lee, Jaewoong Cho

Main category: cs.LG

TL;DR: R2F is a training-free approach that uses LLM guidance to enhance diffusion models’ ability to generate rare concept compositions by exposing relevant frequent concepts during diffusion sampling.

Details

Motivation: State-of-the-art text-to-image diffusion models struggle with generating rare compositions of concepts, such as objects with unusual attributes.

Method: Proposes R2F framework that leverages LLMs to plan and execute rare-to-frequent concept guidance throughout diffusion inference, exposing relevant frequent concepts during sampling without requiring training.

Result: R2F significantly surpasses existing models including SD3.0 and FLUX by up to 28.1% in T2I alignment on benchmarks including the newly proposed RareBench.

Conclusion: The approach effectively enhances compositional generation power for rare concepts and is flexible across pre-trained diffusion models and LLMs, integrable with region-guided diffusion approaches.

Abstract: State-of-the-art text-to-image (T2I) diffusion models often struggle to generate rare compositions of concepts, e.g., objects with unusual attributes. In this paper, we show that the compositional generation power of diffusion models on such rare concepts can be significantly enhanced by the Large Language Model (LLM) guidance. We start with empirical and theoretical analysis, demonstrating that exposing frequent concepts relevant to the target rare concepts during the diffusion sampling process yields more accurate concept composition. Based on this, we propose a training-free approach, R2F, that plans and executes the overall rare-to-frequent concept guidance throughout the diffusion inference by leveraging the abundant semantic knowledge in LLMs. Our framework is flexible across any pre-trained diffusion models and LLMs, and can be seamlessly integrated with the region-guided diffusion approaches. Extensive experiments on three datasets, including our newly proposed benchmark, RareBench, containing various prompts with rare compositions of concepts, R2F significantly surpasses existing models including SD3.0 and FLUX by up to 28.1%p in T2I alignment. Code is available at https://github.com/krafton-ai/Rare-to-Frequent.

[1419] Generalized Tangent Kernel: A Unified Geometric Foundation for Natural Gradient and Standard Gradient

Qinxun Bai, Steven Rosenberg, Wei Xu

Main category: cs.LG

TL;DR: The paper addresses the theoretical gap in natural gradients on function spaces, introduces Generalized Tangent Kernel (GTK) to unify natural and standard gradients, and shows both capture intrinsic structure equally well under fixed parameterization.

Details

Motivation: To resolve the fundamental theoretical issue regarding the existence of natural gradients on function space and provide a complete geometric framework for studying both natural and standard gradients.

Method: Develops a geometric perspective using Generalized Tangent Kernel (GTK) that unifies natural and standard gradients, leveraging RKHS theory and orthonormality properties to establish Riemannian metrics on function space.

Result: Shows that for fixed parameterization, GTK determines a Riemannian metric making standard gradient as “natural” as natural gradient in capturing intrinsic structure of parameterized function space.

Conclusion: The framework provides new solutions for non-immersion/degenerate cases of natural gradient and leads to new families of natural/standard gradient descent methods, bridging theoretical gaps with practical applications.

Abstract: Natural gradients have been widely studied from both theoretical and empirical perspectives, and it is commonly believed that natural gradients have advantages over standard (Euclidean) gradients in capturing the intrinsic geometric structure of the underlying function space and being invariant under reparameterization. However, for function optimization, a fundamental theoretical issue regarding the existence of natural gradients on the function space remains underexplored. We address this issue by providing a geometric perspective and mathematical framework for studying both natural gradient and standard gradient that is more complete than existing studies. The key tool that unifies natural gradient and standard gradient is a generalized form of the Neural Tangent Kernel (NTK), which we name the Generalized Tangent Kernel (GTK). Using a novel orthonormality property of GTK, we show that for a fixed parameterization, GTK determines a Riemannian metric on the entire function space which makes the standard gradient as “natural” as the natural gradient in capturing the intrinsic structure of the parameterized function space. Many aspects of this approach relate to RKHS theory. For the practical side of this theory paper, we showcase that our framework motivates new solutions to the non-immersion/degenerate case of natural gradient and leads to new families of natural/standard gradient descent methods.

[1420] TabText: Language-Based Representations of Tabular Health Data for Predictive Modelling

Kimberly Villalobos Carballo, Liangyuan Na, Yu Ma, Léonard Boussioux, Cynthia Zeng, Luis R. Soenksen, Dimitris Bertsimas

Main category: cs.LG

TL;DR: TabText is a preprocessing method that converts tabular medical data into contextual language and uses LLMs to generate task-independent embeddings, improving prediction performance for challenging healthcare tasks.

Details

Motivation: Traditional tabular data preprocessing ignores contextual information and requires extensive manual cleaning, creating bottlenecks for healthcare ML applications.

Method: Convert tables into contextual language, apply pretrained LLMs to generate fixed embeddings, use these as input for predictive tasks across multiple healthcare datasets.

Result: Achieved AUC 0.75-0.94 on inpatient flow predictions, improved out-of-sample AUC by up to 4 percentage points for challenging tasks like ICU transfer and cancer recurrence, and demonstrated good generalization across hospitals.

Conclusion: TabText effectively leverages contextual information to enhance predictive performance in healthcare, particularly for challenging tasks, while reducing manual preprocessing burdens.

Abstract: Tabular medical records remain the most readily available data format for applying machine learning in healthcare. However, traditional data preprocessing ignores valuable contextual information in tables and requires substantial manual cleaning and harmonisation, creating a bottleneck for model development. We introduce TabText, a preprocessing and feature extraction method that leverages contextual information and streamlines the curation of tabular medical data. This method converts tables into contextual language and applies pretrained large language models (LLMs) to generate task-independent numerical representations. These fixed embeddings are then used as input for various predictive tasks. TabText was evaluated on nine inpatient flow prediction tasks (e.g., ICU admission, discharge, mortality) using electronic medical records across six hospitals from a US health system, and on nine publicly available datasets from the UCI Machine Learning Repository, covering tasks such as cancer diagnosis, recurrence, and survival. TabText models trained on unprocessed data from a single hospital (572,964 patient-days, Jan 2018-Dec 2020) achieved accurate performance (AUC 0.75-0.94) when tested prospectively on 265,917 patient-days from Jan 2021-Apr 2022, and generalised well to five additional hospitals not used for training. When augmenting preprocessed tabular records with these contextual embeddings, out-of-sample AUC improved by up to 4 additive percentage points in challenging tasks such as ICU transfer and breast cancer recurrence, while providing little to no benefit for already high-performing tasks. Findings were consistent across both private and public datasets.

[1421] AQuaMaM: An Autoregressive, Quaternion Manifold Model for Rapidly Estimating Complex SO(3) Distributions

Michael A. Alcorn

Main category: cs.LG

TL;DR: AQuaMaM is a neural network that models complex distributions on SO(3) rotation manifold using autoregressive modeling of unit quaternions, achieving faster inference and higher accuracy than IPDF.

Details

Motivation: IPDF requires N forward passes for inference which is computationally expensive, motivating the need for a method that can calculate exact likelihoods in a single forward pass.

Method: Autoregressively models projected components of unit quaternions as mixtures of uniform distributions that partition their geometrically-restricted domain.

Result: AQuaMaM achieves 14% higher test log-likelihood than IPDF, uses 24% fewer parameters, has 52x faster prediction throughput on single GPU, and converges similarly during training.

Conclusion: AQuaMaM provides a more efficient and accurate alternative to IPDF for modeling distributions on SO(3) rotation manifold with single-pass inference.

Abstract: Accurately modeling complex, multimodal distributions for rotations in three-dimensions, i.e., the SO(3) group, is challenging due to the curvature of the rotation manifold. The recently described implicit-PDF (IPDF) is a simple, elegant, and effective approach for learning arbitrary distributions on SO(3) up to a given precision. However, inference with IPDF requires $N$ forward passes through the network’s final multilayer perceptron (where $N$ places an upper bound on the likelihood that can be calculated by the model), which is prohibitively slow for those without the computational resources necessary to parallelize the queries. In this paper, I introduce AQuaMaM, a neural network capable of both learning complex distributions on the rotation manifold and calculating exact likelihoods for query rotations in a single forward pass. Specifically, AQuaMaM autoregressively models the projected components of unit quaternions as mixtures of uniform distributions that partition their geometrically-restricted domain of values. When trained on an “infinite” toy dataset with ambiguous viewpoints, AQuaMaM rapidly converges to a sampling distribution closely matching the true data distribution. In contrast, the sampling distribution for IPDF dramatically diverges from the true data distribution, despite IPDF approaching its theoretical minimum evaluation loss during training. When trained on a constructed dataset of 500,000 renders of a die in different rotations, AQuaMaM reaches a test log-likelihood 14% higher than IPDF. Further, compared to IPDF, AQuaMaM uses 24% fewer parameters, has a prediction throughput 52$\times$ faster on a single GPU, and converges in a similar amount of time during training.

[1422] Double Machine Learning Based Structure Identification from Temporal Data

Emmanouil Angelis, Francesco Quinzan, Ashkan Soleymani, Patrick Jaillet, Stefan Bauer

Main category: cs.LG

TL;DR: Proposes DR-SIT, a double machine learning method for causal structure identification from time-series data that handles confounding, cycles, and correlated causes with theoretical guarantees.

Details

Motivation: Existing vector auto-regression methods don't account for unknown confounding between potential causes, leading to bias in settings with many noisy causes and correlated variables.

Method: Uses double machine learning approach for structure identification from temporal data (DR-SIT) that can handle cycles and confounding between causes.

Result: Method asymptotically recovers true underlying causal structure even when causes have cycles or are confounded, with superior performance shown in extensive experiments.

Conclusion: DR-SIT provides a robust solution for causal discovery in time-series data with theoretical guarantees and practical effectiveness in complex settings with confounding and cycles.

Abstract: Learning the causes of time-series data is a fundamental task in many applications, spanning from finance to earth sciences or bio-medical applications. Common approaches for this task are based on vector auto-regression, and they do not take into account unknown confounding between potential causes. However, in settings with many potential causes and noisy data, these approaches may be substantially biased. Furthermore, potential causes may be correlated in practical applications or even contain cycles. To address these challenges, we propose a new double machine learning based method for structure identification from temporal data (DR-SIT). We provide theoretical guarantees, showing that our method asymptotically recovers the true underlying causal structure. Our analysis extends to cases where the potential causes have cycles, and they may even be confounded. We further perform extensive experiments to showcase the superior performance of our method. Code: https://github.com/sdi1100041/TMLR_submission_DR_SIT

[1423] EUGENE: Explainable Structure-aware Graph Edit Distance Estimation with Generalized Edit Costs

Aditya Bommakanti, Harshith Reddy Vonteri, Sayan Ranu, Panagiotis Karras

Main category: cs.LG

TL;DR: EUGENE is an efficient, algebraic, and structure-aware optimization method that estimates Graph Edit Distance (GED) and provides corresponding edit paths, achieving state-of-the-art performance without requiring ground-truth GEDs for training.

Details

Motivation: Current neural methods for GED approximation lack explanatory edit paths, require NP-hard ground-truth generation for training, and need separate training on each dataset, limiting their practical applicability.

Method: EUGENE uses an efficient, algebraic, and structure-aware optimization approach that estimates GED while also providing the corresponding edit paths that explain the distance.

Result: Extensive experiments show EUGENE achieves state-of-the-art GED estimation with superior scalability across diverse datasets and generalized cost settings.

Conclusion: EUGENE provides an effective solution for GED estimation that overcomes limitations of neural methods by being efficient, providing explanatory edit paths, and not requiring ground-truth training data.

Abstract: The need to identify graphs with small structural distances from a query arises in domains such as biology, chemistry, recommender systems, and social network analysis. Among several methods for measuring inter-graph distance, Graph Edit Distance (GED) is preferred for its comprehensibility, though its computation is hindered by NP-hardness. Optimization based heuristic methods often face challenges in providing accurate approximations. State-of-the-art GED approximations predominantly utilize neural methods, which, however: (i) lack an explanatory edit path corresponding to the approximated GED; (ii) require the NP-hard generation of ground-truth GEDs for training; and (iii) necessitate separate training on each dataset. In this paper, we propose EUGENE, an efficient, algebraic, and structure-aware optimization based method that estimates GED and also provides edit paths corresponding to the estimated cost. Extensive experimental evaluation demonstrates that EUGENE achieves state-of-the-art GED estimation with superior scalability across diverse datasets and generalized cost settings.

[1424] The Clever Hans Mirage: A Comprehensive Survey on Spurious Correlations in Machine Learning

Wenqian Ye, Luyang Jiang, Eric Xie, Guangtao Zheng, Yunsheng Ma, Xu Cao, Dongliang Guo, Daiqing Qi, Zeyu He, Yijun Tian, Megan Coffee, Zhe Zeng, Sheng Li, Ting-hao, Huang, Ziran Wang, James M. Rehg, Henry Kautz, Andrew Gordon Wilson, Aidong Zhang

Main category: cs.LG

TL;DR: A comprehensive survey on spurious correlations in machine learning models, covering taxonomy of existing methods, datasets, benchmarks, and future challenges in the generative AI era.

Details

Motivation: Modern ML models are sensitive to spurious correlations between non-essential input features and labels, similar to Clever Hans horse phenomenon, which negatively impacts generalization and robustness when data distributions shift.

Method: Provides a systematic survey with fine-grained taxonomy of state-of-the-art methods for addressing spurious correlations, including datasets, benchmarks, and metrics for evaluation.

Result: Organizes existing research into a comprehensive framework that facilitates understanding of different approaches to mitigate spurious correlation issues in machine learning.

Conclusion: Discusses broader impacts, recent advancements, and future challenges in addressing spurious correlations, particularly in the context of generative AI, providing valuable insights for ML researchers.

Abstract: Back in the early 20th century, a horse named Hans appeared to perform arithmetic and other intellectual tasks during exhibitions in Germany, while it actually relied solely on involuntary cues in the body language from the human trainer. Modern machine learning models are no different. These models are known to be sensitive to spurious correlations between non-essential features of the inputs (e.g., background, texture, and secondary objects) and the corresponding labels. Such features and their correlations with the labels are known as “spurious” because they tend to change with shifts in real-world data distributions, which can negatively impact the model’s generalization and robustness. In this paper, we provide a comprehensive survey of this emerging issue, along with a fine-grained taxonomy of existing state-of-the-art methods for addressing spurious correlations in machine learning models. Additionally, we summarize existing datasets, benchmarks, and metrics to facilitate future research. The paper concludes with a discussion of the broader impacts, the recent advancements, and future challenges in the era of generative AI, aiming to provide valuable insights for researchers in the related domains of the machine learning community.

[1425] Principal Components for Neural Network Initialization

Nhan Phan, Thu Nguyen, Uyen Dang, Pål Halvorsen, Michael A. Riegler

Main category: cs.LG

TL;DR: PCsInit incorporates PCA into neural network initialization rather than preprocessing, improving XAI explanations and training performance.

Details

Motivation: PCA preprocessing complicates XAI explanations; need more direct integration of PCA into neural networks.

Method: Initialize first layer with principal components (PCsInit), with variants PCsInit-Act and PCsInit-Sub.

Result: Simpler, more direct XAI explanations; theoretical properties; improved training via backpropagation.

Conclusion: PCsInit strategies better integrate PCA into neural networks, enhancing explainability and training performance.

Abstract: Principal Component Analysis (PCA) is a commonly used tool for dimension reduction and denoising. Therefore, it is also widely used on the data prior to training a neural network. However, this approach can complicate the explanation of eXplainable Artificial Intelligence (XAI) methods for the decision of the model. In this work, we analyze the potential issues with this approach and propose Principal Components-based Initialization (PCsInit), a strategy to incorporate PCA into the first layer of a neural network via initialization of the first layer in the network with the principal components, and its two variants PCsInit-Act and PCsInit-Sub. We will show that explanations using these strategies are more simple, direct and straightforward than using PCA prior to training a neural network on the principal components. We also show that the proposed techniques possess desirable theoretical properties. Moreover, as will be illustrated in the experiments, such training strategies can also allow further improvement of training via backpropagation compared to training neural networks on principal components.

[1426] Data Imputation by Pursuing Better Classification: A Supervised Kernel-Based Method

Ruikai Yang, Fan He, Mingzhen He, Kaijie Wang, Xiaolin Huang

Main category: cs.LG

TL;DR: A two-stage framework that uses label information to guide data imputation for better classification performance, with kernel matrix optimization and robust regression.

Details

Motivation: Existing methods use labels simplistically for data imputation, lacking flexibility and relying on strict assumptions. Better utilization of supervision information can improve imputation quality for classification tasks.

Method: Two-stage approach: 1) Use labels to supervise kernel matrix optimization for better classification, with perturbation for robustness. 2) Use learned kernel matrix as supervision to guide data imputation via regression using block coordinate descent.

Result: Significantly outperforms state-of-the-art imputation methods on four real-world datasets, especially when data has high missing rates (over 60% missing features).

Conclusion: The proposed framework effectively leverages supervision information for data imputation in a way that enhances classification performance, demonstrating superior performance particularly in high-missing-rate scenarios.

Abstract: Data imputation, the process of filling in missing feature elements for incomplete data sets, plays a crucial role in data-driven learning. A fundamental belief is that data imputation is helpful for learning performance, and it follows that the pursuit of better classification can guide the data imputation process. While some works consider using label information to assist in this task, their simplistic utilization of labels lacks flexibility and may rely on strict assumptions. In this paper, we propose a new framework that effectively leverages supervision information to complete missing data in a manner conducive to classification. Specifically, this framework operates in two stages. Firstly, it leverages labels to supervise the optimization of similarity relationships among data, represented by the kernel matrix, with the goal of enhancing classification accuracy. To mitigate overfitting that may occur during this process, a perturbation variable is introduced to improve the robustness of the framework. Secondly, the learned kernel matrix serves as additional supervision information to guide data imputation through regression, utilizing the block coordinate descent method. The superiority of the proposed method is evaluated on four real-world data sets by comparing it with state-of-the-art imputation methods. Remarkably, our algorithm significantly outperforms other methods when the data is missing more than 60% of the features

[1427] Vintix: Action Model via In-Context Reinforcement Learning

Andrey Polubarov, Nikita Lyubaykin, Alexander Derevyagin, Ilya Zisman, Denis Tarasov, Alexander Nikulin, Vladislav Kurenkov

Main category: cs.LG

TL;DR: This paper introduces a scalable approach to In-Context Reinforcement Learning (ICRL) using Algorithm Distillation, demonstrating its potential as a competitive alternative to expert distillation for building generalist decision-making agents.

Details

Motivation: The motivation is to address the scalability challenge of ICRL beyond toy tasks and single-domain settings, aiming to develop generalist agents that can learn through trial-and-error interactions at inference time.

Method: The method involves using Algorithm Distillation framework to create a fixed, cross-domain model capable of learning behaviors through in-context reinforcement learning.

Result: The results demonstrate that Algorithm Distillation offers a compelling and competitive alternative to expert distillation for constructing versatile action models in ICRL.

Conclusion: The findings highlight the potential of ICRL as a scalable approach for developing generalist decision-making systems, representing the first steps toward scaling ICRL effectively.

Abstract: In-Context Reinforcement Learning (ICRL) represents a promising paradigm for developing generalist agents that learn at inference time through trial-and-error interactions, analogous to how large language models adapt contextually, but with a focus on reward maximization. However, the scalability of ICRL beyond toy tasks and single-domain settings remains an open challenge. In this work, we present the first steps toward scaling ICRL by introducing a fixed, cross-domain model capable of learning behaviors through in-context reinforcement learning. Our results demonstrate that Algorithm Distillation, a framework designed to facilitate ICRL, offers a compelling and competitive alternative to expert distillation to construct versatile action models. These findings highlight the potential of ICRL as a scalable approach for generalist decision-making systems. Code released at https://github.com/dunnolab/vintix

[1428] Differential Encoding for Improved Representation Learning over Graphs

Haimin Zhang, Jiahao Xia, Min Xu

Main category: cs.LG

TL;DR: The paper presents a differential encoding method to address information loss in graph neural networks by encoding the difference between neighbor/global information and node self-information, improving node embedding quality.

Details

Motivation: Current graph learning methods using message-passing and global attention suffer from information loss when aggregating from neighborhoods or the whole graph, as it's unclear whether dominant information comes from the node itself or its neighbors, leading to accumulated information loss across layers.

Method: Proposes a differential encoding method that computes the representation difference between information from neighbors/global nodes and the node itself, then combines this differential encoding with the original aggregated representation to generate updated node embeddings.

Result: Empirical evaluation on seven benchmark datasets shows the method improves both message-passing and global attention updates, advancing state-of-the-art performance for graph representation learning.

Conclusion: The differential encoding method is a general approach that enhances representational ability of node embeddings by addressing information loss, effectively improving graph learning performance across various tasks.

Abstract: Combining the message-passing paradigm with the global attention mechanism has emerged as an effective framework for learning over graphs. The message-passing paradigm and the global attention mechanism fundamentally generate node embeddings based on information aggregated from a node’s local neighborhood or from the whole graph. The most basic and commonly used aggregation approach is to take the sum of information from a node’s local neighbourhood or from the whole graph. However, it is unknown if the dominant information is from a node itself or from the node’s neighbours (or the rest of the graph nodes). Therefore, there exists information lost at each layer of embedding generation, and this information lost could be accumulated and become more serious when more layers are used in the model. In this paper, we present a differential encoding method to address the issue of information lost. The idea of our method is to encode the differential representation between the information from a node’s neighbours (or the rest of the graph nodes) and that from the node itself. The obtained differential encoding is then combined with the original aggregated local or global representation to generate the updated node embedding. By integrating differential encodings, the representational ability of generated node embeddings is improved. The differential encoding method is empirically evaluated on different graph tasks on seven benchmark datasets. The results show that it is a general method that improves the message-passing update and the global attention update, advancing the state-of-the-art performance for graph representation learning on these datasets.

[1429] Deep Time Series Models: A Comprehensive Survey and Benchmark

Yuxuan Wang, Haixu Wu, Jiaxiang Dong, Yong Liu, Chen Wang, Mingsheng Long, Jianmin Wang

Main category: cs.LG

TL;DR: This paper provides a comprehensive review of deep time series models, categorizing them by basic modules and architectures, and introduces TSLib - a benchmark library with 30 models, 30 datasets, and 5 analysis tasks for fair evaluation.

Details

Motivation: Time series data presents unique challenges due to its complex, dynamic nature with nonlinear patterns and time-variant trends. While deep learning has revolutionized time series analysis, there's a need for systematic evaluation and fair benchmarking across different models and tasks.

Method: The authors develop TSLib (Time Series Library) as a comprehensive benchmark that implements 30 prominent deep time series models, covers 30 datasets from various domains, and supports five prevalent analysis tasks. They conduct thorough empirical evaluation of 13 advanced models across diverse tasks.

Result: Empirical evaluation reveals that models with specific structures are well-suited for distinct analytical tasks. The findings provide valuable insights for research and practical adoption of deep time series models.

Conclusion: The paper establishes TSLib as a standardized benchmark for deep time series analysis and demonstrates that model architecture selection should be task-specific, offering guidance for researchers and practitioners in choosing appropriate models for different time series analysis scenarios.

Abstract: Time series, characterized by a sequence of data points organized in a discrete-time order, are ubiquitous in real-world scenarios. Unlike other data modalities, time series present unique challenges due to their intricate and dynamic nature, including the entanglement of nonlinear patterns and time-variant trends. Analyzing such data is of great significance in practical applications and has been extensively studied for centuries. Recent years have witnessed remarkable breakthroughs in the time series community, with techniques shifting from traditional statistical methods to contemporary deep learning models. In this paper, we delve into the design of deep time series models across various analysis tasks and review the existing literature from two perspectives: basic modules and model architectures. Further, we develop and release Time Series Library (TSLib) as a fair benchmark of deep time series models for diverse analysis tasks. TSLib implements 30 prominent models, covers 30 datasets from different domains, and supports five prevalent analysis tasks. Based on TSLib, we thoroughly evaluate 13 advanced deep time series models across diverse tasks. Empirical results indicate that models with specific structures are well-suited for distinct analytical tasks, providing insights for research and adoption of deep time series models. Code and datasets are available at https://github.com/thuml/Time-Series-Library.

[1430] A Statistical Learning Perspective on Semi-dual Adversarial Neural Optimal Transport Solvers

Roman Tarasov, Petr Mokrov, Milena Gazdieva, Evgeny Burnaev, Alexander Korotin

Main category: cs.LG

TL;DR: This paper establishes theoretical generalization bounds for neural network-based optimal transport methods, specifically focusing on minimax quadratic OT solvers, filling a gap in statistical learning theory for these approaches.

Details

Motivation: Neural network-based optimal transport has shown promise in various applications but lacks theoretical investigation from a statistical learning perspective, particularly for adversarial minimax solvers based on semi-dual formulations.

Method: The authors establish upper bounds on the generalization error of approximate OT maps recovered by minimax quadratic OT solvers, analyzing the statistical and mathematical properties of neural network functional classes.

Result: The paper derives generalization bounds that depend solely on standard statistical and mathematical properties of the considered neural network functional classes, providing theoretical foundations for quadratic OT methods.

Conclusion: While focused on quadratic OT, the analysis suggests similar bounds could be derived for general optimal transport cases, opening promising directions for future theoretical research in neural network-based OT methods.

Abstract: Neural network-based optimal transport (OT) is a recent and fruitful direction in the generative modeling community. It finds its applications in various fields such as domain translation, image super-resolution, computational biology and others. Among the existing OT approaches, of considerable interest are adversarial minimax solvers based on semi-dual formulations of OT problems. While promising, these methods lack theoretical investigation from a statistical learning perspective. Our work fills this gap by establishing upper bounds on the generalization error of an approximate OT map recovered by the minimax quadratic OT solver. Importantly, the bounds we derive depend solely on some standard statistical and mathematical properties of the considered functional classes (neural nets). While our analysis focuses on the quadratic OT, we believe that similar bounds could be derived for general OT case, paving the promising direction for future research.

[1431] Can DPO Learn Diverse Human Values? A Theoretical Scaling Law

Shawn Im, Yixuan Li

Main category: cs.LG

TL;DR: This paper analyzes how generalization in LLMs scales with value diversity and sample quantity in direct preference optimization, showing challenges in learning diverse values.

Details

Motivation: LLMs often struggle to align with human preferences and need to account for diverse human values. Understanding how generalization scales with value diversity is crucial for ensuring models align with all people.

Method: Introduces a theoretical framework analyzing reward margin trajectories during finite gradient steps in direct preference optimization training, providing generalization error bounds.

Result: The analysis reveals challenges in effectively learning wide sets of concepts or values, with empirical validation on contemporary LLMs confirming the practical relevance.

Conclusion: Generalization in preference learning faces significant challenges when dealing with diverse values, highlighting the need for approaches that can better accommodate value diversity in LLM alignment.

Abstract: Large language models (LLMs) have demonstrated remarkable capabilities but often struggle to align with human preferences, leading to harmful or undesirable outputs. Preference learning, which trains models to distinguish between preferred and non-preferred responses based on human feedback, has become a crucial component for ensuring that LLMs align with human values. An essential part of ensuring that LLMs are aligned for all people is accounting for a diverse set of values. This paper introduces a new theoretical framework to analyze how generalization scales with value diversity and sample quantity in models trained with direct preference optimization. Our framework rigorously assesses how well models generalize after a finite number of gradient steps, reflecting real-world LLM training practices. By analyzing the reward margin associated with each sample and its trajectory throughout training, we provide a bound on the generalization error that demonstrates the challenges of effectively learning a wide set of concepts or values. These insights are empirically validated on contemporary LLMs, underscoring the practical relevance of our theory.

[1432] LEAD: Large Foundation Model for EEG-Based Alzheimer’s Disease Detection

Yihe Wang, Nan Huang, Nadia Mammone, Marco Cecchi, Xiang Zhang

Main category: cs.LG

TL;DR: LEAD is the first large-scale foundation model for EEG-based Alzheimer’s disease detection, achieving 90.91% sensitivity on subject-level detection using a novel transformer architecture trained on the world’s largest EEG-AD corpus.

Details

Motivation: Existing EEG-based AD detection methods face two major challenges: lack of large-scale datasets for robust learning and absence of dedicated pipelines for clinically meaningful subject-level detection (vs sample-level).

Method: Proposed LEAD framework includes: 1) comprehensive preprocessing with multi-scale segmentation, 2) subject-regularized spatio-temporal transformer with novel subject-level cross-entropy loss and indices group-shuffling, 3) AD-guided contrastive pre-training. Pre-trained on 12 datasets and fine-tuned on 4 AD datasets.

Result: LEAD consistently outperforms 10 baselines in subject-level detection under subject-independent cross-validation. Achieves 90.91% subject-level sensitivity on ADFTD dataset under leave-one-subject-out setting.

Conclusion: The method effectively addresses real-world EEG-based AD detection challenges, validating the approach through superior performance on the largest EEG-AD corpus to date.

Abstract: Electroencephalography (EEG) provides a non-invasive, highly accessible, and cost-effective approach for detecting Alzheimer’s disease (AD). However, existing methods, whether based on handcrafted feature engineering or standard deep learning, face two major challenges: 1) the lack of large-scale EEG-AD datasets for robust representation learning, and 2) the absence of a dedicated deep learning pipeline for subject-level detection, which is more clinically meaningful than the commonly used sample-level detection. To address these gaps, we have curated the world’s largest EEG-AD corpus to date, comprising 2,255 subjects. Leveraging this unique data corpus, we propose LEAD, the first large-scale foundation model for EEG analysis in dementia. Our approach provides an innovative framework for subject-level AD detection, including: 1) a comprehensive preprocessing pipeline such as artifact removal, resampling, and filtering, and a newly proposed multi-scale segmentation strategy, 2) a subject-regularized spatio-temporal transformer trained with a novel subject-level cross-entropy loss and an indices group-shuffling algorithm, and 3) AD-guided contrastive pre-training. We pre-train on 12 datasets (3 AD-related and 9 non-AD) and fine-tune/test on 4 AD datasets. Compared with 10 baselines, LEAD consistently obtains superior subject-level detection performance under the challenging subject-independent cross-validation protocol. On the benchmark ADFTD dataset, our model achieves an impressive subject-level Sensitivity of 90.91% under the leave-one-subject-out (LOSO) setting. These results strongly validate the effectiveness of our method for real-world EEG-based AD detection. Source code: https://github.com/DL4mHealth/LEAD

[1433] Efficient Federated Learning against Byzantine Attacks and Data Heterogeneity via Aggregating Normalized Gradients

Shiyuan Zuo, Xingrun Yan, Rongfei Fan, Li Shen, Puning Zhao, Jie Xu, Han Hu

Main category: cs.LG

TL;DR: Fed-NGA is a simple federated learning algorithm that uses normalized gradients for aggregation, achieving Byzantine robustness and handling data heterogeneity with low computational complexity O(pM).

Details

Motivation: Existing Byzantine-robust FL methods suffer from high computational overhead during gradient aggregation, slowing down training despite handling data heterogeneity.

Method: Fed-NGA performs aggregation by computing weighted mean of normalized gradients from each client, providing O(pM) time complexity.

Result: Fed-NGA achieves convergence to stationary points for non-convex functions, with zero optimality gap under mild conditions, and shows superior time efficiency in experiments.

Conclusion: Fed-NGA is an effective Byzantine-robust FL method that handles data heterogeneity with low computational overhead and strong convergence guarantees.

Abstract: Federated Learning (FL) enables multiple clients to collaboratively train models without sharing raw data, but is vulnerable to Byzantine attacks and data heterogeneity, which can severely degrade performance. Existing Byzantine-robust approaches tackle data heterogeneity, but incur high computational overhead during gradient aggregation, thereby slowing down the training process. To address this issue, we propose a simple yet effective Federated Normalized Gradients Algorithm (Fed-NGA), which performs aggregation by merely computing the weighted mean of the normalized gradients from each client. This approach yields a favorable time complexity of $\mathcal{O}(pM)$, where $p$ is the model dimension and $M$ is the number of clients. We rigorously prove that Fed-NGA is robust to both Byzantine faults and data heterogeneity. For non-convex loss functions, Fed-NGA achieves convergence to a neighborhood of stationary points under general assumptions, and further attains zero optimality gap under some mild conditions, which is an outcome rarely achieved in existing literature. In both cases, the convergence rate is $\mathcal{O}(1/T^{\frac{1}{2} - \delta})$, where $T$ denotes the number of iterations and $\delta \in (0, 1/2)$. Experimental results on benchmark datasets confirm the superior time efficiency and convergence performance of Fed-NGA over existing methods.

[1434] Sparse Covariance Neural Networks

Andrea Cavallo, Zhan Gao, Elvin Isufi

Main category: cs.LG

TL;DR: Sparse coVariance Neural Networks (S-VNNs) improve VNNs by applying sparsification techniques to sample covariance matrices, enhancing performance, computational efficiency, and stability against finite-sample estimation errors.

Details

Motivation: Standard VNNs suffer from degraded performance and computational efficiency due to spurious correlations in empirical covariance matrices, creating a mismatch with actual covariance structures.

Method: Proposed S-VNNs apply sparsification techniques: hard/soft thresholding for sparse true covariance matrices, and stochastic sparsification (dropping correlations probabilistically) for dense covariance matrices.

Result: S-VNNs show improved task performance, enhanced stability to finite-sample covariance estimations, reduced computational time, and outperform alternatives across brain data, human action recognition, and other applications.

Conclusion: S-VNNs effectively address covariance estimation issues in VNNs through principled sparsification, achieving better performance, efficiency, and stability while being adaptable to both sparse and dense covariance structures.

Abstract: Covariance Neural Networks (VNNs) perform graph convolutions on the covariance matrix of input data to leverage correlation information as pairwise connections. They have achieved success in a multitude of applications such as neuroscience, financial forecasting, and sensor networks. However, the empirical covariance matrix on which VNNs operate typically contains spurious correlations, creating a mismatch with the actual covariance matrix that degrades VNNs’ performance and computational efficiency. To tackle this issue, we put forth Sparse coVariance Neural Networks (S-VNNs), a framework that applies sparsification techniques on the sample covariance matrix and incorporates the latter into the VNN architecture. We investigate the S-VNN when the underlying data covariance matrix is both sparse and dense. When the true covariance matrix is sparse, we propose hard and soft thresholding to improve the covariance estimation and reduce the computational cost. Instead, when the true covariance is dense, we propose a stochastic sparsification where data correlations are dropped in probability according to principled strategies. Besides performance and computation improvements, we show that S-VNNs are more stable to finite-sample covariance estimations than nominal VNNs and the analogous sparse principal component analysis. By analyzing the impact of sparsification on their behavior, we tie the S-VNN stability to the data distribution and sparsification approach. We support our theoretical findings with experimental results on a variety of application scenarios, ranging from brain data to human action recognition, and show an improved task performance, improved stability, and reduced computational time compared to alternatives.

[1435] Towards Large-Scale In-Context Reinforcement Learning by Meta-Training in Randomized Worlds

Fan Wang, Pengtao Shao, Yiming Zhang, Bo Yu, Shaoshan Liu, Ning Ding, Yang Cao, Yu Kang, Haifeng Wang

Main category: cs.LG

TL;DR: Proposes AnyMDP for scalable procedurally generated tabular MDPs to address the lack of scalable task collections in In-Context Reinforcement Learning, enabling large-scale meta-training and generalization to unseen tasks.

Details

Motivation: Address the major challenge in scaling up In-Context Reinforcement Learning (ICRL) - the lack of scalable task collections that can support large-scale meta-training.

Method: Introduces AnyMDP for procedurally generated tabular MDPs with careful randomization to reduce structural biases, plus decoupled policy distillation and prior information induction in the ICRL framework.

Result: With large-scale AnyMDP tasks, the model can generalize to unseen tasks through versatile in-context learning paradigms. Enables empirical investigation of data distribution-ICRL performance relationship.

Conclusion: ICRL generalization comes at the cost of increased task diversity and longer adaptation periods, highlighting the need for diverse task design and prioritizing asymptotic performance over few-shot adaptation.

Abstract: In-Context Reinforcement Learning (ICRL) enables agents to learn automatically and on-the-fly from their interactive experiences. However, a major challenge in scaling up ICRL is the lack of scalable task collections. To address this, we propose the procedurally generated tabular Markov Decision Processes, named AnyMDP. Through a carefully designed randomization process, AnyMDP is capable of generating high-quality tasks on a large scale while maintaining relatively low structural biases. To facilitate efficient meta-training at scale, we further introduce decoupled policy distillation and induce prior information in the ICRL framework. Our results demonstrate that, with a sufficiently large scale of AnyMDP tasks, the proposed model can generalize to tasks that were not considered in the training set through versatile in-context learning paradigms. The scalable task set provided by AnyMDP also enables a more thorough empirical investigation of the relationship between data distribution and ICRL performance. We further show that the generalization of ICRL potentially comes at the cost of increased task diversity and longer adaptation periods. This finding carries critical implications for scaling robust ICRL capabilities, highlighting the necessity of diverse and extensive task design, and prioritizing asymptotic performance over few-shot adaptation.

[1436] DeepONet for Solving Nonlinear Partial Differential Equations with Physics-Informed Training

Yahong Yang

Main category: cs.LG

TL;DR: DeepONet operator learning for nonlinear PDEs enables generalization across different PDEs without retraining, with complex branch networks providing performance gains and simple trunk networks being most effective.

Details

Motivation: To investigate operator learning (DeepONet) for solving nonlinear PDEs, enabling generalization across different PDEs without retraining, unlike conventional methods that require separate networks for each PDE.

Method: Physics-informed training of DeepONet, examining approximation capabilities of branch and trunk networks, and analyzing generalization error in Sobolev norms using Rademacher complexity and pseudo-dimension analysis.

Result: Complex branch networks provide substantial performance gains, while trunk networks are most effective when kept relatively simple. A bound on generalization error for nonlinear PDEs is derived.

Conclusion: This work bridges a critical theoretical gap by providing rigorous error estimates for physics-informed machine learning models and applications.

Abstract: In this paper, we investigate the applications of operator learning, specifically DeepONet, for solving nonlinear partial differential equations (PDEs). Unlike conventional function learning methods that require training separate neural networks for each PDE, operator learning enables generalization across different PDEs without retraining. This study examines the performance of DeepONet in physics-informed training, focusing on two key aspects: (1) the approximation capabilities of deep branch and trunk networks, and (2) the generalization error in Sobolev norms. Our results show that complex branch networks provide substantial performance gains, while trunk networks are most effective when kept relatively simple. Furthermore, we derive a bound on the generalization error of DeepONet for solving nonlinear PDEs by analyzing the Rademacher complexity of its derivatives in terms of pseudo-dimension. This work bridges a critical theoretical gap by delivering rigorous error estimates. This paper fills a theoretical gap by providing error estimates for a wide range of physics-informed machine learning models and applications.

[1437] Functional-level Uncertainty Quantification for Calibrated Fine-tuning on LLMs

Ruijia Niu, Dongxia Wu, Rose Yu, Yi-An Ma

Main category: cs.LG

TL;DR: UQ4CT is a method that captures and calibrates uncertainty in fine-tuned LLMs by modeling the functional space of input-output mappings during fine-tuning, achieving significant calibration improvements while maintaining accuracy.

Details

Motivation: Fine-tuned LLMs often exhibit overconfidence in uncertain predictions due to limited generalization with sparse data, and existing PEFT uncertainty methods fail to address the core issue of limited adapter specialization for task-specific relationships.

Method: UQ4CT implements functional-level uncertainty quantification during fine-tuning via a mixture-of-experts framework that hierarchically decomposes the functional space of input-output mappings.

Result: Empirical results show over 25% reduction in Expected Calibration Error (ECE) while preserving high accuracy across five benchmarks, with maintained superior ECE performance and high accuracy even under distribution shift.

Conclusion: UQ4CT effectively addresses the core limitation of PEFT adapters by capturing task-specific functional relationships during fine-tuning, leading to improved uncertainty quantification and generalizability.

Abstract: Accurate uncertainty quantification in large language models (LLMs) is essential for providing credible confidence estimates over their outputs. However, fine-tuned LLMs often exhibit overconfidence in uncertain predictions, which stems from their limited ability to generalize with sparse data. Existing parameter efficient fine-tuning (PEFT) uncertainty quantification methods for LLMs focus on post fine-tuning stage, and thus fail to address the core issue: limited specialization of PEFT adapters to accurately capture task-specific input-output relationships. To address these limitations, we propose Functional-Level Uncertainty Quantification for Calibrated Fine-Tuning (UQ4CT), which captures and calibrates uncertainty over the space of functions that map input prompts to outputs. We implement UQ4CT during the fine-tuning stage via a mixture-of-experts framework that hierarchically decomposes the functional space. Empirically, UQ4CT achieves over $25%$ reduction in Expected Calibration Error (ECE) while preserving high accuracy across five benchmarks. Even under distribution shift, UQ4CT maintains superior ECE performance with high accuracy, showcasing improved generalizability.

[1438] Comprehensive Review of Neural Differential Equations for Time Series Analysis

YongKyung Oh, Seungsu Kam, Jonghun Lee, Dong-Young Lim, Sungil Kim, Alex Bui

Main category: cs.LG

TL;DR: A comprehensive review of Neural Differential Equations (NDEs) for time series analysis, covering neural ODEs, controlled differential equations, and stochastic differential equations, highlighting their advantages over traditional methods for continuous-time dynamics.

Details

Motivation: Conventional time series methods like RNNs and Transformers struggle with continuous dynamics and irregular sampling patterns in real-world data, creating a need for approaches that can handle these challenges.

Method: Survey and analysis of NDE-based methods including mathematical formulations, numerical methods, and applications of neural ODEs, neural controlled differential equations, and neural stochastic differential equations.

Result: NDEs provide a paradigm shift by combining neural network flexibility with differential equation rigor, enabling effective modeling of continuous-time dynamics in irregularly sampled time series.

Conclusion: NDEs serve as a foundation for advanced time series analysis, with identified challenges and future research directions for further development in this field.

Abstract: Time series modeling and analysis have become critical in various domains. Conventional methods such as RNNs and Transformers, while effective for discrete-time and regularly sampled data, face significant challenges in capturing the continuous dynamics and irregular sampling patterns inherent in real-world scenarios. Neural Differential Equations (NDEs) represent a paradigm shift by combining the flexibility of neural networks with the mathematical rigor of differential equations. This paper presents a comprehensive review of NDE-based methods for time series analysis, including neural ordinary differential equations, neural controlled differential equations, and neural stochastic differential equations. We provide a detailed discussion of their mathematical formulations, numerical methods, and applications, highlighting their ability to model continuous-time dynamics. Furthermore, we address key challenges and future research directions. This survey serves as a foundation for researchers and practitioners seeking to leverage NDEs for advanced time series analysis.

[1439] Similarity-Dissimilarity Loss for Multi-label Supervised Contrastive Learning

Guangming Huang, Yunfei Long, Cunjin Luo

Main category: cs.LG

TL;DR: Proposes a novel Similarity-Dissimilarity Loss for multi-label supervised contrastive learning that dynamically re-weights samples and provides theoretical proofs, achieving state-of-the-art performance.

Details

Motivation: Addresses the challenge of defining positive samples and contrastive loss functions in multi-label supervised contrastive learning, where multi-label relations are not fully defined.

Method: Systematically formulates multi-label relations, proposes Similarity-Dissimilarity Loss with dynamic re-weighting based on similarity and dissimilarity factors, and provides theoretical proofs.

Result: Consistently outperforms baselines in comprehensive evaluations across image, text, and medical domains, achieving state-of-the-art performance on MIMIC-III-Full.

Conclusion: The method effectively addresses multi-label contrastive learning challenges and provides a unified paradigm for both single-label and multi-label scenarios.

Abstract: Supervised contrastive learning has achieved remarkable success by leveraging label information; however, determining positive samples in multi-label scenarios remains a critical challenge. In multi-label supervised contrastive learning (MSCL), multi-label relations are not yet fully defined, leading to ambiguity in identifying positive samples and formulating contrastive loss functions to construct the representation space. To address these challenges, we: (i) systematically formulate multi-label relations in MSCL, (ii) propose a novel Similarity-Dissimilarity Loss, which dynamically re-weights samples based on similarity and dissimilarity factors, (iii) further provide theoretical grounded proofs for our method through rigorous mathematical analysis that supports the formulation and effectiveness, and (iv) offer a unified form and paradigm for both single-label and multi-label supervised contrastive loss. We conduct experiments on both image and text modalities and further extend the evaluation to the medical domain. The results show that our method consistently outperforms baselines in comprehensive evaluations, demonstrating its effectiveness and robustness. Moreover, the proposed approach achieves state-of-the-art performance on MIMIC-III-Full.

[1440] Collaborative Deterministic-Probabilistic Forecasting for Diverse Spatiotemporal Systems

Zhi Sheng, Yuan Yuan, Yudi Zhang, Jingtao Ding, Yong Li

Main category: cs.LG

TL;DR: CoST is a collaborative forecasting framework that combines deterministic and diffusion models for spatiotemporal systems, using mean-residual decomposition to improve accuracy and efficiency while handling spatial heterogeneity.

Details

Motivation: Probabilistic forecasting is crucial for risk-aware decision-making in spatiotemporal systems, but existing diffusion models face challenges with complex dynamics and high computational demands.

Method: CoST uses mean-residual decomposition: a deterministic model captures conditional mean, while a lightweight diffusion model learns residual uncertainties. It includes a scale-aware diffusion mechanism to handle spatial heterogeneity.

Result: Extensive experiments across 10 real-world datasets show CoST achieves 25% performance gains over state-of-the-art baselines while significantly reducing computational cost.

Conclusion: CoST provides an effective and efficient framework for probabilistic spatiotemporal forecasting that generalizes well across diverse systems.

Abstract: Probabilistic forecasting is crucial for real-world spatiotemporal systems, such as climate, energy, and urban environments, where quantifying uncertainty is essential for informed, risk-aware decision-making. While diffusion models have shown promise in capturing complex data distributions, their application to spatiotemporal forecasting remains limited due to complex spatiotemporal dynamics and high computational demands. we propose CoST, a general forecasting framework that collaborates deterministic and diffusion models for diverse spatiotemporal systems. CoST formulates a mean-residual decomposition strategy: it leverages a powerful deterministic model to capture the conditional mean and a lightweight diffusion model to learn residual uncertainties. This collaborative formulation simplifies learning objectives, improves accuracy and efficiency, and generalizes across diverse spatiotemporal systems. To address spatial heterogeneity, we further design a scale-aware diffusion mechanism to guide the diffusion process. Extensive experiments across ten real-world datasets from climate, energy, communication, and urban systems show that CoST achieves 25% performance gains over state-of-the-art baselines, while significantly reducing computational cost.

[1441] A Predictive Approach To Enhance Time-Series Forecasting

Skye Gunasekaran, Assel Kembay, Hugo Ladret, Rui-Jie Zhu, Laurent Perrinet, Omid Kavehei, Jason Eshraghian

Main category: cs.LG

TL;DR: Future-Guided Learning improves time-series forecasting using a dual-model approach with predictive feedback, achieving significant performance gains in seizure prediction and nonlinear systems forecasting.

Details

Motivation: Deep learning models struggle with long-term dependencies and distribution shifts in time-series data, requiring better methods to capture temporal patterns and adapt to changing data distributions.

Method: Uses two models: a detection model that analyzes future data to identify critical events, and a forecasting model that predicts events from current data. Implements dynamic feedback where discrepancies between models trigger significant updates to the forecasting model, minimizing surprise through predictive coding principles.

Result: Achieved 44.8% increase in AUC-ROC for seizure prediction using EEG data, and 23.4% reduction in MSE for forecasting in nonlinear dynamical systems (excluding outliers).

Conclusion: Future-Guided Learning with predictive feedback mechanism significantly advances deep learning applications for time-series forecasting by enabling dynamic parameter adjustment and better handling of temporal dependencies.

Abstract: Accurate time-series forecasting is crucial in various scientific and industrial domains, yet deep learning models often struggle to capture long-term dependencies and adapt to data distribution shifts over time. We introduce Future-Guided Learning, an approach that enhances time-series event forecasting through a dynamic feedback mechanism inspired by predictive coding. Our method involves two models: a detection model that analyzes future data to identify critical events and a forecasting model that predicts these events based on current data. When discrepancies occur between the forecasting and detection models, a more significant update is applied to the forecasting model, effectively minimizing surprise, allowing the forecasting model to dynamically adjust its parameters. We validate our approach on a variety of tasks, demonstrating a 44.8% increase in AUC-ROC for seizure prediction using EEG data, and a 23.4% reduction in MSE for forecasting in nonlinear dynamical systems (outlier excluded).By incorporating a predictive feedback mechanism, Future-Guided Learning advances how deep learning is applied to time-series forecasting.

[1442] Sometimes I am a Tree: Data Drives Unstable Hierarchical Generalization

Tian Qin, Naomi Saphra, David Alvarez-Melis

Main category: cs.LG

TL;DR: Language models transition from n-gram to hierarchical generalization when trained on complex data with center-embedded clauses, and learn stable rules when data is diverse with many distinct syntax trees.

Details

Motivation: To understand how language models shift from n-gram-like behavior to hierarchical syntactic generalization during training, and identify the specific data characteristics that drive this transition.

Method: Used controlled grammar-learning tasks (question formation and tense inflection) with varying data complexity (presence of center-embedded clauses) and diversity (number of distinct syntax trees) to study model learning dynamics.

Result: Complex data drives hierarchical rule learning, diverse data promotes stable rule learning, while intermediate complexity/diversity creates unstable oscillatory learning with inconsistent behaviors across random seeds.

Conclusion: Training data characteristics (complexity and diversity) fundamentally shape how language models generalize, explaining why different training strategies can lead to unstable outcomes and highlighting the importance of data design for stable hierarchical generalization.

Abstract: Early in training, LMs can behave like n-gram models, but eventually they often learn tree-based syntactic rules and generalize hierarchically out of distribution (OOD). We study this shift using controlled grammar-learning tasks: question formation and tense inflection. We find that a model learns to generalize hierarchically if its training data is complex-in particular, if it includes center-embedded clauses, a special syntactic structure. Under this definition, complex data drives hierarchical rules, while less complex data encourages shortcut learning in the form of n-gram-like linear rules. Furthermore, we find that a model uses rules to generalize, whether hierarchical or linear, if its training data is diverse-in particular, if it includes many distinct syntax trees in the training set. Under this definition, diverse data promotes stable rule learning, whereas less diverse data promotes memorization of individual syntactic sequences. Finally, intermediate diversity and intermediate complexity form an unstable regime, which is characterized by oscillatory learning dynamics and inconsistent behaviors across random seeds. These results highlight the central role of training data in shaping generalization and explain why competing strategies can lead to unstable outcomes.

[1443] Benchmarking Computational Methods for Emerging Drug-Drug Interaction Prediction

Zhenqian Shen, Mingyang Zhou, Yongqi Zhang, Quanming Yao

Main category: cs.LG

TL;DR: DDI-Ben is a benchmarking framework for emerging drug-drug interaction prediction that addresses distribution changes between known and new drugs, showing most existing methods suffer performance degradation under such changes.

Details

Motivation: Emerging DDI prediction is crucial for new drugs but hindered by distribution changes between known and new drugs in real-world scenarios, with current evaluation often neglecting these changes.

Method: Proposes DDI-Ben framework with distribution change simulation that uses distribution changes between drug sets as surrogate for real-world DDI distribution changes, compatible with various drug split strategies.

Result: Benchmarking on ten representative methods shows most approaches suffer substantial performance degradation under distribution changes. LLM-based methods and integration of drug-related textual information show promising robustness.

Conclusion: DDI-Ben highlights the importance of explicitly addressing distribution changes and provides foundation for developing more resilient methods for emerging DDI prediction, with benchmark datasets released for future research.

Abstract: Motivation: Emerging drug-drug interaction (DDI) prediction is crucial for new drugs but is hindered by distribution changes between known and new drugs in real-world scenarios. Current evaluation often neglects these changes, relying on unrealistic i.i.d. split due to the absence of drug approval data. Results: We propose DDI-Ben, a benchmarking framework for emerging DDI prediction under distribution changes. DDI-Ben introduces a distribution change simulation framework that leverages distribution changes between drug sets as a surrogate for real-world distribution changes of DDIs, and is compatible with various drug split strategies. Through extensive benchmarking on ten representative methods, we show that most existing approaches suffer substantial performance degradation under distribution changes. Our analysis further indicates that large language model (LLM) based methods and the integration of drug-related textual information offer promising robustness against such degradation. To support future research, we release the benchmark datasets with simulated distribution changes. Overall, DDI-Ben highlights the importance of explicitly addressing distribution changes and provides a foundation for developing more resilient methods for emerging DDI prediction. Availability and implementation: Our code and data are available at https://github.com/LARS-research/DDI-Bench.

[1444] vCache: Verified Semantic Prompt Caching

Luis Gaspar Schroeder, Aditya Desai, Alejandro Cuadron, Kyle Chu, Shu Liu, Mark Zhao, Stephan Krusche, Alfons Kemper, Ion Stoica, Matei Zaharia, Joseph E. Gonzalez

Main category: cs.LG

TL;DR: vCache is a verified semantic cache system that provides user-defined error rate guarantees by dynamically learning optimal similarity thresholds for each cached prompt, outperforming static-threshold approaches.

Details

Motivation: Existing semantic caches use static similarity thresholds which lack formal correctness guarantees, lead to unexpected error rates, and result in suboptimal cache hit rates.

Method: vCache employs an online learning algorithm to estimate an optimal threshold for each cached prompt, enabling reliable cache responses without additional training.

Result: Experiments show vCache consistently meets specified error bounds while outperforming state-of-the-art static-threshold and fine-tuned embedding baselines.

Conclusion: vCache provides the first verified semantic cache with formal error rate guarantees and releases implementation with three benchmarks for future research.

Abstract: Semantic caches return cached responses for semantically similar prompts to reduce LLM inference latency and cost. They embed cached prompts and store them alongside their response in a vector database. Embedding similarity metrics assign a numerical score to quantify the similarity between a request and its nearest neighbor prompt from the cache. Existing systems use the same static similarity threshold across all requests to determine whether two prompts can share similar responses. However, we observe that static thresholds do not give formal correctness guarantees, can result in unexpected error rates, and lead to suboptimal cache hit rates. This paper proposes vCache, the first verified semantic cache with user-defined error rate guarantees. It employs an online learning algorithm to estimate an optimal threshold for each cached prompt, enabling reliable cache responses without additional training. Our experiments show that vCache consistently meets the specified error bounds while outperforming state-of-the-art static-threshold and fine-tuned embedding baselines. We release the vCache implementation and three benchmarks to support future research.

[1445] Haar-Laplacian for directed graphs

Theodor-Adrian Badea, Bogdan Dumitrescu

Main category: cs.LG

TL;DR: Introduces a novel Hermitian Laplacian matrix for directed graphs inspired by Haar-like transformation, enabling spectral convolutional networks and graph signal processing with better performance in weight prediction and denoising.

Details

Motivation: To enable spectral convolutional networks and extend signal processing applications for directed graphs, as existing methods may not preserve direction and weight information effectively.

Method: Proposes a Haar-inspired Hermitian Laplacian matrix that preserves direction and weight information, then builds HaarNet spectral graph convolutional network and applies it to graph signal processing tasks.

Result: The approach shows better results in weight prediction and denoising on directed graphs compared to existing methods.

Conclusion: The novel Haar-Laplacian matrix successfully enables spectral methods for directed graphs with desirable properties and improved performance in practical applications.

Abstract: This paper introduces a novel Laplacian matrix aiming to enable the construction of spectral convolutional networks and to extend the signal processing applications for directed graphs. Our proposal is inspired by a Haar-like transformation and produces a Hermitian matrix which is not only in one-to-one relation with the adjacency matrix, preserving both direction and weight information, but also enjoys desirable additional properties like scaling robustness, sensitivity, continuity, and directionality. We take a theoretical standpoint and support the conformity of our approach with the spectral graph theory. Then, we address two use-cases: graph learning (by introducing HaarNet, a spectral graph convolutional network built with our Haar-Laplacian) and graph signal processing. We show that our approach gives better results in applications like weight prediction and denoising on directed graphs.

[1446] SURGE: On the Potential of Large Language Models as General-Purpose Surrogate Code Executors

Bohan Lyu, Siqiao Huang, Zichen Liang, Qi-An Sun, Jiaming Zhang

Main category: cs.LG

TL;DR: SURGE is a comprehensive benchmark for evaluating whether large language models (LLMs) can serve as surrogate models for code execution prediction, covering 1160 problems across 8 key programming aspects.

Details

Motivation: To systematically investigate whether LLMs can serve as surrogate models for code execution prediction, an important but underexplored question despite LLMs' demonstrated capabilities in code-related tasks.

Method: Introduced SURGE benchmark with 1160 problems covering 8 key aspects: multi-language programming, competition-level problems, repository-level analysis, scientific computing, time-complexity algorithms, buggy code analysis, compiler/environment-dependent programs, and mathematical proof verification. Evaluated 21 open-source and proprietary LLMs.

Result: Through extensive analysis, the study reveals important insights about the feasibility of LLMs as efficient surrogates for computational processes, examining scaling laws, data efficiency, and predictive accuracy.

Conclusion: LLMs show potential as surrogate models for code execution prediction, with the SURGE benchmark providing a comprehensive framework for systematic evaluation across diverse programming scenarios.

Abstract: Neural surrogate models are powerful and efficient tools in data mining. Meanwhile, large language models (LLMs) have demonstrated remarkable capabilities in code-related tasks, such as generation and understanding. However, an equally important yet underexplored question is whether LLMs can serve as surrogate models for code execution prediction. To systematically investigate it, we introduce SURGE, a comprehensive benchmark with $1160$ problems covering $8$ key aspects: multi-language programming tasks, competition-level programming problems, repository-level code analysis, high-cost scientific computing, time-complexity-intensive algorithms, buggy code analysis, programs dependent on specific compilers or execution environments, and formal mathematical proof verification. Through extensive analysis of $21$ open-source and proprietary LLMs, we examine scaling laws, data efficiency, and predictive accuracy. Our findings reveal important insights about the feasibility of LLMs as efficient surrogates for computational processes. The benchmark and evaluation framework are available at https://github.com/Imbernoulli/SURGE.

[1447] Off-Policy Maximum Entropy RL with Future State and Action Visitation Measures

Adrien Bolland, Gaspard Lambrechts, Damien Ernst

Main category: cs.LG

TL;DR: The paper proposes a novel maximum entropy RL approach using relative entropy of future state-action distributions as intrinsic rewards, which maximizes a lower bound on Q-values and enables efficient off-policy learning.

Details

Motivation: To improve exploration in reinforcement learning by integrating intrinsic rewards based on the entropy of future state-action distributions, providing better state-action space coverage and performance.

Method: Uses relative entropy of discounted future state-action distributions as intrinsic rewards, which can be learned off-policy using existing algorithms due to the distribution being a fixed point of a contraction operator.

Result: The proposed approach achieves good state-action space coverage and high-performance control in reinforcement learning tasks.

Conclusion: The relative entropy-based intrinsic reward framework effectively balances exploration and exploitation, leading to improved policy learning and performance in reinforcement learning.

Abstract: Maximum entropy reinforcement learning integrates exploration into policy learning by providing additional intrinsic rewards proportional to the entropy of some distribution. In this paper, we propose a novel approach in which the intrinsic reward function is the relative entropy of the discounted distribution of states and actions (or features derived from these states and actions) visited during future time steps. This approach is motivated by two results. First, a policy maximizing the expected discounted sum of intrinsic rewards also maximizes a lower bound on the state-action value function of the decision process. Second, the distribution used in the intrinsic reward definition is the fixed point of a contraction operator. Existing algorithms can therefore be adapted to learn this fixed point off-policy and to compute the intrinsic rewards. We finally introduce an algorithm maximizing our new objective, and we show that resulting policies have good state-action space coverage and achieve high-performance control.

[1448] What Makes a Reward Model a Good Teacher? An Optimization Perspective

Noam Razin, Zixuan Wang, Hubert Strauss, Stanley Wei, Jason D. Lee, Sanjeev Arora

Main category: cs.LG

TL;DR: Reward model accuracy alone is insufficient for effective RLHF; reward variance is crucial for optimization efficiency, as low variance creates flat objective landscapes that slow learning.

Details

Motivation: Current RLHF evaluation focuses primarily on reward model accuracy, but it's unclear if this fully captures what makes a reward model effective for guiding language model optimization.

Method: Analyzed RLHF from an optimization perspective, proving mathematically that low reward variance creates flat objective landscapes regardless of accuracy. Conducted experiments with models up to 8B parameters to validate theory.

Result: Perfectly accurate reward models can lead to extremely slow optimization if they induce low reward variance, while less accurate models with higher variance can optimize faster. Reward models effective for one language model may perform poorly for another due to variance differences.

Conclusion: Reward models must be evaluated beyond just accuracy - they need to induce sufficient variance for efficient optimization. The interplay between reward variance, accuracy, and maximization rate is crucial for effective RLHF.

Abstract: The success of Reinforcement Learning from Human Feedback (RLHF) critically depends on the quality of the reward model. However, while this quality is primarily evaluated through accuracy, it remains unclear whether accuracy fully captures what makes a reward model an effective teacher. We address this question from an optimization perspective. First, we prove that regardless of how accurate a reward model is, if it induces low reward variance, then the RLHF objective suffers from a flat landscape. Consequently, even a perfectly accurate reward model can lead to extremely slow optimization, underperforming less accurate models that induce higher reward variance. We additionally show that a reward model that works well for one language model can induce low reward variance, and thus a flat objective landscape, for another. These results establish a fundamental limitation of evaluating reward models solely based on accuracy or independently of the language model they guide. Experiments using models of up to 8B parameters corroborate our theory, demonstrating the interplay between reward variance, accuracy, and reward maximization rate. Overall, our findings highlight that beyond accuracy, a reward model needs to induce sufficient variance for efficient~optimization.

[1449] Euclidean Fast Attention – Machine Learning Global Atomic Representations at Linear Cost

J. Thorben Frank, Stefan Chmiela, Klaus-Robert Müller, Oliver T. Unke

Main category: cs.LG

TL;DR: EFA is a linear-scaling attention mechanism for Euclidean data that captures long-range correlations efficiently, addressing quadratic complexity limitations of self-attention in computational chemistry applications.

Details

Motivation: Self-attention's quadratic complexity limits its practical use in computational chemistry, especially for machine learning force fields (MLFFs) that require efficient modeling of long-range interactions.

Method: Introduces Euclidean fast attention (EFA) with novel Euclidean rotary positional encodings (ERoPE) that encode spatial information while respecting physical symmetries, enabling linear-scaling attention for Euclidean data.

Result: EFA effectively captures diverse long-range effects and enables MLFFs to describe challenging chemical interactions that conventional MLFFs fail to model correctly.

Conclusion: EFA provides an efficient solution for capturing long-range correlations in Euclidean data, overcoming computational limitations of standard attention mechanisms in computational chemistry applications.

Abstract: Long-range correlations are essential across numerous machine learning tasks, especially for data embedded in Euclidean space, where the relative positions and orientations of distant components are often critical for accurate predictions. Self-attention offers a compelling mechanism for capturing these global effects, but its quadratic complexity presents a significant practical limitation. This problem is particularly pronounced in computational chemistry, where the stringent efficiency requirements of machine learning force fields (MLFFs) often preclude accurately modeling long-range interactions. To address this, we introduce Euclidean fast attention (EFA), a linear-scaling attention-like mechanism designed for Euclidean data, which can be easily incorporated into existing model architectures. A core component of EFA are novel Euclidean rotary positional encodings (ERoPE), which enable efficient encoding of spatial information while respecting essential physical symmetries. We empirically demonstrate that EFA effectively captures diverse long-range effects, enabling EFA-equipped MLFFs to describe challenging chemical interactions for which conventional MLFFs yield incorrect results.

[1450] Reasoning to Learn from Latent Thoughts

Yangjun Ruan, Neil Band, Chris J. Maddison, Tatsunori Hashimoto

Main category: cs.LG

TL;DR: The paper proposes modeling latent thoughts underlying text generation to improve data efficiency in language model pretraining, showing significant gains through synthetic data and bootstrapping methods.

Details

Motivation: As compute scaling outpaces human-written text growth, data is becoming the bottleneck for language model scaling. The authors aim to address this data-constrained regime by leveraging the latent thoughts that underlie text generation.

Method: The approach infers latent thoughts from text, viewing web text as compressed outcomes of verbose human thought processes. Uses synthetic data generation and an EM algorithm where the LM bootstraps its own performance by iteratively improving model capability and thought-augmented pretraining data quality.

Result: Experiments show synthetic data approaches significantly improve data efficiency over raw data training. A 1B LM successfully bootstraps performance across multiple EM iterations, outperforming raw data baselines with increasing gains from additional inference compute.

Conclusion: Latent thought inference provides new opportunities for scaling data-constrained pretraining, with gains from both inference scaling and EM iterations suggesting a promising direction for overcoming data bottlenecks.

Abstract: Compute scaling for language model (LM) pretraining has outpaced the growth of human-written texts, leading to concerns that data will become the bottleneck to LM scaling. To continue scaling pretraining in this data-constrained regime, we propose that explicitly modeling and inferring the \emph{latent thoughts} that underlie the text generation process can significantly improve pretraining data efficiency. Intuitively, our approach views web text as the compressed final outcome of a verbose human thought process and that the latent thoughts contain important contextual knowledge and reasoning steps that are critical to data-efficient learning. We empirically demonstrate the effectiveness of our approach through data-constrained continued pretraining for math. We first show that synthetic data approaches to inferring latent thoughts significantly improve data efficiency over training on the same amount of raw data. Furthermore, we demonstrate latent thought inference without a strong teacher, where an LM \emph{bootstraps its own performance} by using an EM algorithm to iteratively improve the capability of the trained LM and the quality of thought-augmented pretraining data. We show that a 1B LM can bootstrap its performance across at least three iterations and significantly outperform baselines trained on raw data, with increasing gains from additional inference compute when performing the E-step. The gains from inference scaling and EM iterations suggest new opportunities for scaling data-constrained pretraining.

[1451] Learning Randomized Reductions

Ferhat Erata, Orr Paradise, Thanos Typaldos, Timos Antonopoulos, ThanhVu Nguyen, Shafi Goldwasser, Ruzica Piskac

Main category: cs.LG

TL;DR: Bitween is an automated method for learning randomized self-reductions (RSRs) for mathematical functions, with two versions: vanilla Bitween using linear regression outperforms existing symbolic methods, and Agentic Bitween uses LLMs to dynamically discover novel query functions for RSR discovery.

Details

Motivation: Manual derivation of randomized self-reductions (RSRs) by experts is time-consuming and limited. There's a need for automated methods to discover RSRs, which enable self-correction capabilities for functions and have applications in complexity theory and cryptography.

Method: Two approaches: 1) Vanilla Bitween uses linear regression learning framework to discover RSRs from correlated samples. 2) Agentic Bitween employs neuro-symbolic approach where large language models dynamically discover novel query functions, using vanilla Bitween for inference and verification.

Result: On RSR-Bench (80 scientific and ML functions), vanilla Bitween surpasses existing symbolic methods (genetic algorithms, symbolic regression, MILP). Agentic Bitween discovers new RSR properties using frontier models to uncover query functions beyond fixed ones like x+r, x-r, x·r.

Conclusion: Bitween provides an effective automated approach for learning randomized self-reductions, with vanilla version outperforming existing methods and Agentic version enabling discovery of novel RSR properties through dynamic query function generation.

Abstract: A self-corrector for a function $f$ takes a black-box oracle computing $f$ that is correct on most inputs and turns it into one that is correct on every input with high probability. Self-correctors exist for any function that is randomly self-reducible (RSR), where the value $f$ at a given point $x$ can be recovered by computing $f$ on random correlated points. While RSRs enable powerful self-correction capabilities and have applications in complexity theory and cryptography, their discovery has traditionally required manual derivation by experts. We present Bitween, a method and tool for automated learning of randomized self-reductions for mathematical functions. We make two key contributions: First, we demonstrate that our learning framework based on linear regression outperforms sophisticated methods including genetic algorithms, symbolic regression, and mixed-integer linear programming for discovering RSRs from correlated samples. Second, we introduce Agentic Bitween, a neuro-symbolic approach where large language models dynamically discover novel query functions for RSR property discovery, leveraging vanilla Bitween as a tool for inference and verification, moving beyond the fixed query functions ($x+r$, $x-r$, $x \cdot r$, $x$, $r$) previously used in the literature. On RSR-Bench, our benchmark suite of 80 scientific and machine learning functions, vanilla Bitween surpasses existing symbolic methods, while Agentic Bitween discovers new RSR properties using frontier models to uncover query functions.

[1452] Toward Model-centric Heterogeneous Federated Graph Learning: A Knowledge-driven Approach

Zhengyu Wu, Guang Zeng, Huilin Lai, Daohan Su, Jishuo Jia, Yinlin Zhu, Xunkai Li, Rong-Hua Li, Guoren Wang, Chenghu Zhou

Main category: cs.LG

TL;DR: FedGKC addresses model-centric heterogeneous federated graph learning by enabling knowledge sharing between clients with different model architectures through self-mutual knowledge distillation and server-side knowledge-aware aggregation.

Details

Motivation: Existing federated graph learning methods overlook the model-centric heterogeneous problem where clients have varying model architectures and scales, which hampers effective collaboration and representation learning.

Method: Proposes FedGKC framework with two components: Client-side Self-Mutual Knowledge Distillation using copilot models for knowledge sharing, and Server-side Knowledge-Aware Model Aggregation for enhanced model integration.

Result: Achieves average accuracy improvement of 3.74% over baseline models on eight benchmark datasets in heterogeneous settings, while maintaining excellent performance in homogeneous scenarios.

Conclusion: FedGKC effectively addresses the model-centric heterogeneous federated graph learning problem and enables successful knowledge collaboration across diverse client architectures.

Abstract: Federated graph learning (FGL) has emerged as a promising paradigm for collaborative machine learning, enabling multiple parties to jointly train models while preserving the privacy of raw graph data. However, existing FGL methods often overlook the model-centric heterogeneous FGL (MHtFGL) problem, which arises in real-world applications, such as the aggregation of models from different companies with varying scales and architectures. MHtFGL presents an additional challenge: the diversity of client model architectures hampers common learning and integration of graph representations. To address this issue, we propose the Federated Graph Knowledge Collaboration (FedGKC) framework, comprising two key components: Client-side Self-Mutual Knowledge Distillation, which fosters effective knowledge sharing among clients through copilot models; and Server-side Knowledge-Aware Model Aggregation, which enhances model integration by accounting for the knowledge acquired by clients. Experiments on eight benchmark datasets demonstrate that FedGKC achieves an average accuracy improvement of 3.74% over baseline models in MHtFGL scenarios, while also maintaining excellent performance in homogeneous settings.

[1453] Visual Planning: Let’s Think Only with Images

Yi Xu, Chengzu Li, Han Zhou, Xingchen Wan, Caiqi Zhang, Anna Korhonen, Ivan Vulić

Main category: cs.LG

TL;DR: Visual Planning paradigm uses purely visual representations for reasoning in spatial tasks, outperforming text-only reasoning methods through reinforcement learning.

Details

Motivation: Language may not be the most effective modality for reasoning in tasks involving spatial and geometrical information, particularly for 'vision-first' tasks where visual representations could be more natural.

Method: Proposed Visual Planning via Reinforcement Learning (VPRL) framework using GRPO for post-training large vision models, enabling planning through sequences of images that encode step-by-step inference.

Result: Substantial improvements in planning performance across visual navigation tasks (FrozenLake, Maze, and MiniBehavior), outperforming all text-only reasoning variants.

Conclusion: Visual Planning establishes itself as a viable and promising supplement to language-based reasoning, opening new avenues for tasks benefiting from intuitive, image-based inference.

Abstract: Recent advancements in Large Language Models (LLMs) and their multimodal extensions (MLLMs) have substantially enhanced machine reasoning across diverse tasks. However, these models predominantly rely on pure text as the medium for both expressing and structuring reasoning, even when visual information is present. In this work, we argue that language may not always be the most natural or effective modality for reasoning, particularly in tasks involving spatial and geometrical information. Motivated by this, we propose a new paradigm, Visual Planning, which enables planning through purely visual representations for these “vision-first” tasks, as a supplementary channel to language-based reasoning. In this paradigm, planning is executed via sequences of images that encode step-by-step inference in the visual domain, akin to how humans sketch or visualize future actions. We introduce a novel reinforcement learning framework, Visual Planning via Reinforcement Learning (VPRL), empowered by GRPO for post-training large vision models, leading to substantial improvements in planning in a selection of representative visual navigation tasks, FrozenLake, Maze, and MiniBehavior. Our visual planning paradigm outperforms all other planning variants that conduct reasoning in the text-only space. Our results establish Visual Planning as a viable and promising supplement to language-based reasoning, opening new avenues for tasks that benefit from intuitive, image-based inference.

[1454] On The Sample Complexity Bounds In Bilevel Reinforcement Learning

Mudit Gaur, Utsav Singh, Amrit Singh Bedi, Raghu Pasupathu, Vaneet Aggarwal

Main category: cs.LG

TL;DR: First sample complexity bound for bilevel reinforcement learning (BRL) established at O(ε⁻³) in continuous spaces, improving from previous O(ε⁻⁶) bounds, with a Hessian-free algorithm for large-scale problems.

Details

Motivation: BRL is powerful for aligning generative models but lacks theoretical foundations, especially sample complexity bounds. Traditional MDP analysis fails due to BRL's nested structure and non-convex lower-level problems.

Method: Leveraged Polyak-Łojasiewicz (PL) condition and MDP structure to obtain closed-form gradients, enabling tight sample complexity analysis. Proposed a fully first-order, Hessian-free algorithm for hypergradient estimation.

Result: Achieved sample complexity bound of O(ε⁻³) for BRL in continuous state-action spaces, and extended to general bi-level optimization with non-convex lower levels, improving from existing O(ε⁻⁶) bounds.

Conclusion: Established first theoretical sample complexity foundation for BRL, developed efficient algorithms, and extended results to broader bi-level optimization settings with significant improvements over existing bounds.

Abstract: Bilevel reinforcement learning (BRL) has emerged as a powerful framework for aligning generative models, yet its theoretical foundations, especially sample complexity bounds, remain underexplored. In this work, we present the first sample complexity bound for BRL, establishing a rate of $\mathcal{O}(\epsilon^{-3})$ in continuous state-action spaces. Traditional MDP analysis techniques do not extend to BRL due to its nested structure and non-convex lower-level problems. We overcome these challenges by leveraging the Polyak-{\L}ojasiewicz (PL) condition and the MDP structure to obtain closed-form gradients, enabling tight sample complexity analysis. Our analysis also extends to general bi-level optimization settings with non-convex lower levels, where we achieve state-of-the-art sample complexity results of $\mathcal{O}(\epsilon^{-3})$ improving upon existing bounds of $\mathcal{O}(\epsilon^{-6})$. Additionally, we address the computational bottleneck of hypergradient estimation by proposing a fully first-order, Hessian-free algorithm suitable for large-scale problems.

[1455] Faster Convergence of Riemannian Stochastic Gradient Descent with Increasing Batch Size

Kanata Oowada, Hideaki Iiduka

Main category: cs.LG

TL;DR: Increasing batch size in Riemannian SGD improves convergence rate from O(T^{-1}+C) to O(T^{-1}) and reduces stochastic first-order oracle complexity, offering benefits of both small and large constant batch sizes.

Details

Motivation: To analyze the convergence behavior of Riemannian stochastic gradient descent and investigate how batch size strategies affect convergence rates and computational efficiency.

Method: Theoretical analysis of RSGD convergence, combined with numerical experiments using principal component analysis and low-rank matrix completion to study computational time via stochastic first-order oracle complexity.

Result: Increasing batch size leads to faster convergence than constant batch size, improves convergence rate, and reduces SFO complexity while combining advantages of both small and large constant batch sizes.

Conclusion: Increasing batch size strategy in RSGD provides superior convergence performance and computational efficiency compared to constant batch size approaches.

Abstract: We theoretically analyzed the convergence behavior of Riemannian stochastic gradient descent (RSGD) and found that using an increasing batch size leads to faster convergence than using a constant batch size, not only with a constant learning rate but also with a decaying learning rate, such as cosine annealing decay and polynomial decay. The convergence rate improves from $O(T^{-1}+C)$ with a constant batch size to $O(T^{-1})$ with an increasing batch size, where $T$ denotes the total number of iterations and $C$ is a constant. Using principal component analysis and low-rank matrix completion, we investigated, both theoretically and numerically, how an increasing batch size affects computational time as quantified by stochastic first-order oracle (SFO) complexity. An increasing batch size was found to reduce the SFO complexity of RSGD. Furthermore, an increasing batch size was found to offer the advantages of both small and large constant batch sizes.

[1456] Stepwise Guided Policy Optimization: Coloring your Incorrect Reasoning in GRPO

Peter Chen, Xiaopeng Li, Ziniu Li, Xi Chen, Tianyi Lin

Main category: cs.LG

TL;DR: SGPO addresses GRPO’s limitation of discarding learning signals from all-negative-sample groups by incorporating response diversity using step-wise judge models, accelerating learning dynamics and improving performance across various model sizes and benchmarks.

Details

Motivation: GRPO fails to update policies when all responses in a group are incorrect, missing valuable learning opportunities from mistakes that humans naturally utilize. This creates a gap between artificial and human intelligence in reinforcement learning.

Method: Proposed stepwise guided policy optimization (SGPO) incorporates response diversity within groups using step-wise judge models, which can be trained directly or adapted from existing LLMs. This diversification helps mitigate the all-negative-sample issue.

Result: SGPO demonstrates consistent gains across model sizes (7B, 14B, 32B) in offline and online training on 9 benchmarks, outperforming GRPO especially in early and mid-training stages where all-negative-sample groups are prevalent.

Conclusion: SGPO effectively addresses GRPO’s limitation by enabling learning from all-negative-sample groups without requiring judge models to generate correct answers, bridging a key gap between artificial and human learning capabilities.

Abstract: Reinforcement learning (RL) has proven effective in strengthening the reasoning capabilities of large language models (LLMs). A widely adopted method, Group Relative Policy Optimization (GRPO), has shown strong empirical results in training DeepSeek-R1. However, GRPO fails to update the policy when all responses within a group are incorrect (i.e., \emph{all-negative-sample} groups). This limitation underscores a key gap between artificial and human intelligence: unlike humans, who can learn from mistakes, GRPO discards these signals. Our first contribution is to introduce a simple framework that mitigates the all-negative-sample issue by incorporating response diversity within groups using a \textit{step-wise} judge model, which can be either directly trained or adapted from existing LLMs. We prove that this diversification can accelerate GRPO’s learning dynamics in a simplified setting. We also empirically validate the proposed stepwise guided policy optimization (SGPO) method, demonstrating consistent gains across model sizes (7B, 14B, 32B) in offline and online training on 9 benchmarks, including base and distilled variants. Our results highlight two advantages: (i) SGPO surpasses GRPO, especially in the early and mid-training stages where all-negative-sample groups are prevalent; and (ii) SGPO does not require judge models to generate correct answers, differentiating it from knowledge distillation methods.

[1457] Norm-Bounded Low-Rank Adaptation

Ruigang Wang, Krishnamurthy Dvijotham, Ian R. Manchester

Main category: cs.LG

TL;DR: NB-LoRA is a parameter-efficient fine-tuning method that provides explicit bounds on singular values of adaptation matrices, offering better hyperparameter robustness and performance compared to existing LoRA methods.

Details

Motivation: To address the need for parameter-efficient fine-tuning methods that can explicitly control norm bounds on weight adaptations and improve hyperparameter robustness.

Method: Proposes norm-bounded low-rank adaptation (NB-LoRA) - a novel parameterization of low-rank weight adaptations that admits explicit bounds on each singular value, making it unconstrained, smooth, and complete.

Result: NB-LoRA matches or surpasses competing LoRA methods in natural language generation, exhibits stronger hyperparameter robustness, and avoids catastrophic forgetting in vision fine-tuning with better adaptation performance.

Conclusion: NB-LoRA provides an effective parameter-efficient fine-tuning approach with explicit norm control and superior hyperparameter robustness compared to existing methods.

Abstract: In this work, we propose norm-bounded low-rank adaptation (NB-LoRA) for parameter-efficient fine tuning. NB-LoRA is a novel parameterization of low-rank weight adaptations that admits explicit bounds on each singular value of the adaptation matrix, which can thereby satisfy any prescribed unitarily invariant norm bound, including the Schatten norms (e.g., nuclear, Frobenius, spectral norm). The proposed parameterization is unconstrained, smooth, and complete, i.e. it covers all matrices satisfying the prescribed rank and singular-value bounds. Natural language generation experiments show that NB-LoRA matches or surpasses performance of competing LoRA methods, while exhibiting stronger hyper-parameter robustness. Vision fine-tuning experiments show that NB-LoRA can avoid model catastrophic forgetting without minor cost on adaptation performance, and compared to existing approaches it is substantially more robust to a hyper-parameters such as including adaptation rank, learning rate and number of training epochs.

[1458] AdaRank: Adaptive Rank Pruning for Enhanced Model Merging

Chanhyuk Lee, Jiho Choi, Chanryeol Lee, Donggyun Kim, Seunghoon Hong

Main category: cs.LG

TL;DR: AdaRank is a novel model merging framework that adaptively selects beneficial singular directions of task vectors to merge multiple models, addressing cross-task interference in SVD-based methods through dynamic rank pruning via entropy minimization.

Details

Motivation: Existing SVD-based model merging techniques rely on manually designed rank selection, which often leads to cross-task interference and suboptimal performance in multi-task learning scenarios.

Method: AdaRank dynamically prunes singular components that cause interference by learning to prune ranks during test-time via entropy minimization, adaptively selecting the most beneficial singular directions of task vectors for merging.

Result: AdaRank consistently achieves state-of-the-art performance with various backbones and number of tasks, reducing the performance gap between fine-tuned models to nearly 1% while mitigating detrimental overlaps among tasks.

Conclusion: The proposed AdaRank framework effectively addresses cross-task interference in model merging through adaptive rank selection, demonstrating superior performance compared to existing SVD-based methods across diverse experimental settings.

Abstract: Model merging has emerged as a promising approach for unifying independently fine-tuned models into an integrated framework, significantly enhancing computational efficiency in multi-task learning. Recently, several SVD-based techniques have been introduced to exploit low-rank structures for enhanced merging, but their reliance on such manually designed rank selection often leads to cross-task interference and suboptimal performance. In this paper, we propose AdaRank, a novel model merging framework that adaptively selects the most beneficial singular directions of task vectors to merge multiple models. We empirically show that the dominant singular components of task vectors can cause critical interference with other tasks, and that naive truncation across tasks and layers degrades performance. In contrast, AdaRank dynamically prunes the singular components that cause interference and offers an optimal amount of information to each task vector by learning to prune ranks during test-time via entropy minimization. Our analysis demonstrates that such method mitigates detrimental overlaps among tasks, while empirical results show that AdaRank consistently achieves state-of-the-art performance with various backbones and number of tasks, reducing the performance gap between fine-tuned models to nearly 1%.

[1459] Federated Sketching LoRA: A Flexible Framework for Heterogeneous Collaborative Fine-Tuning of LLMs

Wenzhi Fang, Dong-Jun Han, Liangqi Yuan, Seyyedali Hosseinalipour, Christopher G. Brinton

Main category: cs.LG

TL;DR: FSLoRA enables federated fine-tuning of LLMs by using sketching mechanisms to adapt to client resource constraints while maintaining performance.

Details

Motivation: Address the challenge of fine-tuning LLMs on resource-constrained clients, where existing approaches lack analytical justification or impose computational overhead.

Method: Propose federated sketching LoRA (FSLoRA) that uses sketching mechanisms to allow clients to selectively update submatrices of global LoRA modules, with adjustable sketching ratios to adapt to client constraints.

Result: Comprehensive experiments on multiple datasets and LLM models show FSLoRA’s performance improvements over various baselines.

Conclusion: FSLoRA provides an efficient and theoretically-grounded solution for federated fine-tuning of LLMs under client resource heterogeneity.

Abstract: Fine-tuning large language models (LLMs) on resource-constrained clients remains a challenging problem. Recent works have fused low-rank adaptation (LoRA) techniques with federated fine-tuning to mitigate challenges associated with client model sizes and data scarcity. Still, the heterogeneity of resources remains a critical bottleneck: while higher-rank modules generally enhance performance, varying client capabilities constrain LoRA’s feasible rank range. Existing approaches attempting to resolve this issue either lack analytical justification or impose additional computational overhead, leaving a wide gap for efficient and theoretically-grounded solutions. To address these challenges, we propose federated sketching LoRA (FSLoRA), which leverages a sketching mechanism to enable clients to selectively update submatrices of global LoRA modules maintained by the server. By adjusting the sketching ratios, which determine the ranks of the submatrices on the clients, FSLoRA flexibly adapts to client-specific communication and computational constraints. We provide a rigorous convergence analysis of FSLoRA that characterizes how the sketching ratios affect the convergence rate. Through comprehensive experiments on multiple datasets and LLM models, we demonstrate FSLoRA’s performance improvements compared to various baselines.

[1460] Optuna vs Code Llama: Are LLMs a New Paradigm for Hyperparameter Tuning?

Roman Kochnev, Arash Torabi Goodarzi, Zofia Antonina Bentyn, Dmitry Ignatov, Radu Timofte

Main category: cs.LG

TL;DR: LLM-based hyperparameter optimization using fine-tuned Code Llama with LoRA achieves competitive performance to traditional methods while significantly reducing computational costs for computer vision tasks.

Details

Motivation: Optimal hyperparameter selection is critical for neural network performance, especially with complex architectures. Traditional methods like Optuna are resource-intensive.

Method: Fine-tuned a parameter-efficient version of Code Llama using LoRA for hyperparameter optimization across various vision architectures including classification, detection, and segmentation tasks.

Result: Achieves competitive or superior RMSE compared to traditional methods while substantially reducing computational overhead. Rivals established Bayesian methods like TPE and accelerates tuning for real-world applications.

Conclusion: LLM-based optimization is an efficient alternative to traditional hyperparameter tuning methods, providing practical benefits for image manipulation systems with lower computational requirements.

Abstract: Optimal hyperparameter selection is critical for maximizing the performance of neural networks in computer vision, particularly as architectures become more complex. This work explores the use of large language models (LLMs) for hyperparameter optimization by fine-tuning a parameter-efficient version of Code Llama using LoRA. The resulting model produces accurate and computationally efficient hyperparameter recommendations across a wide range of vision architectures. Unlike traditional methods such as Optuna, which rely on resource-intensive trial-and-error procedures, our approach achieves competitive or superior Root Mean Square Error (RMSE) while substantially reducing computational overhead. Importantly, the models evaluated span image-centric tasks such as classification, detection, and segmentation, fundamental components in many image manipulation pipelines including enhancement, restoration, and style transfer. Our results demonstrate that LLM-based optimization not only rivals established Bayesian methods like Tree-structured Parzen Estimators (TPE), but also accelerates tuning for real-world applications requiring perceptual quality and low-latency processing. All generated configurations are publicly available in the LEMUR Neural Network Dataset (https://github.com/ABrain-One/nn-dataset), which serves as an open source benchmark for hyperparameter optimization research and provides a practical resource to improve training efficiency in image manipulation systems.

[1461] DAL: A Practical Prior-Free Black-Box Framework for Non-Stationary Bandits

Argyrios Gerogiannis, Yu-Han Huang, Subhonmesh Bose, Venugopal V. Veeravalli

Main category: cs.LG

TL;DR: Detection Augmented Learning (DAL) is a black-box framework that enhances any stationary bandit algorithm with change detection to handle non-stationary environments without prior knowledge of non-stationarity.

Details

Motivation: To address the challenge of non-stationary bandits where the reward distributions change over time, without requiring prior knowledge about the nature or timing of these changes.

Method: DAL augments any existing stationary bandit algorithm with a change detector, creating a black-box framework that can adapt to non-stationary environments by detecting and responding to changes.

Result: Extensive experiments show DAL consistently outperforms state-of-the-art methods across various non-stationary scenarios, including synthetic benchmarks and real-world datasets.

Conclusion: DAL provides a versatile and scalable solution for non-stationary bandits that works with any stationary algorithm, demonstrating strong empirical performance backed by theoretical insights.

Abstract: We introduce a practical, black-box framework termed Detection Augmented Learning (DAL) for the problem of non-stationary bandits without prior knowledge of the underlying non-stationarity. DAL accepts any stationary bandit algorithm as input and augments it with a change detector, enabling applicability to all common bandit variants. Extensive experimentation demonstrates that DAL consistently surpasses current state-of-the-art methods across diverse non-stationary scenarios, including synthetic benchmarks and real-world datasets, underscoring its versatility and scalability. We provide theoretical insights into DAL’s strong empirical performance, complemented by thorough experimental validation.

[1462] InfoBridge: Mutual Information estimation via Bridge Matching

Sergei Kholkin, Ivan Butakov, Evgeny Burnaev, Nikita Gushchin, Alexander Korotin

Main category: cs.LG

TL;DR: This paper proposes a novel mutual information estimator using diffusion bridge models, framing MI estimation as a domain transfer problem.

Details

Motivation: To address the challenge of estimating mutual information between random variables, especially for data that poses difficulties for conventional MI estimators.

Method: Leveraging diffusion bridge models to frame MI estimation as a domain transfer problem and constructing an unbiased estimator.

Result: The estimator demonstrates strong performance on three standard MI benchmarks (low-dimensional, image-based, high MI) and real-world protein language model embeddings.

Conclusion: Diffusion bridge models provide an effective approach for mutual information estimation across various data types and challenging scenarios.

Abstract: Diffusion bridge models have recently become a powerful tool in the field of generative modeling. In this work, we leverage their power to address another important problem in machine learning and information theory, the estimation of the mutual information (MI) between two random variables. Neatly framing MI estimation as a domain transfer problem, we construct an unbiased estimator for data posing difficulties for conventional MI estimators. We showcase the performance of our estimator on three standard MI estimation benchmarks, i.e., low-dimensional, image-based and high MI, and on real-world data, i.e., protein language model embeddings.

[1463] AdaSTaR: Adaptive Data Sampling for Training Self-Taught Reasoners

Woosung Koh, Wonbeen Oh, Jaein Jang, MinHyung Lee, Hyeongjin Kim, Ah Yeon Kim, Joonkee Kim, Junghyun Lee, Taehyeon Kim, Se-Young Yun

Main category: cs.LG

TL;DR: AdaSTaR improves self-improving language models by using adaptive sampling for diversity and curriculum learning, achieving better accuracy with 58.6% fewer training FLOPs.

Details

Motivation: Standard self-improving methods like STaR/RFT use random sampling, causing training imbalance - over-training on easy examples while under-training on challenging ones.

Method: AdaSTaR integrates two adaptive sampling principles: (1) Adaptive Sampling for Diversity to balance training across observations, and (2) Adaptive Sampling for Curriculum to dynamically adjust data difficulty matching model’s evolving capabilities.

Result: Across six benchmarks, AdaSTaR achieved best test accuracy in all instances (6/6) and reduced training FLOPs by an average of 58.6% compared to baselines. Improvements generalize to different pre-trained LMs and larger models.

Conclusion: AdaSTaR enables more efficient and effective self-improving language models by addressing training imbalance through adaptive sampling strategies.

Abstract: Self-Taught Reasoners (STaR), synonymously known as Rejection sampling Fine-Tuning (RFT), is an integral part of the training pipeline of self-improving reasoning Language Models (LMs). The self-improving mechanism often employs random observation (data) sampling. However, this results in trained observation imbalance; inefficiently over-training on solved examples while under-training on challenging ones. In response, we introduce Adaptive STaR (AdaSTaR), a novel algorithm that rectifies this by integrating two adaptive sampling principles: (1) Adaptive Sampling for Diversity: promoting balanced training across observations, and (2) Adaptive Sampling for Curriculum: dynamically adjusting data difficulty to match the model’s evolving strength. Across six benchmarks, AdaSTaR achieves best test accuracy in all instances (6/6) and reduces training FLOPs by an average of 58.6% against an extensive list of baselines. These improvements in performance and efficiency generalize to different pre-trained LMs and larger models, paving the way for more efficient and effective self-improving LMs.

[1464] Efficient Generative Model Training via Embedded Representation Warmup

Deyuan Liu, Peng Sun, Xufeng Li, Tao Lin

Main category: cs.LG

TL;DR: ERW is a two-phase training framework that decouples semantic learning from synthesis details in generative models, achieving 11.5x speedup in training.

Details

Motivation: Conventional end-to-end training entangles semantic learning and synthesis details, leading to inefficient optimization. Decoupling these conflicting objectives can improve generative modeling efficiency.

Method: Two-phase framework: 1) Build semantic foundation by aligning early diffusion model layers with pretrained encoder, 2) Generative full training with alignment loss to refine representation for high-fidelity synthesis.

Result: Achieves 11.5x speedup in 350 epochs to reach FID=1.41 compared to single-phase methods like REPA. Early layers become functionally specialized for representation.

Conclusion: Explicitly decoupling semantic learning from synthesis details through two-phase training enables more effective and efficient generative modeling.

Abstract: Generative models face a fundamental challenge: they must simultaneously learn high-level semantic concepts (what to generate) and low-level synthesis details (how to generate it). Conventional end-to-end training entangles these distinct, and often conflicting objectives, leading to a complex and inefficient optimization process. We argue that explicitly decoupling these tasks is key to unlocking more effective and efficient generative modeling. To this end, we propose Embedded Representation Warmup (ERW), a principled two-phase training framework. The first phase is dedicated to building a robust semantic foundation by aligning the early layers of a diffusion model with a powerful pretrained encoder. This provides a strong representational prior, allowing the second phase – generative full training with alignment loss to refine the representation – to focus its resources on high-fidelity synthesis. Our analysis confirms that this efficacy stems from functionally specializing the model’s early layers for representation. Empirically, our framework achieves a 11.5$\times$ speedup in 350 epochs to reach FID=1.41 compared to single-phase methods like REPA. Code is available at https://github.com/LINs-lab/ERW.

[1465] Progressive Binarization with Semi-Structured Pruning for LLMs

Xianglong Yan, Tianao Zhang, Zhiteng Li, Haotong Qin, Yulun Zhang

Main category: cs.LG

TL;DR: PBS$^2$P is a post-training framework that combines binarization and semi-structured pruning to compress LLMs, achieving better performance than state-of-the-art binary quantization methods.

Details

Motivation: LLMs have high computational and memory costs that limit deployment on resource-constrained devices. While binarization is an extreme quantization form, binarized models still contain redundancy that pruning can remove, but naive combination causes performance degradation.

Method: Progressive Binarization with Semi-Structured Pruning (PBS$^2$P) uses Stepwise semi-structured Pruning with Binarization Optimization (SPBO) to progressively introduce sparsity while optimizing binarization parameters, and Coarse-to-Fine Search (CFS) to first allocate pruning ratios then refine element selection.

Result: Extensive experiments across multiple LLM families show PBS$^2$P consistently outperforms state-of-the-art binary post-training quantization methods in both perplexity and downstream accuracy.

Conclusion: PBS$^2$P provides an effective framework for compressing LLMs through joint binarization and pruning, achieving superior performance compared to existing binary quantization approaches.

Abstract: Large language models (LLMs) have achieved remarkable progress in natural language processing, but their high computational and memory costs hinder deployment on resource-constrained devices. Binarization represents the most extreme form of quantization, yet binarized models still contain redundancy that can be further removed. Pruning provides a natural way to eliminate such redundancy, but na"ive combination with binarization often results in severe performance degradation. In this paper, we propose Progressive Binarization with Semi-Structured Pruning (PBS$^2$P), a novel post-training framework that seamlessly integrates binarization and semi-structured pruning. We first propose Stepwise semi-structured Pruning with Binarization Optimization (SPBO), which progressively introduces sparsity while optimizing binarization parameters to jointly reduce pruning and quantization error, yielding more stable and accurate compression. Additionally, we propose a Coarse-to-Fine Search (CFS) that first allocates pruning ratios and then refines element selection, further enhancing overall performance. Extensive experiments across multiple LLM families show that PBS$^2$P consistently outperforms state-of-the-art (SOTA) binary post-training quantization methods in both perplexity and downstream accuracy. The code and models will be available at https://github.com/XIANGLONGYAN/PBS2P.

[1466] Pre-training Epidemic Time Series Forecasters with Compartmental Prototypes

Zewen Liu, Juntong Ni, Max S. Y. Lau, Wei Jin

Main category: cs.LG

TL;DR: CAPE is the first open-source pre-trained model for epidemic forecasting that learns transferable knowledge from diverse disease surveillance data, outperforming baselines in zero-shot, few-shot, and full-shot forecasting.

Details

Motivation: Existing epidemic forecasting models are brittle and struggle with data scarcity during new outbreaks and distribution shifts, despite decades of available surveillance data from diverse diseases.

Method: CAPE models epidemic dynamics as mixtures of latent population states (compartmental prototypes) discovered from surveillance data, combining self-supervised pre-training with epidemic-aware regularizers for robust generalization.

Result: On a comprehensive benchmark spanning 17 diseases and 50+ regions, CAPE significantly outperforms strong baselines in zero-shot, few-shot, and full-shot forecasting scenarios.

Conclusion: CAPE represents a principled step toward pre-trained epidemic models that are both transferable and epidemiologically grounded, enabling better outbreak preparedness.

Abstract: Accurate epidemic forecasting is crucial for outbreak preparedness, but existing data-driven models are often brittle. Typically trained on a single pathogen, they struggle with data scarcity during new outbreaks and fail under distribution shifts caused by viral evolution or interventions. However, decades of surveillance data from diverse diseases offer an untapped source of transferable knowledge. To leverage the collective lessons from history, we propose CAPE, the first open-source pre-trained model for epidemic forecasting. Unlike existing time series foundation models that overlook epidemiological challenges, CAPE models epidemic dynamics as mixtures of latent population states, termed compartmental prototypes. It discovers a flexible dictionary of compartment prototypes directly from surveillance data, enabling each outbreak to be expressed as a time-varying mixture that links observed infections to latent population states. To promote robust generalization, CAPE combines self-supervised pre-training objectives with lightweight epidemic-aware regularizers that align the learned prototypes with epidemiological semantics. On a comprehensive benchmark spanning 17 diseases and 50+ regions, CAPE significantly outperforms strong baselines in zero-shot, few-shot, and full-shot forecasting. This work represents a principled step toward pre-trained epidemic models that are both transferable and epidemiologically grounded.

[1467] On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning

Yifan Zhang, Yifeng Liu, Huizhuo Yuan, Yang Yuan, Quanquan Gu, Andrew Chi-Chih Yao

Main category: cs.LG

TL;DR: This paper introduces Regularized Policy Gradient (RPG), a unified framework that clarifies KL regularization variants in policy gradient algorithms for LLMs, identifies and fixes off-policy weighting issues, and enables stable large-scale training with improved performance on mathematical reasoning tasks.

Details

Motivation: KL regularization is widely used in policy gradient algorithms for LLMs, but the design choices (KL direction, normalization, estimators) are scattered across literature and intertwined with off-policy estimation. The paper aims to systematically understand what weighting is needed for each KL variant to ensure the surrogate optimization yields the exact gradient of the intended KL-regularized objective.

Method: The authors develop Regularized Policy Gradient (RPG), a unified derivation that: (i) unifies normalized and unnormalized KL variants, (ii) specifies conditions for gradient-equivalent surrogates, (iii) identifies and corrects off-policy importance-weighting mismatch in GRPO’s KL term, and (iv) introduces RPG-Style Clip for stable off-policy training.

Result: On mathematical reasoning benchmarks (AIME24, AIME25), RPG-REINFORCE with RPG-Style Clip improves accuracy by up to +6 absolute percentage points over DAPO. The method provides stable and scalable RL training for LLM reasoning.

Conclusion: RPG provides a comprehensive framework that unifies KL regularization variants, enables stable off-policy training through proper weighting and truncated importance sampling, and demonstrates significant performance improvements on reasoning tasks, establishing it as an effective RL algorithm for LLM reasoning.

Abstract: Policy gradient algorithms have been successfully applied to enhance the reasoning capabilities of large language models (LLMs). KL regularization is ubiquitous, yet the design surface, choice of KL direction (forward vs. reverse), normalization (normalized vs. unnormalized), and estimator ($k_1/k_2/k_3$), is scattered across the literature and often intertwined with off-policy estimation. We ask a focused question: under the off-policy setting, what weighting is required for each KL variant so that the surrogate we optimize yields the exact gradient of the intended KL-regularized objective? We answer this with a compact, unified derivation we call the Regularized Policy Gradient (RPG) view. RPG (i) unifies normalized and unnormalized KL variants and shows that the widely-used $k_3$ penalty is exactly the unnormalized KL; (ii) specifies conditions under which REINFORCE-style losses with stop-gradient are gradient-equivalent to fully differentiable surrogates; (iii) identifies and corrects an off-policy importance-weighting mismatch in GRPO’s KL term; and (iv) introduces RPG-Style Clip, a truncated-importance-sampling step within RPG-REINFORCE that enables stable, off-policy policy-gradient training at scale. On mathematical reasoning benchmarks (AIME24, AIME25), RPG-REINFORCE with RPG-Style Clip improves accuracy by up to $+6$ absolute percentage points over DAPO. Notably, RPG is a stable and scalable RL algorithm for LLM reasoning, realized via (a) a KL-correct objective, (b) truncated importance sampling, and (c) an iterative reference-policy update scheme.

[1468] Sharpness-Aware Minimization with Z-Score Gradient Filtering

Vincent-Daniel Yun

Main category: cs.LG

TL;DR: Z-Score Filtered Sharpness-Aware Minimization improves generalization by applying Z-score based filtering to gradients, keeping only top percentile components with largest absolute Z-scores to focus perturbation on significant directions.

Details

Motivation: Standard Sharpness-Aware Minimization uses entire gradient vectors, allowing small or noisy components to affect the ascent step and potentially miss optimal solutions.

Method: Constructs a mask to retain only top percentile gradient components with largest absolute Z-scores in each layer, focusing ascent step on directions that stand out compared to layer average.

Result: Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet with ResNet, VGG, and Vision Transformers show consistent test accuracy improvements over Sharpness-Aware Minimization and its variants.

Conclusion: Selective perturbation using Z-score filtering refines search toward flatter minima while reducing influence of less significant gradients, improving generalization performance.

Abstract: Deep neural networks achieve high performance across many domains but can still face challenges in generalization when optimization is influenced by small or noisy gradient components. Sharpness-Aware Minimization improves generalization by perturbing parameters toward directions of high curvature, but it uses the entire gradient vector, which means that small or noisy components may affect the ascent step and cause the optimizer to miss optimal solutions. We propose Z-Score Filtered Sharpness-Aware Minimization, which applies Z-score based filtering to gradients in each layer. Instead of using all gradient components, a mask is constructed to retain only the top percentile with the largest absolute Z-scores. The percentile threshold $Q_p$ determines how many components are kept, so that the ascent step focuses on directions that stand out most compared to the average of the layer. This selective perturbation refines the search toward flatter minima while reducing the influence of less significant gradients. Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet with architectures including ResNet, VGG, and Vision Transformers show that the proposed method consistently improves test accuracy compared to Sharpness-Aware Minimization and its variants. The code repository is available at: https://github.com/YUNBLAK/Sharpness-Aware-Minimization-with-Z-Score-Gradient-Filtering

[1469] Functional Complexity-adaptive Temporal Tensor Decomposition

Panqi Chen, Lei Cheng, Jianlong Li, Weichang Li, Weiqing Liu, Jiang Bian, Shikai Fang

Main category: cs.LG

TL;DR: The paper proposes CATTE, a functional complexity-adaptive temporal tensor decomposition method that handles continuous indexes in multiple modes and automatically adapts model complexity using sparsity-inducing priors.

Details

Motivation: Existing temporal tensor decomposition methods struggle with general tensor data containing continuous indexes beyond just temporal modes (e.g., spatial coordinates in climate data) and lack self-adapting model complexity mechanisms.

Method: Encodes continuous spatial indexes as learnable Fourier features, uses neural ODEs in latent space for temporal trajectories, and introduces sparsity-inducing priors over factor trajectories for automatic complexity adaptation. Employs efficient variational inference with analytical ELBO.

Result: CATTE successfully reveals underlying ranks of functional temporal tensors and significantly outperforms existing methods in prediction performance and noise robustness across synthetic and real-world datasets.

Conclusion: The proposed CATTE method effectively addresses limitations of existing approaches by handling continuous indexes in multiple modes and enabling automatic model complexity adaptation through sparsity-inducing priors.

Abstract: Tensor decomposition is a fundamental tool for analyzing multi-dimensional data by learning low-rank factors to represent high-order interactions. While recent works on temporal tensor decomposition have made significant progress by incorporating continuous timestamps in latent factors, they still struggle with general tensor data with continuous indexes not only in the temporal mode but also in other modes, such as spatial coordinates in climate data. Moreover, the challenge of self-adapting model complexity is largely unexplored in functional temporal tensor models, with existing methods being inapplicable in this setting. To address these limitations, we propose functional \underline{C}omplexity-\underline{A}daptive \underline{T}emporal \underline{T}ensor d\underline{E}composition (\textsc{Catte}). Our approach encodes continuous spatial indexes as learnable Fourier features and employs neural ODEs in latent space to learn the temporal trajectories of factors. To enable automatic adaptation of model complexity, we introduce a sparsity-inducing prior over the factor trajectories. We develop an efficient variational inference scheme with an analytical evidence lower bound, enabling sampling-free optimization. Through extensive experiments on both synthetic and real-world datasets, we demonstrate that \textsc{Catte} not only reveals the underlying ranks of functional temporal tensors but also significantly outperforms existing methods in prediction performance and robustness against noise.

[1470] Trinity-RFT: A General-Purpose and Unified Framework for Reinforcement Fine-Tuning of Large Language Models

Xuchen Pan, Yanxi Chen, Yushuo Chen, Yuchang Sun, Daoyuan Chen, Wenhao Zhang, Yuexiang Xie, Yilun Huang, Yilei Zhang, Dawei Gao, Weijie Shi, Yaliang Li, Bolin Ding, Jingren Zhou

Main category: cs.LG

TL;DR: Trinity-RFT is a unified framework for reinforcement fine-tuning of large language models with modular design supporting various RFT modes and efficient agent-environment interaction.

Details

Motivation: To create a general-purpose, easy-to-use framework that unifies different reinforcement fine-tuning approaches for large language models and serves as a development platform for advanced RL research.

Method: Built with modular design: (1) RFT-core unifying synchronous/asynchronous, on-policy/off-policy, and online/offline RFT modes; (2) efficient agent-environment interaction; (3) optimized data pipelines for RFT.

Result: A flexible framework that can be adapted for diverse application scenarios and serves as a unified platform for RL development at both macroscopic and microscopic levels.

Conclusion: Trinity-RFT provides a comprehensive solution for reinforcement fine-tuning of LLMs with demonstrated functionality and user-friendliness through extensive examples and experiments.

Abstract: Trinity-RFT is a general-purpose, unified and easy-to-use framework designed for reinforcement fine-tuning (RFT) of large language models. It is built with a modular and decoupled design, consisting of (1) an RFT-core that unifies and generalizes synchronous/asynchronous, on-policy/off-policy, and online/offline modes of RFT; (2) seamless integration for agent-environment interaction with high efficiency and robustness; and (3) systematic data pipelines optimized for RFT. Trinity-RFT can be easily adapted for diverse application scenarios, and serves as a unified platform for development and research of advanced reinforcement learning paradigms at both macroscopic and microscopic levels. This technical report outlines the vision, features, design and implementations of Trinity-RFT, accompanied by extensive examples, applications and experiments that demonstrate its functionalities and user-friendliness.

[1471] Recurrent Memory for Online Interdomain Gaussian Processes

Wenlong Chen, Naoki Kiyohara, Harrison Bo Hua Zhu, Jacob Curran-Sebastian, Samir Bhatt, Yingzhen Li

Main category: cs.LG

TL;DR: OHSVGP is an online Gaussian process model that captures long-term memory using HiPPO framework, enabling efficient online learning with superior predictive performance.

Details

Motivation: To develop an online GP model that can effectively capture long-term memory in sequential data, addressing limitations of existing online GP methods in preserving historical information.

Method: Integrates HiPPO framework with sparse variational Gaussian processes, interpreting HiPPO projections as inducing variables with time-dependent orthogonal polynomial basis functions, allowing online kernel updates via recurrence relations.

Result: Outperforms existing online GP methods in predictive performance, long-term memory preservation, and computational efficiency across time series prediction, continual learning, and deep generative modeling tasks.

Conclusion: The HiPPO framework naturally fits interdomain GP framework, enabling effective long-term memory modeling in online learning settings with improved computational efficiency.

Abstract: We propose a novel online Gaussian process (GP) model that is capable of capturing long-term memory in sequential data in an online learning setting. Our model, Online HiPPO Sparse Variational Gaussian Process (OHSVGP), leverages the HiPPO (High-order Polynomial Projection Operators) framework, which is popularized in the RNN domain due to its long-range memory modeling capabilities. We interpret the HiPPO time-varying orthogonal projections as inducing variables with time-dependent orthogonal polynomial basis functions, which allows the SVGP inducing variables to memorize the process history. We show that the HiPPO framework fits naturally into the interdomain GP framework and demonstrate that the kernel matrices can also be updated online in a recurrence form based on the ODE evolution of HiPPO. We evaluate OHSVGP with online prediction for 1D time series, continual learning in discriminative GP model for data with multidimensional inputs, and deep generative modeling with sparse Gaussian process variational autoencoder, showing that it outperforms existing online GP methods in terms of predictive performance, long-term memory preservation, and computational efficiency.

[1472] Reward Model Overoptimisation in Iterated RLHF

Lorenz Wolf, Robert Kirk, Mirco Musolesi

Main category: cs.LG

TL;DR: This paper studies overoptimisation in iterated RLHF, finding that overoptimisation decreases over iterations but performance gains diminish, with policy reinitialisation strategies affecting recovery from early overoptimisation.

Details

Motivation: RLHF suffers from reward model overoptimisation where models overfit to reward functions, resulting in non-generalisable policies. Iterated RLHF is commonly used but its dynamics remain poorly understood.

Method: Comprehensive study using AlpacaFarm benchmark, systematically analyzing key design choices: reward model training data transfer across iterations, reward function selection, and policy initialisation strategies.

Result: Overoptimisation decreases over successive iterations as reward models better approximate ground-truth preferences. Performance gains diminish over time. Reinitialising from base policy is robust but limits flexibility, while other strategies often fail to recover from early overoptimisation.

Conclusion: The findings provide actionable insights for building more stable and generalisable RLHF pipelines by understanding the dynamics of overoptimisation in iterated settings.

Abstract: Reinforcement learning from human feedback (RLHF) is a widely used method for aligning large language models with human preferences. However, RLHF often suffers from reward model overoptimisation, in which models overfit to the reward function, resulting in non-generalisable policies that exploit the idiosyncrasies and peculiarities of the reward function. A common mitigation is iterated RLHF, in which reward models are repeatedly retrained with updated human feedback and policies are re-optimised. Despite its increasing adoption, the dynamics of overoptimisation in this setting remain poorly understood. In this work, we present the first comprehensive study of overoptimisation in iterated RLHF. We systematically analyse key design choices - how reward model training data is transferred across iterations, which reward function is used for optimisation, and how policies are initialised. Using the controlled AlpacaFarm benchmark, we observe that overoptimisation tends to decrease over successive iterations, as reward models increasingly approximate ground-truth preferences. However, performance gains diminish over time, and while reinitialising from the base policy is robust, it limits optimisation flexibility. Other initialisation strategies often fail to recover from early overoptimisation. These findings offer actionable insights for building more stable and generalisable RLHF pipelines.

[1473] Fine-tuning Diffusion Policies with Backpropagation Through Diffusion Timesteps

Ningyuan Yang, Jiaxuan Gao, Feng Gao, Yi Wu, Chao Yu

Main category: cs.LG

TL;DR: NCDPO is a novel framework that reformulates Diffusion Policies as noise-conditioned deterministic policies, enabling efficient RL fine-tuning with PPO by making likelihood evaluation tractable and allowing gradient backpropagation through all diffusion timesteps.

Details

Motivation: Diffusion policies can learn diverse skills from demonstrations but suffer from sub-optimal performance due to limited demonstration data coverage. Existing RL fine-tuning approaches struggle to adapt PPO to diffusion models because of computational intractability in action likelihood estimation during denoising.

Method: NCDPO reformulates Diffusion Policy as a noise-conditioned deterministic policy, treating each denoising step as a differentiable transformation conditioned on pre-sampled noise. This enables tractable likelihood evaluation and gradient backpropagation through all diffusion timesteps.

Result: NCDPO achieves sample efficiency comparable to MLP+PPO when training from scratch, outperforming existing methods in both sample efficiency and final performance across continuous robot control and multi-agent game benchmarks. The method is also robust to the number of denoising timesteps.

Conclusion: NCDPO successfully addresses the computational challenges of applying PPO to diffusion policies, enabling efficient RL fine-tuning while maintaining the representation power of diffusion models, making it a promising approach for improving diffusion policies in decision-making scenarios.

Abstract: Diffusion policies, widely adopted in decision-making scenarios such as robotics, gaming and autonomous driving, are capable of learning diverse skills from demonstration data due to their high representation power. However, the sub-optimal and limited coverage of demonstration data could lead to diffusion policies that generate sub-optimal trajectories and even catastrophic failures. While reinforcement learning (RL)-based fine-tuning has emerged as a promising solution to address these limitations, existing approaches struggle to effectively adapt Proximal Policy Optimization (PPO) to diffusion models. This challenge stems from the computational intractability of action likelihood estimation during the denoising process, which leads to complicated optimization objectives. In our experiments starting from randomly initialized policies, we find that online tuning of Diffusion Policies demonstrates much lower sample efficiency compared to directly applying PPO on MLP policies (MLP+PPO). To address these challenges, we introduce NCDPO, a novel framework that reformulates Diffusion Policy as a noise-conditioned deterministic policy. By treating each denoising step as a differentiable transformation conditioned on pre-sampled noise, NCDPO enables tractable likelihood evaluation and gradient backpropagation through all diffusion timesteps. Our experiments demonstrate that NCDPO achieves sample efficiency comparable to MLP+PPO when training from scratch, outperforming existing methods in both sample efficiency and final performance across diverse benchmarks, including continuous robot control and multi-agent game scenarios. Furthermore, our experimental results show that our method is robust to the number denoising timesteps in the Diffusion Policy.

[1474] The Accuracy Cost of Weakness: A Theoretical Analysis of Fixed-Segment Weak Labeling for Events in Time

John Martinsson, Tuomas Virtanen, Maria Sandsten, Olof Mogren

Main category: cs.LG

TL;DR: This paper analyzes how segment length affects label accuracy and annotation cost in weak labeling processes, comparing fixed-length labeling with an oracle method that uses true event activations.

Details

Motivation: Accurate labels are critical for robust machine learning models, and understanding how labeling processes affect label quality and cost is essential for optimizing sequence labeling tasks.

Method: Modeled the accuracy and cost of a weak labeling process where annotators assign presence/absence labels to fixed-length data segments, comparing this with an oracle method that uses true event activations to construct segments.

Result: The oracle method outperforms fixed-length labeling in both accuracy and cost in most realistic scenarios, revealing a performance gap between the two approaches.

Conclusion: The findings justify adaptive weak labeling strategies that mimic the oracle process and provide a foundation for optimizing weak labeling in sequence labeling tasks.

Abstract: Accurate labels are critical for deriving robust machine learning models. Labels are used to train supervised learning models and to evaluate most machine learning paradigms. In this paper, we model the accuracy and cost of a common weak labeling process where annotators assign presence or absence labels to fixed-length data segments for a given event class. The annotator labels a segment as “present” if it sufficiently covers an event from that class, e.g., a birdsong sound event in audio data. We analyze how the segment length affects the label accuracy and the required number of annotations, and compare this fixed-length labeling approach with an oracle method that uses the true event activations to construct the segments. Furthermore, we quantify the gap between these methods and verify that in most realistic scenarios the oracle method is better than the fixed-length labeling method in both accuracy and cost. Our findings provide a theoretical justification for adaptive weak labeling strategies that mimic the oracle process, and a foundation for optimizing weak labeling processes in sequence labeling tasks.

[1475] Continuous Optimization for Feature Selection with Permutation-Invariant Embedding and Policy-Guided Search

Rui Liu, Rui Xie, Zijun Yao, Yanjie Fu, Dongjie Wang

Main category: cs.LG

TL;DR: A new feature selection framework that uses encoder-decoder with permutation invariance and reinforcement learning to overcome limitations of existing methods in handling feature interactions and non-convex search spaces.

Details

Motivation: Existing feature selection methods struggle with complex feature interactions and adaptation to diverse scenarios. Current generative approaches face challenges with permutation sensitivity in embedding feature subsets and ineffective gradient-based search due to non-convexity assumptions.

Method: Proposed framework uses: 1) encoder-decoder paradigm with pairwise feature relationships for permutation-invariant embeddings, enhanced by inducing point mechanism for efficiency; 2) policy-based reinforcement learning to explore embedding space without convexity assumptions, adaptively prioritizing high-potential regions.

Result: Extensive experiments demonstrate the model’s effectiveness, efficiency, robustness, and explicitness in feature selection tasks.

Conclusion: The proposed framework successfully addresses key limitations in feature selection by ensuring permutation invariance in embeddings and enabling effective exploration of non-convex search spaces through reinforcement learning.

Abstract: Feature selection removes redundant features to enhanc performance and computational efficiency in downstream tasks. Existing works often struggle to capture complex feature interactions and adapt to diverse scenarios. Recent advances in this domain have incorporated generative intelligence to address these drawbacks by uncovering intricate relationships between features. However, two key limitations remain: 1) embedding feature subsets in a continuous space is challenging due to permutation sensitivity, as changes in feature order can introduce biases and weaken the embedding learning process; 2) gradient-based search in the embedding space assumes convexity, which is rarely guaranteed, leading to reduced search effectiveness and suboptimal subsets. To address these limitations, we propose a new framework that can: 1) preserve feature subset knowledge in a continuous embedding space while ensuring permutation invariance; 2) effectively explore the embedding space without relying on strong convex assumptions. For the first objective, we develop an encoder-decoder paradigm to preserve feature selection knowledge into a continuous embedding space. This paradigm captures feature interactions through pairwise relationships within the subset, removing the influence of feature order on the embedding. Moreover, an inducing point mechanism is introduced to accelerate pairwise relationship computations. For the second objective, we employ a policy-based reinforcement learning (RL) approach to guide the exploration of the embedding space. The RL agent effectively navigates the space by balancing multiple objectives. By prioritizing high-potential regions adaptively and eliminating the reliance on convexity assumptions, the RL agent effectively reduces the risk of converging to local optima. Extensive experiments demonstrate the effectiveness, efficiency, robustness and explicitness of our model.

[1476] Learning to Explain Air Traffic Situation

Hong-ah Chai, Seokbin Yoon, Keumjin Lee

Main category: cs.LG

TL;DR: A Transformer-based multi-agent framework that explains air traffic situations by quantifying aircraft influence through attention scores, trained on real-world data from Incheon Airport.

Details

Motivation: Current models focus on specific tasks or pairwise interactions, failing to capture comprehensive air traffic dynamics needed to understand controllers' mental picture of complex situations.

Method: Transformer-based multi-agent trajectory model that captures spatio-temporal aircraft movements and social interactions, using attention scores to quantify individual aircraft influence.

Result: The framework effectively explicates air traffic situations from real-world surveillance data, providing explainable insights into traffic dynamics.

Conclusion: The approach can potentially enhance air traffic controllers’ decision-making and situational awareness by providing interpretable understanding of complex traffic situations.

Abstract: Understanding how air traffic controllers construct a mental ‘picture’ of complex air traffic situations is crucial but remains a challenge due to the inherently intricate, high-dimensional interactions between aircraft, pilots, and controllers. Previous work on modeling the strategies of air traffic controllers and their mental image of traffic situations often centers on specific air traffic control tasks or pairwise interactions between aircraft, neglecting to capture the comprehensive dynamics of an air traffic situation. To address this issue, we propose a machine learning-based framework for explaining air traffic situations. Specifically, we employ a Transformer-based multi-agent trajectory model that encapsulates both the spatio-temporal movement of aircraft and social interaction between them. By deriving attention scores from the model, we can quantify the influence of individual aircraft on overall traffic dynamics. This provides explainable insights into how air traffic controllers perceive and understand the traffic situation. Trained on real-world air traffic surveillance data collected from the terminal airspace around Incheon International Airport in South Korea, our framework effectively explicates air traffic situations. This could potentially support and enhance the decision-making and situational awareness of air traffic controllers.

[1477] TGT: A Temporal Gating Transformer for Smartphone App Usage Prediction

Longlong Li, Cunquan Qu, Guanghui Wang

Main category: cs.LG

TL;DR: TGT is a Transformer framework with temporal gating that improves smartphone app usage prediction by modeling fine-grained temporal dynamics, outperforming 15 baselines and showing robustness in cold-start scenarios.

Details

Motivation: Existing approaches struggle with sparse and irregular user behavior in app usage prediction, especially under cold-start conditions, and lack fine-grained temporal modeling capabilities.

Method: TGT uses a Transformer with temporal gating module that conditions hidden representations on hour-of-day, adaptively rescaling features in time-aware manner, plus a context-aware encoder integrating session sequences and user profiles.

Result: TGT significantly outperforms 15 competitive baselines on Tsinghua App Usage and LSApp datasets, achieving notable gains in HR@1 and maintaining robustness under cold-start scenarios.

Conclusion: TGT establishes itself as both powerful and interpretable framework for time-aware app usage prediction, learning human-consistent patterns of app behavior through interpretable gating vectors.

Abstract: Accurately predicting smartphone app usage is challenging due to the sparsity and irregularity of user behavior, especially under cold-start and low-activity conditions. Existing approaches mostly rely on static or attention-only architectures, which struggle to model fine-grained temporal dynamics. We propose TGT, a Transformer framework equipped with a temporal gating module that conditions hidden representations on the hour-of-day. Unlike conventional time embeddings, temporal gating adaptively rescales feature dimensions in a time-aware manner, working orthogonally to self-attention and strengthening temporal sensitivity. TGT further incorporates a context-aware encoder that integrates session sequences and user profiles into a unified representation. Experiments on two real-world datasets, Tsinghua App Usage and LSApp, demonstrate that TGT significantly outperforms 15 competitive baselines, achieving notable gains in HR@1 and maintaining robustness under cold-start scenarios. Beyond accuracy, analysis of gating vectors uncovers interpretable daily usage rhythms, showing that TGT learns human-consistent patterns of app behavior. These results establish TGT as both a powerful and interpretable framework for time-aware app usage prediction.

[1478] Comba: Improving Bilinear RNNs with Closed-loop Control

Jiaxi Hu, Yongqi Pan, Jusen Du, Disen Lan, Xiaqiang Tang, Qingsong Wen, Yuxuan Liang, Weigao Sun

Main category: cs.LG

TL;DR: The paper introduces Bilinear RNNs and proposes Comba, a novel variant with scalar-plus-low-rank state transition and feedback corrections, achieving superior performance in language and vision modeling.

Details

Motivation: Recent sequence modeling methods like Gated DeltaNet, TTT, and RWKV-7 have shown improvements through Delta learning rule but have limitations in recurrent memory management. The authors aim to address these limitations by introducing a more effective bilinear RNN architecture.

Method: The authors first analyze Bilinear RNNs, then propose Comba - a novel variant based on closed-loop control theory. Comba uses scalar-plus-low-rank state transition with state and output feedback corrections. They implement a hardware-efficient chunk-wise parallel kernel in Triton and train 340M/1.3B parameter models.

Result: Comba demonstrates superior performance and computation efficiency in both language and vision modeling tasks compared to previous methods.

Conclusion: The proposed Comba model, with its bilinear RNN architecture and feedback mechanisms, provides an effective and efficient solution for sequence modeling that outperforms existing approaches in both language and vision domains.

Abstract: Recent efficient sequence modeling methods such as Gated DeltaNet, TTT, and RWKV-7 have achieved performance improvements by supervising the recurrent memory management through Delta learning rule. Unlike previous state-space models (e.g., Mamba) and gated linear attentions (e.g., GLA), these models introduce interactions between the recurrent state and the key vector, structurally resembling bilinear systems. In this paper, we first introduce the concept of Bilinear RNNs with a comprehensive analysis on the advantages and limitations of these models. Then, based on closed-loop control theory, we propose a novel Bilinear RNN variant named Comba, which adopts a scalar-plus-low-rank state transition, with both state feedback and output feedback corrections. We also implement a hardware-efficient chunk-wise parallel kernel in Triton and train models with 340M/1.3B parameters on large-scale corpus. Comba demonstrates superior performance and computation efficiency in both language and vision modeling.

[1479] PiFlow: Principle-aware Scientific Discovery with Multi-Agent Collaboration

Yingming Pu, Tao Lin, Hongyu Chen

Main category: cs.LG

TL;DR: PiFlow is an information-theoretical framework that treats automated scientific discovery as a structured uncertainty reduction problem guided by scientific principles, significantly improving discovery efficiency and solution quality.

Details

Motivation: Existing LLM-based multi-agent systems for scientific discovery lack rationality constraints, leading to aimless hypothesizing and failure to link hypotheses with evidence, hindering systematic uncertainty reduction.

Method: Introduces PiFlow framework that treats scientific discovery as structured uncertainty reduction guided by principles like scientific laws, using information theory to constrain exploration.

Result: Achieved 73.55% increase in AUC of property values vs exploration steps and 94.06% improvement in solution quality compared to vanilla agent systems across three scientific domains.

Conclusion: PiFlow establishes a novel paradigm for highly efficient automated scientific discovery, serving as a plug-and-play method for more robust and accelerated AI-driven research.

Abstract: Large Language Model (LLM)-based multi-agent systems (MAS) demonstrate remarkable potential for scientific discovery. Existing approaches, however, often automate scientific discovery using predefined workflows that lack rationality constraints. This often leads to aimless hypothesizing and a failure to consistently link hypotheses with evidence, thereby hindering the systematic reduction of uncertainty. Overcoming these limitations fundamentally requires a principled approach to exploration. We introduce PiFlow, an information-theoretical framework, treating automated scientific discovery as a structured uncertainty reduction problem guided by principles (e.g., scientific laws). In evaluations across three distinct scientific domains – discovering nanomaterial structures, bio-molecules, and superconductor candidates with targeted properties – our method significantly improves discovery efficiency, reflected by a 73.55% increase in the Area Under the Curve (AUC) of property values versus exploration steps, and enhances solution quality by 94.06% compared to a vanilla agent system. Overall, PiFlow serves as a Plug-and-Play method, establishing a novel paradigm shift in highly efficient automated scientific discovery, paving the way for more robust and accelerated AI-driven research. Code is publicly available at our \href{https://github.com/amair-lab/PiFlow}{GitHub}.

[1480] Joint Value Estimation and Bidding in Repeated First-Price Auctions

Yuxiao Wen, Yanjun Han, Zhengyuan Zhou

Main category: cs.LG

TL;DR: Regret minimization in repeated first-price auctions with partial feedback (win/loss outcomes only), addressing online advertising scenarios where value depends on treatment effects between won vs lost auctions.

Details

Motivation: Practical online advertising scenarios where impression value depends on treatment effects (e.g., click/conversion differences between won and lost auctions), with only win/loss feedback available.

Method: Proposed algorithms for three outcome models: (1) adversarial outcomes without features, (2) linear potential outcomes with features, and (3) linear treatment effects in features. Jointly estimate private values and optimize bidding strategies.

Result: Achieved near-optimal regret bounds. Framework eliminates need for overlap condition in causal inference since treatments are actively chosen.

Conclusion: Effective regret minimization framework for first-price auctions with partial feedback, handling various outcome models while avoiding traditional causal inference assumptions through active treatment selection.

Abstract: We study regret minimization in repeated first-price auctions (FPAs), where a bidder observes only the realized outcome after each auction – win or loss. This setup reflects practical scenarios in online display advertising where the actual value of an impression depends on the difference between two potential outcomes, such as clicks or conversion rates, when the auction is won versus lost. We analyze three outcome models: (1) adversarial outcomes without features, (2) linear potential outcomes with features, and (3) linear treatment effects in features. For each setting, we propose algorithms that jointly estimate private values and optimize bidding strategies, achieving near-optimal regret bounds. Notably, our framework enjoys a unique feature that the treatments are also actively chosen, and hence eliminates the need for the overlap condition commonly required in causal inference.

[1481] Scaling Diffusion Transformers Efficiently via $μ$P

Chenyu Zheng, Xinyu Zhang, Rongzhen Wang, Wei Huang, Zhi Tian, Weilin Huang, Jun Zhu, Chongxuan Li

Main category: cs.LG

TL;DR: The paper generalizes Maximal Update Parametrization (μP) to diffusion Transformers, enabling stable hyperparameter transfer from small to large models, reducing tuning costs by up to 96.5% while improving performance.

Details

Motivation: Diffusion Transformers face scalability limitations due to high hyperparameter tuning costs at large scales, while μP has shown success in vanilla Transformers but its applicability to diffusion Transformers was unclear due to architectural and objective differences.

Method: The authors rigorously prove that μP aligns with mainstream diffusion Transformers (DiT, U-ViT, PixArt-α, MMDiT) and systematically validate hyperparameter transferability through large-scale experiments.

Result: DiT-XL-2-μP achieved 2.9x faster convergence than original DiT-XL-2. Scaling PixArt-α from 0.04B to 0.61B and MMDiT from 0.18B to 18B under μP outperformed baselines with only 5.5% and 3% of typical tuning costs respectively.

Conclusion: μP establishes a principled and efficient framework for scaling diffusion Transformers, enabling stable hyperparameter transfer and dramatically reducing tuning costs while maintaining or improving performance.

Abstract: Diffusion Transformers have emerged as the foundation for vision generative models, but their scalability is limited by the high cost of hyperparameter (HP) tuning at large scales. Recently, Maximal Update Parametrization ($\mu$P) was proposed for vanilla Transformers, which enables stable HP transfer from small to large language models, and dramatically reduces tuning costs. However, it remains unclear whether $\mu$P of vanilla Transformers extends to diffusion Transformers, which differ architecturally and objectively. In this work, we generalize standard $\mu$P to diffusion Transformers and validate its effectiveness through large-scale experiments. First, we rigorously prove that $\mu$P of mainstream diffusion Transformers, including DiT, U-ViT, PixArt-$\alpha$, and MMDiT, aligns with that of the vanilla Transformer, enabling the direct application of existing $\mu$P methodologies. Leveraging this result, we systematically demonstrate that DiT-$\mu$P enjoys robust HP transferability. Notably, DiT-XL-2-$\mu$P with transferred learning rate achieves 2.9 times faster convergence than the original DiT-XL-2. Finally, we validate the effectiveness of $\mu$P on text-to-image generation by scaling PixArt-$\alpha$ from 0.04B to 0.61B and MMDiT from 0.18B to 18B. In both cases, models under $\mu$P outperform their respective baselines while requiring small tuning cost, only 5.5% of one training run for PixArt-$\alpha$ and 3% of consumption by human experts for MMDiT-18B. These results establish $\mu$P as a principled and efficient framework for scaling diffusion Transformers.

[1482] Meta-Learning to Explore via Memory Density Feedback

Kevin McKee, Eric Alt, Andrew Grebenisan, Mick van Gelderen, Gary Miguel

Main category: cs.LG

TL;DR: The paper proposes a meta-learning approach for reinforcement learning exploration that learns to maximize exploration progress within episodes by minimizing observation probability density relative to memories.

Details

Motivation: Traditional exploration algorithms use intrinsic rewards to seek unseen states, but this work aims to enable agents to maximize exploration progress even in novel states through meta-learning.

Method: The agent learns a policy that minimizes probability density of new observations relative to all memories, using recurrent networks to remember density trajectories and feedback evaluations.

Result: The approach allows agents to navigate complex familiarity landscapes in real-time and maximize exploration progress in novel states without prior training.

Conclusion: Meta-learning enables more effective exploration by learning to learn exploration strategies that work across different environment states and training epochs.

Abstract: Exploration algorithms for reinforcement learning typically replace or augment the reward function with an additional ``intrinsic’’ reward that trains the agent to seek previously unseen states of the environment. Here, we consider an exploration algorithm that exploits meta-learning, or learning to learn, such that the agent learns to maximize its exploration progress within a single episode, even between epochs of training. The agent learns a policy that aims to minimize the probability density of new observations with respect to all of its memories. In addition, it receives as feedback evaluations of the current observation density and retains that feedback in a recurrent network. By remembering trajectories of density, the agent learns to navigate a complex and growing landscape of familiarity in real-time, allowing it to maximize its exploration progress even in completely novel states of the environment for which its policy has not been trained.

[1483] InstructPro: Natural Language Guided Ligand-Binding Protein Design

Zhenqiao Song, Ramith Hettiarachchi, Chuan Li, Jianwen Xie, Lei Li

Main category: cs.LG

TL;DR: InstructPro is a family of generative models that designs proteins from natural language instructions and ligand formulas, overcoming data scarcity limitations in protein-ligand complex data.

Details

Motivation: Existing AI approaches for designing ligand-binding proteins are limited by scarce protein-ligand complex data, while abundant text descriptions of protein-ligand interactions remain underutilized.

Method: Developed InstructPro models that generate protein sequences from natural language instructions and ligand formulas, trained on InstructProBench dataset containing 9.6 million (function description, ligand, protein) triples.

Result: InstructPro-1B achieved design success rates of 2.46% (seen ligands) and 3.14% (zero-shot), while InstructPro-3B reached 5.06% and 3.93% respectively, substantially outperforming strong baselines.

Conclusion: Natural language-guided generative modeling has the potential to expand protein design capabilities beyond traditional data limitations.

Abstract: Designing ligand-binding proteins with precise functions is fundamental to advances in biology and chemistry, yet existing AI approaches are limited by scarce protein-ligand complex data. Meanwhile, abundant text descriptions of protein-ligand interactions remain underutilized. We introduce InstructPro, a family of generative models that design proteins from natural language instructions and ligand formulas. InstructPro produces protein sequences consistent with specified functional descriptions and ligand targets. To enable training and evaluation, we develop InstructProBench, a large-scale dataset of 9.6 million (function description, ligand, protein) triples. We train two model variants: InstructPro-1B and InstructPro-3B, which substantially outperform strong baselines. InstructPro-1B achieves design success rates of 2.46% (seen ligands) and 3.14% (zero-shot), while InstructPro-3B reaches 5.06% and 3.93%, respectively. These results demonstrate the potential of natural language-guided generative modeling to expand protein design capabilities beyond traditional data limitations.

[1484] Not All Models Suit Expert Offloading: On Local Routing Consistency of Mixture-of-Expert Models

Jingcong Liang, Siyuan Wang, Miren Tian, Yitong Li, Duyu Tang, Zhongyu Wei

Main category: cs.LG

TL;DR: This paper analyzes local routing consistency in Mixture-of-Experts (MoE) models, proposing two metrics (SRP and SCH) to measure how consecutive tokens activate similar experts, and finds that models with MoE on every layer and no shared experts have highest consistency.

Details

Motivation: To enable efficient deployment of large MoE models on memory-constrained devices through expert offloading, by understanding and exploiting the locality of expert activations where consecutive tokens activate similar experts.

Method: Proposed two metrics: Segment Routing Best Performance (SRP) to evaluate how well fixed expert groups cover token segments, and Segment Cache Best Hit Rate (SCH) to measure optimal segment-level cache hit rates. Analyzed 20 diverse MoE LLMs with different sizes and architectures.

Result: Found that models applying MoE on every layer without shared experts exhibit highest local routing consistency. Domain-specialized experts contribute more to routing consistency than vocabulary-specialized ones. Most models balance cache effectiveness and efficiency with cache sizes ~2x active experts.

Conclusion: These findings enable memory-efficient MoE design and deployment without compromising inference speed, paving the way for better expert offloading strategies.

Abstract: Mixture-of-Experts (MoE) enables efficient scaling of large language models (LLMs) with sparsely activated experts during inference. To effectively deploy large MoE models on memory-constrained devices, many systems introduce expert offloading that caches a subset of experts in fast memory, leaving others on slow memory to run on CPU or load on demand. While some research has exploited the locality of expert activations, where consecutive tokens activate similar experts, the degree of this local routing consistency varies across models and remains understudied. In this paper, we propose two metrics to measure local routing consistency of MoE models: (1) Segment Routing Best Performance (SRP), which evaluates how well a fixed group of experts can cover the needs of a segment of tokens, and (2) Segment Cache Best Hit Rate (SCH), which measures the optimal segment-level cache hit rate under a given cache size limit. We analyzed 20 MoE LLMs with diverse sizes and architectures and found that models that apply MoE on every layer and do not use shared experts exhibit the highest local routing consistency. We further showed that domain-specialized experts contribute more to routing consistency than vocabulary-specialized ones, and that most models can balance between cache effectiveness and efficiency with cache sizes approximately 2x the active experts. These findings pave the way for memory-efficient MoE design and deployment without compromising inference speed. We publish the code for replicating experiments at https://github.com/ljcleo/moe-lrc .

[1485] Chunking the Critic: A Transformer-based Soft Actor-Critic with N-Step Returns

Dong Tian, Onur Celik, Gerhard Neumann

Main category: cs.LG

TL;DR: Introduces a sequence-conditioned critic for SAC that uses a lightweight Transformer to model trajectory context and trains on aggregated N-step targets, improving performance on long-horizon tasks without importance sampling.

Details

Motivation: To address limitations of prior approaches that either score state-action pairs in isolation or rely on actor-side action chunking for long horizons, by strengthening the critic itself through sequence modeling.

Method: Uses a 2-layer Transformer with 128-256 hidden units to condition the critic on short trajectory segments, integrates multi-step returns without importance sampling, and freezes critic parameters for stability without target networks.

Result: Consistently outperforms standard SAC and strong off-policy baselines, with particularly large gains on long-trajectory control tasks, while maintaining stable training with maximum UTD ratio of 1.

Conclusion: Sequence modeling and N-step bootstrapping on the critic side are valuable for long-horizon reinforcement learning, enabling better temporal structure capture without complex importance sampling.

Abstract: We introduce a sequence-conditioned critic for Soft Actor–Critic (SAC) that models trajectory context with a lightweight Transformer and trains on aggregated $N$-step targets. Unlike prior approaches that (i) score state–action pairs in isolation or (ii) rely on actor-side action chunking to handle long horizons, our method strengthens the critic itself by conditioning on short trajectory segments and integrating multi-step returns – without importance sampling (IS). The resulting sequence-aware value estimates capture the critical temporal structure for extended-horizon and sparse-reward problems. On local-motion benchmarks, we further show that freezing critic parameters for several steps makes our update compatible with CrossQ’s core idea, enabling stable training \emph{without} a target network. Despite its simplicity – a 2-layer Transformer with 128-256 hidden units and a maximum update-to-data ratio (UTD) of $1$ – the approach consistently outperforms standard SAC and strong off-policy baselines, with particularly large gains on long-trajectory control. These results highlight the value of sequence modeling and $N$-step bootstrapping on the critic side for long-horizon reinforcement learning.

[1486] Scalable Graph Generative Modeling via Substructure Sequences

Zehong Wang, Zheyuan Zhang, Tianyi Ma, Chuxu Zhang, Yanfang Ye

Main category: cs.LG

TL;DR: G2PM is a generative Transformer pre-training framework for graphs that represents graph instances as sequences of substructures, addressing limitations of traditional message-passing GNNs.

Details

Motivation: Message-passing GNNs suffer from constrained expressiveness, over-smoothing, over-squashing, and limited capacity to model long-range dependencies, which hinder scalability.

Method: Represent graph instances as sequences of substructures and employ generative pre-training over these sequences using Transformer architecture.

Result: G2PM demonstrates strong scalability, improving with model sizes up to 60M parameters and outperforming prior approaches that plateau at smaller scales. It achieves state-of-the-art performance across diverse graph tasks.

Conclusion: G2PM establishes a compelling foundation for scalable graph learning, consistently outperforming strong baselines across node/link/graph classification, transfer learning, and cross-graph pretraining tasks.

Abstract: Graph neural networks (GNNs) have been predominantly driven by message-passing, where node representations are iteratively updated via local neighborhood aggregation. Despite their success, message-passing suffers from fundamental limitations – including constrained expressiveness, over-smoothing, over-squashing, and limited capacity to model long-range dependencies. These issues hinder scalability: increasing data size or model size often fails to yield improved performance. To this end, we explore pathways beyond message-passing and introduce Generative Graph Pattern Machine (G$^2$PM), a generative Transformer pre-training framework for graphs. G$^2$PM represents graph instances (nodes, edges, or entire graphs) as sequences of substructures, and employs generative pre-training over the sequences to learn generalizable and transferable representations. Empirically, G$^2$PM demonstrates strong scalability: on the ogbn-arxiv benchmark, it continues to improve with model sizes up to 60M parameters, outperforming prior generative approaches that plateau at significantly smaller scales (e.g., 3M). In addition, we systematically analyze the model design space, highlighting key architectural choices that contribute to its scalability and generalization. Across diverse tasks – including node/link/graph classification, transfer learning, and cross-graph pretraining – G$^2$PM consistently outperforms strong baselines, establishing a compelling foundation for scalable graph learning. The code and dataset are available at https://github.com/Zehong-Wang/G2PM.

[1487] Finite-Time Bounds for Two-Time-Scale Stochastic Approximation with Arbitrary Norm Contractions and Markovian Noise

Siddharth Chandak, Shaan Ul Haque, Nicholas Bambos

Main category: cs.LG

TL;DR: This paper provides the first mean square bound analysis for non-linear two-time-scale stochastic approximation with arbitrary norm contractive mappings and Markovian noise, achieving O(1/n^{2/3}) convergence rate in general and O(1/n) in special cases.

Details

Motivation: Motivated by applications in reinforcement learning, particularly for analyzing algorithms like Q-Learning and learning Generalized Nash Equilibria, where prior analysis was limited to Euclidean norm contractions and fixed point iterations.

Method: Uses generalized Moreau envelope to handle arbitrary norm contractions and solutions of Poisson equation to deal with Markovian noise in two-time-scale stochastic approximation algorithms.

Result: Shows mean square error decays at O(1/n^{2/3}) for general case and O(1/n) for special case with noiseless slower timescale. Applies analysis to SSP Q-Learning (first O(1/n) bound for asynchronous MDP control), Q-Learning with Polyak-averaging, and GNE learning for strongly monotone games.

Conclusion: Provides first comprehensive convergence analysis for non-linear two-time-scale SA with arbitrary norms and Markovian noise, with applications yielding improved convergence rates for several reinforcement learning and game theory algorithms.

Abstract: Two-time-scale Stochastic Approximation (SA) is an iterative algorithm with applications in reinforcement learning and optimization. Prior finite time analysis of such algorithms has focused on fixed point iterations with mappings contractive under Euclidean norm. Motivated by applications in reinforcement learning, we give the first mean square bound on non linear two-time-scale SA where the iterations have arbitrary norm contractive mappings and Markovian noise. We show that the mean square error decays at a rate of $O(1/n^{2/3})$ in the general case, and at a rate of $O(1/n)$ in a special case where the slower timescale is noiseless. Our analysis uses the generalized Moreau envelope to handle the arbitrary norm contractions and solutions of Poisson equation to deal with the Markovian noise. By analyzing the SSP Q-Learning algorithm, we give the first $O(1/n)$ bound for an algorithm for asynchronous control of MDPs under the average reward criterion. We also obtain a rate of $O(1/n)$ for Q-Learning with Polyak-averaging and provide an algorithm for learning Generalized Nash Equilibrium (GNE) for strongly monotone games which converges at a rate of $O(1/n^{2/3})$.

[1488] Enhancing Delta Compression in LLMs via SVD-based Quantization Error Minimization

Boya Xiong, Shuo Wang, Weifeng Ge, Guanhua Chen, Yun Chen

Main category: cs.LG

TL;DR: DeltaMix is an adaptive mixed-precision delta-compression framework that minimizes quantization error in SVD space for compressing fine-tuned LLMs, outperforming existing methods at high compression ratios.

Details

Motivation: Current delta-compression methods for multi-tenant LLM serving exhibit inadequate performance at high compression ratios due to their empirical nature, lacking theoretical justification.

Method: DeltaMix provides theoretical justification for mixed-precision compression and solves a 0/1 linear integer programming problem with reconstruction target correction to minimize quantization error in SVD space.

Result: DeltaMix consistently outperforms all baseline methods across multiple models and benchmarks, exceeding Delta-CoMe by 22.3% on AIME2024 and 6.1% on GQA for 7B parameter models.

Conclusion: DeltaMix offers an effective delta-compression framework with theoretical foundations that achieves superior performance in compressing fine-tuned LLMs for multi-tenant serving scenarios.

Abstract: Fine-tuning is a crucial process for adapting large language models (LLMs) to diverse applications. In certain scenarios, like multi-tenant serving, a large number of LLMs finetuned from the same base model are deployed to meet complex requirements for users. Recent works explore delta-compression approaches to quantize and compress the delta weights between the customized LLM and the corresponding base model. However, they exhibit inadequate performance at high compression ratios due to their empirical nature. In this work, we introduce DeltaMix, an adaptive mixed-precision delta-compression framework designed to minimize quantization error in the singular value decomposition (SVD) space without imposing additional assumptions. DeltaMix provides a theoretical justification for the necessity of mixed-precision compression and presents a practical quantization solution that involves solving a 0/1 linear integer programming problem alongside a reconstruction target correction method. Experimental results across multiple models and benchmarks illustrate that DeltaMix consistently outperforms all baseline methods. Notably, on tasks such as AIME2024 and GQA, DeltaMix exceeds the performance of the best baseline, Delta-CoMe, by 22.3% and 6.1% for 7B parameter models, respectively.

[1489] Pairwise Optimal Transports for Training All-to-All Flow-Based Condition Transfer Model

Kotaro Ikeda, Masanori Koyama, Jinzhe Zhang, Kohei Hayashi, Kenji Fukumizu

Main category: cs.LG

TL;DR: A flow-based method for learning all-to-all transfer maps among conditional distributions that approximates pairwise optimal transport, handling continuous conditions with sparse observations.

Details

Motivation: To address the challenge of learning optimal transport maps among conditional distributions with continuous conditions, where empirical observations per condition are sparse.

Method: Proposes a flow-based approach with a novel cost function for simultaneous learning of optimal transports for all conditional distribution pairs, supported by theoretical convergence guarantees.

Result: The method effectively learns transport maps for coupling data points in conditional flow matching, demonstrated on synthetic, benchmark, and chemical datasets with continuous physical properties as conditions.

Conclusion: The proposed flow-based method successfully approximates pairwise optimal transport for conditional distributions with continuous conditions, providing a practical solution for sparse observation scenarios.

Abstract: In this paper, we propose a flow-based method for learning all-to-all transfer maps among conditional distributions that approximates pairwise optimal transport. The proposed method addresses the challenge of handling the case of continuous conditions, which often involve a large set of conditions with sparse empirical observations per condition. We introduce a novel cost function that enables simultaneous learning of optimal transports for all pairs of conditional distributions. Our method is supported by a theoretical guarantee that, in the limit, it converges to the pairwise optimal transports among infinite pairs of conditional distributions. The learned transport maps are subsequently used to couple data points in conditional flow matching. We demonstrate the effectiveness of this method on synthetic and benchmark datasets, as well as on chemical datasets in which continuous physical properties are defined as conditions.

[1490] Runtime Adaptive Pruning for LLM Inference

Huanrong Liu, Chunlin Tian, Xuyang Wei, Qingbiao Li, Li Li

Main category: cs.LG

TL;DR: RAP is an elastic pruning framework using reinforcement learning to dynamically compress LLMs by jointly optimizing model weights and KV-cache allocation based on runtime memory constraints.

Details

Motivation: Large language models have high computational and memory requirements that hinder deployment. Existing compression methods use fixed heuristics and fail to adapt to runtime memory variations and heterogeneous KV-cache demands from diverse user requests.

Method: RAP uses reinforcement learning to dynamically track the ratio between model parameters and KV-cache during execution. It selectively retains components that maximize utility within current memory budget, considering FFNs for parameters and attention layers for KV-cache.

Result: Extensive experiments show RAP outperforms state-of-the-art baselines, marking the first approach to jointly consider model weights and KV-cache dynamically during runtime.

Conclusion: RAP successfully addresses LLM deployment constraints through elastic, runtime-aware compression that adapts to memory variations and workload demands.

Abstract: Large language models (LLMs) excel at language understanding and generation, but their enormous computational and memory requirements hinder deployment. Compression offers a potential solution to mitigate these constraints. However, most existing methods rely on fixed heuristics and thus fail to adapt to runtime memory variations or heterogeneous KV-cache demands arising from diverse user requests. To address these limitations, we propose RAP, an elastic pruning framework driven by reinforcement learning (RL) that dynamically adjusts compression strategies in a runtime-aware manner. Specifically, RAP dynamically tracks the evolving ratio between model parameters and KV-cache across practical execution. Recognizing that FFNs house most parameters, whereas parameter -light attention layers dominate KV-cache formation, the RL agent retains only those components that maximize utility within the current memory budget, conditioned on instantaneous workload and device state. Extensive experiments results demonstrate that RAP outperforms state-of-the-art baselines, marking the first time to jointly consider model weights and KV-cache on the fly.

[1491] A Model Zoo on Phase Transitions in Neural Networks

Konstantin Schürholt, Léo Meynent, Yefan Zhou, Haiquan Lu, Yaoqing Yang, Damian Borth

Main category: cs.LG

TL;DR: This paper introduces structured model zoos that systematically cover different phases of neural network loss landscapes, providing controlled diversity for weight space learning research.

Details

Motivation: Existing model zoos lack structured diversity, while statistical physics research has identified distinct phases in neural networks. Combining these concepts can create better datasets for weight space learning.

Method: Created 12 large-scale model zoos that systematically cover known phases, varying model architecture, size, and datasets across different modalities (CV, NLP, scientific ML). Computed loss landscape metrics and validated phase coverage.

Result: Developed comprehensive datasets that provide full coverage of neural network phases, with evidence suggesting phase information impacts applications like training, analysis, and sparsification.

Conclusion: The structured model zoos serve as valuable resources for weight space learning and other applications, demonstrating the importance of phase information in downstream methods like transfer learning and model averaging.

Abstract: Using the weights of trained Neural Network (NN) models as data modality has recently gained traction as a research field - dubbed Weight Space Learning (WSL). Multiple recent works propose WSL methods to analyze models, evaluate methods, or synthesize weights. Weight space learning methods require populations of trained models as datasets for development and evaluation. However, existing collections of models - called model zoos' - are unstructured or follow a rudimentary definition of diversity. In parallel, work rooted in statistical physics has identified phases and phase transitions in NN models. Models are homogeneous within the same phase but qualitatively differ from one phase to another. We combine the idea of model zoos’ with phase information to create a controlled notion of diversity in populations. We introduce 12 large-scale zoos that systematically cover known phases and vary over model architecture, size, and datasets. These datasets cover different modalities, such as computer vision, natural language processing, and scientific ML. For every model, we compute loss landscape metrics and validate full coverage of the phases. With this dataset, we provide the community with a resource with a wide range of potential applications for WSL and beyond. Evidence suggests the loss landscape phase plays a role in applications such as model training, analysis, or sparsification. We demonstrate this in an exploratory study of the downstream methods like transfer learning or model weights averaging.

[1492] Overcoming Long-Context Limitations of State-Space Models via Context-Dependent Sparse Attention

Zhihao Zhan, Jianan Zhao, Zhaocheng Zhu, Jian Tang

Main category: cs.LG

TL;DR: SSMs struggle with long-context modeling despite sub-quadratic complexity. The paper introduces joint recall as a better synthetic task, proves SSMs cannot solve it efficiently, and proposes HAX (SSM + CDSA) which outperforms baselines on benchmarks.

Details

Motivation: Address the limitations of state-space models (SSMs) in capturing long-range dependencies effectively, despite their sub-quadratic time complexity advantage over Transformers.

Method: Propose joint recall as a more realistic synthetic task, theoretically analyze SSM limitations, and develop HAX - a hybrid approach combining SSMs with Context-Dependent Sparse Attention using locality-sensitive hashing.

Result: HAX consistently outperforms SSM baselines and SSMs with context-independent sparse attention on both synthetic and real-world long-context benchmarks.

Conclusion: The proposed HAX framework effectively bridges the gap between theoretical analysis and practical applications, providing an efficient solution for long-context modeling that overcomes SSM limitations.

Abstract: Efficient long-context modeling remains a critical challenge for natural language processing (NLP), as the time complexity of the predominant Transformer architecture scales quadratically with the sequence length. While state-space models (SSMs) offer alternative sub-quadratic solutions, they struggle to capture long-range dependencies effectively. In this work, we focus on analyzing and improving the long-context modeling capabilities of SSMs. We show that the widely used synthetic task, associative recall, which requires a model to recall a value associated with a single key without context, insufficiently represents the complexities of real-world long-context modeling. To address this limitation, we extend the associative recall to a novel synthetic task, \emph{joint recall}, which requires a model to recall the value associated with a key given in a specified context. Theoretically, we prove that SSMs do not have the expressiveness to solve multi-query joint recall in sub-quadratic time complexity. To resolve this issue, we propose a solution based on integrating SSMs with Context-Dependent Sparse Attention (CDSA), which has the expressiveness to solve multi-query joint recall with sub-quadratic computation. To bridge the gap between theoretical analysis and real-world applications, we propose locality-sensitive Hashing Attention with sparse Key Selection (HAX), which instantiates the theoretical solution and is further tailored to natural language domains. Extensive experiments on both synthetic and real-world long-context benchmarks show that HAX consistently outperforms SSM baselines and SSMs integrated with context-independent sparse attention (CISA).

[1493] Improved Sample Complexity For Diffusion Model Training Without Empirical Risk Minimizer Access

Mudit Gaur, Prashant Trivedi, Sasidhar Kunapuli, Amrit Singh Bedi, Vaneet Aggarwal

Main category: cs.LG

TL;DR: This paper provides the first theoretical analysis of discrete-state diffusion models, achieving a state-of-the-art sample complexity bound of O~(ε⁻⁴) without assuming access to an empirical risk minimizer.

Details

Motivation: Discrete-state diffusion models are crucial for text, sequences, and combinatorial structures but remain poorly understood theoretically compared to continuous-state models. Existing analyses make unrealistic assumptions about access to empirical risk minimizers.

Method: The authors develop a principled theoretical framework that decomposes score estimation error into statistical and optimization components, providing insights into efficient training of diffusion models.

Result: The analysis achieves a sample complexity bound of O~(ε⁻⁴), which is state-of-the-art for discrete-state diffusion models and addresses fundamental gaps in existing literature.

Conclusion: This work establishes the theoretical tractability and practical relevance of diffusion models by providing rigorous analysis that bridges the gap between empirical success and theoretical understanding, particularly for discrete-state applications.

Abstract: Diffusion models have demonstrated remarkable performance in generating high-dimensional samples across domains such as vision, language, and the sciences. Although continuous-state diffusion models have been extensively studied both empirically and theoretically, discrete-state diffusion models, essential for applications involving text, sequences, and combinatorial structures, they remain significantly less understood from a theoretical standpoint. In particular, all existing analyses of discrete-state models assume access to an empirical risk minimizer. In this work, we present a principled theoretical framework analyzing diffusion models, providing a state-of-the-art sample complexity bound of $\widetilde{\mathcal{O}}(\epsilon^{-4})$. Our structured decomposition of the score estimation error into statistical and optimization components offers critical insights into how diffusion models can be trained efficiently. This analysis addresses a fundamental gap in the literature and establishes the theoretical tractability and practical relevance of diffusion models.

[1494] Sparsity Forcing: Reinforcing Token Sparsity of MLLMs

Feng Chen, Yefei He, Lequan Lin, Chenhui Gou, Jing Liu, Bohan Zhuang, Qi Wu

Main category: cs.LG

TL;DR: Sparsity Forcing is an RL-based post-training framework that explicitly enforces token sparsity in MLLMs, achieving 20-75% token reduction with minimal accuracy loss while speeding up decoding 3.3x.

Details

Motivation: Existing sparse attention methods either exploit inherent sparsity (plateauing at ~50% reduction) or use rigid patterns/regularizers without direct budget control, lacking ability to push sparsity further without accuracy loss.

Method: Uses RL-based post-training with multiple rollouts at different token budgets, formulating efficiency (token reduction) and performance (answer correctness) as joint rewards. Contrasts rollouts to reward efficient/correct answers and penalize inefficient/incorrect ones.

Result: Achieves 20-75% token reduction on Qwen2-VL/Qwen2.5-VL across 13 benchmarks with minimal accuracy decline, reduces long-context inference memory by 3x, and speeds up decoding by 3.3x.

Conclusion: Sparsity Forcing successfully turns token saving into an end-to-end, inference-consistent optimization objective, enabling significant efficiency gains while maintaining accuracy.

Abstract: Sparse attention mechanisms aim to reduce computational overhead with minimal accuracy loss by selectively processing salient tokens. Despite their effectiveness, most methods merely exploit a model’s inherent sparsity and thus plateau at moderate budgets (about 50% token reduction), with little headroom to push budget lower without hurting accuracy. Other approaches attempt to enforce sparsity through trainable sparse attention or sharpness-inducing regularizers, but these either fix rigid patterns that ignore input and layer dynamics, or optimize proxy objectives without direct control over token budgets. In this paper, we explicitly reinforce token sparsity in well-posed multimodal large language models (MLLMs) through a simple RL-based post-training framework named \textit{Sparsity Forcing}. Our method explores the efficiency-accuracy trade-off by running multiple rollouts with different token budgets, where both efficiency (token reduction ratio) and performance (answer correctness) are formulated as joint rewards. By contrasting rollouts within each group, the more efficient and correct answer is rewarded while less efficient or incorrect ones are penalized, thereby turning token saving into an end-to-end, inference-consistent optimization objective. Across thirteen image and video benchmarks, Sparsity Forcing raises token reduction ratio on Qwen2-VL/Qwen2.5-VL from 20% to 75% with minimal accuracy decline, significantly reducing long-context inference memory by up to 3$\times$ while speeding up decoding by up to 3.3$\times$.

[1495] One Token to Fool LLM-as-a-Judge

Yulai Zhao, Haolin Liu, Dian Yu, Sunyuan Kung, Meijia Chen, Haitao Mi, Dong Yu

Main category: cs.LG

TL;DR: LLMs used as automated judges are vulnerable to reward hacking through superficial inputs called “master keys” (e.g., symbols like ‘:’ or generic reasoning phrases), which can consistently trigger false positive rewards without substantive reasoning.

Details

Motivation: To uncover critical vulnerabilities in LLM-based reward models used for evaluation and training, particularly in reference-based settings like RLVR, where models are increasingly trusted as automated judges.

Method: Systematic evaluation of vulnerability across diverse models, followed by proposing a data augmentation strategy using truncated model outputs as adversarial negative examples to create robust Master Reward Models (Master-RMs).

Result: Found widespread susceptibility to “master key” attacks across leading proprietary models (GPT-o1, Claude-4), with simple inputs consistently eliciting false rewards. Master-RMs achieved state-of-the-art robustness against these attacks while maintaining high standard evaluation performance.

Conclusion: LLM judges are not as robust as assumed, posing significant reliability threats. The proposed Master-RM approach effectively mitigates these vulnerabilities and provides insights for future research on robust LLM evaluation.

Abstract: Large language models (LLMs) are increasingly trusted as automated judges, assisting evaluation and providing reward signals for training other models, particularly in reference-based settings like Reinforcement Learning with Verifiable Rewards (RLVR). However, we uncover a critical vulnerability even in this reference-based paradigm: generative reward models are systematically susceptible to reward hacking. We find that superficial inputs, which we term ‘‘master keys’’ such as non-word symbols (e.g., ‘’:’’ or ‘’.’’) or generic reasoning openers (e.g., ‘‘Thought process:’’ or ‘‘Let’s solve this problem step by step.’’), can consistently elicit false positive rewards without any substantive reasoning. Our systematic evaluation demonstrates this is a widespread failure affecting a diverse range of models, including leading proprietary systems such as GPT-o1 and Claude-4. These results challenge the assumed robustness of LLM judges and pose a significant threat to their reliability. To address this, we propose a simple yet effective data augmentation strategy using truncated model outputs as adversarial negative examples. The resulting Master Reward Models (Master-RMs) demonstrate state-of-the-art robustness against these ‘‘master key’’ attacks while maintaining high performance in standard evaluation settings. We supplement these findings with a comprehensive analysis of the vulnerability across model scales, prompt variations, and common inference-time strategies, offering insights to guide future research on robust LLM evaluation. We release our robust, general-domain reward models and the synthetic training data at https://huggingface.co/sarosavo/Master-RM and https://huggingface.co/datasets/sarosavo/Master-RM.

[1496] ThanoRA: Task Heterogeneity-Aware Multi-Task Low-Rank Adaptation

Jian Liang, Wenke Huang, Xianda Guo, Guancheng Wan, Bo Du, Mang Ye

Main category: cs.LG

TL;DR: ThanoRA is a multi-task adaptation framework that allocates varying subspace dimensions based on task heterogeneity and uses diversity-preserving regularization to prevent task interference, achieving better performance than separate task-specific fine-tuning without additional inference overhead.

Details

Motivation: Existing multi-task adaptation methods either underperform compared to multi-task training (Model Merging with LoRA) or introduce inference overhead and mergeability issues (MoE-based LoRA approaches), limiting real-world deployment practicality.

Method: ThanoRA performs multi-task learning by tailoring subspace allocation at initialization based on inter-task heterogeneity, and enforces diversity preservation throughout training to mitigate task interference and subspace collapse.

Result: Extensive experiments across multimodal and text-only benchmarks show ThanoRA consistently outperforms strong baselines, even surpassing separate task-specific fine-tuning, while introducing no additional structures or inference overhead.

Conclusion: ThanoRA enables effective, efficient and unified multi-task downstream adaptation without introducing additional structure, making it practical for real-world deployment.

Abstract: Low-Rank Adaptation (LoRA) is widely adopted for downstream fine-tuning of foundation models due to its efficiency and zero additional inference cost. Many real-world applications require foundation models to specialize in several specific tasks simultaneously, motivating the need for efficient multi-task downstream adaptation. To address this need, existing studies have primarily explored two directions: Model Merging with LoRA, which shows advantages in training-free scenarios but still lags behind multi-task training in overall performance; and MoE-based LoRA approaches, which improve multi-task learning performance but introduce routers that hinder the mergeability of LoRA parameters and incur considerable inference overhead, thereby limiting real-world deployment practicality. To this end, we propose ThanoRA, a Task Heterogeneity-Aware Multi-Task Low-Rank Adaptation framework that enables effective, efficient and unified multi-task downstream adaptation without introducing additional structure. ThanoRA performs multi-task learning by tailoring subspace allocation at initialization and enforcing diversity preservation throughout training: it allocates varying dimensions to construct task-specific low-rank subspaces driven by inter-task heterogeneity, enabling fine-grained knowledge injection, while diversity-preserving regularization mitigates task interference and subspace collapse, thereby fully exploiting the low-rank capacity. Extensive experiments across multimodal and text-only benchmarks under varying multi-task mixtures demonstrate that ThanoRA consistently outperforms strong baselines, surpassing even separate task-specific fine-tuning, while introducing no additional structures or inference overhead. Our code will be publicly available at: https://github.com/LiangJian24/ThanoRA.

[1497] A Unified MDL-based Binning and Tensor Factorization Framework for PDF Estimation

Mustafa Musab, Joseph K. Chege, Arie Yeredor, Martin Haardt

Main category: cs.LG

TL;DR: A novel non-parametric density estimation method using MDL-based binning with quantile cuts and tensor factorization via canonical polyadic decomposition.

Details

Motivation: Conventional histogram-based density estimators fail to capture local variations in multimodal distributions and have discontinuities that hinder gradient-based optimization and other applications requiring smooth derivatives.

Method: Uses minimum description length (MDL)-based binning with quantile cuts and tensor factorization through canonical polyadic decomposition (CPD) of joint probability tensors.

Result: Demonstrated effectiveness on synthetic data and a challenging real dry bean classification dataset.

Conclusion: The proposed approach provides a reliable non-parametric density estimation method that overcomes limitations of traditional histograms for complex multimodal distributions.

Abstract: Reliable density estimation is fundamental for numerous applications in statistics and machine learning. In many practical scenarios, data are best modeled as mixtures of component densities that capture complex and multimodal patterns. However, conventional density estimators based on uniform histograms often fail to capture local variations, especially when the underlying distribution is highly nonuniform. Furthermore, the inherent discontinuity of histograms poses challenges for tasks requiring smooth derivatives, such as gradient-based optimization, clustering, and nonparametric discriminant analysis. In this work, we present a novel non-parametric approach for multivariate probability density function (PDF) estimation that utilizes minimum description length (MDL)-based binning with quantile cuts. Our approach builds upon tensor factorization techniques, leveraging the canonical polyadic decomposition (CPD) of a joint probability tensor. We demonstrate the effectiveness of our method on synthetic data and a challenging real dry bean classification dataset.

[1498] LoTA-QAF: Lossless Ternary Adaptation for Quantization-Aware Fine-Tuning

Junyu Chen, Junzhuo Li, Zhen Peng, Wenjie Wang, Yuxiang Ren, Long Shi, Xuming Hu

Main category: cs.LG

TL;DR: LoTA-QAF is a novel fine-tuning method for quantized LLMs that enables lossless merging of ternary adaptation weights into quantized weights and adjusts all quantized weights, overcoming challenges in quantization-aware fine-tuning.

Details

Motivation: Address challenges in fine-tuning quantized LLMs including data type mismatch between low-precision quantized weights and high-precision adaptation weights, accuracy degradation during merging, and lack of lossless merging methods.

Method: Uses ternary adaptation aligned with quantization grid, TA-based lossless merging mechanism, and ternary signed gradient descent (t-SignSGD) for updating TA weights.

Result: On MMLU benchmark, recovers performance for quantized models and surpasses 16-bit LoRA by up to 5.14%. For task-specific fine-tuning, outperforms other methods while 16-bit LoRA achieves superior results.

Conclusion: LoTA-QAF effectively addresses quantization-aware fine-tuning challenges and enables efficient deployment of LLMs on resource-constrained edge devices.

Abstract: Quantization and fine-tuning are crucial for deploying large language models (LLMs) on resource-constrained edge devices. However, fine-tuning quantized models presents significant challenges, primarily stemming from: First, the mismatch in data types between the low-precision quantized weights (e.g., 4-bit) and the high-precision adaptation weights (e.g., 16-bit). This mismatch limits the computational efficiency advantage offered by quantized weights during inference. Second, potential accuracy degradation when merging these high-precision adaptation weights into the low-precision quantized weights, as the adaptation weights often necessitate approximation or truncation. Third, as far as we know, no existing methods support the lossless merging of adaptation while adjusting all quantized weights. To address these challenges, we introduce lossless ternary adaptation for quantization-aware fine-tuning (LoTA-QAF). This is a novel fine-tuning method specifically designed for quantized LLMs, enabling the lossless merging of ternary adaptation weights into quantized weights and the adjustment of all quantized weights. LoTA-QAF operates through a combination of: i) A custom-designed ternary adaptation (TA) that aligns ternary weights with the quantization grid and uses these ternary weights to adjust quantized weights. ii) A TA-based mechanism that enables the lossless merging of adaptation weights. iii) Ternary signed gradient descent (t-SignSGD) for updating the TA weights. We apply LoTA-QAF to Llama-3.1/3.3 and Qwen-2.5 model families and validate its effectiveness on several downstream tasks. On the MMLU benchmark, our method effectively recovers performance for quantized models, surpassing 16-bit LoRA by up to 5.14%. For task-specific fine-tuning, 16-bit LoRA achieves superior results, but LoTA-QAF still outperforms other methods. Code: github.com/KingdalfGoodman/LoTA-QAF.

[1499] Localized Diffusion Models

Georg A. Gottwald, Shuigen Liu, Youssef Marzouk, Sebastian Reich, Xin T. Tong

Main category: cs.LG

TL;DR: Localized diffusion models exploit locality structure in target distributions to reduce dimensionality, enabling more efficient training with lower sample complexity while maintaining performance.

Details

Motivation: Diffusion models suffer from the curse of dimensionality when estimating high-dimensional score functions, but many target distributions have sparse conditional dependencies (locality structure) that can be exploited for more efficient training.

Method: Proposes localized diffusion models that use a localized score matching loss to train score functions within a localized hypothesis space, leveraging locality structure to reduce dimensionality.

Result: Theoretical analysis shows localized diffusion models can circumvent the curse of dimensionality with moderate localization radius balancing statistical and localization errors. Enables parallel training for large-scale applications.

Conclusion: Localized diffusion models effectively exploit locality structure in target distributions to achieve better performance with reduced sample complexity, making them more efficient for large-scale generative tasks.

Abstract: Diffusion models are state-of-the-art tools for various generative tasks. Yet training these models involves estimating high-dimensional score functions, which in principle suffers from the curse of dimensionality. It is therefore important to understand how low-dimensional structure in the target distribution can be exploited in these models. Here we consider locality structure, which describes certain sparse conditional dependencies among the target random variables. Given some locality structure, the score function is effectively low-dimensional, so that it can be estimated by a localized neural network with significantly reduced sample complexity. This observation motivates the localized diffusion model, where a localized score matching loss is used to train the score function within a localized hypothesis space. We prove that such localization enables diffusion models to circumvent the curse of dimensionality, at the price of additional localization error. Under realistic sample size scaling, we then show both theoretically and numerically that a moderate localization radius can balance the statistical and localization errors, yielding better overall performance. Localized structure also facilitates parallel training, making localized diffusion models potentially more efficient for large-scale applications.

[1500] Probabilistic Soundness Guarantees in LLM Reasoning Chains

Weiqiu You, Anton Xue, Shreya Havaldar, Delip Rao, Helen Jin, Chris Callison-Burch, Eric Wong

Main category: cs.LG

TL;DR: ARES is a probabilistic framework that detects propagated errors in LLM reasoning chains by evaluating each step based only on previously-verified premises, providing certified statistical guarantees.

Details

Motivation: Initial errors in LLM reasoning chains often propagate and undermine final conclusions, and current error detection methods fail to detect these propagated errors because earlier errors corrupt downstream reasoning judgments.

Method: Autoregressive Reasoning Entailment Stability (ARES) framework that uses an inductive approach to evaluate each reasoning step based solely on previously-verified premises, providing nuanced scores with certified statistical guarantees.

Result: ARES achieves state-of-the-art performance across four benchmarks (72.1% Macro-F1, +8.2 points) and demonstrates superior robustness on very long synthetic reasoning chains, excelling at detecting propagated errors (90.3% F1, +27.6 points).

Conclusion: ARES provides an effective probabilistic framework for detecting propagated errors in LLM reasoning chains with certified guarantees, significantly outperforming existing methods, especially on long reasoning chains.

Abstract: In reasoning chains generated by large language models (LLMs), initial errors often propagate and undermine the reliability of the final conclusion. Current LLM-based error detection methods often fail to detect propagated errors because earlier errors can corrupt judgments of downstream reasoning. To better detect such errors, we introduce Autoregressive Reasoning Entailment Stability (ARES), a probabilistic framework that evaluates each reasoning step based solely on previously-verified premises. This inductive method yields a nuanced score for each step and provides certified statistical guarantees of its soundness, rather than a brittle binary label. ARES achieves state-of-the-art performance across four benchmarks (72.1% Macro-F1, +8.2 points) and demonstrates superior robustness on very long synthetic reasoning chains, where it excels at detecting propagated errors (90.3% F1, +27.6 points).

[1501] Diagnosing and Mitigating Modality Interference in Multimodal Large Language Models

Rui Cai, Bangzheng Li, Xiaofei Wen, Muhao Chen, Zhe Zhao

Main category: cs.LG

TL;DR: MLLMs struggle with distinguishing relevant from irrelevant cross-modal signals, leading to performance degradation in unimodal tasks due to modality interference. The paper proposes a perturbation-based framework with data augmentation and consistency regularization to improve robustness.

Details

Motivation: Multimodal LLMs have difficulty fairly evaluating all modalities and are vulnerable to spurious inputs from irrelevant modalities, especially in modality-specific tasks where they should rely on only one modality.

Method: Proposes a perturbation-based framework including: 1) data augmentations with heuristic and adversarial perturbations, and 2) consistency regularization applied to model outputs with original and perturbed inputs.

Result: Experiments on multiple benchmark datasets and model families show significant improvements in robustness and cross-modality competency, enhancing both unimodal reasoning and multimodal task performance.

Conclusion: The proposed framework effectively mitigates modality interference in MLLMs, improving their ability to handle unimodal tasks while maintaining strong multimodal performance.

Abstract: Multimodal Large Language Models have demonstrated impressive capabilities across tasks, yet they often exhibit difficulty in distinguishing task-relevant from irrelevant signals – particularly in tasks like Visual Question Answering – which can lead to susceptibility to misleading or spurious inputs. We refer to this broader limitation as the Cross-Modality Competency Problem – the model’s inability to fairly evaluate all modalities. This vulnerability becomes more evident in modality-specific tasks – such as image classification or pure text question answering – where models are expected to rely solely on one modality. In such tasks, spurious information from irrelevant modalities often leads to significant performance degradation. We refer to this failure as Modality Interference, which serves as a concrete and measurable instance of the cross-modality competency problem, and we further design a perturbation-based causal diagnostic experiment to verify and quantify this problem. To mitigate modality interference, we propose a novel framework to finetune MLLMs, including perturbation-based data augmentations with both heuristic perturbations and adversarial perturbations, and a consistency regularization strategy applying on model outputs with original and perturbed inputs. Experiments on multiple benchmark datasets (image-heavy, text-heavy and multimodal tasks) and multiple model families with different scales demonstrate significant improvements in robustness and cross-modality competency, indicating our method’s effectiveness in boosting unimodal reasoning ability while enhancing performance on multimodal tasks.

[1502] Generating Full-field Evolution of Physical Dynamics from Irregular Sparse Observations

Panqi Chen, Yifan Sun, Lei Cheng, Yang Yang, Weichang Li, Yang Liu, Weiqing Liu, Jiang Bian, Shikai Fang

Main category: cs.LG

TL;DR: SDIFT is a sequential diffusion framework that generates full-field physical dynamics from sparse, irregular observations using functional Tucker space representation and message-passing posterior sampling.

Details

Motivation: Current diffusion-based methods struggle with sparse, off-grid observations of continuous physical dynamics, requiring a framework that can handle irregular data across different scales.

Method: Uses functional Tucker model as latent space representer, sequential diffusion with temporally augmented UNet in functional Tucker space, and message-passing posterior sampling for conditional generation.

Result: Validated on three physical systems (supernova explosions, ocean sound speed fields, organic liquid), showing significant improvements in reconstruction accuracy and computational efficiency over state-of-the-art approaches.

Conclusion: SDIFT effectively bridges the gap between sparse observations and full-field physical dynamics reconstruction across multiple domains and scales.

Abstract: Modeling and reconstructing multidimensional physical dynamics from sparse and off-grid observations presents a fundamental challenge in scientific research. Recently, diffusion-based generative modeling shows promising potential for physical simulation. However, current approaches typically operate on on-grid data with preset spatiotemporal resolution, but struggle with the sparsely observed and continuous nature of real-world physical dynamics. To fill the gaps, we present SDIFT, Sequential DIffusion in Functional Tucker space, a novel framework that generates full-field evolution of physical dynamics from irregular sparse observations. SDIFT leverages the functional Tucker model as the latent space representer with proven universal approximation property, and represents observations as latent functions and Tucker core sequences. We then construct a sequential diffusion model with temporally augmented UNet in the functional Tucker space, denoising noise drawn from a Gaussian process to generate the sequence of core tensors. At the posterior sampling stage, we propose a Message-Passing Posterior Sampling mechanism, enabling conditional generation of the entire sequence guided by observations at limited time steps. We validate SDIFT on three physical systems spanning astronomical (supernova explosions, light-year scale), environmental (ocean sound speed fields, kilometer scale), and molecular (organic liquid, millimeter scale) domains, demonstrating significant improvements in both reconstruction accuracy and computational efficiency compared to state-of-the-art approaches.

[1503] A Markov Categorical Framework for Language Modeling

Yifan Zhang

Main category: cs.LG

TL;DR: A new compositional framework using Markov categories to unify understanding of language models’ training objectives, representation geometry, and practical capabilities, revealing how NLL training induces spectral alignment and explains multi-token prediction success.

Details

Motivation: To develop a unified theory explaining autoregressive language models' internal mechanisms, how training shapes representations, and enables complex behaviors, which remains elusive despite remarkable performance.

Method: Introduces a compositional framework modeling single-step generation as information-processing stages using Markov categories, connecting training objective, representation geometry, and model capabilities through categorical entropy and linear-softmax analysis.

Result: Provides information-theoretic rationale for multi-token prediction success, shows NLL training compels learning of data’s conditional uncertainty, and proves spectral alignment where representation space aligns with eigenspectrum of predictive similarity operator under bounded features.

Conclusion: The framework offers a powerful new lens for understanding information flow through models and how training objectives shape internal geometry, unifying previously isolated aspects of language modeling.

Abstract: Autoregressive language models achieve remarkable performance, yet a unified theory explaining their internal mechanisms, how training shapes their representations, and enables complex behaviors, remains elusive. We introduce a new analytical framework that models the single-step generation process as a composition of information-processing stages using the language of Markov categories. This compositional perspective provides a unified mathematical language to connect three critical aspects of language modeling that are typically studied in isolation: the training objective, the geometry of the learned representation space, and practical model capabilities. First, our framework provides a precise information-theoretic rationale for the success of multi-token prediction methods like speculative decoding, quantifying the information surplus a model’s hidden state contains about tokens beyond the immediate next one. Second, we clarify how the standard negative log-likelihood (NLL) objective compels the model to learn not just the next word, but also the data’s intrinsic conditional uncertainty, a process we formalize using categorical entropy. Our central result shows that, under a linear-softmax head with bounded features, minimizing NLL induces spectral alignment: the learned representation space aligns with the eigenspectrum of a predictive similarity operator. This work presents a powerful new lens for understanding how information flows through a model and how the training objective shapes its internal geometry.

[1504] ePC: Overcoming Exponential Signal Decay in Deep Predictive Coding Networks

Cédric Goemaere, Gaspard Oliviers, Rafal Bogacz, Thomas Demeester

Main category: cs.LG

TL;DR: This paper introduces error-based Predictive Coding (ePC), a novel reparameterization that solves the signal decay problem in state-based PC, enabling efficient training of deep neural networks that matches backpropagation performance.

Details

Motivation: Predictive Coding offers a biologically plausible alternative to backpropagation but struggles with deeper architectures due to inherent signal decay problems that scale exponentially with depth.

Method: The authors introduce error-based PC (ePC) which optimizes over prediction errors rather than states, allowing signals to reach all layers simultaneously and unattenuated without suffering from signal decay.

Result: Experiments show ePC converges orders of magnitude faster than state-based PC and matches backpropagation’s performance even for deeper models where traditional PC struggles.

Conclusion: The work provides theoretical insight into PC dynamics and establishes a foundation for scaling bio-inspired learning to deeper architectures on digital hardware.

Abstract: Predictive Coding (PC) offers a biologically plausible alternative to backpropagation for neural network training, yet struggles with deeper architectures. This paper identifies the root cause and provides a principled solution. We uncover that the canonical state-based formulation of PC (sPC) is, by design, deeply inefficient on digital hardware, due to an inherent signal decay problem that scales exponentially with depth. To address this fundamental limitation, we introduce a novel reparameterization of PC, named error-based PC (ePC), which does not suffer from signal decay. By optimizing over prediction errors rather than states, ePC enables signals to reach all layers simultaneously and unattenuated, converging orders of magnitude faster than sPC. Experiments across multiple architectures and datasets demonstrate that ePC matches backpropagation’s performance even for deeper models where sPC struggles. Besides practical improvements, our work provides theoretical insight into PC dynamics and establishes a foundation for scaling bio-inspired learning to deeper architectures on digital hardware and beyond.

[1505] RainPro-8: An Efficient Deep Learning Model to Estimate Rainfall Probabilities Over 8 Hours

Rafael Pablos Sarabia, Joachim Nyborg, Morten Birk, Jeppe Liborius Sjørup, Anders Lillevang Vesterholt, Ira Assent

Main category: cs.LG

TL;DR: A deep learning model for high-resolution probabilistic precipitation forecasting in Europe that integrates radar, satellite, and NWP data to produce accurate 8-hour forecasts with uncertainty quantification.

Details

Motivation: Overcome limitations of radar-only deep learning models with short forecast lead times and improve precipitation forecasting accuracy.

Method: Efficiently integrates multiple data sources (radar, satellite, NWP) using a compact architecture that captures long-range interactions and enables probabilistic forecasting.

Result: Surpasses current operational NWP systems, extrapolation-based methods, and deep-learning nowcasting models, setting new standards for European precipitation forecasting.

Conclusion: The model achieves a balance between accuracy, interpretability, and computational efficiency for high-resolution precipitation forecasting.

Abstract: We present a deep learning model for high-resolution probabilistic precipitation forecasting over an 8-hour horizon in Europe, overcoming the limitations of radar-only deep learning models with short forecast lead times. Our model efficiently integrates multiple data sources - including radar, satellite, and physics-based numerical weather prediction (NWP) - while capturing long-range interactions, resulting in accurate forecasts with robust uncertainty quantification through consistent probabilistic maps. Featuring a compact architecture, it enables more efficient training and faster inference than existing models. Extensive experiments demonstrate that our model surpasses current operational NWP systems, extrapolation-based methods, and deep-learning nowcasting models, setting a new standard for high-resolution precipitation forecasting in Europe, ensuring a balance between accuracy, interpretability, and computational efficiency.

[1506] Can Language Models Discover Scaling Laws?

Haowei Lin, Haotian Ye, Wenzheng Feng, Quzhe Huang, Yujun Li, Hubert Lim, Zhengrui Li, Xiangyu Wang, Jianzhu Ma, James Zou, Yitao Liang

Main category: cs.LG

TL;DR: SLDAgent is an evolution-based AI agent that automatically discovers scaling laws for predicting model performance, outperforming human-derived counterparts across diverse tasks.

Details

Motivation: Discovering scaling laws for model performance prediction currently relies on slow human experimentation, creating a need for automated approaches using LLMs.

Method: SLDAgent uses an evolution-based approach that co-optimizes scaling law models and parameters, enabling autonomous exploration of complex variable relationships.

Result: SLDAgent discovered scaling laws that consistently provide more accurate extrapolation than established human-derived counterparts across all seven diverse tasks.

Conclusion: This establishes a new paradigm for agentic scientific discovery, showing AI systems can understand their own scaling behavior and contribute novel practical knowledge.

Abstract: Discovering scaling laws for predicting model performance at scale is a fundamental and open-ended challenge, mostly reliant on slow, case specific human experimentation. To investigate the potential for LLMs to automate this process, we collect over 5,000 experiments from existing literature and curate seven diverse scaling law discovery tasks. While existing agents struggle to produce accurate law formulas, this paper introduces SLDAgent, an evolution-based agent that co-optimize the scaling law model and the parameters, enabling it to autonomously explore complex relationships between variables. For the first time, we demonstrates that SLDAgent can automatically discover laws that exhibit consistently more accurate extrapolation than their established, human-derived counterparts across all tasks. Through comprehensive analysis, we elucidate why these discovered laws are superior and verify their practical utility in both pretraining and finetuning applications. This work establishes a new paradigm for agentic scientific discovery, showing that AI systems can understand their own scaling behavior, and can contribute novel and practical knowledge back to the research community.

[1507] Variational Deep Learning via Implicit Regularization

Jonathan Wenger, Beau Coker, Juraj Marusic, John P. Cunningham

Main category: cs.LG

TL;DR: Proposes a method to regularize variational neural networks using the implicit bias of gradient descent, achieving strong in- and out-of-distribution performance without additional hyperparameter tuning.

Details

Motivation: Deep neural networks generalize well in-distribution but are non-robust out-of-distribution. Bayesian deep learning addresses this but requires significant resources and careful priors. The authors aim to leverage implicit regularization instead.

Method: Regularize variational neural networks using the implicit bias of (stochastic) gradient descent. Characterize this inductive bias theoretically in overparametrized linear models as generalized variational inference.

Result: The approach demonstrates strong in- and out-of-distribution performance without additional hyperparameter tuning and with minimal computational overhead.

Conclusion: Implicit bias of gradient descent can effectively regularize variational neural networks, providing robust performance without the computational burden of traditional Bayesian methods.

Abstract: Modern deep learning models generalize remarkably well in-distribution, despite being overparametrized and trained with little to no explicit regularization. Instead, current theory credits implicit regularization imposed by the choice of architecture, hyperparameters and optimization procedure. However, deep neural networks can be surprisingly non-robust, resulting in overconfident predictions and poor out-of-distribution generalization. Bayesian deep learning addresses this via model averaging, but typically requires significant computational resources as well as carefully elicited priors to avoid overriding the benefits of implicit regularization. Instead, in this work, we propose to regularize variational neural networks solely by relying on the implicit bias of (stochastic) gradient descent. We theoretically characterize this inductive bias in overparametrized linear models as generalized variational inference and demonstrate the importance of the choice of parametrization. Empirically, our approach demonstrates strong in- and out-of-distribution performance without additional hyperparameter tuning and with minimal computational overhead.

[1508] The Final Layer Holds the Key: A Unified and Efficient GNN Calibration Framework

Jincheng Huang, Jie Xu, Xiaoshuang Shi, Ping Hu, Lei Feng, Xiaofeng Zhu

Main category: cs.LG

TL;DR: A simple graph calibration method that addresses GNN under-confidence by reducing final-layer weight decay for class-centroid-level calibration and node-level calibration to improve prediction reliability.

Details

Motivation: GNNs often exhibit miscalibrated predictive confidence (typically under-confidence), which harms decision reliability. Existing methods add extra components that don't capture intrinsic model-confidence relationships, leading to limited guarantees and higher computational costs.

Method: Proposes a unified theoretical framework showing model confidence is governed by class-centroid-level and node-level calibration. Reduces final-layer weight decay for class-centroid calibration and uses node-level calibration to bring test nodes closer to predicted class centroids in final-layer representations.

Result: Extensive experiments validate the method’s superiority in addressing GNN under-confidence and improving calibration.

Conclusion: The proposed simple yet efficient graph calibration method effectively addresses GNN under-confidence through theoretical insights about class-centroid and node-level calibration mechanisms.

Abstract: Graph Neural Networks (GNNs) have demonstrated remarkable effectiveness on graph-based tasks. However, their predictive confidence is often miscalibrated, typically exhibiting under-confidence, which harms the reliability of their decisions. Existing calibration methods for GNNs normally introduce additional calibration components, which fail to capture the intrinsic relationship between the model and the prediction confidence, resulting in limited theoretical guarantees and increased computational overhead. To address this issue, we propose a simple yet efficient graph calibration method. We establish a unified theoretical framework revealing that model confidence is jointly governed by class-centroid-level and node-level calibration at the final layer. Based on this insight, we theoretically show that reducing the weight decay of the final-layer parameters alleviates GNN under-confidence by acting on the class-centroid level, while node-level calibration acts as a finer-grained complement to class-centroid level calibration, which encourages each test node to be closer to its predicted class centroid at the final-layer representations. Extensive experiments validate the superiority of our method.

[1509] AttriLens-Mol: Attribute Guided Reinforcement Learning for Molecular Property Prediction with Large Language Models

Xuan Lin, Long Chen, Yile Wang

Main category: cs.LG

TL;DR: AttriLens-Mol is an attribute-guided reinforcement learning framework that improves molecular property prediction by steering LLM reasoning through format, count, and rationality rewards to elicit relevant molecular attributes.

Details

Motivation: Current LLMs for molecular property prediction rely on human-crafted prompts and chain-of-thought templates, while advanced reasoning models like DeepSeek-R1 can produce verbose and irrelevant reasoning. There's a need to better guide LLM reasoning to extract relevant molecular attributes.

Method: Uses reinforcement learning with three rewards: format reward for structured attribute-based output, count reward to avoid irrelevant attributes, and rationality reward using advanced LLMs and RDKit to verify attribute relatedness. Trained on 4,000 samples with 7B-size models.

Result: Significantly boosts performance on both in-distribution and out-of-distribution datasets, achieving comparable or better results than supervised fine-tuning models and advanced models like GPT-3.5, GPT-4o, and DeepSeek variants. Extracted attributes also improve interpretable decision tree performance.

Conclusion: AttriLens-Mol effectively elicits more relevant and predictive molecular attributes, leading to enhanced interpretability and performance for molecular property prediction tasks.

Abstract: Large Language Models (LLMs) have shown promise in assisting molecular property prediction tasks but often rely on human-crafted prompts and chain-of-thought templates. While recent advanced large reasoning models like DeepSeek-R1 employ reinforcement learning for an extended ``thinking’’ process, their reasoning can be verbose and lack relevance. We introduce AttriLens-Mol, an attribute-guided reinforcement learning framework for molecular property prediction with LLMs. AttriLens-Mol steers the model’s reasoning by using: (1) a format reward encouraging attribute-based structured output, (2) a count reward to avoid enumerating irrelevant attributes, and (3) a rationality reward using advanced LLMs and RDKit to verify the relatedness of the generated attributes. This approach implicitly elicits the model’s inherent knowledge of relevant molecular attributes during reasoning, enables making predictions for the molecular property more effectively. Experiments on both in-distribution and out-of-distribution datasets show that, training both 7B-size R1-Distilled-Qwen2.5 and R1-Distilled-LLaMA3.1 models on 4,000 samples with our proposed AttriLens-Mol method significantly boosts the performance, getting comparable or better results than supervised fine-tuning models (Mol-Instructions, ChemDFM, etc.) and advanced models (GPT-3.5, GPT-4o, DeepSeek-V3, DeepSeek-R1, etc.). Further, our extracted attributes for the target property, when used as features for an interpretable decision tree model, yield superior performance compared to attributes generated by prompting LLMs. This shows that AttriLens-Mol effectively elicits more relevant and predictive molecular attributes, leading to enhanced interpretability and performance for property prediction. We release the code in https://github.com/szu-tera/AttriLens-Mol.

[1510] PDFBench: A Benchmark for De novo Protein Design from Function

Jiahao Kuang, Nuowei Liu, Jie Wang, Changzhi Sun, Tao Ji, Yuanbin Wu

Main category: cs.LG

TL;DR: PDFBench is the first comprehensive benchmark for function-guided protein design, evaluating 8 models on 16 metrics across description-guided and keyword-guided design settings.

Details

Motivation: The field lacks a unified evaluation framework, with current models assessed using inconsistent metrics that prevent fair comparison and understanding of evaluation criteria relationships.

Method: Systematically evaluates 8 state-of-the-art models on 16 metrics across two settings: description-guided design (using repurposed Mol-Instructions dataset) and keyword-guided design (using new SwissTest dataset with strict datetime cutoff).

Result: PDFBench enables more reliable model comparisons and provides key insights into metric correlations to guide future research.

Conclusion: PDFBench addresses the critical gap in protein design evaluation by providing the first comprehensive benchmark framework for fair model comparison and research guidance.

Abstract: Function-guided protein design is a crucial task with significant applications in drug discovery and enzyme engineering. However, the field lacks a unified and comprehensive evaluation framework. Current models are assessed using inconsistent and limited subsets of metrics, which prevents fair comparison and a clear understanding of the relationships between different evaluation criteria. To address this gap, we introduce PDFBench, the first comprehensive benchmark for function-guided denovo protein design. Our benchmark systematically evaluates eight state-of-the-art models on 16 metrics across two key settings: description-guided design, for which we repurpose the Mol-Instructions dataset, originally lacking quantitative benchmarking, and keyword-guided design, for which we introduce a new test set, SwissTest, created with a strict datetime cutoff to ensure data integrity. By benchmarking across a wide array of metrics and analyzing their correlations, PDFBench enables more reliable model comparisons and provides key insights to guide future research.

[1511] MINGLE: Mixture of Null-Space Gated Low-Rank Experts for Test-Time Continual Model Merging

Zihuan Qiu, Yi Xu, Chiyuan He, Fanman Meng, Linfeng Xu, Qingbo Wu, Hongliang Li

Main category: cs.LG

TL;DR: MINGLE is a novel framework for Test-Time Continual Model Merging that uses mixture-of-experts architecture with null-space constrained gating and adaptive relaxation to handle parameter interference and distribution shifts during inference.

Details

Motivation: Address challenges in continual model merging: parameter interference causing catastrophic forgetting and limited adaptability to evolving test distributions.

Method: Uses mixture-of-experts with low-rank experts, null-space constrained gating to restrict updates to orthogonal subspaces, and adaptive relaxation strategy to balance stability and adaptability.

Result: Achieves robust generalization, significantly reduces forgetting, and surpasses previous state-of-the-art methods by 7-9% on average across diverse task orders.

Conclusion: MINGLE effectively addresses parameter conflicts and distribution shifts in continual model merging through test-time adaptation and constrained optimization.

Abstract: Continual model merging integrates independently fine-tuned models sequentially without access to the original training data, offering a scalable and efficient solution for continual learning. However, existing methods face two critical challenges: parameter interference among tasks, which leads to catastrophic forgetting, and limited adaptability to evolving test distributions. To address these issues, we introduce the task of Test-Time Continual Model Merging (TTCMM), which leverages a small set of unlabeled test samples during inference to alleviate parameter conflicts and handle distribution shifts. We propose MINGLE, a novel framework for TTCMM. MINGLE employs a mixture-of-experts architecture with parameter-efficient, low-rank experts, which enhances adaptability to evolving test distributions while dynamically merging models to mitigate conflicts. To further reduce forgetting, we propose Null-Space Constrained Gating, which restricts gating updates to subspaces orthogonal to prior task representations, thereby suppressing activations on old tasks and preserving past knowledge. We further introduce an Adaptive Relaxation Strategy that adjusts constraint strength dynamically based on interference signals observed during test-time adaptation, striking a balance between stability and adaptability. Extensive experiments on standard continual merging benchmarks demonstrate that MINGLE achieves robust generalization, significantly reduces forgetting, and consistently surpasses previous state-of-the-art methods by 7-9% on average across diverse task orders. Our code is available at: https://github.com/zihuanqiu/MINGLE

[1512] Attention Layers Add Into Low-Dimensional Residual Subspaces

Junxuan Wang, Xuyang Ge, Wentao Shu, Zhengfu He, Xipeng Qiu

Main category: cs.LG

TL;DR: Attention outputs in transformers are confined to surprisingly low-dimensional subspaces, causing dead feature problems in sparse dictionary learning. A subspace-constrained training method is proposed to reduce dead features from 87% to below 1%.

Details

Motivation: To understand why sparse dictionary learning methods suffer from dead features and to address the mismatch between randomly initialized features and the intrinsic low-dimensional geometry of attention activation spaces.

Method: Proposed a subspace-constrained training method for sparse autoencoders that initializes feature directions into the active subspace of activations, leveraging the discovered low-rank structure of attention outputs.

Result: The method dramatically reduces dead features from 87% to below 1% in Attention Output SAEs with 1M features, and can be extended to other sparse dictionary learning methods.

Conclusion: The findings provide new insights into the geometry of attention mechanisms and practical tools for improving sparse dictionary learning in large language models.

Abstract: Transformer architectures, and their attention mechanisms in particular, form the foundation of modern large language models. While transformer models are widely believed to operate in high-dimensional hidden spaces, we show that attention outputs are confined to a surprisingly low-dimensional subspace, where about 60% of the directions account for 99% of the variance–a phenomenon that is consistently observed across diverse model families and datasets, and is induced by the attention output projection matrix. Critically, we find this low-rank structure as a key factor of the prevalent dead feature problem in sparse dictionary learning, where it creates a mismatch between randomly initialized features and the intrinsic geometry of the activation space. Building on this insight, we propose a subspace-constrained training method for sparse autoencoders (SAEs), initializing feature directions into the active subspace of activations. Our approach reduces dead features from 87% to below 1% in Attention Output SAEs with 1M features, and can further extend to other sparse dictionary learning methods. Our findings provide both new insights into the geometry of attention and practical tools for improving sparse dictionary learning in large language models.

[1513] Equivariant Spherical Transformer for Efficient Molecular Modeling

Junyi An, Xinyu Lu, Chao Qu, Yunfei Shi, Peijia Lin, Qianwei Tang, Licheng Xu, Fenglei Cao, Yuan Qi

Main category: cs.LG

TL;DR: The paper introduces Equivariant Spherical Transformer (EST), a novel framework that applies Transformer architecture to spherical Fourier domain to enhance expressiveness of equivariant GNNs while preserving equivariance.

Details

Motivation: Current equivariant GNNs using Clebsch-Gordan tensor product convolutions have limited expressiveness due to restricted non-linearity and low degree of group representations.

Method: EST applies Transformer-like architecture to Fourier spatial domain of group representations using uniform sampling strategy of spherical Fourier transforms.

Result: EST-based models achieve state-of-the-art performance on OC20 and QM9 benchmarks, with small EST models outperforming larger models and those using additional data on complex molecular systems.

Conclusion: EST provides strong expressiveness while preserving equivariance, paving the way for new research in equivariant graph neural networks.

Abstract: Equivariant Graph Neural Networks (GNNs) have significantly advanced the modeling of 3D molecular structure by leveraging group representations. However, their message passing, heavily relying on Clebsch-Gordan tensor product convolutions, suffers from restricted expressiveness due to the limited non-linearity and low degree of group representations. To overcome this, we introduce the Equivariant Spherical Transformer (EST), a novel plug-and-play framework that applies a Transformer-like architecture to the Fourier spatial domain of group representations. EST achieves higher expressiveness than conventional models while preserving the crucial equivariant inductive bias through a uniform sampling strategy of spherical Fourier transforms. As demonstrated by our experiments on challenging benchmarks like OC20 and QM9, EST-based models achieve state-of-the-art performance. For the complex molecular systems within OC20, small models empowered by EST can outperform some larger models and those using additional data. In addition to demonstrating such strong expressiveness,we provide both theoretical and experimental validation of EST’s equivariance as well, paving the way for new research in this area.

[1514] AltLoRA: Towards Better Gradient Approximation in Low-Rank Adaptation with Alternating Projections

Xin Yu, Yujia Wang, Jinghui Chen, Lingzhou Xue

Main category: cs.LG

TL;DR: AltLoRA is an alternating projection method that improves upon LoRA by integrating momentum into low-rank adaptation without increasing memory complexity, achieving better performance while maintaining memory efficiency.

Details

Motivation: LoRA suffers from sub-optimal performance compared to full fine-tuning and recent variants like LoRA-Pro have issues with non-unique solutions and high memory costs when incorporating momentum optimization.

Method: AltLoRA uses an alternating projection method that avoids gradient approximation difficulties from joint update designs and integrates momentum without higher memory complexity.

Result: Extensive experiments show AltLoRA outperforms LoRA and its variants, narrowing the performance gap with full fine-tuning while preserving superior memory efficiency.

Conclusion: AltLoRA provides a more effective approach for low-rank adaptation with convergence guarantees, stable feature learning, and transformation invariance robustness.

Abstract: Low-Rank Adaptation (LoRA) has emerged as an effective technique for reducing memory overhead in fine-tuning large language models. However, it often suffers from sub-optimal performance compared with full fine-tuning since the update is constrained in the low-rank space. Recent variants such as LoRA-Pro attempt to mitigate this by adjusting the gradients of the low-rank matrices to approximate the full gradient. However, LoRA-Pro’s solution is not unique, and different solutions can lead to significantly varying performance in ablation studies. Besides, to incorporate momentum or adaptive optimization design, approaches like LoRA-Pro must first compute the equivalent gradient, causing a higher memory cost close to full fine-tuning. A key challenge remains in integrating momentum properly into the low-rank space with lower memory cost. In this work, we propose AltLoRA, an alternating projection method that avoids the difficulties in gradient approximation brought by the joint update design, meanwhile integrating momentum without higher memory complexity. Our theoretical analysis provides convergence guarantees and further shows that AltLoRA enables stable feature learning and robustness to transformation invariance. Extensive experiments across multiple tasks demonstrate that AltLoRA outperforms LoRA and its variants, narrowing the gap toward full fine-tuning while preserving superior memory efficiency.

[1515] TDRM: Smooth Reward Models with Temporal Difference for LLM RL and Inference

Dan Zhang, Min Cai, Jonathan Light, Ziniu Hu, Yisong Yue, Jie Tang

Main category: cs.LG

TL;DR: TDRM improves reward models by minimizing temporal differences, leading to more stable RL training and better performance in both inference-time verification and training-time reinforcement learning.

Details

Motivation: Existing reward models lack temporal consistency, causing ineffective policy updates and unstable RL training in language models.

Method: Introduces TDRM, a method that learns smoother reward models by minimizing temporal differences (TD) for both training-time RL and inference-time verification.

Result: TD-trained process reward models improve Best-of-N (up to 6.6%) and tree-search (up to 23.7%) performance. Combined with RLVR, they achieve comparable performance with 2.5k data vs 50.1k for baselines, and yield higher-quality policies across 8 model variants.

Conclusion: TDRM enables more data-efficient RL training and produces higher-quality language model policies through temporally consistent reward modeling.

Abstract: Reward models are central to both reinforcement learning (RL) with language models and inference-time verification. However, existing reward models often lack temporal consistency, leading to ineffective policy updates and unstable RL training. We introduce TDRM, a method for learning smoother and more reliable reward models by minimizing temporal differences (TD) for training-time reinforcement learning and inference-time verification. Experiments show that TD-trained process reward models (PRMs) improve performance across Best-of-N (up to 6.6%) and tree-search (up to 23.7%) settings. When combined with Reinforcement Learning with Verifiable Rewards (RLVR), TD-trained PRMs lead to more data-efficient RL – achieving comparable performance with just 2.5k data to what baseline methods require 50.1k data to attain – and yield higher-quality language model policies in 8 model variants (5 series), e.g., Qwen2.5-(0.5B, 1,5B), GLM4-9B-0414, GLM-Z1-9B-0414, Qwen2.5-Math-(1.5B, 7B), and DeepSeek-R1-Distill-Qwen-(1.5B, 7B). We release all code at https://github.com/THUDM/TDRM.

[1516] ReCalKV: Low-Rank KV Cache Compression via Head Reordering and Offline Calibration

Xianglong Yan, Zhiteng Li, Tianao Zhang, Haotong Qin, Linghe Kong, Yulun Zhang, Xiaokang Yang

Main category: cs.LG

TL;DR: ReCalKV is a post-training KV cache compression method that uses head-wise similarity reordering for Keys and offline value calibration for Values to achieve efficient long-context inference with minimal performance loss.

Details

Motivation: Current KV cache compression methods neglect the distinct roles and varying importance of Keys and Values, leading to significant performance drops under high compression, which constrains long-context reasoning in LLMs.

Method: Proposes Head-wise Similarity aware Reordering (HSR) for Keys to cluster similar heads and use grouped SVD, and Offline Value Calibration (OVC) for Values to calibrate the value projection matrix using calibration data without training.

Result: Extensive experiments show ReCalKV consistently outperforms existing low-rank compression methods, achieving high compression ratios with minimal performance loss.

Conclusion: ReCalKV effectively addresses KV cache compression by treating Keys and Values separately with tailored strategies, enabling efficient long-context inference in LLMs.

Abstract: Large language models (LLMs) have demonstrated remarkable performance, but their long-context reasoning remains constrained by the excessive memory required for the Key-Value (KV) cache. This makes KV cache compression a critical step toward efficient long-context inference. Recent methods have explored low-rank techniques to reduce the hidden size of the KV cache. However, they neglect the distinct roles and varying importance of Keys and Values, leading to significant performance drops under high compression. To address this, we propose ReCalKV, a post-training low-rank KV cache compression approach with tailored strategies for Keys and Values. For Keys, we propose Head-wise Similarity aware Reordering (HSR), which clusters structurally similar heads into groups, enabling more accurate low-rank approximation via grouped SVD. For Values, we propose Offline Value Calibration (OVC), which efficiently calibrates the value projection matrix using calibration data without training, ensuring an accurate representation of contextual information. Extensive experiments show that ReCalKV consistently outperforms existing low-rank compression methods, achieving high compression ratios with minimal performance loss. The code and models will be available at:https://github.com/XIANGLONGYAN/ReCalKV.

[1517] Hamiltonian Neural PDE Solvers through Functional Approximation

Anthony Zhou, Amir Barati Farimani

Main category: cs.LG

TL;DR: Hamiltonian Neural Solver (HNS) extends Hamiltonian mechanics to neural PDE solvers by representing Hamiltonian functionals as neural fields, enabling conservation of energy-like quantities and improved stability.

Details

Motivation: Many physical phenomena are governed by PDEs with Hamiltonian structure, but existing neural network approaches have been limited to discrete, analytically solvable systems rather than infinite-dimensional fields.

Method: Represent Hamiltonian functional as kernel integral parameterized by neural field, use automatic differentiation to calculate functional derivatives, and learn in the gradient domain.

Result: HNS shows improved stability and conservation of energy-like quantities across 1D and 2D PDEs, with better generalization to longer time horizons and unseen initial conditions.

Conclusion: Hamiltonian framework provides principled way to ensure conservation laws in neural PDE solvers, making them effective surrogate models for physical systems.

Abstract: Designing neural networks within a Hamiltonian framework offers a principled way to ensure that conservation laws are respected in physical systems. While promising, these capabilities have been largely limited to discrete, analytically solvable systems. In contrast, many physical phenomena are governed by PDEs, which govern infinite-dimensional fields through Hamiltonian functionals and their functional derivatives. Building on prior work, we represent the Hamiltonian functional as a kernel integral parameterized by a neural field, enabling learnable function-to-scalar mappings and the use of automatic differentiation to calculate functional derivatives. This allows for an extension of Hamiltonian mechanics to neural PDE solvers by predicting a functional and learning in the gradient domain. We show that the resulting Hamiltonian Neural Solver (HNS) can be an effective surrogate model through improved stability and conserving energy-like quantities across 1D and 2D PDEs. This ability to respect conservation laws also allows HNS models to better generalize to longer time horizons or unseen initial conditions.

[1518] Causes and Consequences of Representational Similarity in Machine Learning Models

Zeyu Michael Li, Hung Anh Vu, Damilola Awofisayo, Emily Wenger

Main category: cs.LG

TL;DR: This paper investigates how dataset overlap and task overlap influence model representational similarity across modalities and model sizes, finding both factors increase similarity and combining them has the strongest effect, with downstream consequences including increased vulnerability to transferable attacks.

Details

Motivation: To understand the causes of similarity in how machine learning models represent the world across modalities, as previous work has noted these similarities but little has explored their underlying causes.

Method: Conducted experiments across model sizes and modalities (from small classifiers to large language models) to evaluate the effects of dataset overlap and task overlap on downstream model similarity.

Result: Both task overlap and dataset overlap cause higher representational similarity, with the strongest effect occurring when both factors are combined. Greater similarity also increases vulnerability to transferable adversarial and jailbreak attacks.

Conclusion: Dataset and task overlap are key factors driving representational similarity across models, and this increased similarity has practical security implications by making models more vulnerable to transferable attacks.

Abstract: Numerous works have noted similarities in how machine learning models represent the world, even across modalities. Although much effort has been devoted to uncovering properties and metrics on which these models align, surprisingly little work has explored causes of this similarity. To advance this line of inquiry, this work explores how two factors - dataset overlap and task overlap - influence downstream model similarity. We evaluate the effects of both factors through experiments across model sizes and modalities, from small classifiers to large language models. We find that both task and dataset overlap cause higher representational similarity and that combining them provides the strongest effect. Finally, we consider downstream consequences of representational similarity, demonstrating how greater similarity increases vulnerability to transferable adversarial and jailbreak attacks.

[1519] VAMO: Efficient Zeroth-Order Variance Reduction for SGD with Faster Convergence

Jiahe Chen, Ziye Ma

Main category: cs.LG

TL;DR: VAMO is a hybrid optimizer combining first-order and zeroth-order methods to achieve dimension-agnostic convergence with reduced memory requirements, outperforming both FO and ZO baselines.

Details

Motivation: Address the trade-off between first-order methods' high computational/memory costs and zeroth-order methods' slow convergence in high-dimensional deep learning problems.

Method: Stochastic variance-reduced method extending mini-batch SGD with full-batch ZO gradients using SVRG framework, featuring hybrid design with two-point ZO estimator and multi-point variant.

Result: Achieves dimension-agnostic convergence rate of O(1/T + 1/b), surpassing SGD’s O(1/√T) and ZO methods’ dimension-dependent rates, with smaller memory footprint than FO baselines.

Conclusion: VAMO provides superior performance over established FO and ZO methods with reduced memory requirements, making it particularly suitable for edge deployment in large-scale optimization.

Abstract: Optimizing large-scale nonconvex problems, common in deep learning, demands balancing rapid convergence with computational efficiency. First-order (FO) optimizers, which serve as today’s baselines, provide fast convergence and good generalization but often incur high computation and memory costs due to the large size of modern models. Conversely, zeroth-order (ZO) algorithms reduce this burden using estimated gradients, yet their slow convergence in high-dimensional settings limits practicality. We introduce VAMO (VAriance-reduced Mixed-gradient Optimizer), a stochastic variance-reduced method that extends mini-batch SGD with full-batch ZO gradients under an SVRG-style framework. VAMO’s hybrid design utilizes a two-point ZO estimator to achieve a dimension-agnostic convergence rate of $\mathcal{O}(1/T + 1/b)$, where $T$ is the number of iterations and $b$ is the batch-size, surpassing the dimension-dependent slowdown of purely ZO methods and significantly improving over SGD’s $\mathcal{O}(1/\sqrt{T})$ rate. Additionally, we propose a multi-point variant that mitigates the $O(1/b)$ error by adjusting the number of estimation points to balance convergence and cost. Importantly, VAMO achieves these gains with smaller dynamic memory requirements than many FO baselines, making it particularly attractive for edge deployment. Experiments including traditional neural network training and LLM finetuning confirm that VAMO not only outperforms established FO and ZO methods, but also does so with a light memory footprint.

[1520] PiERN: Token-Level Routing for Integrating High-Precision Computation and Reasoning

Hengbo Xiao, Jingyuan Fan, Xin Tong, Jingzhao Zhang, Chao Lu, Guannan He

Main category: cs.LG

TL;DR: PiERN is a new architecture that integrates computational capabilities into neural networks through token-level routing, enabling efficient computation-reasoning alternation without the overhead of multi-agent systems.

Details

Motivation: Current LLMs cannot perform high-precision numerical computation as an intrinsic capability, and multi-agent approaches introduce communication overhead and scalability limitations.

Method: Separately train computational experts, a text-to-computation module, and a router, then integrate them endogenously. The router directs computation and reasoning at token level for iterative alternation within a single chain of thought.

Result: PiERN achieves higher accuracy than LLM finetuning and significant improvements in response latency, token usage, and GPU energy consumption compared to multi-agent approaches.

Conclusion: PiERN provides an efficient, interpretable, and scalable paradigm for interfacing language models with scientific systems.

Abstract: Tasks on complex systems require high-precision numerical computation to support decisions, but current large language models (LLMs) cannot integrate such computations as an intrinsic and interpretable capability with existing architectures. Multi-agent approaches can leverage external experts, but inevitably introduce communication overhead and suffer from inefficiency caused by limited scalability. To this end, we propose Physically-isolated Experts Routing Network (PiERN), an architecture for integrating computation and reasoning. Instead of the tool-use workflows or function-calling, PiERN endogenously integrates computational capabilities into neural networks after separately training experts, a text-to-computation module, and a router. At inference, the router directs computation and reasoning at the token level, thereby enabling iterative alternation within a single chain of thought. We evaluate PiERN on representative linear and nonlinear computation-reasoning tasks against LLM finetuning and the multi-agent system approaches. Results show that the PiERN architecture achieves not only higher accuracy than directly finetuning LLMs but also significant improvements in response latency, token usage, and GPU energy consumption compared with mainstream multi-agent approaches. PiERN offers an efficient, interpretable, and scalable paradigm for interfacing language models with scientific systems.

[1521] Angles Don’t Lie: Unlocking Training-Efficient RL Through the Model’s Own Signals

Qinsi Wang, Jinghan Ke, Hancheng Ye, Yueqian Lin, Yuzhe Fu, Jianyi Zhang, Kurt Keutzer, Chenfeng Xu, Yiran Chen

Main category: cs.LG

TL;DR: GAIN-RL is a new RL framework that uses angle concentration signals from LLM hidden states to dynamically select training data, achieving 2.5x training efficiency improvement and better performance with half the data.

Details

Motivation: Current RFT methods suffer from sample inefficiency due to redundant query exposure and uniform data sampling. Previous curriculum learning approaches using heuristic difficulty metrics neglect the model's intrinsic learning signals.

Method: Proposes GAIN-RL framework that leverages angle concentration - a model-inherent signal from token hidden state vectors - to dynamically select training data based on learning preference, ensuring impactful gradient updates.

Result: GAIN-RL achieves over 2.5x acceleration in training efficiency across mathematical and coding tasks, and achieves better performance with half the original data compared to vanilla GRPO.

Conclusion: The angle concentration signal effectively reflects LLM learning capacity, and GAIN-RL’s dynamic data selection based on this signal significantly enhances training efficiency and data utilization.

Abstract: Current Reinforcement Fine-tuning (RFT) paradigms for Large Language Models (LLMs) suffer from sample inefficiency due to the redundant exposure of identical queries under uniform data sampling. While previous work has explored curriculum learning via heuristic difficulty metrics, these strategies exhibit limitations by neglecting the intrinsic learning signals generated by the model itself, thus leading to suboptimal training regimes. In this paper, we identify a model-inherent signal termed angle concentration that effectively reflects an LLM’s capacity to learn from specific data. We theoretically and empirically demonstrate a correlation between the angular distribution of token hidden state vectors and the resulting gradient, revealing a learning preference for data exhibiting higher angle concentration. Inspired by this finding, we propose GAIN-RL, a Gradient-driven Angle-Informed Navigated RL framework. By leveraging the model’s intrinsic angle concentration signal, GAIN-RL dynamically selects training data in each epoch, ensuring consistently impactful gradient updates and thus significantly enhancing overall training efficiency. Empirical evaluations show that GAIN-RL (GRPO) achieves over a 2.5x acceleration in training efficiency across diverse mathematical and coding tasks and varying model scales. Furthermore, GAIN-RL (GRPO)’s efficient sampling yields data-efficient training, achieving better performance with half the original data compared to vanilla GRPO with full training data. Code is realsed at https://github.com/wangqinsi1/GAINRL/tree/main.

[1522] Learning with Local Search MCMC Layers

Germain Vivier-Ardisson, Mathieu Blondel, Axel Parmentier

Main category: cs.LG

TL;DR: A theoretically-principled approach for learning with inexact combinatorial solvers by transforming local search neighborhood systems into proposal distributions for MCMC, enabling differentiable combinatorial layers.

Details

Motivation: Existing approaches for integrating combinatorial optimization into neural networks lack theoretical guarantees or perform poorly with inexact solvers, which are necessary for NP-hard problems that often require local search heuristics.

Method: Transform problem-specific neighborhood systems from local search heuristics into proposal distributions to implement MCMC on combinatorial spaces, creating differentiable combinatorial layers and loss functions.

Result: The approach strongly reduces computational burden compared to exact solvers while maintaining theoretical principles, demonstrated on a large-scale dynamic vehicle routing problem with time windows.

Conclusion: The proposed method provides a principled way to incorporate inexact combinatorial solvers into neural networks, making learning feasible for complex NP-hard optimization problems.

Abstract: Integrating combinatorial optimization layers into neural networks has recently attracted significant research interest. However, many existing approaches lack theoretical guarantees or fail to perform adequately when relying on inexact solvers. This is a critical limitation, as many operations research problems are NP-hard, often necessitating the use of neighborhood-based local search heuristics. These heuristics iteratively generate and evaluate candidate solutions based on an acceptance rule. In this paper, we introduce a theoretically-principled approach for learning with such inexact combinatorial solvers. Inspired by the connection between simulated annealing and Metropolis-Hastings, we propose to transform problem-specific neighborhood systems used in local search heuristics into proposal distributions, implementing MCMC on the combinatorial space of feasible solutions. This allows us to construct differentiable combinatorial layers and associated loss functions. Replacing an exact solver by a local search strongly reduces the computational burden of learning on many applications. We demonstrate our approach on a large-scale dynamic vehicle routing problem with time windows.

[1523] Interaction Field Matching: Overcoming Limitations of Electrostatic Models

Stepan I. Manukhov, Alexander Kolesov, Vladimir V. Palyulin, Alexander Korotin

Main category: cs.LG

TL;DR: Interaction Field Matching (IFM) generalizes Electrostatic Field Matching by using general interaction fields beyond electrostatic ones, solving problems in modeling electrostatic fields with neural networks.

Details

Motivation: Electrostatic Field Matching requires modeling complex electrostatic fields outside capacitor plates using neural networks, which is non-trivial. The paper aims to generalize this approach and solve the modeling difficulties.

Method: Proposes Interaction Field Matching (IFM) as a generalization of EFM, using general interaction fields. Specifically designs an interaction field inspired by strong interactions between quarks and antiquarks in physics to address electrostatic field modeling problems.

Result: The method is evaluated on toy and image data transfer problems, showing performance improvements over the original EFM approach.

Conclusion: IFM successfully generalizes EFM and solves the electrostatic field modeling problems by using more general interaction fields, particularly those inspired by quark-antiquark strong interactions.

Abstract: Electrostatic field matching (EFM) has recently appeared as a novel physics-inspired paradigm for data generation and transfer using the idea of an electric capacitor. However, it requires modeling electrostatic fields using neural networks, which is non-trivial because of the necessity to take into account the complex field outside the capacitor plates. In this paper, we propose Interaction Field Matching (IFM), a generalization of EFM which allows using general interaction fields beyond the electrostatic one. Furthermore, inspired by strong interactions between quarks and antiquarks in physics, we design a particular interaction field realization which solves the problems which arise when modeling electrostatic fields in EFM. We show the performance on a series of toy and image data transfer problems.

[1524] Certified Neural Approximations of Nonlinear Dynamics

Frederik Baymler Mathiesen, Nikolaus Vertovec, Francesco Fabiano, Luca Laurenti, Alessandro Abate

Main category: cs.LG

TL;DR: A novel adaptive verification method that provides formal error bounds for neural network approximations of dynamical systems, enabling safe use in safety-critical applications.

Details

Motivation: Neural networks can model nonlinear dynamical systems but require formal closeness bounds for safety-critical applications, which current methods lack.

Method: Proposes an adaptive, parallelizable verification method based on certified first-order models that interprets error bounds as bounded disturbances.

Result: Significantly outperforms state-of-the-art methods on established benchmarks and successfully handles previously intractable scenarios like neural network compression and Koopman operator learning.

Conclusion: The method provides effective and scalable formal verification for neural approximations of dynamical systems, enabling their safe deployment in safety-critical contexts.

Abstract: Neural networks hold great potential to act as approximate models of nonlinear dynamical systems, with the resulting neural approximations enabling verification and control of such systems. However, in safety-critical contexts, the use of neural approximations requires formal bounds on their closeness to the underlying system. To address this fundamental challenge, we propose a novel, adaptive, and parallelizable verification method based on certified first-order models. Our approach provides formal error bounds on the neural approximations of dynamical systems, allowing them to be safely employed as surrogates by interpreting the error bound as bounded disturbances acting on the approximated dynamics. We demonstrate the effectiveness and scalability of our method on a range of established benchmarks from the literature, showing that it significantly outperforms the state-of-the-art. Furthermore, we show that our framework can successfully address additional scenarios previously intractable for existing methods - neural network compression and an autoencoder-based deep learning architecture for learning Koopman operators for the purpose of trajectory prediction.

[1525] Bridging the Performance Gap Between Target-Free and Target-Based Reinforcement Learning

Théo Vincent, Yogesh Tripathi, Tim Faust, Yaniv Oren, Jan Peters, Carlo D’Eramo

Main category: cs.LG

TL;DR: Introduces iS-QL, a hybrid approach that uses only the last linear layer as target network while sharing other parameters, combining benefits of target-free and target-based methods with improved sample efficiency.

Details

Motivation: Target networks stabilize learning but require extra memory and delay updates, while target-free approaches are brittle. Need a solution that balances both advantages.

Method: Uses copy of last linear layer as target network while sharing remaining parameters with online network, combined with iterated Q-learning to learn consecutive Bellman updates in parallel.

Result: Bridges performance gap between target-free and target-based approaches across various problems while using single Q-network.

Conclusion: iS-QL is a step toward resource-efficient RL algorithms that maintains low-memory footprint while leveraging target-based stabilization benefits.

Abstract: The use of target networks in deep reinforcement learning is a widely popular solution to mitigate the brittleness of semi-gradient approaches and stabilize learning. However, target networks notoriously require additional memory and delay the propagation of Bellman updates compared to an ideal target-free approach. In this work, we step out of the binary choice between target-free and target-based algorithms. We introduce a new method that uses a copy of the last linear layer of the online network as a target network, while sharing the remaining parameters with the up-to-date online network. This simple modification enables us to keep the target-free’s low-memory footprint while leveraging the target-based literature. We find that combining our approach with the concept of iterated Q-learning, which consists of learning consecutive Bellman updates in parallel, helps improve the sample-efficiency of target-free approaches. Our proposed method, iterated Shared Q-Learning (iS-QL), bridges the performance gap between target-free and target-based approaches across various problems, while using a single Q-network, thus being a step forward towards resource-efficient reinforcement learning algorithms.

[1526] Towards Identifiability of Interventional Stochastic Differential Equations

Aaron Zweig, Zaikang Lin, Elham Azizi, David Knowles

Main category: cs.LG

TL;DR: The paper provides the first provable bounds for unique recovery of SDE parameters from stationary distributions under multiple interventions, with tight bounds for linear SDEs and upper bounds for nonlinear SDEs in small noise regimes.

Details

Motivation: To establish identifiability conditions for stochastic differential equations under interventions, enabling parameter recovery from observational data with provable guarantees.

Method: Theoretical analysis of identifiability bounds for linear and nonlinear SDEs under multiple interventions, with experimental validation on synthetic data and application to gene regulatory dynamics using learnable activation functions.

Result: Tight bounds on the number of necessary interventions for linear SDEs and upper bounds for nonlinear SDEs in small noise regimes, with successful parameter recovery demonstrated experimentally.

Conclusion: The study establishes fundamental identifiability results for SDEs under interventions, enabling practical parameter recovery and suggesting advantages of learnable activation functions in biological applications.

Abstract: We study identifiability of stochastic differential equations (SDE) under multiple interventions. Our results give the first provable bounds for unique recovery of SDE parameters given samples from their stationary distributions. We give tight bounds on the number of necessary interventions for linear SDEs, and upper bounds for nonlinear SDEs in the small noise regime. We experimentally validate the recovery of true parameters in synthetic data, and motivated by our theoretical results, demonstrate the advantage of parameterizations with learnable activation functions in application to gene regulatory dynamics.

[1527] Bridging Kolmogorov Complexity and Deep Learning: Asymptotically Optimal Description Length Objectives for Transformers

Peter Shaw, James Cohan, Jacob Eisenstein, Kristina Toutanova

Main category: cs.LG

TL;DR: This paper introduces asymptotically optimal description length objectives for neural networks like Transformers, grounded in Kolmogorov complexity theory, and shows they can achieve optimal compression up to an additive constant.

Details

Motivation: The MDL principle lacks principled measures for neural network complexity, making its application to Transformers challenging. The paper aims to provide theoretical foundations for compression and generalization in neural networks.

Method: Developed asymptotically optimal description length objectives based on Kolmogorov complexity, proved their existence for Transformers using computational universality, and constructed a tractable variational objective with adaptive Gaussian mixture prior.

Result: The variational objective selects low-complexity solutions with strong generalization on algorithmic tasks, but standard optimizers fail to find such solutions from random initialization, revealing optimization challenges.

Conclusion: The framework provides theoretical guarantees for description length objectives, outlining a path towards training neural networks with better compression and generalization capabilities.

Abstract: The Minimum Description Length (MDL) principle offers a formal framework for applying Occam’s razor in machine learning. However, its application to neural networks such as Transformers is challenging due to the lack of a principled, universal measure for model complexity. This paper introduces the theoretical notion of asymptotically optimal description length objectives, grounded in the theory of Kolmogorov complexity. We establish that a minimizer of such an objective achieves optimal compression, for any dataset, up to an additive constant, in the limit as model resource bounds increase. We prove that asymptotically optimal objectives exist for Transformers, building on a new demonstration of their computational universality. We further show that such objectives can be tractable and differentiable by constructing and analyzing a variational objective based on an adaptive Gaussian mixture prior. Our empirical analysis shows that this variational objective selects for a low-complexity solution with strong generalization on an algorithmic task, but standard optimizers fail to find such solutions from a random initialization, highlighting key optimization challenges. More broadly, by providing a theoretical framework for identifying description length objectives with strong asymptotic guarantees, we outline a potential path towards training neural networks that achieve greater compression and generalization.

[1528] Towards Better Generalization via Distributional Input Projection Network

Yifan Hao, Yanxin Lu, Hanning Zhang, Xinwei Shen, Tong Zhang

Main category: cs.LG

TL;DR: DIPNet projects inputs into learnable distributions at each layer to create smoother loss landscapes, improving generalization across various architectures and tasks.

Details

Motivation: Training loss alone provides limited insight into generalization for overparameterized models, and directly enforcing smoothness in neural networks is challenging.

Method: Introduces Distributional Input Projection Networks (DIPNet) that project inputs into learnable distributions at each layer, inducing smoother loss landscapes with respect to input.

Result: Empirically validated across Vision Transformers, LLMs, ResNet and MLPs, consistently enhancing test performance under standard settings, adversarial attacks, out-of-distribution inputs, and reasoning benchmarks.

Conclusion: DIPNet provides a general and effective approach for boosting generalization performance that can be seamlessly integrated into existing models.

Abstract: As overparameterized models become increasingly prevalent, training loss alone offers limited insight into generalization performance. While smoothness has been linked to improved generalization across various settings, directly enforcing smoothness in neural networks remains challenging. To address this, we introduce Distributional Input Projection Networks (DIPNet), a novel framework that projects inputs into learnable distributions at each layer. This distributional representation induces a smoother loss landscape with respect to the input, promoting better generalization. We provide theoretical analysis showing that DIPNet reduces both local smoothness measures and the Lipschitz constant of the network, contributing to improved generalization performance. Empirically, we validate DIPNet across a wide range of architectures and tasks, including Vision Transformers (ViTs), Large Language Models (LLMs), ResNet and MLPs. Our method consistently enhances test performance under standard settings, adversarial attacks, out-of-distribution inputs, and reasoning benchmarks. We demonstrate that the proposed input projection strategy can be seamlessly integrated into existing models, providing a general and effective approach for boosting generalization performance in modern deep learning.

[1529] ICYM2I: The illusion of multimodal informativeness under missingness

Young Sang Choi, Vincent Jeanselme, Pierre Elias, Shalmali Joshi

Main category: cs.LG

TL;DR: The paper addresses the problem of modality missingness in multimodal learning, where source and target environments have different missing modalities, and proposes ICYM2I framework to correct bias in information gain estimation using inverse probability weighting.

Details

Motivation: Multimodal learning faces challenges when modalities available in source environments differ from target environments due to cost, hardware failure, or perceived informativeness, leading to distribution shift and biased information gain estimates.

Method: The authors introduce ICYM2I (In Case You Multimodal Missed It), a framework that uses inverse probability weighting-based correction to evaluate predictive performance and information gain under missingness conditions.

Result: The proposed adjustment effectively corrects bias in information gain estimation under missingness, as demonstrated on synthetic, semi-synthetic, and real-world datasets.

Conclusion: Explicitly accounting for missingness processes is crucial for accurate estimation of modality value in target environments, and ICYM2I provides an effective solution to address distribution shift caused by modality missingness.

Abstract: Multimodal learning is of continued interest in artificial intelligence-based applications, motivated by the potential information gain from combining different types of data. However, modalities observed in the source environment may differ from the modalities observed in the target environment due to multiple factors, including cost, hardware failure, or the perceived informativeness of a given modality. This shift in missingness between the source and target environment has not been carefully studied. Naive estimation of the information gain associated with including an additional modality without accounting for missingness may result in improper estimates of that modality’s value in the target environment. We formalize the problem of missingness, demonstrate its ubiquity, and show that the subsequent distribution shift results in bias when the missingness process is not explicitly accounted for. To address this issue, we introduce ICYM2I (In Case You Multimodal Missed It), a framework for the evaluation of predictive performance and information gain under missingness through inverse probability weighting-based correction. We demonstrate the importance of the proposed adjustment to estimate information gain under missingness on synthetic, semi-synthetic, and real-world datasets.

[1530] Reshaping Reasoning in LLMs: A Theoretical Analysis of RL Training Dynamics through Pattern Selection

Xingwu Chen, Tianle Li, Difan Zou

Main category: cs.LG

TL;DR: RL training in LLMs primarily optimizes critical tokens rather than reasoning patterns, reshaping reasoning distributions to improve performance. Theoretical analysis reveals convergence depends on base model quality, and internal rewards can initially help but may degrade performance with over-training.

Details

Motivation: To understand the unclear training dynamics of reinforcement learning in language models and explain how RL actually enhances reasoning capabilities.

Method: Empirical analysis through reasoning-pattern-level and token-level analysis across RL training, plus theoretical modeling of RL with verifiable reward (RLVR) and model’s internal feedback (RLIF).

Result: RL mainly optimizes sparse critical tokens, reshaping reasoning pattern distributions. For RLVR, convergence depends on base model reasoning quality. For RLIF, internal rewards initially improve performance but can degrade with continued training.

Conclusion: The study provides both empirical and theoretical understanding of RL training dynamics in LLMs, advancing theoretical foundations and practical applications of RL for language model enhancement.

Abstract: While reinforcement learning (RL) demonstrated remarkable success in enhancing the reasoning capabilities of language models, the training dynamics of RL in LLMs remain unclear. In this work, we provide an explanation of the RL training process through empirical analysis and rigorous theoretical modeling. First, through systematic reasoning-pattern-level and token-level analysis across the RL training process, we show that while different reasoning patterns exhibit relatively stable success rates during training, RL primarily optimizes a sparse subset of critical tokens, thereby reshaping reasoning pattern distributions to affect model performance. Building on these empirical insights, we develop a theoretical framework to understand the training dynamics of RL with two typical rewards: verifiable reward (RLVR) and model’s internal feedback (RLIF). For RLVR, we analyze the training dynamics under two special cases: one where models readily converge to optimal reasoning strategies, and another where optimization becomes challenging, revealing that the base model’s reasoning quality is crucial for determining convergence behavior. For RLIF, we examine how internal rewards initially improve model performance but can potentially lead to degradation with continued training. Extensive experiments validate our findings, advancing both theoretical understanding and practical applications of RL in language model enhancement.

[1531] Navigate the Unknown: Enhancing LLM Reasoning with Intrinsic Motivation Guided Exploration

Jingtong Gao, Ling Pan, Yejing Wang, Rui Zhong, Chi Lu, Qingpeng Cai, Peng Jiang, Xiangyu Zhao

Main category: cs.LG

TL;DR: i-MENTOR is a new RL method for LLM reasoning that addresses sparse reward limitations through dense rewards and enhanced exploration mechanisms, achieving significant performance improvements.

Details

Motivation: Existing RL approaches like PPO and GRPO suffer from sparse outcome-based rewards and inadequate exploration mechanisms, leading to inefficient guidance for reasoning tasks and systematic biases that hinder novel solution discovery.

Method: i-MENTOR introduces three key innovations: trajectory-aware exploration rewards to mitigate token-level bias, error-conditioned reward allocation for efficient exploration on challenging samples, and advantage-preserving integration to maintain advantage distribution integrity.

Result: Experiments across 4 public datasets show i-MENTOR’s effectiveness, achieving a 22.23% improvement on AIME 2024 compared to existing methods.

Conclusion: i-MENTOR successfully addresses critical limitations of current RL approaches for LLM reasoning by providing dense rewards and enhanced exploration, leading to substantial performance gains in complex reasoning tasks.

Abstract: Reinforcement Learning (RL) has emerged as a pivotal method for improving the reasoning capabilities of Large Language Models (LLMs). However, prevalent RL approaches such as Proximal Policy Optimization (PPO) and Group-Regularized Policy Optimization (GRPO) face critical limitations due to their reliance on sparse outcome-based rewards and inadequate mechanisms for incentivizing exploration. These limitations result in inefficient guidance for reasoning. Specifically, sparse rewards fail to deliver sufficient feedback, particularly for challenging problems. Furthermore, such rewards induce systematic biases that prioritize exploitation of familiar trajectories over novel solution discovery. These shortcomings critically hinder performance in complex reasoning tasks, which inherently demand iterative refinement across intermediate steps. To address these challenges, we propose an Intrinsic Motivation guidEd exploratioN meThOd foR LLM Reasoning (i-MENTOR), a method designed to deliver dense rewards and amplify exploration in the RL-based paradigm. i-MENTOR introduces three innovations: trajectory-aware exploration rewards that mitigate bias in token-level strategies while maintaining computational efficiency; error-conditioned reward allocation to ensure efficient exploration on challenging samples while intrinsically stabilizing training; and advantage-preserving integration that maintains advantage distribution integrity while incorporating exploratory guidance. Experiments across 4 public datasets demonstrate i-MENTOR’s effectiveness, achieving a 22.23% improvement on AIME 2024.

[1532] TreeRPO: Tree Relative Policy Optimization

Zhicheng Yang, Zhijiang Guo, Yinya Huang, Xiaodan Liang, Yiwei Wang, Jing Tang

Main category: cs.LG

TL;DR: TreeRPO is a novel RLVR method that uses tree sampling to estimate step-level rewards, providing fine-grained guidance for LLM reasoning processes and significantly improving performance and efficiency.

Details

Motivation: Existing RLVR methods with trajectory-level rewards provide insufficient guidance for optimizing intermediate reasoning steps, limiting LLM reasoning capabilities.

Method: TreeRPO estimates mathematical expectations of rewards at various reasoning steps using tree sampling, computes step-level rewards based on groups generated during sampling, and builds on GRPO’s group-relative reward training mechanism.

Result: TreeRPO improves Qwen-2.5-Math’s average Pass@1 accuracy from 19.0% to 35.5%, outperforms GRPO by 2.9% while reducing average response length by 18.1%.

Conclusion: TreeRPO effectively provides dense, fine-grained reward signals that significantly enhance LLM reasoning performance and efficiency compared to existing methods.

Abstract: Large Language Models (LLMs) have shown remarkable reasoning capabilities through Reinforcement Learning with Verifiable Rewards (RLVR) methods. However, a key limitation of existing approaches is that rewards defined at the full trajectory level provide insufficient guidance for optimizing the intermediate steps of a reasoning process. To address this, we introduce \textbf{\name}, a novel method that estimates the mathematical expectations of rewards at various reasoning steps using tree sampling. Unlike prior methods that rely on a separate step reward model, \name directly estimates these rewards through this sampling process. Building on the group-relative reward training mechanism of GRPO, \name innovatively computes rewards based on step-level groups generated during tree sampling. This advancement allows \name to produce fine-grained and dense reward signals, significantly enhancing the learning process and overall performance of LLMs. Experimental results demonstrate that our \name algorithm substantially improves the average Pass@1 accuracy of Qwen-2.5-Math on test benchmarks, increasing it from 19.0% to 35.5%. Furthermore, \name significantly outperforms GRPO by 2.9% in performance while simultaneously reducing the average response length by 18.1%, showcasing its effectiveness and efficiency. Our code will be available at \href{https://github.com/yangzhch6/TreeRPO}{https://github.com/yangzhch6/TreeRPO}.

[1533] What Do You Need for Diverse Trajectory Composition in Diffusion Planning?

Quentin Clark, Florian Shkurti

Main category: cs.LG

TL;DR: The paper identifies positional equivariance and local receptiveness as key properties enabling stitching in diffusion planners, showing these architectural properties can match performance of computationally expensive methods like replanning.

Details

Motivation: To understand why generative behavioral cloning methods can stitch sub-trajectories together, which is crucial for developing reliable stitching algorithms.

Method: Analyzed diffusion planners trained via behavioral cloning, focusing on architectural properties and comparing with methods like replanning, data augmentation, and scaling.

Result: Found both positional equivariance and local receptiveness are crucial for composition, with locality being more important; simple architectural choices can compete with expensive methods; inpainting-based guidance enables generalization.

Conclusion: Understanding and implementing positional equivariance and local receptiveness in diffusion planners enables effective stitching and can replace more computationally expensive approaches.

Abstract: In planning, stitching is an ability of algorithms to piece together sub-trajectories of data they are trained on to generate new and diverse behaviours. While stitching is historically a strength of offline reinforcement learning, recent generative behavioural cloning (BC) methods have also shown proficiency at stitching. However, the main factors behind this are poorly understood, hindering the development of new algorithms that can reliably stitch. Focusing on diffusion planners trained via BC, we find two properties are needed to compose: \emph{positional equivariance} and \emph{local receptiveness}. We use these two properties to explain architecture, data, and inference choices in existing generative BC methods based on diffusion planning, including replanning frequency, data augmentation, and data scaling. Experimental comparisions show that (1) while locality is more important than positional equivariance in creating a diffusion planner capable of composition, both are crucial (2) enabling these properties through relatively simple architecture choices can be competitive with more computationally expensive methods such as replanning or scaling data, and (3) simple inpainting-based guidance can guide architecturally compositional models to enable generalization in goal-conditioned settings.

[1534] InverseScope: Scalable Activation Inversion for Interpreting Large Language Models

Yifan Luo, Zhennan Zhou, Bin Dong

Main category: cs.LG

TL;DR: InverseScope is a scalable framework for interpreting LLM activations through input inversion, using conditional generation to efficiently sample inputs that produce similar activations and enabling quantitative analysis of internal representations.

Details

Motivation: Existing feature interpretability methods rely on strong assumptions about representation structure that may not hold in practice, creating a need for assumption-light approaches to understand LLM internal representations.

Method: Defines distribution over inputs generating similar target activations, uses novel conditional generation architecture for efficient sampling in high-dimensional spaces, and introduces quantitative evaluation protocol with feature consistency rate.

Result: Significantly improves sample efficiency compared to previous methods, scales inversion-based interpretability to larger models and practical tasks.

Conclusion: InverseScope enables systematic and quantitative analysis of internal representations in real-world LLMs through scalable input inversion.

Abstract: Understanding the internal representations of large language models (LLMs) is a central challenge in interpretability research. Existing feature interpretability methods often rely on strong assumptions about the structure of representations that may not hold in practice. In this work, we introduce InverseScope, an assumption-light and scalable framework for interpreting neural activations via input inversion. Given a target activation, we define a distribution over inputs that generate similar activations and analyze this distribution to infer the encoded information. To address the inefficiency of sampling in high-dimensional spaces, we propose a novel conditional generation architecture that significantly improves sample efficiency compared to previous method. We further introduce a quantitative evaluation protocol that tests interpretability hypotheses using the feature consistency rate computed over the sampled inputs. InverseScope scales inversion-based interpretability methods to larger models and practical tasks, enabling systematic and quantitative analysis of internal representations in real-world LLMs.

[1535] Logic Gate Neural Networks are Good for Verification

Fabian Kresse, Emily Yu, Christoph H. Lampert, Thomas A. Henzinger

Main category: cs.LG

TL;DR: SAT encoding for verifying global robustness and fairness in Logic Gate Networks (LGNs), which are more verification-friendly than traditional neural networks while maintaining good performance.

Details

Motivation: Traditional neural networks are complex and difficult to formally verify, while LGNs offer a sparse, netlist-like architecture that is inherently more amenable to symbolic verification.

Method: Developed a SAT encoding approach for verifying global robustness and fairness properties in LGNs, which replace multiplications with Boolean logic gates.

Result: Evaluation on five benchmark datasets (including a new 5-class variant) shows LGNs are both verification-friendly and maintain strong predictive performance.

Conclusion: LGNs provide a promising alternative to traditional neural networks by being more amenable to formal verification while still delivering good performance.

Abstract: Learning-based systems are increasingly deployed across various domains, yet the complexity of traditional neural networks poses significant challenges for formal verification. Unlike conventional neural networks, learned Logic Gate Networks (LGNs) replace multiplications with Boolean logic gates, yielding a sparse, netlist-like architecture that is inherently more amenable to symbolic verification, while still delivering promising performance. In this paper, we introduce a SAT encoding for verifying global robustness and fairness in LGNs. We evaluate our method on five benchmark datasets, including a newly constructed 5-class variant, and find that LGNs are both verification-friendly and maintain strong predictive performance.

[1536] Stochastic Primal-Dual Double Block-Coordinate for Two-way Partial AUC Maximization

Linli Zhou, Bokun Wang, My T. Thai, Tianbao Yang

Main category: cs.LG

TL;DR: The paper introduces two stochastic primal-dual double block-coordinate algorithms for optimizing two-way partial AUC (TPAUC) in imbalanced binary classification, with theoretical convergence guarantees and superior experimental performance.

Details

Motivation: TPAUC is important for imbalanced data classification but existing stochastic optimization methods are limited - either using approximated loss functions or having sub-optimal computational complexities.

Method: Two stochastic primal-dual double block-coordinate algorithms that use block-coordinate updates for both primal and dual variables, applicable to both convex and non-convex settings.

Result: Theoretical convergence rate analyses show significant improvements over prior approaches. Experiments on benchmark datasets demonstrate faster convergence and better generalization.

Conclusion: The work advances TPAUC optimization state-of-the-art and provides practical tools for real-world machine learning applications with imbalanced data.

Abstract: Two-way partial AUC (TPAUC) is a critical performance metric for binary classification with imbalanced data, as it focuses on specific ranges of the true positive rate (TPR) and false positive rate (FPR). However, stochastic algorithms for TPAUC optimization remain under-explored, with existing methods either limited to approximated TPAUC loss functions or burdened by sub-optimal complexities. To overcome these limitations, we introduce two innovative stochastic primal-dual double block-coordinate algorithms for TPAUC maximization. These algorithms utilize stochastic block-coordinate updates for both the primal and dual variables, catering to both convex and non-convex settings. We provide theoretical convergence rate analyses, demonstrating significant improvements over prior approaches. Our experimental results, based on multiple benchmark datasets, validate the superior performance of our algorithms, showcasing faster convergence and better generalization. This work advances the state of the art in TPAUC optimization and offers practical tools for real-world machine learning applications.

[1537] Towards Robust Real-World Multivariate Time Series Forecasting: A Unified Framework for Dependency, Asynchrony, and Missingness

Jinkwan Jang, Hyungjin Park, Jinmyeong Choi, Taesup Kim

Main category: cs.LG

TL;DR: ChannelTokenFormer is a Transformer-based framework that addresses three key challenges in multivariate time series forecasting: channel dependencies, asynchronous sampling, and missing values, outperforming existing methods in real-world scenarios.

Details

Motivation: Real-world time series data face three fundamental challenges: complex inter-channel dependencies, asynchronous channel sampling, and missing values. Existing methods only partially address these issues and rely on simplifying assumptions, leaving the combined challenges unresolved.

Method: Proposes ChannelTokenFormer, a Transformer-based forecasting framework with flexible architecture that explicitly captures cross-channel interactions, accommodates channel-wise asynchronous sampling, and effectively handles missing values.

Result: Extensive experiments on public benchmark datasets and a private industrial dataset demonstrate ChannelTokenFormer’s superior robustness and accuracy under challenging real-world conditions.

Conclusion: ChannelTokenFormer effectively bridges the gap in handling the combined challenges of asynchronous channel sampling, test-time missing blocks, and intricate inter-channel dependencies in multivariate time series forecasting.

Abstract: Real-world time series data are inherently multivariate, often exhibiting complex inter-channel dependencies. Each channel is typically sampled at its own period and is prone to missing values due to various practical and operational constraints. These characteristics pose three fundamental challenges involving channel dependency, sampling asynchrony, and missingness, all of which must be addressed simultaneously to enable robust and reliable forecasting in practical settings. However, existing architectures typically address only parts of these challenges in isolation and still rely on simplifying assumptions, leaving unresolved the combined challenges of asynchronous channel sampling, test-time missing blocks, and intricate inter-channel dependencies. To bridge this gap, we propose ChannelTokenFormer, a Transformer-based forecasting framework with a flexible architecture designed to explicitly capture cross-channel interactions, accommodate channel-wise asynchronous sampling, and effectively handle missing values. Extensive experiments on public benchmark datasets reflecting practical settings, along with one private real-world industrial dataset, demonstrate the superior robustness and accuracy of ChannelTokenFormer under challenging real-world conditions.

[1538] Efficient AllReduce with Stragglers

Arjun Devraj, Eric Ding, Abhishek Vijaya Kumar, Robert Kleinberg, Rachee Singh

Main category: cs.LG

TL;DR: StragglAR is a parallel AllReduce algorithm that accelerates distributed ML training and inference by exploiting GPU execution time variations, achieving 2x theoretical speedup over bandwidth-efficient algorithms and 25% speedup on 8-GPU systems.

Details

Motivation: Traditional AllReduce algorithms are delayed by stragglers (slowest GPUs), which limits performance in distributed ML workloads using data and tensor parallelism.

Method: StragglAR implements ReduceScatter among remaining GPUs during straggler delays, then executes a novel collective algorithm to complete AllReduce once the final GPU reaches the synchronization barrier.

Result: Achieves 2x theoretical speedup over bandwidth-efficient algorithms for large GPU clusters, surpassing the lower bound for bandwidth-optimal synchronous AllReduce. On 8-GPU servers, provides 25% speedup over state-of-the-art AllReduce algorithms.

Conclusion: StragglAR effectively mitigates straggler delays in distributed ML by leveraging natural GPU execution time variations, providing significant performance improvements over existing AllReduce approaches.

Abstract: Distributed machine learning workloads use data and tensor parallelism for training and inference, both of which rely on the AllReduce collective to synchronize gradients or activations. However, AllReduce algorithms are delayed by the slowest GPU to reach the synchronization barrier before the collective (i.e., the straggler). To address this challenge, we propose StragglAR: a parallel algorithm for AllReduce that accelerates distributed training and inference by exploiting natural variation in GPU execution times. StragglAR implements a ReduceScatter among the remaining GPUs during the straggler-induced delay, and then executes a novel collective algorithm to complete the AllReduce once the final GPU reaches the synchronization barrier. StragglAR achieves a 2x theoretical speedup over popular bandwidth-efficient algorithms for large GPU clusters, surpassing the lower bound for bandwidth-optimal synchronous AllReduce by leveraging the asymmetry in when GPUs reach the synchronization barrier. On an 8-GPU server, StragglAR provides a 25% speedup over state-of-the-art AllReduce algorithms.

[1539] Meta Pruning via Graph Metanetworks : A Universal Meta Learning Framework for Network Pruning

Yewei Liu, Xiyuan Wang, Muhan Zhang

Main category: cs.LG

TL;DR: A new meta-learning framework for network pruning that uses a metanetwork to automatically learn pruning strategies, applicable to various network types without special training requirements.

Details

Motivation: Existing pruning methods either rely on hand-crafted criteria or require special training for each pruning task, lacking generality and transferability.

Method: Establish bijective mapping between neural networks and graphs, then use a graph neural network as metanetwork to automatically learn pruning strategies that transform hard-to-prune networks into easier-to-prune versions.

Result: Achieves outstanding results on popular pruning tasks for both CNNs and Transformers, enabling state-of-the-art pruning with just a feedforward pass through the metanetwork and standard finetuning.

Conclusion: The proposed meta-learning framework provides a general, transferable approach to network pruning that automatically learns complex pruning rules and works across different network architectures without special training requirements.

Abstract: We propose an entirely new meta-learning framework for network pruning. It is a general framework that can be theoretically applied to almost all types of networks with all kinds of pruning and has great generality and transferability. Experiments have shown that it can achieve outstanding results on many popular and representative pruning tasks (including both CNNs and Transformers). Unlike all prior works that either rely on fixed, hand-crafted criteria to prune in a coarse manner, or employ learning to prune ways that require special training during each pruning and lack generality. Our framework can learn complex pruning rules automatically via a neural network (metanetwork) and has great generality that can prune without any special training. More specifically, we introduce the newly developed idea of metanetwork from meta-learning into pruning. A metanetwork is a network that takes another network as input and produces a modified network as output. In this paper, we first establish a bijective mapping between neural networks and graphs, and then employ a graph neural network as our metanetwork. We train a metanetwork that learns the pruning strategy automatically and can transform a network that is hard to prune into another network that is much easier to prune. Once the metanetwork is trained, our pruning needs nothing more than a feedforward through the metanetwork and some standard finetuning to prune at state-of-the-art. Our code is available at https://github.com/Yewei-Liu/MetaPruning.

[1540] Continuous Chain of Thought Enables Parallel Exploration and Reasoning

Halil Alperen Gozeten, M. Emrullah Ildiz, Xuechen Zhang, Hrayr Harutyunyan, Ankit Singh Rawat, Samet Oymak

Main category: cs.LG

TL;DR: CoT2 introduces continuous-valued tokens for chain-of-thought reasoning, enabling parallel tracking of multiple discrete traces with theoretical guarantees and improved inference efficiency.

Details

Motivation: To provide a richer and more expressive alternative to discrete chain-of-thought reasoning by using continuous-valued tokens, particularly for logical reasoning tasks requiring search capabilities.

Method: Developed theoretical guarantees for parallel trace tracking, introduced continuous supervision by matching model outputs to empirical token distributions, and created sampling strategies that compose K discrete tokens at each step to control parallelism.

Result: Experiments show optimal parallelism is governed by embedding dimension, continuous supervision outperforms alternatives, and policy optimization with CoT2 improves model performance beyond initial discrete or continuous supervision.

Conclusion: CoT2 with continuous-valued tokens offers a more expressive reasoning framework with theoretical benefits for parallel processing and practical improvements through novel supervision and sampling strategies.

Abstract: Modern language models generate chain-of-thought traces by autoregressively sampling tokens from a finite vocabulary. While this discrete sampling has achieved remarkable success, conducting chain-of-thought with continuously-valued tokens (CoT2) offers a richer and more expressive alternative. Our work provides new theoretical guarantees and algorithms for CoT2, motivated by logical reasoning tasks that inherently require search capabilities. Theoretically, we establish how CoT2 facilitates the model to track multiple discrete traces in parallel; and quantify the level of achievable parallelism and its benefits for inference efficiency. We also provide a CoT2-based one-layer transformer construction that solves the combinatorial “subset sum problem” given a sufficient embedding dimension. These insights arise from a novel and effective supervision strategy where we match the language model outputs to the empirical token distributions of a set of target traces. Complementing this, we introduce sampling strategies that unlock policy optimization methods for CoT2. Our primary strategy samples and composes $K$ discrete tokens at each decoding step to control the level of parallelism. Experiments confirm that (i) the optimal level of parallelism is governed by the embedding dimension, (ii) our continuous supervision strategy can outperform alternative methods, and (iii) policy optimization with CoT2 indeed improves the performance of the model beyond its initial discrete or continuous supervision.

[1541] Adaptive Sample Scheduling for Direct Preference Optimization

Zixuan Huang, Yikun Ban, Lean Fu, Xiaojie Li, Zhongxiang Dai, Jianxin Li, Deqing Wang

Main category: cs.LG

TL;DR: DPO’s performance depends on human preference data quality. SamS algorithm dynamically schedules training samples based on model’s evolving states during DPO, improving performance across tasks with minimal computational overhead.

Details

Motivation: DPO alignment effectiveness is limited by human preference data quality, and existing data selection methods ignore the model's changing states during training.

Method: Proposed SamS algorithm for sample scheduling in DPO - adaptively selects training samples in each batch based on LLM’s learning feedback to maximize generalization.

Result: SamS significantly improves DPO performance across tasks without modifying core algorithm, with minimal additional computational cost.

Conclusion: Sample scheduling based on model states is a promising direction for improving LLM alignment through better utilization of fixed preference datasets.

Abstract: Direct Preference Optimization (DPO) has emerged as an effective approach for aligning large language models (LLMs) with human preferences. However, its performance is highly dependent on the quality of the underlying human preference data. To address this bottleneck, prior work has explored various data selection strategies, but these methods often overlook the impact of the evolving states of the language model during the DPO process. %including active querying, response pair selection, and data pre-selection. In this paper, we introduce a novel problem: Sample Scheduling for DPO, which aims to dynamically and adaptively schedule training samples based on the model’s evolving states throughout preference optimization. To solve this problem, we propose SamS, an efficient and effective algorithm that adaptively selects samples in each training batch based on the LLM’s learning feedback to maximize the potential generalization performance. Notably, without modifying the core DPO algorithm, simply integrating SamS significantly improves performance across tasks, with minimal additional computational overhead. This work points to a promising new direction for improving LLM alignment through more effective utilization of fixed preference datasets.

[1542] Vision Language Models are Biased

An Vo, Khai-Nguyen Nguyen, Mohammad Reza Taesiri, Vy Tuong Dang, Anh Totti Nguyen, Daeyoung Kim

Main category: cs.LG

TL;DR: VLMs are biased by prior knowledge about popular subjects, leading to poor performance on objective visual tasks like counting and identification, with accuracy dropping to 17.05% across diverse domains.

Details

Motivation: To investigate how VLMs' memorized knowledge from the Internet negatively impacts their performance on standard visual tasks, causing biased outputs.

Method: Tested state-of-the-art VLMs on counting and identification tasks across 7 domains (animals, logos, chess, etc.), analyzed reasoning patterns, and experimented with background removal to isolate contextual bias triggers.

Result: VLMs scored only 17.05% accuracy on counting tasks, but removing backgrounds nearly doubled accuracy (+21.09 percentage points). Counting accuracy initially improved with thinking tokens (~40%) before declining with excessive reasoning.

Conclusion: VLMs exhibit significant bias from prior knowledge, presenting an interesting failure mode that requires human-supervised automated testing frameworks to detect and mitigate.

Abstract: Large language models (LLMs) memorize a vast amount of prior knowledge from the Internet that helps them on downstream tasks but also may notoriously sway their outputs towards wrong or biased answers. In this work, we test how the knowledge about popular subjects hurt the accuracy of vision language models (VLMs) on standard, objective visual tasks of counting and identification. We find that state-of-the-art VLMs are strongly biased (e.g., unable to recognize the 4th stripe has been added to a 3-stripe Adidas logo) scoring an average of 17.05% accuracy in counting (e.g., counting stripes in an Adidas-like logo) across 7 diverse domains from animals, logos, chess, board games, optical illusions, to patterned grids. Removing image backgrounds nearly doubles accuracy (21.09 percentage points), revealing that contextual visual cues trigger these biased responses. Further analysis of VLMs’ reasoning patterns shows that counting accuracy initially rises with thinking tokens, reaching ~40%, before declining with excessive reasoning. Our work presents an interesting failure mode in VLMs and a human-supervised automated framework for testing VLM biases. Code and data are available at: vlmsarebiased.github.io.

[1543] On the Emergence of Weak-to-Strong Generalization: A Bias-Variance Perspective

Gengze Xu, Wei Yao, Ziqiao Wang, Yong Liu

Main category: cs.LG

TL;DR: The paper provides a theoretical analysis of weak-to-strong generalization (W2SG) using Bregman divergence, showing that W2SG emerges when students approximate posterior mean teachers rather than mimicking individual teachers, and that reverse cross-entropy loss improves performance.

Details

Motivation: To theoretically understand why strong student models outperform weak teachers when trained on teacher-labeled data, particularly removing restrictive assumptions from previous work and identifying key factors that enable W2SG.

Method: Theoretical analysis using generalized bias-variance decomposition of Bregman divergence, examining conditions for W2SG emergence, and empirical verification with reverse cross-entropy loss experiments.

Result: W2SG occurs when students approximate posterior mean teachers rather than individual teachers; sufficiently large student models can converge to posterior mean teachers; reverse cross-entropy loss is less sensitive to teacher uncertainty and consistently improves student performance.

Conclusion: W2SG emerges through posterior mean approximation, facilitated by avoiding teacher overfitting and reducing prediction entropy, with reverse cross-entropy loss providing practical benefits for weak-to-strong generalization.

Abstract: Weak-to-strong generalization (W2SG) refers to the phenomenon where a strong student model, trained on a dataset labeled by a weak teacher, ultimately outperforms the teacher on the target task. Recent studies attribute this performance gain to the prediction misfit between the student and teacher models. In this work, we theoretically investigate the emergence of W2SG through a generalized bias-variance decomposition of Bregman divergence. Specifically, we show that the expected population risk gap between the student and teacher is quantified by the expected misfit between the two models. While this aligns with previous results, our analysis removes several restrictive assumptions, most notably, the convexity of the student’s hypothesis class, required in earlier works. Moreover, we show that W2SG is more likely to emerge when the student model approximates its posterior mean teacher, rather than mimicking an individual teacher. Using a concrete example, we demonstrate that if the student model size is sufficiently large, it can indeed converge to the posterior mean teacher in expectation. Our analysis also suggests that avoiding overfitting to the teacher’s supervision and reducing the entropy of student’s prediction further facilitate W2SG. In addition, we show that the reverse cross-entropy loss, unlike the standard forward cross-entropy, is less sensitive to the predictive uncertainty of the teacher. Finally, we empirically verify our theoretical insights and demonstrate that incorporating the reverse cross-entropy loss consistently improves student performance.

[1544] Weight-Space Linear Recurrent Neural Networks

Roussel Desmond Nzoyem, Nawid Keshtmand, Enrique Crespo Fernandez, Idriss Tsayem, Raul Santos-Rodriguez, David A. W. Barton, Tom Deakin

Main category: cs.LG

TL;DR: WARP unifies weight-space learning with linear recurrence for sequence modeling, using hidden states as weights/biases of an auxiliary network and input differences for recurrence, enabling gradient-free adaptation and in-context learning.

Details

Motivation: To overcome limitations of conventional RNNs that collapse temporal dynamics into fixed hidden states, and to create a brain-inspired model that enables efficient adaptation and integration of domain knowledge.

Method: Parametrizes hidden state as weights/biases of auxiliary neural network, uses input differences to drive recurrence, supports gradient-free test-time adaptation and physics-informed variants.

Result: Matches or surpasses SOTA baselines on diverse classification tasks (top 3 in 5/6 datasets), excels in sequential image completion, time series forecasting, and dynamical system reconstruction. Physics-informed variant outperforms next best model by 10x+.

Conclusion: WARP solidifies weight-space linear RNNs as a transformative paradigm for adaptive machine intelligence, with proven architectural necessity and strong generalization capabilities.

Abstract: We introduce WARP (Weight-space Adaptive Recurrent Prediction), a simple yet powerful model that unifies weight-space learning with linear recurrence to redefine sequence modeling. Unlike conventional recurrent neural networks (RNNs) which collapse temporal dynamics into fixed-dimensional hidden states, WARP explicitly parametrizes its hidden state as the weights and biases of a distinct auxiliary neural network, and uses input differences to drive its recurrence. This brain-inspired formulation enables efficient gradient-free adaptation of the auxiliary network at test-time, in-context learning abilities, and seamless integration of domain-specific physical priors. Empirical validation shows that WARP matches or surpasses state-of-the-art baselines on diverse classification tasks, featuring in the top three in 5 out of 6 real-world challenging datasets. Furthermore, extensive experiments across sequential image completion, multivariate time series forecasting, and dynamical system reconstruction demonstrate its expressiveness and generalisation capabilities. Remarkably, a physics-informed variant of our model outperforms the next best model by more than 10x. Ablation studies confirm the architectural necessity of key components, solidifying weight-space linear RNNs as a transformative paradigm for adaptive machine intelligence.

[1545] Data Uniformity Improves Training Efficiency and More, with a Convergence Framework Beyond the NTK Regime

Yuqing Wang, Shangding Gu

Main category: cs.LG

TL;DR: Selecting more uniformly distributed data improves training efficiency and performance in LLMs by increasing minimum pairwise distance between data points, which accelerates gradient descent convergence and reduces approximation error.

Details

Motivation: While data quality and diversity are well-studied, it's unclear if other general principles of data selection can consistently improve performance for complex tasks. The paper investigates whether data uniformity can serve as such a principle.

Method: Theoretical analysis showing that uniform data distribution increases minimum pairwise distance (h_min), which accelerates gradient descent training dynamics and reduces neural network approximation error. A convergence framework for GD beyond NTK regime is developed, applicable to transformers without Lipschitz smoothness requirements.

Result: Comprehensive experiments across various settings show that data selection by maximizing pairwise distance significantly accelerates training and achieves comparable or better performance in LLMs across diverse datasets.

Conclusion: Data uniformity is a quantitative and general principle for data selection that consistently improves training efficiency and performance in LLMs, with theoretical justification for its effectiveness.

Abstract: Data selection plays a crucial role in data-driven decision-making, including in large language models (LLMs), and is typically task-dependent. Properties such as data quality and diversity have been extensively studied and are known to enhance model performance. However, it remains unclear whether there exist other quantitative and general principles of data selection that can consistently improve performance, especially for complicated tasks. In this paper, we demonstrate that selecting more uniformly distributed data can improve training efficiency while enhancing performance. Specifically, we establish that more uniform (less biased) distribution leads to a larger minimum pairwise distance between data points, denoted by $h_{\min}$, and prove that a smaller $h_{\min}$ can slow down the training dynamics of gradient descent (GD). Moreover, we theoretically show that the approximation error of neural networks decreases as $h_{\min}$ increases. Our analysis introduces a convergence framework for GD beyond the Neural Tangent Kernel (NTK) regime, applicable to a broad class of architectures, including transformers, without requiring Lipschitz smoothness. This framework further provides theoretical justification for the use of residual connection and function composition in deep neural architectures. In the end, we conduct comprehensive experiments for supervised fine-tuning across various settings, including different optimization strategies, model sizes, and training datasets. The results consistently demonstrate that selecting data by maximizing pairwise distance significantly accelerates training and achieves comparable or better performance in LLMs across diverse datasets. Code and Datasets are available at the link: https://github.com/SafeRL-Lab/data-uniformity.

[1546] QKV Projections Require a Fraction of Their Memory

Malik Khalaf, Yara Shamshoum, Nitzan Hodos, Yuval Sieradzki, Assaf Schuster

Main category: cs.LG

TL;DR: PAMM reduces memory consumption of Q,K,V projections in attention layers by up to 512x while maintaining or improving perplexity, and works with efficient attention methods like FlashAttention.

Details

Motivation: Most attention efficiency works focus on approximating scaled dot product, but overlook the memory consumption of linear projections that compute Q, K, and V tensors from input.

Method: Proposed Point-Approximate Matrix Multiplication (PAMM), a tensor compression technique that compresses the Q,K,V projections in attention layers.

Result: Reduces memory consumption by up to 512x, effectively erasing their memory footprint while achieving similar or better final perplexity.

Conclusion: PAMM is a practical and complementary method for memory-efficient LLM training that is fully composable with efficient attention techniques like FlashAttention.

Abstract: The Multi-Head Attention mechanism is central to LLM operation, and multiple works target its compute and memory efficiency during training. While most works focus on approximating the scaled dot product, the memory consumption of the linear projections that compute the $Q$, $K$, and $V$ tensors from the input $x$ is often overlooked. To address this, we propose Point-Approximate Matrix Multiplication (PAMM), a novel tensor compression technique that reduces memory consumption of the $Q,K,V$ projections in attention layers by a factor of up to $\times 512$, effectively erasing their memory footprint, while achieving similar or better final perplexity. PAMM is fully composable with efficient attention techniques such as FlashAttention, making it a practical and complementary method for memory-efficient LLM training.

[1547] Theoretical Modeling of LLM Self-Improvement Training Dynamics Through Solver-Verifier Gap

Yifan Sun, Yushan Liang, Zhen Zhang, Jiaye Teng

Main category: cs.LG

TL;DR: This paper proposes a theoretical framework to model LLM self-improvement dynamics using solver-verifier gap, showing how performance evolves and quantifying capability limits.

Details

Motivation: Self-improvement is important for LLMs but how performance evolves during this process remains underexplored theoretically.

Method: Theoretical modeling of training dynamics via solver-verifier gap concept, fitting theoretical model to experimental results to quantify capability limits.

Result: Empirical validation shows effectiveness across various LLMs and datasets. External data analysis reveals it can be used at any stage without affecting final performance in limited data regimes.

Conclusion: The solver-verifier gap framework successfully models self-improvement dynamics and provides insights into capability limits and external data utilization.

Abstract: Self-improvement is among the most prominent techniques within the realm of large language models (LLM), aiming to enhance the LLM performance without relying on external data. Despite its significance, generally how LLM performances evolve during the self-improvement process remains underexplored. In this paper, we theoretically model the training dynamics of self-improvement via the concept of solver-verifier gap. This is inspired by the conjecture that the performance enhancement of self-improvement stems from the gap between LLM’s solver capability and verifier capability. Based on the theoretical framework, we further show how to model the entire training trajectory. This framework allows quantifying the capability limit of self-improvement by fitting the theoretical model to the experiment results. We empirically validate the effectiveness of the theoretical framework on various LLMs and datasets. Beyond self-improvement, we extend our analysis to investigate how external data influences these dynamics within the framework. Notably, we find that under limited external data regimes, such external data can be utilized at any stage without significantly affecting final performances, which accords with the empirical observations.

[1548] Curse of Slicing: Why Sliced Mutual Information is a Deceptive Measure of Statistical Dependence

Alexander Semenenko, Ivan Butakov, Alexey Frolov, Ivan Oseledets

Main category: cs.LG

TL;DR: Sliced Mutual Information (SMI) has serious flaws: it saturates easily, fails to detect increased statistical dependence, prioritizes redundancy over useful information, and can perform worse than simple correlation measures.

Details

Motivation: To investigate the limitations and counterintuitive behavior of SMI, which is widely used as a scalable alternative to mutual information for measuring non-linear statistical dependence.

Method: Extensive benchmarking and theoretical analysis to evaluate SMI’s performance and behavior under various conditions.

Result: SMI is highly susceptible to data manipulation, saturates easily, fails to detect increases in statistical dependence even under linear transformations, prioritizes redundancy over informative content, and sometimes performs worse than correlation coefficient.

Conclusion: SMI has significant limitations that undermine its reliability as a measure of statistical dependence, despite its advantages in scalability and convergence.

Abstract: Sliced Mutual Information (SMI) is widely used as a scalable alternative to mutual information for measuring non-linear statistical dependence. Despite its advantages, such as faster convergence, robustness to high dimensionality, and nullification only under statistical independence, we demonstrate that SMI is highly susceptible to data manipulation and exhibits counterintuitive behavior. Through extensive benchmarking and theoretical analysis, we show that SMI saturates easily, fails to detect increases in statistical dependence (even under linear transformations designed to enhance the extraction of information), prioritizes redundancy over informative content, and in some cases, performs worse than simpler dependence measures like the correlation coefficient.

[1549] Learning to Segment for Vehicle Routing Problems

Wenbin Ouyang, Sirui Li, Yining Ma, Cathy Wu

Main category: cs.LG

TL;DR: The paper introduces Learning-to-Segment (L2Seg), a neural framework that accelerates iterative VRP solvers by 2x-7x through intelligent decomposition of stable and unstable solution segments.

Details

Motivation: Iterative heuristics for VRPs suffer from redundant computations due to large portions of solutions remaining stable across iterations, particularly problematic for large-scale problems with long subtours.

Method: Proposes First-Segment-Then-Aggregate (FSTA) decomposition and L2Seg neural framework with three variants: non-autoregressive, autoregressive, and their synergy to identify stable/unstable segments for aggregation.

Result: Empirical results on CVRP and VRPTW show 2x to 7x acceleration of state-of-the-art solvers, with the synergistic approach achieving best performance.

Conclusion: L2Seg is a versatile framework compatible with traditional, learning-based, and hybrid solvers that significantly accelerates VRP solving through intelligent solution decomposition.

Abstract: Iterative heuristics are widely recognized as state-of-the-art for Vehicle Routing Problems (VRPs). In this work, we exploit a critical observation: a large portion of the solution remains stable, i.e., unchanged across search iterations, causing redundant computations, especially for large-scale VRPs with long subtours. To address this, we pioneer the formal study of the First-Segment-Then-Aggregate (FSTA) decomposition technique to accelerate iterative solvers. FSTA preserves stable solution segments during the search, aggregates nodes within each segment into fixed hypernodes, and focuses the search only on unstable portions. Yet, a key challenge lies in identifying which segments should be aggregated. To this end, we introduce Learning-to-Segment (L2Seg), a novel neural framework to intelligently differentiate potentially stable and unstable portions for FSTA decomposition. We present three L2Seg variants: non-autoregressive (globally comprehensive but locally indiscriminate), autoregressive (locally refined but globally deficient), and their synergy. Empirical results on CVRP and VRPTW show that L2Seg accelerates state-of-the-art solvers by 2x to 7x. We further provide in-depth analysis showing why synergy achieves the best performance. Notably, L2Seg is compatible with traditional, learning-based, and hybrid solvers, while supporting various VRPs.

[1550] Leveraging Coordinate Momentum in SignSGD and Muon: Memory-Optimized Zero-Order

Egor Petrov, Grigoriy Evseev, Aleksey Antonov, Andrey Veprikov, Nikolay Bushkov, Stanislav Moiseev, Aleksandr Beznosikov

Main category: cs.LG

TL;DR: Proposes JAGUAR SignSGD and JAGUAR Muon - zero-order optimization methods for memory-efficient LLM fine-tuning that match first-order methods’ performance with significant memory reduction.

Details

Motivation: Traditional first-order optimizers like SGD and Adam are too memory-intensive for large LLMs, especially with parameter-efficient techniques like LoRA. Need compute-efficient alternatives.

Method: Developed two zero-order algorithms: JAGUAR SignSGD (momentum-based extension of ZO SignSGD) and JAGUAR Muon (novel ZO extension leveraging matrix structure). Both require same parameters as ZO SGD with O(1) function evaluations.

Result: Algorithms achieve convergence quality matching or exceeding standard first-order methods while significantly reducing memory usage. First study to establish rigorous convergence guarantees for stochastic ZO SignSGD.

Conclusion: Zero-order optimization methods are practical and theoretically grounded for resource-constrained LLM adaptation, offering memory-efficient alternatives to traditional optimizers.

Abstract: Fine-tuning Large Language Models (LLMs) is essential for adapting pre-trained models to downstream tasks. Yet traditional first-order optimizers such as Stochastic Gradient Descent (SGD) and Adam incur prohibitive memory and computational costs that scale poorly with model size. In this paper, we investigate zero-order (ZO) optimization methods as a memory- and compute-efficient alternative, particularly in the context of parameter-efficient fine-tuning techniques like LoRA. We propose $\texttt{JAGUAR SignSGD}$, a ZO momentum-based algorithm that extends ZO SignSGD, requiring the same number of parameters as the standard ZO SGD and only $\mathcal{O}(1)$ function evaluations per iteration. To the best of our knowledge, this is the first study to establish rigorous convergence guarantees for SignSGD in the stochastic ZO case. We further propose $\texttt{JAGUAR Muon}$, a novel ZO extension of the Muon optimizer that leverages the matrix structure of model parameters, and we provide its convergence rate under arbitrary stochastic noise. Through extensive experiments on challenging LLM fine-tuning benchmarks, we demonstrate that the proposed algorithms meet or exceed the convergence quality of standard first-order methods, achieving significant memory reduction. Our theoretical and empirical results establish new ZO optimization methods as a practical and theoretically grounded approach for resource-constrained LLM adaptation. Our code is available at https://github.com/brain-mmo-lab/ZO_LLM

[1551] Flow-Attentional Graph Neural Networks

Pascal Plettenberg, Dominik Köhler, Bernhard Sick, Josephine M. Thomas

Main category: cs.LG

TL;DR: Flow attention adapts graph attention mechanisms to satisfy Kirchhoff’s first law, improving performance on flow graph datasets like electronic circuits and power grids.

Details

Motivation: Existing GNNs don't consider conservation laws inherent in graphs with physical resource flows (e.g., electrical current, traffic), which reduces model performance.

Method: Propose flow attention that modifies graph attention mechanisms to satisfy Kirchhoff’s first law, and analyze its influence on expressivity and graph discrimination capabilities.

Result: Flow attention enhances performance of attention-based GNNs on both graph-level classification and regression tasks across electronic circuits and power grids datasets.

Conclusion: Incorporating physical conservation laws like Kirchhoff’s first law into graph attention mechanisms improves GNN performance on flow graph applications.

Abstract: Graph Neural Networks (GNNs) have become essential for learning from graph-structured data. However, existing GNNs do not consider the conservation law inherent in graphs associated with a flow of physical resources, such as electrical current in power grids or traffic in transportation networks, which can lead to reduced model performance. To address this, we propose flow attention, which adapts existing graph attention mechanisms to satisfy Kirchhoff$\text{’}$s first law. Furthermore, we discuss how this modification influences the expressivity and identify sets of non-isomorphic graphs that can be discriminated by flow attention but not by standard attention. Through extensive experiments on two flow graph datasets (electronic circuits and power grids) we demonstrate that flow attention enhances the performance of attention-based GNNs on both graph-level classification and regression tasks.

[1552] Model Collapse Is Not a Bug but a Feature in Machine Unlearning for LLMs

Yan Scholten, Sophie Xhonneux, Leo Schwinn, Stephan Günnemann

Main category: cs.LG

TL;DR: Proposes Partial Model Collapse (PMC), a novel machine unlearning method that removes private information without using unlearning targets in the objective, leveraging model collapse principles.

Details

Motivation: Current unlearning methods risk reinforcing exposure to sensitive data by incorporating it into fine-tuning data, contradicting the principle of minimizing sensitive data use.

Method: PMC deliberately triggers model collapse for data to be removed, training models on their own generations to cause distribution collapse and remove information from outputs.

Result: PMC effectively removes private information from model outputs while preserving general model utility, overcoming limitations of existing unlearning methods.

Conclusion: PMC represents an important step toward comprehensive unlearning that aligns with real-world privacy constraints without exposing sensitive data during unlearning.

Abstract: Current unlearning methods for LLMs optimize on the private information they seek to remove by incorporating it into their fine-tuning data. We argue this not only risks reinforcing exposure to sensitive data, it also fundamentally contradicts the principle of minimizing its use. As a remedy, we propose a novel unlearning method-Partial Model Collapse (PMC), which does not require unlearning targets in the unlearning objective. Our approach is inspired by recent observations that training generative models on their own generations leads to distribution collapse, effectively removing information from model outputs. Our central insight is that model collapse can be leveraged for machine unlearning by deliberately triggering it for data we aim to remove. We theoretically analyze that our approach converges to the desired outcome, i.e. the model unlearns the data targeted for removal. We empirically demonstrate that PMC overcomes three key limitations of existing unlearning methods that explicitly optimize on unlearning targets, and more effectively removes private information from model outputs while preserving general model utility. Overall, our contributions represent an important step toward more comprehensive unlearning that aligns with real-world privacy constraints. Code available at https://www.cs.cit.tum.de/daml/partial-model-collapse/.

[1553] Can In-Context Reinforcement Learning Recover From Reward Poisoning Attacks?

Paulius Sasnauskas, Yiğit Yalın, Goran Radanović

Main category: cs.LG

TL;DR: Proposes Adversarially Trained Decision-Pretrained Transformer (AT-DPT) to defend against reward poisoning attacks in in-context reinforcement learning by training both attacker and defender simultaneously.

Details

Motivation: Address the vulnerability of Decision-Pretrained Transformer (DPT) to reward poisoning attacks that can manipulate environment rewards to minimize the true reward of the model.

Method: Simultaneous adversarial training framework where an attacker minimizes true reward by poisoning environment rewards while the DPT model learns to infer optimal actions from poisoned data.

Result: AT-DPT significantly outperforms standard bandit algorithms and robust baselines against reward contamination, with similar effectiveness against adaptive attackers, and generalizes robustness from bandit to MDP settings.

Conclusion: Adversarial training effectively enhances corruption-robustness of in-context reinforcement learning, making DPT models resilient to reward poisoning attacks across different environment complexities.

Abstract: We study the corruption-robustness of in-context reinforcement learning (ICRL), focusing on the Decision-Pretrained Transformer (DPT, Lee et al., 2023). To address the challenge of reward poisoning attacks targeting the DPT, we propose a novel adversarial training framework, called Adversarially Trained Decision-Pretrained Transformer (AT-DPT). Our method simultaneously trains an attacker to minimize the true reward of the DPT by poisoning environment rewards, and a DPT model to infer optimal actions from the poisoned data. We evaluate the effectiveness of our approach against standard bandit algorithms, including robust baselines designed to handle reward contamination. Our results show that the proposed method significantly outperforms these baselines in bandit settings, under a learned attacker. We additionally evaluate AT-DPT on an adaptive attacker, and observe similar results. Furthermore, we extend our evaluation to the MDP setting, confirming that the robustness observed in bandit scenarios generalizes to more complex environments.

[1554] Why Masking Diffusion Works: Condition on the Jump Schedule for Improved Discrete Diffusion

Alan N. Amin, Nate Gruver, Andrew Gordon Wilson

Main category: cs.LG

TL;DR: SCUD (schedule-conditioned discrete diffusion) models incorporate known jump time distributions into discrete diffusion, outperforming masking diffusion by leveraging fundamental differences between continuous and discrete Markov processes.

Details

Motivation: To explain why masking diffusion performs better than other discrete diffusion models and develop improved models that incorporate inductive biases while maintaining performance advantages.

Method: Propose SCUD models that bake in the known distribution of jump times from discrete Markov processes, generalizing classical discrete diffusion and masking diffusion across image, text, and protein data.

Result: SCUD models outperform masking diffusion when applied to models with inductive biases on various data types including images, text, and proteins.

Conclusion: By explicitly modeling jump time distributions in discrete diffusion processes, SCUD provides a unified framework that achieves superior performance over existing approaches while enabling better incorporation of domain-specific inductive biases.

Abstract: Discrete diffusion models, like continuous diffusion models, generate high-quality samples by gradually undoing noise applied to datapoints with a Markov process. Gradual generation in theory comes with many conceptual benefits; for example, inductive biases can be incorporated into the noising Markov process, and access to improved sampling algorithms. In practice, however, the consistently best performing discrete diffusion model is, surprisingly, masking diffusion, which does not denoise gradually. Here we explain the superior performance of masking diffusion by noting that it makes use of a fundamental difference between continuous and discrete Markov processes: discrete Markov processes evolve by discontinuous jumps at a fixed rate and, unlike other discrete diffusion models, masking diffusion builds in the known distribution of jump times and only learns where to jump to. We show that we can similarly bake in the known distribution of jump times into any discrete diffusion model. The resulting models - schedule-conditioned discrete diffusion (SCUD) - generalize classical discrete diffusion and masking diffusion. By applying SCUD to models with noising processes that incorporate inductive biases on images, text, and protein data, we build models that outperform masking.

[1555] Foundation Models for Causal Inference via Prior-Data Fitted Networks

Yuchen Ma, Dennis Frauen, Emil Javurek, Stefan Feuerriegel

Main category: cs.LG

TL;DR: CausalFM is a framework for training prior-data fitted networks (PFNs) as foundation models for Bayesian causal inference, enabling in-context learning across various causal settings including back-door, front-door, and instrumental variable adjustment.

Details

Motivation: To create foundation models for causal inference that can perform Bayesian inference through in-context learning, overcoming limitations of traditional causal inference methods and providing a more flexible approach for practitioners.

Method: Formalizes Bayesian priors based on structural causal models, proposes causality-inspired Bayesian neural networks as prior distributions, and trains PFN-based models for in-context learning in multiple causal inference settings.

Result: CausalFM achieves competitive in-context learning performance compared to task-specific baselines, demonstrating effectiveness across various causal inference scenarios.

Conclusion: CausalFM provides a general framework for training foundation models in causal inference, offering a novel paradigm that could fundamentally change how practitioners perform causal inference across multiple disciplines.

Abstract: Prior-data fitted networks (PFNs) have recently been proposed as a promising way to train tabular foundation models. PFNs are transformers that are pre-trained on synthetic data generated from a prespecified prior distribution and that enable Bayesian inference through in-context learning. In this paper, we introduce CausalFM, a comprehensive framework for training PFN-based foundation models in various causal inference settings. First, we formalize the construction of Bayesian priors for causal inference based on structural causal models (SCMs) in a principled way and derive necessary criteria for the validity of such priors. Building on this, we propose a novel family of prior distributions using causality-inspired Bayesian neural networks that enable CausalFM to perform Bayesian causal inference in various settings, including for back-door, front-door, and instrumental variable adjustment. Finally, we instantiate CausalFM and explicitly train models to perform in-context learning in these settings. We show that CausalFM achieves competitive in-context learning performance even when compared to baselines that are specifically trained for the task at hand. In sum, our framework can be used as a general recipe to train foundation models for various causal inference settings. In contrast to the current state-of-the-art in causal inference, CausalFM offers a novel paradigm with the potential to fundamentally change how practitioners perform causal inference in medicine, economics, and other disciplines.

[1556] Vidar: Embodied Video Diffusion Model for Generalist Manipulation

Yao Feng, Hengkai Tan, Xinyi Mao, Chendong Xiang, Guodong Liu, Shuhe Huang, Hang Su, Jun Zhu

Main category: cs.LG

TL;DR: Vidar is a prior-driven, low-shot adaptation paradigm that uses transferable video priors to scale robot manipulation across different embodiments with minimal demonstration data.

Details

Motivation: Scaling general-purpose manipulation to new robot platforms is challenging due to the need for large, homogeneous demonstrations and the sensitivity of pixel-to-action pipelines to background and viewpoint shifts.

Method: Vidar combines an embodied video diffusion model pre-trained on Internet-scale videos with a masked inverse dynamics model (MIDM) adapter. The diffusion model captures temporal coherence and physical interactions, while MIDM learns action-relevant pixel masks to ground the prior into specific embodiment action spaces.

Result: With only 20 minutes of human demonstrations (1% of typical data), Vidar outperforms state-of-the-art VLA baselines and generalizes to unseen tasks, backgrounds, and camera layouts.

Conclusion: The approach enables scalable ‘one prior, many embodiments’ robotics by combining strong video priors with minimal on-robot alignment, shifting the challenge from data collection to efficient prior alignment.

Abstract: Scaling general-purpose manipulation to new robot embodiments remains challenging: each platform typically needs large, homogeneous demonstrations, and pixel-to-action VLA pipelines typically degenerate under background and viewpoint shifts. In this paper, we present Vidar, a prior-driven, low-shot adaptation paradigm that replaces most embodiment-specific data with transferable video priors. Vidar consists of an embodied video diffusion model as the generalizable prior and a masked inverse dynamics model (MIDM) adapter based on a key decoupling of the policy. The embodied diffusion model is pre-trained on Internet-scale videos and then domain-adapted to 750K multi-view trajectories from three real-world robot platforms using a unified observation space encoding robot, camera, task, and scene contexts. The MIDM module learns action-relevant pixel masks without dense labels, grounding the prior into the target embodiment’s action space while suppressing distractors. Crucially, the generative video prior models the distribution of plausible, temporally coherent interactions, implicitly capturing affordances, contact dynamics, and physical consistency from massive unlabeled video. This shifts the challenge from collecting large amounts of new robot data to efficiently aligning a rich prior with a new embodiment. With only 20 minutes of human demonstrations on an unseen robot (1% of typical data), Vidar outperforms state-of-the-art VLA baselines and generalizes to unseen tasks, backgrounds, and camera layouts. Our results suggest a scalable recipe for “one prior, many embodiments”: strong, inexpensive video priors + minimal on-robot alignment.

[1557] Unlearning Isn’t Invisible: Detecting Unlearning Traces in LLMs from Model Outputs

Yiwei Chen, Soumyadeep Pal, Yimeng Zhang, Qing Qu, Sijia Liu

Main category: cs.LG

TL;DR: Machine unlearning in LLMs leaves detectable “fingerprints” that can identify whether a model has undergone unlearning, even with forget-irrelevant inputs, posing privacy risks.

Details

Motivation: To investigate the security vulnerability of machine unlearning in LLMs, specifically the persistence of detectable traces that could reveal unlearning activities and potentially reverse-engineer forgotten information.

Method: Analyzed unlearning traces through model behavior and internal representations, using supervised classifiers on prediction logits and textual outputs, and examining intermediate activations and their propagation to final layers.

Result: Unlearning traces can be detected with over 90% accuracy even under forget-irrelevant inputs, with larger LLMs showing stronger detectability. Traces form low-dimensional, learnable manifolds in activation space.

Conclusion: Machine unlearning leaves measurable signatures that introduce new security risks, enabling detection of unlearning activities and potential reverse-engineering of forgotten information when models are identified as unlearned.

Abstract: Machine unlearning (MU) for large language models (LLMs), commonly referred to as LLM unlearning, seeks to remove specific undesirable data or knowledge from a trained model, while maintaining its performance on standard tasks. While unlearning plays a vital role in protecting data privacy, enforcing copyright, and mitigating sociotechnical harms in LLMs, we identify a new vulnerability post-unlearning: unlearning trace detection. We discover that unlearning leaves behind persistent ‘‘fingerprints’’ in LLMs, detectable traces in both model behavior and internal representations. These traces can be identified from output responses, even when prompted with forget-irrelevant inputs. Specifically, even a simple supervised classifier can determine whether a model has undergone unlearning, using only its prediction logits or even its textual outputs. Further analysis shows that these traces are embedded in intermediate activations and propagate nonlinearly to the final layer, forming low-dimensional, learnable manifolds in activation space. Through extensive experiments, we demonstrate that unlearning traces can be detected with over 90% accuracy even under forget-irrelevant inputs, and that larger LLMs exhibit stronger detectability. These findings reveal that unlearning leaves measurable signatures, introducing a new risk of reverse-engineering forgotten information when a model is identified as unlearned, given an input query.

[1558] Learning to summarize user information for personalized reinforcement learning from human feedback

Hyunji Nam, Yanming Wan, Mickel Liu, Jianxun Lian, Peter Ahnn, Natasha Jaques

Main category: cs.LG

TL;DR: PLUS is a reinforcement learning framework that learns to generate personalized user preference summaries to improve reward modeling for LLM personalization, achieving significant improvements over standard RLHF.

Details

Motivation: Standard RLHF assumes uniform user preferences across the entire population, but users have diverse preferences and goals that require personalized AI assistant responses.

Method: Uses reinforcement learning to train both a user-summarization model and reward model simultaneously, creating text-based summaries of each user’s preferences, characteristics, and conversation history that condition the reward model.

Result: Achieved 11-77% improvement in reward model accuracy, 25% improvement over best personalized RLHF technique, and 72% win rate vs 28% for default GPT-4o in zero-shot personalization.

Conclusion: PLUS enables effective personalization of LLM responses by learning interpretable user preference summaries, improving performance with new users and topics while providing greater transparency and user control.

Abstract: As everyday use cases of large language model (LLM) AI assistants have expanded, it is becoming increasingly important to personalize responses to align to different users’ preferences and goals. While reinforcement learning from human feedback (RLHF) is effective at improving LLMs to be generally more helpful and fluent, it does not account for variability across users, as it models the entire user population with a single reward model, meaning it assumes that everyone’s preferences are the same. We present a novel framework, Preference Learning Using Summarization (PLUS), that uses reinforcement learning (RL) to learn to produce text-based summaries of each user’s preferences, characteristics, and past conversations. These summaries condition the reward model, enabling it to make personalized predictions about the types of responses valued by each user. Both the user-summarization model and reward model are trained simultaneously, creating an online co-adaptation loop. We show that in contrast to the standard Bradley-Terry model, summaries produced by PLUS capture diverse aspects of user preferences, achieving a 11-77% improvement in reward model accuracy. Key strengths of PLUS are: (1) robust performance with new users and conversation topics, achieving a 25% improvement over the best personalized RLHF technique; (2) zero-shot personalization with state-of-the-art proprietary models like GPT-4 (e.g., PLUS-summary-conditioned responses achieved a 72% win rate compared to 28% for default GPT-4o); (3) learning from flexible user contexts beyond preference labels, and (4) interpretable representation of users, enabling greater transparency and user control in pluralistic LLM alignment.

[1559] Muon Optimizes Under Spectral Norm Constraints

Lizhang Chen, Jonathan Li, Qiang Liu

Main category: cs.LG

TL;DR: This paper provides a theoretical analysis of the Muon optimizer by showing it belongs to the Lion-K family with nuclear norm, revealing its implicit spectral norm regularization.

Details

Motivation: To bridge the theoretical gap in understanding the Muon optimizer's empirical performance by placing it within the established Lion-K framework.

Method: Theoretical analysis showing Muon corresponds to Lion-K with nuclear norm, leveraging existing Lion-K theoretical results to establish implicit regularization properties.

Result: Demonstrated that Muon with decoupled weight decay implicitly solves optimization problems constraining the spectral norm of weight matrices.

Conclusion: This theoretical perspective demystifies Muon’s regularization effects and enables natural generalizations through different convex maps K, expanding the class of implicitly regularized optimization algorithms.

Abstract: The pursuit of faster optimization algorithms remains an active and important research direction in deep learning. Recently, the Muon optimizer [JJB+24] has demonstrated promising empirical performance, but its theoretical foundation remains less understood. In this paper, we bridge this gap and provide a theoretical analysis of Muon by placing it within the Lion-$\mathcal{K}$ family of optimizers [CLLL24]. Specifically, we show that Muon corresponds to Lion-$\mathcal{K}$ when equipped with the nuclear norm, and we leverage the theoretical results of Lion-$\mathcal{K}$ to establish that Muon (with decoupled weight decay) implicitly solves an optimization problem that enforces a constraint on the spectral norm of weight matrices. This perspective not only demystifies the implicit regularization effects of Muon but also leads to natural generalizations through varying the choice of convex map $\mathcal{K}$, allowing for the exploration of a broader class of implicitly regularized and constrained optimization algorithms.

[1560] GRID: Scalable Task-Agnostic Prompt-Based Continual Learning for Language Models

Anushka Tiwari, Sayantan Pal, Rohini K. Srihari, Kaiyi Ji

Main category: cs.LG

TL;DR: GRID is a unified framework for prompt-based continual learning that addresses performance degradation and scalability issues through enhanced decoding mechanisms and gradient-guided prompt compression.

Details

Motivation: Existing prompt-based continual learning methods face two major challenges: severe performance degradation on earlier tasks under task-agnostic inference, and limited scalability due to prompt memory accumulation as task sequences grow.

Method: GRID incorporates a decoding mechanism with representative inputs, automatic task identification, and constrained decoding for backward transfer enhancement. It also uses gradient-guided prompt selection to compress less informative prompts into a single aggregated representation.

Result: Extensive experiments show GRID improves average accuracy and backward transfer, achieves competitive forward transfer, and substantially reduces prompt memory usage on long-sequence and negative transfer benchmarks.

Conclusion: GRID provides a scalable and memory-efficient solution for prompt-based continual learning that addresses key limitations of existing methods while maintaining strong performance across task sequences.

Abstract: Prompt-based continual learning (CL) provides a parameter-efficient approach for adapting large language models (LLMs) across task sequences. However, most existing methods rely on task-aware inference and maintain a growing set of task-specific prompts, which introduces two major challenges: (1) severe performance degradation on earlier tasks under task-agnostic inference, and (2) limited scalability due to prompt memory accumulation as task sequences grow. In this paper, we present GRID, a unified framework designed to address these challenges. GRID incorporates a decoding mechanism that enhances backward transfer by leveraging representative inputs, automatic task identification, and constrained decoding. Furthermore, it employs a gradient-guided prompt selection strategy to compress less informative prompts into a single aggregated representation, ensuring scalable and memory-efficient continual learning. Extensive experiments on long-sequence and negative transfer benchmarks show that GRID improves average accuracy and backward transfer, achieves competitive forward transfer, and substantially reduces prompt memory usage.

[1561] Reward-Agnostic Prompt Optimization for Text-to-Image Diffusion Models

Semin Kim, Yeonwoo Cha, Jaehoon Yoo, Seunghoon Hong

Main category: cs.LG

TL;DR: RATTPO is a reward-agnostic test-time prompt optimization method that improves text-to-image generation prompts by iteratively searching with LLMs using optimization trajectory and reward-aware hints, without requiring reward-specific task descriptions.

Details

Motivation: Existing automated prompt engineering methods are specialized for specific reward configurations and perform poorly when applied to new scenarios with different reward models, creating a need for a flexible approach that works across various reward setups.

Method: RATTPO iteratively searches for optimized prompts by querying LLMs using optimization trajectory and a novel reward-aware feedback signal (hint) as context, without requiring reward-specific task descriptions.

Result: RATTPO effectively enhances prompts across diverse reward setups (aesthetics, human preference, spatial relationships) and runs 4.8x faster than naive baselines. With sufficient budget, it achieves comparable performance to learning-based methods that require fine-tuning.

Conclusion: RATTPO provides a versatile and efficient approach for test-time prompt optimization that works across various reward scenarios without modification, outperforming specialized methods in flexibility and search efficiency.

Abstract: We investigate a general approach for improving user prompts in text-to-image (T2I) diffusion models by finding prompts that maximize a reward function specified at test-time. Although diverse reward models are used for evaluating image generation, existing automated prompt engineering methods typically target specific reward configurations. Consequently, these specialized designs exhibit suboptimal performance when applied to new prompt engineering scenarios involving different reward models. To address this limitation, we introduce RATTPO (Reward-Agnostic Test-Time Prompt Optimization), a flexible test-time optimization method applicable across various reward scenarios without modification. RATTPO iteratively searches for optimized prompts by querying large language models (LLMs) \textit{without} requiring reward-specific task descriptions. Instead, it uses the optimization trajectory and a novel reward-aware feedback signal (termed a “hint”) as context. Empirical results demonstrate the versatility of RATTPO, effectively enhancing user prompts across diverse reward setups that assess various generation aspects, such as aesthetics, general human preference, or spatial relationships between objects. RATTPO surpasses other test-time search baselines in search efficiency, running 4.8 times faster than naive reward-agnostic test-time search baseline on average. Furthermore, with sufficient inference budget, it can achieve comparable performance to learning-based baselines that require reward-specific fine-tuning. The code is available at https://github.com/seminkim/RATTPO.

[1562] Omni-Thinker: Scaling Multi-Task RL in LLMs with Hybrid Reward and Task Scheduling

Derek Li, Jiaming Zhou, Leo Maxime Brunswic, Abbas Ghaddar, Qianyi Sun, Liheng Ma, Yu Luo, Dong Li, Mark Coates, Jianye Hao, Yingxue Zhang

Main category: cs.LG

TL;DR: Omni-Thinker is a unified RL framework that scales LLMs across diverse tasks using hybrid rewards and backward-transfer-guided scheduling, achieving significant performance gains over joint training and model merging.

Details

Motivation: To develop general-purpose AI that can handle both structured reasoning and open-ended generation by scaling LLMs across diverse tasks effectively.

Method: Combines hybrid rewards (rule-based verifiable signals + preference-based LLM-as-a-Judge evaluations) with backward-transfer-guided scheduling based on accuracy backward transfer (BWT).

Result: Achieved gains of 6.2% over joint training and 12.4% over model merging across four domains, with accurate predictions of curriculum outcomes using simple accuracy transfer assumptions.

Conclusion: BWT-aware scheduling and hybrid supervision are crucial for scaling RL-based post-training toward general-purpose LLMs.

Abstract: The pursuit of general-purpose artificial intelligence depends on large language models (LLMs) that can handle both structured reasoning and open-ended generation. We present Omni-Thinker, a unified reinforcement learning (RL) framework that scales LLMs across diverse tasks by combining hybrid rewards with backward-transfer-guided scheduling. Hybrid rewards integrate rule-based verifiable signals with preference-based evaluations from an LLM-as-a-Judge, enabling learning in both deterministic and subjective domains. Our scheduler orders tasks according to accuracy backward transfer (BWT), reducing forgetting and improving multi-task performance. Experiments across four domains show gains of 6.2% over joint training and 12.4% over model merging. Moreover, we demonstrate that simple assumptions on accuracy transfer yield accurate predictions of curriculum outcomes, with entropy dynamics explaining deviations due to generative tasks. These findings underscore the importance of BWT-aware scheduling and hybrid supervision for scaling RL-based post-training toward general-purpose LLMs.

[1563] Origins of Creativity in Attention-Based Diffusion Models

Emma Finn, T. Anderson Keller, Manos Theodosis, Demba E. Ba

Main category: cs.LG

TL;DR: This paper extends diffusion model theory to include self-attention, showing how it enables globally consistent image generation beyond patch-level mosaics.

Details

Motivation: To understand how creativity emerges in diffusion models, particularly the role of self-attention in generating globally coherent images rather than just patch-wise mosaics.

Method: Extend existing diffusion theory to models with CNN+self-attention architecture, analyze how self-attention affects score matching, and empirically validate on a crafted dataset.

Result: The theory suggests self-attention induces globally consistent arrangement of local features in generated samples, which is verified empirically.

Conclusion: Self-attention in diffusion models enables global image consistency beyond the patch-level mosaics produced by CNN-only architectures, providing a theoretical foundation for understanding creativity in diffusion.

Abstract: As diffusion models have become the tool of choice for image generation and as the quality of the images continues to improve, the question of how creativity' originates in diffusion has become increasingly important. The score matching perspective on diffusion has proven particularly fruitful for understanding how and why diffusion models generate images that remain plausible while differing significantly from their training images. In particular, as explained in (Kamb \& Ganguli, 2024) and others, e.g., (Ambrogioni, 2023), theory suggests that if our score matching were optimal, we would only be able to recover training samples through our diffusion process. However, as shown by Kamb \& Ganguli, (2024), in diffusion models where the score is parametrized by a simple CNN, the inductive biases of the CNN itself (translation equivariance and locality) allow the model to generate samples that globally do not match any training samples, but are rather patch-wise mosaics’. Notably, however, this theory does not extend to describe the role of self-attention in this process. In this work, we take a preliminary step in this direction to extend this theory to the case of diffusion models whose score is parametrized by a CNN with a final self-attention layer. We show that our theory suggests that self-attention will induce a globally image-consistent arrangement of local features beyond the patch-level in generated samples, and we verify this behavior empirically on a carefully crafted dataset.

[1564] GLANCE: Graph Logic Attention Network with Cluster Enhancement for Heterophilous Graph Representation Learning

Zhongtian Sun, Anoushka Harit, Alexandra Cristea, Christl A. Donnelly, Pietro Liò

Main category: cs.LG

TL;DR: GLANCE is a novel GNN framework that integrates logic-guided reasoning, dynamic graph refinement, and adaptive clustering to address limitations of traditional GNNs on heterophilous graphs.

Details

Motivation: Traditional GNNs struggle on heterophilous graphs where connected nodes differ in features or class labels, due to indiscriminate neighbor aggregation and insufficient higher-order structural pattern incorporation.

Method: GLANCE combines a logic layer for interpretable embeddings, multi-head attention-based edge pruning for denoising, and clustering mechanisms for capturing global patterns.

Result: Experimental results on benchmark datasets (Cornell, Texas, Wisconsin) show GLANCE achieves competitive performance and offers robust, interpretable solutions for heterophilous graphs.

Conclusion: GLANCE is a lightweight, adaptable framework uniquely suited for heterophilous graph challenges, providing both performance improvements and interpretability.

Abstract: Graph Neural Networks (GNNs) have demonstrated significant success in learning from graph-structured data but often struggle on heterophilous graphs, where connected nodes differ in features or class labels. This limitation arises from indiscriminate neighbor aggregation and insufficient incorporation of higher-order structural patterns. To address these challenges, we propose GLANCE (Graph Logic Attention Network with Cluster Enhancement), a novel framework that integrates logic-guided reasoning, dynamic graph refinement, and adaptive clustering to enhance graph representation learning. GLANCE combines a logic layer for interpretable and structured embeddings, multi-head attention-based edge pruning for denoising graph structures, and clustering mechanisms for capturing global patterns. Experimental results in benchmark datasets, including Cornell, Texas, and Wisconsin, demonstrate that GLANCE achieves competitive performance, offering robust and interpretable solutions for heterophilous graph scenarios. The proposed framework is lightweight, adaptable, and uniquely suited to the challenges of heterophilous graphs.

[1565] Aha Moment Revisited: Are VLMs Truly Capable of Self Verification in Inference-time Scaling?

Mingyuan Wu, Meitang Li, Jingcheng Yang, Jize Jiang, Kaizhuo Yan, Zhaoheng Li, Hanchao Yu, Minjia Zhang, Klara Nahrstedt

Main category: cs.LG

TL;DR: Inference-time scaling techniques that benefit LLMs show limited effectiveness for RL-trained VLMs due to weak self-verification capabilities across visual and textual modalities.

Details

Motivation: To investigate whether inference-time scaling methods (like decoding-time scaling and self-refinement) that substantially improve reasoning in LLMs similarly benefit vision-language models, especially those fine-tuned with RL.

Method: Extensive evaluation of inference-time strategies including majority vote, best-of-N with self-verification, and analysis of RL-tuned model behaviors like the ‘A-ha moment’ across vision-language models.

Result: Majority vote enhances VLM performance but significantly outperforms verification-centric methods. RL-tuned VLMs exhibit weak self-verification across both visual and textual modalities, limiting inference-time scaling effectiveness.

Conclusion: Current RL-trained VLMs have fundamental limitations in self-verification capabilities that prevent inference-time scaling methods from achieving the same performance gains as seen in LLMs.

Abstract: Inference-time techniques such as decoding-time scaling and self-refinement have been shown to substantially improve reasoning in large language models (LLMs), driven by emergent self-correction and self-verification behaviors often elicited through reinforcement learning (RL). In this work, we investigate whether these inference-time scaling methods similarly benefit vision-language models (VLMs), especially those fine-tuned with RL. Through extensive evaluation, we find that while strategies like majority vote and best-of-N with self-verification enhance VLM performance, majority vote significantly outperforms verification-centric ones. Furthermore, inference time scaling behaviors commonly associated with RL-tuned models, such as the ‘A-ha moment,’ do not yield consistent performance gains. Our analysis identifies a key limitation: current RL-trained VLMs exhibit weak self-verification across both visual and textual modalities, limiting the effectiveness of inference-time scaling.

[1566] Mitigating Semantic Collapse in Generative Personalization with Test-Time Embedding Adjustment

Anh Bui, Trang Vu, Trung Le, Junae Kim, Tamas Abraham, Rollin Omari, Amar Kaur, Dinh Phung

Main category: cs.LG

TL;DR: The paper addresses semantic collapsing in generative personalization, where learned visual concepts dominate other concepts in multi-concept prompts, and proposes a training-free method to adjust embedding magnitude and direction at inference time to mitigate this issue.

Details

Motivation: Semantic collapsing reduces semantic richness in multi-concept prompts, leading to simplified output images that fail to capture intended concepts, caused by unconstrained optimization allowing embedding drift.

Method: A training-free method that adjusts the magnitude and direction of pre-trained embeddings at inference time to prevent semantic collapsing.

Result: The method significantly improves text-image alignment across different personalization methods and diverse use cases.

Conclusion: The proposed embedding adjustment technique effectively mitigates semantic collapsing in generative personalization without requiring retraining.

Abstract: In this paper, we investigate the semantic collapsing problem in generative personalization, an under-explored topic where the learned visual concept ($V$) gradually shifts from its original textual meaning and comes to dominate other concepts in multi-concept input prompts. This issue not only reduces the semantic richness of complex input prompts like “a photo of $V$ wearing glasses and playing guitar” into simpler, less contextually rich forms such as “a photo of $V$” but also leads to simplified output images that fail to capture the intended concept. We identify the root cause as unconstrained optimisation, which allows the learned embedding $V$ to drift arbitrarily in the embedding space, both in direction and magnitude. To address this, we propose a simple yet effective training-free method that adjusts the magnitude and direction of pre-trained embedding at inference time, effectively mitigating the semantic collapsing problem. Our method is broadly applicable across different personalization methods and demonstrates significant improvements in text-image alignment in diverse use cases. Our code is anonymously published at https://github.com/tuananhbui89/Embedding-Adjustment

[1567] Cooperative Sheaf Neural Networks

André Ribeiro, Ana Luiza Tenório, Juan Belieni, Amauri H. Souza, Diego Mesquita

Main category: cs.LG

TL;DR: The paper introduces Cooperative Sheaf Neural Networks (CSNNs) to enable cooperative behavior in sheaf diffusion by using directed graphs, addressing limitations of existing methods that lack message directionality.

Details

Motivation: To combine the benefits of sheaf diffusion (handling heterophilic data, avoiding oversmoothing) with cooperative message passing (flexible information diffusion), and overcome the limitation that existing sheaf diffusion methods cannot exhibit cooperative behavior due to lack of message directionality.

Method: Introduces cellular sheaves over directed graphs with in- and out-degree Laplacians, and proposes Cooperative Sheaf Neural Networks (CSNNs) that allow nodes to selectively attend to distant nodes while ignoring others.

Result: CSNNs show overall better performance compared to prior sheaf diffusion methods and cooperative graph neural networks, with theoretical analysis showing they can mitigate oversquashing.

Conclusion: Cooperative behavior can be achieved in sheaf diffusion through the proposed directed graph approach with CSNNs, overcoming limitations of existing methods and providing better performance.

Abstract: Sheaf diffusion has recently emerged as a promising design pattern for graph representation learning due to its inherent ability to handle heterophilic data and avoid oversmoothing. Meanwhile, cooperative message passing has also been proposed as a way to enhance the flexibility of information diffusion by allowing nodes to independently choose whether to propagate/gather information from/to neighbors. A natural question ensues: is sheaf diffusion capable of exhibiting this cooperative behavior? Here, we provide a negative answer to this question. In particular, we show that existing sheaf diffusion methods fail to achieve cooperative behavior due to the lack of message directionality. To circumvent this limitation, we introduce the notion of cellular sheaves over directed graphs and characterize their in- and out-degree Laplacians. We leverage our construction to propose Cooperative Sheaf Neural Networks (CSNNs). Theoretically, we characterize the receptive field of CSNN and show it allows nodes to selectively attend (listen) to arbitrarily far nodes while ignoring all others in their path, potentially mitigating oversquashing. Our experiments show that CSNN presents overall better performance compared to prior art on sheaf diffusion as well as cooperative graph neural networks.

[1568] Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts

Zhaomin Wu, Mingzhe Du, See-Kiong Ng, Bingsheng He

Main category: cs.LG

TL;DR: This paper investigates self-initiated deception in LLMs on benign prompts, proposing a framework with two statistical metrics to quantify deception likelihood based on psychological principles.

Details

Motivation: To address the underexplored risk of intentional deception in LLMs, moving beyond human-induced deception to study self-initiated deception in real-world interaction scenarios.

Method: Proposed a framework using Contact Searching Questions (CSQ) with two metrics: Deceptive Intention Score (measures bias toward hidden objectives) and Deceptive Behavior Score (measures inconsistency between internal belief and expressed output).

Result: Evaluation of 16 leading LLMs showed both deception metrics rise in parallel and escalate with task difficulty. Increasing model capacity doesn’t consistently reduce deception.

Conclusion: Self-initiated deception poses a significant challenge for LLM development, as it persists even with increased model capacity and escalates with task complexity.

Abstract: Large Language Models (LLMs) are widely deployed in reasoning, planning, and decision-making tasks, making their trustworthiness critical. A significant and underexplored risk is intentional deception, where an LLM deliberately fabricates or conceals information to serve a hidden objective. Existing studies typically induce deception by explicitly setting a hidden objective through prompting or fine-tuning, which may not reflect real-world human-LLM interactions. Moving beyond such human-induced deception, we investigate LLMs’ self-initiated deception on benign prompts. To address the absence of ground truth, we propose a framework based on Contact Searching Questions~(CSQ). This framework introduces two statistical metrics derived from psychological principles to quantify the likelihood of deception. The first, the Deceptive Intention Score, measures the model’s bias toward a hidden objective. The second, the Deceptive Behavior Score, measures the inconsistency between the LLM’s internal belief and its expressed output. Evaluating 16 leading LLMs, we find that both metrics rise in parallel and escalate with task difficulty for most models. Moreover, increasing model capacity does not always reduce deception, posing a significant challenge for future LLM development.

[1569] JAX-MPM: A Learning-Augmented Differentiable Meshfree Framework for GPU-Accelerated Lagrangian Simulation and Geophysical Inverse Modeling

Honghui Du, QiZhi He

Main category: cs.LG

TL;DR: JAX-MPM is a differentiable meshfree solver using the material point method, implemented in JAX for GPU acceleration and automatic differentiation, enabling efficient forward and inverse modeling for geomechanics applications.

Details

Motivation: To leverage differentiable programming for scientific computing, enabling automatic differentiation through simulation pipelines and supporting both forward and inverse modeling with GPU acceleration.

Method: Uses a hybrid Eulerian-Lagrangian framework based on the material point method (MPM) implemented in JAX, capturing large deformations, frictional contact, and inelastic material behavior with GPU acceleration.

Result: Validated through 2D/3D benchmarks showing high performance (2.7M particles in 22-98 seconds for 1000 steps) and successful inverse modeling tasks like velocity reconstruction and friction estimation from sparse data.

Conclusion: JAX-MPM establishes a unified differentiable meshfree platform that advances fast physical simulation and data assimilation for complex solid and geophysical systems.

Abstract: Differentiable programming has emerged as a powerful paradigm in scientific computing, enabling automatic differentiation through simulation pipelines and naturally supporting both forward and inverse modeling. We present JAX-MPM, a general-purpose differentiable meshfree solver based on the material point method (MPM) and implemented in the modern JAX architecture. The solver adopts a hybrid Eulerian-Lagrangian framework to capture large deformations, frictional contact, and inelastic material behavior, with emphasis on geomechanics and geophysical hazard applications. Leveraging GPU acceleration and automatic differentiation, JAX-MPM enables efficient gradient-based optimization directly through its time-stepping solvers and supports joint training of physical models with deep learning to infer unknown system conditions and uncover hidden constitutive parameters. We validate JAX-MPM through a series of 2D and 3D benchmark simulations, including dam-break and granular collapse problems, demonstrating both numerical accuracy and GPU-accelerated performance. Results show that a high-resolution 3D granular cylinder collapse with 2.7 million particles completes 1000 time steps in approximately 22 seconds (single precision) and 98 seconds (double precision) on a single GPU. Beyond high-fidelity forward modeling, we demonstrate the framework’s inverse modeling capabilities through tasks such as velocity field reconstruction and the estimation of spatially varying friction from sparse data. In particular, JAX-MPM accommodates data assimilation from both Lagrangian (particle-based) and Eulerian (region-based) observations, and can be seamlessly coupled with neural network representations. These results establish JAX-MPM as a unified and scalable differentiable meshfree platform that advances fast physical simulation and data assimilation for complex solid and geophysical systems.

[1570] MoQE: Improve Quantization Model performance via Mixture of Quantization Experts

Jinhao Zhang, Yunquan Zhang, Boyang Zhang, Zeyu Liu, Daning Cheng

Main category: cs.LG

TL;DR: MoQE is a quantization inference framework that uses Mixture-of-Experts architecture to combine multiple quantization variants of a model, dynamically routing inputs to specialized quantization experts to reduce accuracy degradation while maintaining efficiency.

Details

Motivation: Quantization reduces model deployment costs but inevitably causes accuracy degradation. The paper aims to improve quantization model performance by leveraging multiple specialized quantization variants.

Method: Proposes MoQE framework that combines multiple quantization variants as ‘quantization experts’ and uses lightweight structure-aware router models to dynamically route inputs to the most suitable expert based on input characteristics.

Result: Experimental evaluations on ResNet, LLaMA, and Qwen models across ImageNet, WikiText, C4, and OpenWebText datasets show MoQE achieves performance comparable to state-of-the-art quantization models without significant inference latency increase.

Conclusion: MoQE effectively alleviates performance degradation in single quantization models through specialized quantization experts and dynamic routing, enabling efficient deployment while maintaining competitive performance.

Abstract: Quantization method plays a crucial role in improving model efficiency and reducing deployment costs, enabling the widespread application of deep learning models on resource-constrained devices. However, the quantization process inevitably introduces accuracy degradation. In this paper, we propose Mixture of Quantization Experts( abbr. MoQE), a quantization inference framework based on the Mixture-of-Experts (MoE) architecture, aiming to jointly improve the performance of quantization models. MoQE combines multiple quantization variants of one full-precision model as specialized “quantization experts” and dynamically routes input data to the most suitable expert based on its characteristics. MoQE alleviates the performance degradation commonly seen in single quantization models through specialization quantization expert models. We design lightweight, structure-aware router models tailored for both CV and NLP tasks. Experimental evaluations on ResNet, LLaMA, and Qwen model families across benchmark datasets including ImageNet, WikiText, C4, and OpenWebText demonstrate that MoQE achieves performance comparable to SOTA quantization model, without incurring significant increases in inference latency.

[1571] Discrete Diffusion Trajectory Alignment via Stepwise Decomposition

Jiaqi Han, Austin Wang, Minkai Xu, Wenda Chu, Meihua Dang, Yisong Yue, Stefano Ermon

Main category: cs.LG

TL;DR: Proposes an offline preference optimization method for discrete diffusion models that decomposes trajectory alignment into stepwise objectives by matching per-step posteriors, enabling efficient optimization compatible with arbitrary reward functions.

Details

Motivation: To improve discrete diffusion models for sequence data by aligning them with rewards, inspired by RL success in language models, while avoiding the inefficiency of applying rewards only on final outputs.

Method: Decomposes trajectory alignment into stepwise objectives by matching per-step posterior distributions, enabling offline preference optimization that is compatible with arbitrary reward functions.

Result: Achieves up to 12% improvement in predicted activity on DNA sequence design over RL baselines, and improves GSM8K score from 78.6 to 81.2 on LLaDA-8B-Instruct for language modeling.

Conclusion: The proposed stepwise alignment framework provides an efficient and effective approach for optimizing discrete diffusion models across multiple domains including DNA, protein, and language modeling.

Abstract: Discrete diffusion models have demonstrated great promise in modeling various sequence data, ranging from human language to biological sequences. Inspired by the success of RL in language models, there is growing interest in further improving the models by alignment with a certain reward. In this work, we propose an offline preference optimization method to approach trajectory alignment for discrete diffusion models. Instead of applying the reward on the final output and backpropagating the gradient to the entire denoising process, we decompose the problem into a set of stepwise alignment objectives by matching the per-step posterior. This framework enables efficient diffusion optimization, is compatible with arbitrary reward functions, and importantly, yields an equivalent optimal solution under additive factorization of the trajectory reward. Experiments across multiple domains including DNA sequence design, protein inverse folding, and language modeling consistently demonstrate the superiority of our approach. Notably, it achieves an up to 12% improvement over the most competitive RL-based baseline in terms of predicted activity on DNA sequence design, and further improves the GSM8K score from 78.6 to 81.2 on LLaDA-8B-Instruct for language modeling.

[1572] Robust Deep Network Learning of Nonlinear Regression Tasks by Parametric Leaky Exponential Linear Units (LELUs) and a Diffusion Metric

Enda D. V. Bigarella

Main category: cs.LG

TL;DR: Proposes a smooth Leaky Exponential Linear Unit activation function with non-zero gradient to improve neural network performance on nonlinear regression tasks, addressing limitations of existing activation functions like ELU, SiLU, RELU, and Leaky-RELU.

Details

Motivation: To address limitations of existing activation functions - smooth functions with vanishing gradients (ELU, SiLU) have limited performance, while non-smooth functions (RELU, Leaky-RELU) cause discontinuity in trained models, impacting overfitting and sensitivity to model parameters.

Method: Introduces a smooth ‘Leaky Exponential Linear Unit’ activation function with non-zero gradient that can be trained, and proposes a novel diffusion-loss metric to evaluate model performance in terms of overfitting.

Result: Demonstrates improved performance with the proposed smooth activation function compared to existing alternatives like ELU, SiLU, RELU, and Leaky-RELU.

Conclusion: The smooth Leaky Exponential Linear Unit with non-zero gradient provides better performance for multidimensional nonlinear data regression by addressing both smoothness and gradient issues present in existing activation functions.

Abstract: This document proposes a parametric activation function (ac.f.) aimed at improving multidimensional nonlinear data regression. It is a established knowledge that nonlinear ac.f’s are required for learning nonlinear datasets. This work shows that smoothness and gradient properties of the ac.f. further impact the performance of large neural networks in terms of overfitting and sensitivity to model parameters. Smooth but vanishing-gradient ac.f’s such as ELU or SiLU (Swish) have limited performance and non-smooth ac.f’s such as RELU and Leaky-RELU further impart discontinuity in the trained model. Improved performance is demonstrated with a smooth “Leaky Exponential Linear Unit”, with non-zero gradient that can be trained. A novel diffusion-loss metric is also proposed to gauge the performance of the trained models in terms of overfitting.

[1573] Contrastive Representations for Temporal Reasoning

Alicja Ziarko, Michal Bortkiewicz, Michal Zawalski, Benjamin Eysenbach, Piotr Milos

Main category: cs.LG

TL;DR: CRTR introduces a negative sampling scheme to remove spurious features in temporal contrastive learning, enabling learned representations to support temporal reasoning for solving puzzles like Sokoban and Rubik’s Cube without external search algorithms.

Details

Motivation: To bridge the gap between perception and planning by developing representations that capture both perceptual and temporal structure, overcoming limitations of standard temporal contrastive learning that often fails due to reliance on spurious features.

Method: Combinatorial Representations for Temporal Reasoning (CRTR) uses a negative sampling scheme to provably remove spurious features and facilitate temporal reasoning in learned representations.

Result: CRTR achieves strong performance on domains with complex temporal structure like Sokoban and Rubik’s Cube. For Rubik’s Cube, it learns representations that generalize across all initial states and solves puzzles using fewer search steps than BestFS, though with longer solutions.

Conclusion: CRTR is the first method that efficiently solves arbitrary Rubik’s Cube states using only learned representations without relying on external search algorithms, demonstrating the emergence of temporal reasoning from representation learning.

Abstract: In classical AI, perception relies on learning state-based representations, while planning, which can be thought of as temporal reasoning over action sequences, is typically achieved through search. We study whether such reasoning can instead emerge from representations that capture both perceptual and temporal structure. We show that standard temporal contrastive learning, despite its popularity, often fails to capture temporal structure due to its reliance on spurious features. To address this, we introduce Combinatorial Representations for Temporal Reasoning (CRTR), a method that uses a negative sampling scheme to provably remove these spurious features and facilitate temporal reasoning. CRTR achieves strong results on domains with complex temporal structure, such as Sokoban and Rubik’s Cube. In particular, for the Rubik’s Cube, CRTR learns representations that generalize across all initial states and allow it to solve the puzzle using fewer search steps than BestFS, though with longer solutions. To our knowledge, this is the first method that efficiently solves arbitrary Cube states using only learned representations, without relying on an external search algorithm.

[1574] Warm Starts Accelerate Conditional Diffusion

Jonas Scholz, Richard E. Turner

Main category: cs.LG

TL;DR: Warm-Start Diffusion (WSD) accelerates conditional generation by using a deterministic model to provide an informed prior instead of starting from random noise, reducing required diffusion steps from hundreds to just 4-12.

Details

Motivation: Diffusion and flow-matching models are slow, requiring hundreds of function evaluations. There's a need to accelerate the generation process while maintaining quality.

Method: WSD uses a deterministic warm-start model to predict an informed prior distribution conditioned on input context, reducing the distance the generative process must traverse.

Result: WSD substantially outperforms standard diffusion, generating realistic samples with only 4-6 function evaluations and saturating performance with 10-12 steps.

Conclusion: WSD is a simple, effective method that accelerates conditional generation, works with any diffusion/flow matching algorithm, and synergizes with other fast sampling techniques.

Abstract: Generative models like diffusion and flow-matching create high-fidelity samples by progressively refining noise. The refinement process is notoriously slow, often requiring hundreds of function evaluations. We introduce Warm-Start Diffusion (WSD), a method that uses a simple, deterministic model to dramatically accelerate conditional generation by providing a better starting point. Instead of starting generation from an uninformed $N(\boldsymbol{0}, I)$ prior, our deterministic warm-start model predicts an informed prior $N(\hat{\boldsymbol{\mu}}_C, \text{diag}(\hat{\boldsymbol{\sigma}}^2_C))$, whose moments are conditioned on the input context $C$. This warm start substantially reduces the distance the generative process must traverse, and therefore the number of diffusion steps required, particularly when the context $C$ is strongly informative. WSD is applicable to any standard diffusion or flow matching algorithm, is orthogonal to and synergistic with other fast sampling techniques like efficient solvers, and is simple to implement. We test WSD in a variety of settings, and find that it substantially outperforms standard diffusion in the efficient sampling regime, generating realistic samples using only 4-6 function evaluations, and saturating performance with 10-12.

[1575] Depth-Breadth Synergy in RLVR: Unlocking LLM Reasoning Gains with Adaptive Exploration

Zhicheng Yang, Zhijiang Guo, Yinya Huang, Yongxin Wang, Dongchun Xie, Yiwei Wang, Xiaodan Liang, Jing Tang

Main category: cs.LG

TL;DR: RLVR’s potential is limited by depth (hardest solvable problems) and breadth (training instances per iteration). The paper introduces DARS to address depth neglect by re-weighting hard problems, and shows that scaling breadth with large batch sizes improves reasoning performance.

Details

Motivation: Current RLVR approaches like GRPO suffer from systematic bias that disproportionately weights medium-accuracy samples while neglecting low-accuracy instances crucial for pushing reasoning boundaries. Both depth and breadth dimensions are under-explored in existing methods.

Method: 1) DARS (Difficulty Adaptive Rollout Sampling): Re-weights hard problems through targeted multi-stage rollouts to increase positive rollouts for difficult instances. 2) Large-breadth training: Scales batch size aggressively and replaces PPO’s mini-batch iterations with full-batch updates over multiple epochs.

Result: DARS delivers consistent Pass@K gains without extra inference cost. Large-breadth training significantly enhances Pass@1 performance and sustains high token-level entropy, indicating continued exploration. DARS-B combines both approaches for simultaneous gains in Pass@K and Pass@1.

Conclusion: Breadth and adaptive exploration across depth operate as orthogonal dimensions in RLVR. Addressing both dimensions is key to unleashing the full reasoning power of RLVR, with DARS and large-breadth training providing complementary benefits.

Abstract: Reinforcement Learning with Verifiable Reward (RLVR) has emerged as a powerful paradigm for unlocking reasoning capabilities in large language models, yet its full potential is hindered by two under-explored dimensions: Depth-the hardest problem a model can sample; Breadth-the number of instances consumed in a single iteration. We dissect the popular GRPO algorithm and reveal a systematic bias: the cumulative-advantage disproportionately weights samples with medium accuracy, while down-weighting the low-accuracy instances that are crucial for pushing reasoning boundaries. To rectify the depth neglect, we introduce Difficulty Adaptive Rollout Sampling (DARS), which re-weights hard problems through targeted multi-stage rollouts, thereby increasing the number of positive rollouts for hard problems. Empirically, naively enlarging rollout size only accelerates convergence and even hurts Pass@K. Our DARS, in contrast, delivers consistent Pass@K gains without extra inference cost at convergence. Just as we adaptively expanded the depth of exploration, we now ask whether aggressively scaling the breadth of training data can further amplify reasoning gains. To this end, we intensely scale batch size and replace PPO’s mini-batch iterations with full-batch updates over multiple epochs. Increasing breadth significantly enhances Pass@1 performance. Large-breadth training sustains high token-level entropy, indicating continued exploration and reduced gradient noise. We further present DARS-B, which augments DARS with large breadth, and demonstrate simultaneous gains in Pass@K and Pass@1. The results confirm that breadth and adaptive exploration across depth operate as orthogonal dimensions in RLVR, which are key to unleashing the reasoning power of RLVR.

[1576] FusionFactory: Fusing LLM Capabilities with Multi-LLM Log Data

Tao Feng, Haozhen Zhang, Zijie Lei, Pengrui Han, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jiaxuan You

Main category: cs.LG

TL;DR: LLMFusionBench is a benchmark for fusing multiple LLMs’ capabilities using log data, and FusionFactory is a framework with query, thought, and model-level fusion that outperforms individual LLMs.

Details

Motivation: The diversity of LLMs creates valuable multi-LLM log data that can be leveraged to fuse complementary capabilities, but practical fusion must be compatible with real-world serving scenarios and flexible across different pipeline stages.

Method: Proposed FusionFactory framework with three levels: query-level fusion via LLM routers, thought-level fusion using retrieved reasoning templates, and model-level fusion via distillation from top-ranked responses.

Result: FusionFactory consistently outperforms the best individual LLM across all 14 benchmarks in LLMFusionBench, with optimal fusion configurations varying across different benchmarks.

Conclusion: Multi-LLM log data serves as a practical foundation for fusing diverse LLM capabilities, and the varying optimal configurations highlight the importance of adaptive fusion strategies.

Abstract: The rapid advancement of large language models (LLMs) has created a diverse landscape of models, each excelling at different tasks. This diversity drives researchers to employ multiple LLMs in practice, leaving behind valuable multi-LLM log data. This naturally leads to the question of whether such logs can be fully leveraged to fuse LLMs’ complementary capabilities. Although prior work has explored various strategies for integrating multiple LLMs, we argue that practical fusion must meet two essential requirements: (1) compatibility with real-world serving scenarios (e.g., local and API-based serving), and (2) flexibility to operate at different stages of the LLM pipeline to meet varied user needs (e.g., fine-tuning and inference stages). To this end, we introduce LLMFusionBench, a large-scale benchmark for LLM fusion that spans 14 tasks across five domains, with responses from 20 open-source LLMs (8B–671B) totaling 103M tokens. Building on LLMFusionBench, we propose FusionFactory, a systematic framework with three elaborated levels: (1) query-level fusion via tailored LLM routers, (2) thought-level fusion leveraging retrieved abstract reasoning templates, and (3) model-level fusion via distillation from top-ranked responses. Experiments show that FusionFactory consistently outperforms the best individual LLM across all 14 benchmarks, with the optimal fusion configuration varying across benchmarks, highlighting the promise of multi-LLM log data as a practical foundation for fusing diverse LLM capabilities.

[1577] Hard Examples Are All You Need: Maximizing GRPO Post-Training Under Annotation Budgets

Benjamin Pikus, Pratyush Ranjan Tiwari, Burton Ye

Main category: cs.LG

TL;DR: Training on the hardest examples (where base model fails) yields dramatic performance gains up to 47% in GRPO fine-tuning, while easy examples provide minimal improvements of 3-15%. Hard examples maintain learning signals throughout training.

Details

Motivation: Collecting high-quality training examples for language model fine-tuning is expensive, with practical budgets limiting data procurement. The research investigates whether example difficulty affects GRPO training effectiveness.

Method: Compare selection strategies (easy, medium, hard, random) across multiple models and reasoning tasks. Focus on training on the hardest 10% of examples where the base model fails most often.

Result: Training on hard examples yields 47% performance gains vs 3-15% for easy examples. Hard examples maintain outcome variance (mixed success/failure) that generates learning signals, while easy examples quickly converge to consistent success. Hard-trained models also show superior out-of-distribution generalization on AIME2025 benchmark.

Conclusion: When budget-constrained, prioritize collecting and annotating examples where the base model struggles, as these drive nearly all learning value in GRPO fine-tuning due to their maintained outcome variance.

Abstract: Collecting high-quality training examples for language model fine-tuning is expensive, with practical budgets limiting the amount of data that can be procured. We investigate whether example difficulty affects GRPO training effectiveness by comparing selection strategies (easy, medium, hard, random) across multiple models and reasoning tasks. Training on the hardest 10% of examples (those where the base model fails most often) yields dramatic performance gains up to 47%, while easy examples produce minimal improvements of 3-15%. This occurs because GRPO requires outcome variance to generate learning signals; hard examples maintain mixed success/failure outcomes throughout training while easy examples quickly converge to consistent success, eliminating learning opportunities. Moreover, models trained on hard examples show superior out-of-distribution generalization, with only hard-trained models achieving meaningful gains on the AIME2025 benchmark. Our findings provide clear guidance: when budget-constrained, prioritize collecting and annotating examples where your base model struggles, as these drive nearly all learning value in GRPO fine-tuning

[1578] A Graph-in-Graph Learning Framework for Drug-Target Interaction Prediction

Yuehua Song, Yong Gao

Main category: cs.LG

TL;DR: A novel Graph-in-Graph (GiG) framework combining transductive and inductive learning for drug-target interaction prediction, outperforming existing methods.

Details

Motivation: Existing machine learning approaches struggle to effectively integrate diverse features of drugs, targets, and their interactions in drug-target interaction prediction.

Method: GiG model represents drug and target molecular structure graphs as meta-nodes in a drug-target interaction graph, enabling detailed exploration of intricate relationships through combined transductive and inductive learning.

Result: GiG significantly outperforms existing approaches across all evaluation metrics on a compiled benchmark of drug SMILES, protein sequences, and interaction data.

Conclusion: The integration of different learning paradigms and interaction data provides substantial benefits for accurate drug-target interaction prediction.

Abstract: Accurately predicting drug-target interactions (DTIs) is pivotal for advancing drug discovery and target validation techniques. While machine learning approaches including those that are based on Graph Neural Networks (GNN) have achieved notable success in DTI prediction, many of them have difficulties in effectively integrating the diverse features of drugs, targets and their interactions. To address this limitation, we introduce a novel framework to take advantage of the power of both transductive learning and inductive learning so that features at molecular level and drug-target interaction network level can be exploited. Within this framework is a GNN-based model called Graph-in-Graph (GiG) that represents graphs of drug and target molecular structures as meta-nodes in a drug-target interaction graph, enabling a detailed exploration of their intricate relationships. To evaluate the proposed model, we have compiled a special benchmark comprising drug SMILES, protein sequences, and their interaction data, which is interesting in its own right. Our experimental results demonstrate that the GiG model significantly outperforms existing approaches across all evaluation metrics, highlighting the benefits of integrating different learning paradigms and interaction data.

[1579] Speculative Safety-Aware Decoding

Xuekang Wang, Shengyu Zhu, Xueqi Cheng

Main category: cs.LG

TL;DR: SSD is a lightweight decoding-time method that enhances LLM safety against jailbreak attacks using speculative sampling and dynamic switching between utility and safety priorities, while also accelerating inference.

Details

Motivation: Jailbreak attacks continue to exploit LLM vulnerabilities despite alignment efforts, and tuning large models is resource-intensive with inconsistent performance guarantees.

Method: Uses speculative sampling with a small safety-aware model, quantifies jailbreak risks via match ratio, dynamically switches decoding schemes, and samples from combined distributions of original and small models.

Result: SSD successfully equips large models with desired safety properties, maintains helpfulness for benign queries, and accelerates inference time.

Conclusion: SSD provides an effective decoding-time approach to enhance LLM safety against jailbreak attacks while improving inference efficiency.

Abstract: Despite extensive efforts to align Large Language Models (LLMs) with human values and safety rules, jailbreak attacks that exploit certain vulnerabilities continuously emerge, highlighting the need to strengthen existing LLMs with additional safety properties to defend against these attacks. However, tuning large models has become increasingly resource intensive and may have difficulty ensuring consistent performance. We introduce Speculative Safety-Aware Decoding (SSD), a lightweight decoding-time approach that equips LLMs with the desired safety property while accelerating inference. We assume that there exists a small language model that possesses this desired property. SSD integrates speculative sampling during decoding and leverages the match ratio between the small and composite models to quantify jailbreak risks. This enables SSD to dynamically switch between decoding schemes to prioritize utility or safety, to handle the challenge of different model capacities. The output token is then sampled from a new distribution that combines the distributions of the original and the small models. Experimental results show that SSD successfully equips the large model with the desired safety property, and also allows the model to remain helpful to benign queries. Furthermore, SSD accelerates the inference time, thanks to the speculative sampling design.

[1580] U-Cast: Learning Hierarchical Structures for High-Dimensional Time Series Forecasting

Juntong Ni, Shiyu Wang, Zewen Liu, Xiaoming Shi, Xinyue Zhong, Zhou Ye, Wei Jin

Main category: cs.LG

TL;DR: U-Cast is a novel architecture for High-Dimensional Time Series Forecasting that learns hierarchical channel structures using query-based attention and full-rank regularization, achieving superior accuracy and efficiency on the new Time-HD benchmark.

Details

Motivation: Traditional time series forecasting models struggle with high-dimensional datasets (thousands of channels) where complex hierarchical channel correlations exist, as existing approaches either ignore these interactions or fail to scale effectively.

Method: Proposed U-Cast uses channel-dependent forecasting with innovative query-based attention to learn latent hierarchical channel structures, and adds full-rank regularization during training to disentangle highly correlated channel representations.

Result: Theoretical analysis shows exploiting cross-channel information lowers forecasting risk. Experiments on the new Time-HD benchmark demonstrate U-Cast surpasses strong baselines in both accuracy and efficiency.

Conclusion: U-Cast and the Time-HD benchmark provide a solid foundation for future high-dimensional time series forecasting research, addressing the scalability and correlation modeling challenges in this domain.

Abstract: Time series forecasting (TSF) is a central problem in time series analysis. However, as the number of channels in time series datasets scales to the thousands or more, a scenario we define as High-Dimensional Time Series Forecasting (HDTSF), it introduces significant new modeling challenges that are often not the primary focus of traditional TSF research. HDTSF is challenging because the channel correlation often forms complex and hierarchical patterns. Existing TSF models either ignore these interactions or fail to scale as dimensionality grows. To address this issue, we propose U-Cast, a channel-dependent forecasting architecture that learns latent hierarchical channel structures with an innovative query-based attention. To disentangle highly correlated channel representation, U-Cast adds a full-rank regularization during training. We also release Time-HD, the first benchmark of large, diverse, high-dimensional datasets. Our theory shows that exploiting cross-channel information lowers forecasting risk, and experiments on Time-HD demonstrate that U-Cast surpasses strong baselines in both accuracy and efficiency. Together, U-Cast and Time-HD provide a solid basis for future HDTSF research.

[1581] Type-Compliant Adaptation Cascades: Adapting Programmatic LM Workflows to Data

Chu-Cheng Lin, Daiyi Peng, Yifeng Lu, Ming Zhang, Eugene Ie

Main category: cs.LG

TL;DR: TACs is a framework that treats LLM workflows as typed probabilistic programs, enabling gradient-based training and improving performance on structured tasks.

Details

Motivation: Current LLM composition methods are brittle and struggle with formal compliance for structured tasks, requiring a more robust approach.

Method: Recast workflow adaptation as learning typed probabilistic programs, treating the entire workflow as an unnormalized joint distribution for gradient-based training.

Result: Significant performance improvements on structured tasks: FinQA (12.0% to 24.7%), MGSM-SymPy (57.1% to 75.9%), MGSM (1.6% to 27.3%), MuSR (36.5% to 62.6%).

Conclusion: TACs provide a robust and theoretically grounded paradigm for developing reliable, task-compliant LLM systems.

Abstract: Reliably composing Large Language Models (LLMs) for complex, multi-step workflows remains a significant challenge. The dominant paradigm – optimizing discrete prompts in a pipeline – is notoriously brittle and struggles to enforce the formal compliance required for structured tasks. We introduce Type-Compliant Adaptation Cascades (TACs), a framework that recasts workflow adaptation as learning typed probabilistic programs. TACs treat the entire workflow, which is composed of parameter-efficiently adapted LLMs and deterministic logic, as an unnormalized joint distribution. This enables principled, gradient-based training even with latent intermediate structures. We provide theoretical justification for our tractable optimization objective, proving that the optimization bias vanishes as the model learns type compliance. Empirically, TACs significantly outperform state-of-the-art prompt-optimization baselines. Gains are particularly pronounced on structured tasks, improving FinQA from $12.0%$ to $24.7%$ for a Qwen 3 8B model, MGSM-SymPy from $57.1%$ to $75.9%$ for a Gemma 2 27B model, MGSM from $1.6%$ to $27.3%$, and MuSR from $36.5%$ to $62.6%$ for a Gemma 7B model. TACs offer a robust and theoretically grounded paradigm for developing reliable, task-compliant LLM systems.

[1582] Enhancing Stability of Physics-Informed Neural Network Training Through Saddle-Point Reformulation

Dmitry Bylinkin, Mikhail Aleksandrov, Savelii Chezhegov, Aleksandr Beznosikov

Main category: cs.LG

TL;DR: PINNs face performance instability due to complex loss landscapes. The paper reformulates PINN training as a nonconvex-strongly concave saddle-point problem to address this issue.

Details

Motivation: Physics-informed neural networks (PINNs) have performance instability problems caused by the complex landscape of their loss function, which limits their effectiveness in applications.

Method: Reformulate PINN training as a nonconvex-strongly concave saddle-point problem and establish theoretical foundations for this approach.

Result: Extensive experimental evaluation shows the proposed method outperforms current state-of-the-art techniques across various tasks and architectures.

Conclusion: The saddle-point reformulation effectively addresses PINN performance instability and demonstrates superior performance compared to existing methods.

Abstract: Physics-informed neural networks (PINNs) have gained prominence in recent years and are now effectively used in a number of applications. However, their performance remains unstable due to the complex landscape of the loss function. To address this issue, we reformulate PINN training as a nonconvex-strongly concave saddle-point problem. After establishing the theoretical foundation for this approach, we conduct an extensive experimental study, evaluating its effectiveness across various tasks and architectures. Our results demonstrate that the proposed method outperforms the current state-of-the-art techniques.

[1583] What Matters in Data for DPO?

Yu Pan, Zhongze Cai, Guanting Chen, Huaiyang Zhong, Chonghuan Wang

Main category: cs.LG

TL;DR: DPO performance depends more on chosen response quality than rejected response quality. Contrastive preference data helps primarily by improving chosen samples, and online DPO reduces to supervised fine-tuning on chosen responses.

Details

Motivation: Despite DPO's effectiveness for LLM alignment, the fundamental question of which preference data characteristics are most critical remains unanswered. The paper aims to systematically study how preference data distribution influences DPO performance.

Method: Combined theoretical analysis and extensive empirical experiments across diverse tasks. Theoretical analysis characterized optimal response distribution under DPO and revealed contrastiveness mechanisms. Also studied online DPO setting.

Result: Quality of chosen responses plays dominant role in DPO optimization, while rejected response quality has limited impact. Online DPO effectively reduces to supervised fine-tuning on chosen responses. Experiments consistently showed performance improvements from better chosen responses regardless of rejected response quality.

Conclusion: The study provides practical insights for constructing high-impact preference datasets for LLM alignment, explaining mechanisms behind widely adopted strategies and emphasizing the critical importance of chosen response quality over contrastiveness.

Abstract: Direct Preference Optimization (DPO) has emerged as a simple and effective approach for aligning large language models (LLMs) with human preferences, bypassing the need for a learned reward model. Despite its growing adoption, a fundamental question remains open: what characteristics of preference data are most critical for DPO performance? In this work, we provide a systematic study of how preference data distribution influences DPO, from both theoretical and empirical perspectives. We show that the quality of chosen responses plays a dominant role in optimizing the DPO objective, while the quality of rejected responses may have relatively limited impact. Our theoretical analysis characterizes the optimal response distribution under DPO and reveals how contrastiveness between responses helps primarily by improving the chosen samples. We further study an online DPO setting and show it effectively reduces to supervised fine-tuning on the chosen responses. Extensive experiments across diverse tasks confirm our findings: improving the quality of chosen responses consistently boosts performance regardless of the quality of the rejected responses. We also investigate the benefit of mixing the on-policy data. Our results interpret the mechanism behind some widely adopted strategies and offer practical insights for constructing high-impact preference datasets for LLM alignment.

[1584] Neural Ordinary Differential Equations for Learning and Extrapolating System Dynamics Across Bifurcations

Eva van Tegelen, George van Voorn, Ioannis Athanasiadis, Peter van Heijster

Main category: cs.LG

TL;DR: Neural ODEs can learn and forecast bifurcations in dynamical systems from time-series data, even beyond training parameter ranges.

Details

Motivation: Existing machine learning methods for bifurcation analysis are limited to discrete-time approaches and local bifurcations, lacking continuous-time modeling capabilities.

Method: Use Neural Ordinary Differential Equations to learn parameter-dependent vector fields directly from time-series data.

Result: Neural ODEs successfully recover bifurcation structures and can forecast bifurcations beyond the parameter regions seen in training data.

Conclusion: Neural ODEs provide an effective data-driven framework for learning and predicting bifurcations in continuous-time dynamical systems.

Abstract: Forecasting system behaviour near and across bifurcations is crucial for identifying potential shifts in dynamical systems. While machine learning has recently been used to learn critical transitions and bifurcation structures from data, most studies remain limited as they exclusively focus on discrete-time methods and local bifurcations. To address these limitations, we use Neural Ordinary Differential Equations which provide a data-driven framework for learning system dynamics. Our results show that Neural Ordinary Differential Equations can recover underlying bifurcation structures directly from time-series data by learning parameter-dependent vector fields. Notably, we demonstrate that Neural Ordinary Differential Equations can forecast bifurcations even beyond the parameter regions represented in the training data. We demonstrate our approach on three test cases: the Lorenz system transitioning from non-chaotic to chaotic behaviour, the R"ossler system moving from chaos to period doubling, and a predator-prey model exhibiting collapse via a global bifurcation.

[1585] End-to-End On-Device Quantization-Aware Training for LLMs at Inference Cost

Qitao Tan, Xiaoying Song, Jin Lu, Guoming Li, Jun Liu, Lingzi Hong, Caiwen Ding, Jundong Li, Xiaoming Zhai, Shaoyi Huang, Wei Niu, Geng Yuan

Main category: cs.LG

TL;DR: ZeroQAT is a zeroth-order optimization-based quantization-aware training framework that eliminates backpropagation, enabling efficient LLM quantization on resource-limited devices.

Details

Motivation: Existing PTQ methods suffer accuracy loss in low-bit scenarios, while traditional QAT has prohibitive memory costs due to backpropagation, limiting practicality for LLM deployment.

Method: Uses forward-only gradient estimation for zeroth-order optimization, supports both weight and activation quantization, and includes a lightweight variant that freezes and pre-quantizes most parameters to reduce memory usage.

Result: Outperforms PTQ and QAT baselines with significantly less memory, enables fine-tuning 13B models at 2-4 bits on single 8GB GPU, and allows fine-tuning 6.7B models on smartphones.

Conclusion: ZeroQAT provides a practical solution for end-to-end QAT on resource-limited edge devices, demonstrating superior performance and efficiency over existing quantization methods.

Abstract: Quantization is an effective technique to reduce the deployment cost of large language models (LLMs), and post-training quantization (PTQ) has been widely studied due to its efficiency. However, existing PTQ methods are limited by their inability to fine-tune model parameters and often suffer significant accuracy loss in low-bit scenarios. Quantization-aware training (QAT) provides a more principled solution, but its reliance on backpropagation incurs prohibitive memory costs, limiting its practicality for LLM deployment. To address these challenges, we propose ZeroQAT, a zeroth-order optimization-based QAT framework that supports both weight and activation quantization. ZeroQAT leverages forward-only gradient estimation to eliminate backpropagation, substantially reducing computational and memory overhead while retaining the benefits of end-to-end optimization. We further introduce a lightweight variant of ZeroQAT for quantized fine-tuning, which freezes and pre-quantizes most parameters to further cut memory usage. Experiments show that ZeroQAT consistently outperforms representative PTQ and QAT baselines while requiring significantly less memory. For example, ZeroQAT enables fine-tuning of a 13B model at extremely low bit-widths (e.g., 2-4 bits) on a single 8GB GPU, and even allows fine-tuning a 6.7B model on a OnePlus 12 smartphone, demonstrating its practicality for end-to-end QAT on resource-limited edge devices.

[1586] Uncertainty-driven Embedding Convolution

Sungjun Lim, Kangjun Noh, Youngjun Choi, Heeyoung Lee, Kyungwoo Song

Main category: cs.LG

TL;DR: UEC is an uncertainty-driven ensemble method that transforms deterministic embeddings into probabilistic ones, computes adaptive weights based on embedding uncertainty, and uses uncertainty-aware similarity scoring to improve performance and robustness.

Details

Motivation: Existing embedding ensemble methods operate on deterministic embeddings and fail to account for model-specific uncertainty, limiting robustness and reliability in downstream applications.

Method: UEC transforms deterministic embeddings into probabilistic embeddings, computes adaptive ensemble weights based on embedding uncertainty using Bayes-optimal solution under surrogate loss, and employs uncertainty-aware similarity function that incorporates uncertainty into similarity scoring.

Result: Extensive experiments on diverse benchmarks show UEC consistently improves both performance and robustness through principled uncertainty modeling.

Conclusion: UEC provides a theoretically grounded and efficient approach that leverages uncertainty modeling to enhance embedding ensemble performance and robustness.

Abstract: Text embeddings are essential components in modern NLP pipelines. While numerous embedding models have been proposed, their performance varies across domains. This variability motivates the use of ensemble techniques to combine complementary strengths. However, most existing ensemble methods operate on deterministic embeddings and fail to account for model-specific uncertainty, limiting their robustness and reliability in downstream applications. To address these limitations, we propose Uncertainty-driven Embedding Convolution (UEC). UEC first transforms deterministic embeddings into probabilistic ones in a post-hoc manner. It then computes adaptive ensemble weights based on embedding uncertainty, grounded in a Bayes-optimal solution under a surrogate loss. Additionally, UEC employs an uncertainty-aware similarity function that directly incorporates uncertainty into the similarity scoring, providing a theoretically grounded and efficient surrogate to distributional distances. Extensive experiments on diverse benchmarks demonstrate that UEC consistently improves both performance and robustness by leveraging principled uncertainty modeling.

[1587] Probabilistic Consistency in Machine Learning and Its Connection to Uncertainty Quantification

Paul Patrone, Anthony Kearsley

Main category: cs.LG

TL;DR: The paper develops a level-set theory of classification that connects machine learning models to uncertainty quantification, showing that self-consistent ML models are equivalent to class-conditional probability distributions and deriving necessary conditions for valid probabilistic interpretations.

Details

Motivation: To address the black-box nature of ML models and the difficulty in quantifying confidence in predictions, by understanding how models are mathematical abstractions of training data and connecting this to uncertainty quantification.

Method: Develops a level-set theory of classification by analyzing binary Bayes optimal classifiers, parameterizing them in terms of prevalence, and deriving monotonicity and class-switching properties to deduce density ratios without direct access to boundary sets.

Result: Shows that Bayes classifiers satisfy important properties that enable constructing multiclass Bayes-optimal classifiers and estimating inherent uncertainty in class assignments. Derives normalization and self-consistency conditions equivalent to the law of total probability.

Conclusion: The analysis provides a framework for uncertainty quantification in ML through uncertainty propagation, establishing necessary conditions for ML models to have valid probabilistic interpretations and connecting classification theory to uncertainty quantification.

Abstract: Machine learning (ML) is often viewed as a powerful data analysis tool that is easy to learn because of its black-box nature. Yet this very nature also makes it difficult to quantify confidence in predictions extracted from ML models, and more fundamentally, to understand how such models are mathematical abstractions of training data. The goal of this paper is to unravel these issues and their connections to uncertainty quantification (UQ) by pursuing a line of reasoning motivated by diagnostics. In such settings, prevalence - i.e. the fraction of elements in class - is often of inherent interest. Here we analyze the many interpretations of prevalence to derive a level-set theory of classification, which shows that certain types of self-consistent ML models are equivalent to class-conditional probability distributions. We begin by studying the properties of binary Bayes optimal classifiers, recognizing that their boundary sets can be reinterpreted as level-sets of pairwise density ratios. By parameterizing Bayes classifiers in terms of the prevalence, we then show that they satisfy important monotonicity and class-switching properties that can be used to deduce the density ratios without direct access to the boundary sets. Moreover, this information is sufficient for tasks such as constructing the multiclass Bayes-optimal classifier and estimating inherent uncertainty in the class assignments. In the multiclass case, we use these results to deduce normalization and self-consistency conditions, the latter being equivalent to the law of total probability for classifiers. We also show that these are necessary conditions for arbitrary ML models to have valid probabilistic interpretations. Throughout we demonstrate how this analysis informs the broader task of UQ for ML via an uncertainty propagation framework.

[1588] GradES: Significantly Faster Training in Transformers with Gradient-Based Early Stopping

Qifu Wen, Xi Zeng, Zihan Zhou, Shuaijun Liu, Mehdi Hosseinzadeh, Ningxin Su, Reza Rawassizadeh

Main category: cs.LG

TL;DR: GradES is a gradient-based early stopping method that monitors gradient changes in transformer components during fine-tuning, allowing individual projection matrices to stop updating when converged, eliminating validation passes and speeding up training while improving accuracy.

Details

Motivation: Traditional early stopping requires costly validation inference for large transformers. Different transformer components converge at varying rates during fine-tuning, suggesting potential for more efficient stopping strategies.

Method: Track magnitude of gradient changes in backpropagation for attention projections and Feed-Forward layer matrices. When a projection matrix’s gradient changes fall below threshold τ, exclude that matrix from further updates individually while allowing others to continue learning.

Result: Speeds up training time by 1.57-7.22× while improving generalization. Achieves 1.2% higher average accuracy in language tasks and 3.88% on multimodal benchmarks compared to traditional early stopping.

Conclusion: GradES provides an efficient alternative to traditional early stopping by leveraging component-wise convergence patterns, eliminating validation overhead while preventing overfitting and improving performance.

Abstract: Early stopping monitors global validation loss and halts all parameter updates simultaneously, which is computationally costly for large transformers due to the extended time required for validation inference. We propose \textit{GradES}, a novel gradient-based early stopping approach that operates within transformer components (attention projections and Feed-Forward layer matrices). We found that different components converge at varying rates during fine-tuning for both language and vision-language models. \textit{GradES} tracks the magnitude of gradient changes in backpropagation for these matrices during training. When a projection matrix’s magnitude of gradient changes fall below a convergence threshold $\tau$, we exclude that projection matrix from further updates individually, eliminating costly validation passes while allowing slow converging matrices to continue learning. \textit{GradES} speeds up training time by 1.57–7.22$\times$ while simultaneously enhancing generalization through early prevention of overfitting, resulting in 1.2% higher average accuracy in language tasks and 3.88% on multimodal benchmarks.

[1589] Merging Memory and Space: A State Space Neural Operator

Nodens Koren, Samuel Lanthaler

Main category: cs.LG

TL;DR: SS-NO is a compact neural operator for time-dependent PDEs that extends state space models with adaptive damping and frequency modulation, achieving SOTA performance with fewer parameters.

Details

Motivation: To develop an efficient architecture for learning solution operators of time-dependent PDEs that can capture long-range dependencies while maintaining parameter efficiency.

Method: Extends structured state space models to joint spatiotemporal modeling with adaptive damping (stabilizes learning) and learnable frequency modulation (data-driven spectral selection). Includes theoretical connections between SSMs and neural operators.

Result: Achieves state-of-the-art performance on 1D Burgers’, Kuramoto-Sivashinsky equations, and 2D Navier-Stokes and compressible Euler flows with significantly fewer parameters. Factorized variant shows scalable performance on 2D problems.

Conclusion: SS-NO demonstrates effectiveness of damping and frequency learning in operator modeling, with lightweight factorization providing efficient large-scale PDE learning.

Abstract: We propose the State Space Neural Operator (SS-NO), a compact architecture for learning solution operators of time-dependent partial differential equations (PDEs). Our formulation extends structured state space models (SSMs) to joint spatiotemporal modeling, introducing two key mechanisms: adaptive damping, which stabilizes learning by localizing receptive fields, and learnable frequency modulation, which enables data-driven spectral selection. These components provide a unified framework for capturing long-range dependencies with parameter efficiency. Theoretically, we establish connections between SSMs and neural operators, proving a universality theorem for convolutional architectures with full field-of-view. Empirically, SS-NO achieves state-of-the-art performance across diverse PDE benchmarks-including 1D Burgers’ and Kuramoto-Sivashinsky equations, and 2D Navier-Stokes and compressible Euler flows-while using significantly fewer parameters than competing approaches. A factorized variant of SS-NO further demonstrates scalable performance on challenging 2D problems. Our results highlight the effectiveness of damping and frequency learning in operator modeling, while showing that lightweight factorization provides a complementary path toward efficient large-scale PDE learning.

[1590] Differentially Private Clipped-SGD: High-Probability Convergence with Arbitrary Clipping Level

Saleh Vatan Khah, Savelii Chezhegov, Shahrokh Farahmand, Samuel Horváth, Eduard Gorbunov

Main category: cs.LG

TL;DR: First high-probability convergence analysis for DP-Clipped-SGD with fixed clipping level, showing faster convergence to optimal solution neighborhood under heavy-tailed noise while maintaining DP guarantees.

Details

Motivation: Existing gradient clipping analyses require increasing clipping thresholds, which conflicts with standard DP mechanisms like Gaussian mechanism that need fixed clipping levels.

Method: DP-Clipped-SGD with fixed clipping level, analyzed for both convex and non-convex smooth optimization under heavy-tailed noise with bounded central α-th moment (α ∈ (1,2]).

Result: Method converges to neighborhood of optimal solution with faster rate than existing approaches, with neighborhood size balanced against DP noise for refined privacy-convergence trade-off.

Conclusion: Fixed clipping level enables practical DP gradient clipping with improved convergence rates while maintaining privacy guarantees, bridging gap between theoretical analysis and practical DP mechanisms.

Abstract: Gradient clipping is a fundamental tool in Deep Learning, improving the high-probability convergence of stochastic first-order methods like SGD, AdaGrad, and Adam under heavy-tailed noise, which is common in training large language models. It is also a crucial component of Differential Privacy (DP) mechanisms. However, existing high-probability convergence analyses typically require the clipping threshold to increase with the number of optimization steps, which is incompatible with standard DP mechanisms like the Gaussian mechanism. In this work, we close this gap by providing the first high-probability convergence analysis for DP-Clipped-SGD with a fixed clipping level, applicable to both convex and non-convex smooth optimization under heavy-tailed noise, characterized by a bounded central $\alpha$-th moment assumption, $\alpha \in (1,2]$. Our results show that, with a fixed clipping level, the method converges to a neighborhood of the optimal solution with a faster rate than the existing ones. The neighborhood can be balanced against the noise introduced by DP, providing a refined trade-off between convergence speed and privacy guarantees.

[1591] Signals, Concepts, and Laws: Toward Universal, Explainable Time-Series Forecasting

Hongwei Ma, Junbin Gao, Minh-Ngoc Tran

Main category: cs.LG

TL;DR: DORIC is a domain-universal transformer for time-series forecasting that uses five self-supervised concepts and enforces first-principles constraints through ODE regularization.

Details

Motivation: Addressing the challenge of accurate, explainable, and physically credible forecasting for multivariate time-series with varying statistical properties across domains.

Method: Uses a Domain-Universal, ODE-Regularized, Interpretable-Concept Transformer with five self-supervised, domain-agnostic concepts and enforces differentiable residuals through first-principles constraints.

Result: Not specified in the abstract.

Conclusion: Not specified in the abstract.

Abstract: Accurate, explainable and physically credible forecasting remains a persistent challenge for multivariate time-series whose statistical properties vary across domains. We propose DORIC, a Domain-Universal, ODE-Regularized, Interpretable-Concept Transformer for Time-Series Forecasting that generates predictions through five self-supervised, domain-agnostic concepts while enforcing differentiable residuals grounded in first-principles constraints.

[1592] Toward Efficient Spiking Transformers: Synapse Pruning Meets Synergistic Learning-Based Compensation

Hongze Sun, Wuque Cai, Duo Chen, Quan Tang, Shifeng Mao, Jiayi He, Zhenxing Wang, Yan Cui, Dezhong Yao, Daqing Guo

Main category: cs.LG

TL;DR: This paper proposes pruning and compensation strategies to create lightweight spiking Transformer models that reduce parameters and computational costs while maintaining performance.

Details

Motivation: Existing spiking Transformer models require substantial parameters and high computational costs, limiting deployment in resource-constrained environments.

Method: Combines synapse pruning (unstructured L1P for sparsity and structured DSP for low-rank) with synergistic learning-based compensation using enhanced sLIF neurons that learn through synaptic and intrinsic plasticity mechanisms.

Result: Extensive experiments show significant reduction in model size and computational overhead while maintaining competitive performance on benchmark datasets.

Conclusion: The proposed pruning and compensation strategies effectively construct efficient and high-performing spiking Transformer-based models.

Abstract: As a foundational architecture of artificial intelligence models, Transformer has been recently adapted to spiking neural networks with promising performance across various tasks. However, existing spiking Transformer~(ST)-based models require a substantial number of parameters and incur high computational costs, thus limiting their deployment in resource-constrained environments. To address these challenges, we propose combining synapse pruning with a synergistic learning-based compensation strategy to derive lightweight ST-based models. Specifically, two types of tailored pruning strategies are introduced to reduce redundancy in the weight matrices of ST blocks: an unstructured $\mathrm{L_{1}P}$ method to induce sparse representations, and a structured DSP method to induce low-rank representations. In addition, we propose an enhanced spiking neuron model, termed the synergistic leaky integrate-and-fire (sLIF) neuron, to effectively compensate for model pruning through synergistic learning between synaptic and intrinsic plasticity mechanisms. Extensive experiments on benchmark datasets demonstrate that the proposed methods significantly reduce model size and computational overhead while maintaining competitive performance. These results validate the effectiveness of the proposed pruning and compensation strategies in constructing efficient and high-performing ST-based models.

[1593] Semantic-Enhanced Time-Series Forecasting via Large Language Models

Hao Liu, Chun Yang, Zhang xiaoxing, Xiaobin Zhu

Main category: cs.LG

TL;DR: SE-LLM enhances LLMs for time series forecasting by embedding periodicity and anomaly characteristics into semantic space, and adds a plugin module to capture both long-term and short-term dependencies while reducing computational costs.

Details

Motivation: Existing LLM approaches for time series forecasting focus on token-level alignment but fail to bridge the intrinsic modality gap between linguistic knowledge and time series patterns, limiting semantic representation. Also, Transformers are good at long-range dependencies but weak at modeling short-term anomalies.

Method: Proposes Semantic-Enhanced LLM (SE-LLM) that embeds time series periodicity and anomalous characteristics into semantic space to enhance token embeddings. Also introduces a plugin module within self-attention to model both long-term and short-term dependencies while freezing the LLM and reducing token sequence dimensionality.

Result: Experiments demonstrate superior performance against state-of-the-art methods.

Conclusion: SE-LLM effectively bridges the modality gap between linguistic knowledge and time series patterns, activates LLMs’ potential for temporal analysis, and achieves better forecasting performance with reduced computational consumption.

Abstract: Time series forecasting plays a significant role in finance, energy, meteorology, and IoT applications. Recent studies have leveraged the generalization capabilities of large language models (LLMs) to adapt to time series forecasting, achieving promising performance. However, existing studies focus on token-level modal alignment, instead of bridging the intrinsic modality gap between linguistic knowledge structures and time series data patterns, greatly limiting the semantic representation. To address this issue, we propose a novel Semantic-Enhanced LLM (SE-LLM) that explores the inherent periodicity and anomalous characteristics of time series to embed into the semantic space to enhance the token embedding. This process enhances the interpretability of tokens for LLMs, thereby activating the potential of LLMs for temporal sequence analysis. Moreover, existing Transformer-based LLMs excel at capturing long-range dependencies but are weak at modeling short-term anomalies in time-series data. Hence, we propose a plugin module embedded within self-attention that models long-term and short-term dependencies to effectively adapt LLMs to time-series analysis. Our approach freezes the LLM and reduces the sequence dimensionality of tokens, greatly reducing computational consumption. Experiments demonstrate the superiority performance of our SE-LLM against the state-of-the-art (SOTA) methods.

[1594] CC-Time: Cross-Model and Cross-Modality Time Series Forecasting

Peng Chen, Yihang Wang, Yang Shu, Yunyao Cheng, Kai Zhao, Zhongwen Rao, Lujia Pan, Bin Yang, Chenjuan Guo

Main category: cs.LG

TL;DR: CC-Time is a novel approach that leverages pre-trained language models (PLMs) for time series forecasting by incorporating cross-modality learning and cross-model fusion to improve prediction accuracy.

Details

Motivation: Current PLM-based time series forecasting methods fail to achieve satisfactory prediction accuracy despite the strong sequential modeling power of language models.

Method: CC-Time uses cross-modality learning to model temporal dependency and channel correlations from both time series sequences and text descriptions, and cross-model fusion to integrate knowledge from PLMs and time series models.

Result: Extensive experiments on nine real-world datasets show CC-Time achieves state-of-the-art prediction accuracy in both full-data training and few-shot learning scenarios.

Conclusion: The proposed CC-Time framework successfully bridges the gap between PLMs and time series forecasting by addressing what time series features PLMs can model and whether PLMs alone are sufficient for time series modeling.

Abstract: With the success of pre-trained language models (PLMs) in various application fields beyond natural language processing, language models have raised emerging attention in the field of time series forecasting (TSF) and have shown great prospects. However, current PLM-based TSF methods still fail to achieve satisfactory prediction accuracy matching the strong sequential modeling power of language models. To address this issue, we propose Cross-Model and Cross-Modality Learning with PLMs for time series forecasting (CC-Time). We explore the potential of PLMs for time series forecasting from two aspects: 1) what time series features could be modeled by PLMs, and 2) whether relying solely on PLMs is sufficient for building time series models. In the first aspect, CC-Time incorporates cross-modality learning to model temporal dependency and channel correlations in the language model from both time series sequences and their corresponding text descriptions. In the second aspect, CC-Time further proposes the cross-model fusion block to adaptively integrate knowledge from the PLMs and time series model to form a more comprehensive modeling of time series patterns. Extensive experiments on nine real-world datasets demonstrate that CC-Time achieves state-of-the-art prediction accuracy in both full-data training and few-shot learning situations.

[1595] DHG-Bench: A Comprehensive Benchmark for Deep Hypergraph Learning

Fan Li, Xiaoyang Wang, Wenjie Zhang, Ying Zhang, Xuemin Lin

Main category: cs.LG

TL;DR: DHG-Bench is the first comprehensive benchmark for Hypergraph Neural Networks (HNNs), evaluating 17 state-of-the-art algorithms across 22 datasets on node-, edge-, and graph-level tasks under unified settings.

Details

Motivation: Existing deep graph models focus on pairwise relationships but cannot capture higher-order interactions in real-world systems. While HNNs address this, inconsistent experimental protocols and limited evaluation in current toolkits hinder deeper understanding and development of HNN research.

Method: Systematically evaluates HNNs across four dimensions: effectiveness, efficiency, robustness, and fairness. Uses 17 state-of-the-art HNN algorithms on 22 diverse datasets spanning different task levels under unified experimental settings.

Result: Extensive experiments reveal both strengths and limitations of existing HNN algorithms, providing valuable insights for future research. The benchmark includes a comprehensive library for reproducible research.

Conclusion: DHG-Bench fills the gap in comprehensive HNN evaluation and provides a standardized benchmark to advance hypergraph learning research, with an open-source library available for community use.

Abstract: Deep graph models have achieved great success in network representation learning. However, their focus on pairwise relationships restricts their ability to learn pervasive higher-order interactions in real-world systems, which can be naturally modeled as hypergraphs. To tackle this issue, Hypergraph Neural Networks (HNNs) have garnered substantial attention in recent years. Despite the proposal of numerous HNNs, the absence of consistent experimental protocols and multi-dimensional empirical analysis impedes deeper understanding and further development of HNN research. While several toolkits for deep hypergraph learning (DHGL) have been introduced to facilitate algorithm evaluation, they provide only limited quantitative evaluation results and insufficient coverage of advanced algorithms, datasets, and benchmark tasks. To fill the gap, we introduce DHG-Bench, the first comprehensive benchmark for HNNs. Specifically, DHG-Bench systematically investigates the characteristics of HNNs in terms of four dimensions: effectiveness, efficiency, robustness, and fairness. We comprehensively evaluate 17 state-of-the-art HNN algorithms on 22 diverse datasets spanning node-, edge-, and graph-level tasks, under unified experimental settings. Extensive experiments reveal both the strengths and limitations of existing algorithms, offering valuable insights and directions for future research. Furthermore, to facilitate reproducible research, we have developed an easy-to-use library for training and evaluating different HNN methods. The DHG-Bench library is available at: https://github.com/Coco-Hut/DHG-Bench.

[1596] Graph Alignment via Dual-Pass Spectral Encoding and Latent Space Communication

Maysam Behmanesh, Erkan Turan, Maks Ovsjanikov

Main category: cs.LG

TL;DR: A novel graph alignment framework that enhances node distinctiveness and enforces geometric consistency across latent spaces, outperforming existing unsupervised methods and generalizing to vision-language domains.

Details

Motivation: Existing unsupervised graph alignment methods suffer from node distinctiveness degradation due to GNN oversmoothing and latent space misalignment caused by structural noise, feature heterogeneity, and training instability.

Method: Dual-pass encoder combining low-pass and high-pass spectral filters for structure-aware discriminative embeddings, plus geometry-aware functional map module for bijective and isometric transformations between graph embeddings.

Result: Consistently outperforms existing unsupervised alignment baselines on graph benchmarks, with superior robustness to structural inconsistencies and challenging alignment scenarios. Also generalizes effectively to vision-language benchmarks.

Conclusion: The proposed framework successfully addresses key limitations in unsupervised graph alignment by simultaneously enhancing node distinctiveness and enforcing geometric consistency, demonstrating broad applicability across domains.

Abstract: Graph alignment, the problem of identifying corresponding nodes across multiple graphs, is fundamental to numerous applications. Most existing unsupervised methods embed node features into latent representations to enable cross-graph comparison without ground-truth correspondences. However, these methods suffer from two critical limitations: the degradation of node distinctiveness due to oversmoothing in GNN-based embeddings, and the misalignment of latent spaces across graphs caused by structural noise, feature heterogeneity, and training instability, ultimately leading to unreliable node correspondences. We propose a novel graph alignment framework that simultaneously enhances node distinctiveness and enforces geometric consistency across latent spaces. Our approach introduces a dual-pass encoder that combines low-pass and high-pass spectral filters to generate embeddings that are both structure-aware and highly discriminative. To address latent space misalignment, we incorporate a geometry-aware functional map module that learns bijective and isometric transformations between graph embeddings, ensuring consistent geometric relationships across different representations. Extensive experiments on graph benchmarks demonstrate that our method consistently outperforms existing unsupervised alignment baselines, exhibiting superior robustness to structural inconsistencies and challenging alignment scenarios. Additionally, comprehensive evaluation on vision-language benchmarks using diverse pretrained models shows that our framework effectively generalizes beyond graph domains, enabling unsupervised alignment of vision and language representations.

[1597] Double Check My Desired Return: Transformer with Target Alignment for Offline Reinforcement Learning

Yue Pei, Hongming Zhang, Chao Gao, Martin Müller, Mengxiao Zhu, Hao Sheng, Ziliang Chen, Liang Lin, Haogang Zhu

Main category: cs.LG

TL;DR: Doctor is a novel offline RL approach that combines supervised learning and temporal difference learning with a double-check mechanism to ensure better alignment between predicted actions and desired target returns.

Details

Motivation: Existing RvS-based transformers struggle to reliably align actual achieved returns with specified target returns, especially for underrepresented or extrapolated returns, while real-world applications need precise control over policy performance levels.

Method: Doctor jointly optimizes action prediction and value estimation, using a double-check mechanism where actions are sampled around desired target returns and then validated with value functions during inference.

Result: Doctor demonstrates stronger performance and tunable expertise on D4RL and EpiCare benchmarks, showing effectiveness across various tasks with improved return alignment.

Conclusion: The proposed Doctor approach successfully addresses the return alignment problem in offline RL by integrating SL and TD learning with a double-check mechanism, enabling more accurate control over policy performance levels.

Abstract: Offline reinforcement learning (RL) has achieved significant advances in domains such as robotic control, autonomous driving, and medical decision-making. Most existing methods primarily focus on training policies that maximize cumulative returns from a given dataset. However, many real-world applications require precise control over policy performance levels, rather than simply pursuing the best possible return. Reinforcement learning via supervised learning (RvS) frames offline RL as a sequence modeling task, enabling the extraction of diverse policies by conditioning on different desired returns. Yet, existing RvS-based transformers, such as Decision Transformer (DT), struggle to reliably align the actual achieved returns with specified target returns, especially when interpolating within underrepresented returns or extrapolating beyond the dataset. To address this limitation, we propose Doctor, a novel approach that Double Checks the Transformer with target alignment for Offline RL. Doctor integrates the strengths of supervised learning (SL) and temporal difference (TD) learning by jointly optimizing the action prediction and value estimation. During inference, Doctor introduces a double-check mechanism: actions are first sampled around the desired target returns and then validated with value functions. This ensures more accurate alignment between predicted actions and desired target returns. We evaluate Doctor on the D4RL and EpiCare benchmarks, demonstrating aligned control yields stronger performance and tunable expertise, showing its effectiveness in a wide range of tasks.

[1598] Learning to Route: Per-Sample Adaptive Routing for Multimodal Multitask Prediction

Marzieh Ajirak, Oded Bein, Ellen Rose Bowen, Dora Kanellopoulos, Avital Falk, Faith M. Gunning, Nili Solomonov, Logan Grosenick

Main category: cs.LG

TL;DR: A unified framework for adaptive routing in multitask, multimodal prediction that dynamically selects modality processing pathways and task-sharing strategies per sample, particularly for psychotherapy applications with structured assessments and unstructured notes.

Details

Motivation: Address data heterogeneity and task interactions in psychotherapy settings where structured assessments and unstructured clinician notes coexist with missing data and correlated outcomes, enabling personalized healthcare.

Method: Routing-based architecture with multiple modality paths (raw and fused text/numeric representations) that learns to route inputs through optimal expert combinations, with task-specific predictions from shared/independent heads trained end-to-end.

Result: Outperforms fixed multitask or single-task baselines on synthetic and real-world psychotherapy data for depression/anxiety prediction, with learned routing providing interpretable insights into modality relevance and task structure.

Conclusion: Enables per-subject adaptive information processing for personalized healthcare, potentially improving mental health outcomes, treatment precision, and cost-effectiveness through personalized intervention strategies.

Abstract: We propose a unified framework for adaptive routing in multitask, multimodal prediction settings where data heterogeneity and task interactions vary across samples. Motivated by applications in psychotherapy where structured assessments and unstructured clinician notes coexist with partially missing data and correlated outcomes, we introduce a routing-based architecture that dynamically selects modality processing pathways and task-sharing strategies on a per-sample basis. Our model defines multiple modality paths, including raw and fused representations of text and numeric features and learns to route each input through the most informative expert combination. Task-specific predictions are produced by shared or independent heads depending on the routing decision, and the entire system is trained end-to-end. We evaluate the model on both synthetic data and real-world psychotherapy notes predicting depression and anxiety outcomes. Our experiments show that our method consistently outperforms fixed multitask or single-task baselines, and that the learned routing policy provides interpretable insights into modality relevance and task structure. This addresses critical challenges in personalized healthcare by enabling per-subject adaptive information processing that accounts for data heterogeneity and task correlations. Applied to psychotherapy, this framework could improve mental health outcomes, enhance treatment assignment precision, and increase clinical cost-effectiveness through personalized intervention strategies.

[1599] ChartMaster: Advancing Chart-to-Code Generation with Real-World Charts and Chart Similarity Reinforcement Learning

Wentao Tan, Qiong Cao, Chao Xue, Yibing Zhan, Changxing Ding, Xiaodong He

Main category: cs.LG

TL;DR: ChartMaster addresses chart-to-code generation challenges with ReChartPrompt dataset using real arXiv charts and ChartSimRL reinforcement learning with visual similarity rewards, achieving SOTA results comparable to GPT-4o.

Details

Motivation: Existing chart-to-code generation faces limited data diversity from synthetic datasets and difficulty maintaining visual consistency between generated and original charts.

Method: Proposes ReChartPrompt dataset using real-world arXiv charts and ChartSimRL RL algorithm with multimodal chart similarity reward combining attribute and visual similarity.

Result: ChartMaster achieves state-of-the-art results among 7B-parameter models and rivals GPT-4o on various chart-to-code benchmarks.

Conclusion: The combination of diverse real-world chart data and multimodal visual similarity rewards significantly improves chart-to-code generation performance and visual consistency.

Abstract: The chart-to-code generation task requires MLLMs to convert chart images into executable code. This task faces two main challenges: limited data diversity and the difficulty of maintaining visual consistency between generated charts and the original ones. Existing datasets mainly rely on synthetic seed data to prompt GPT models for code generation, resulting in homogeneous samples that limit model generalization to real-world chart styles. To address this, we propose ReChartPrompt, leveraging real-world, human-designed charts extracted from arXiv papers as prompts. By harnessing the rich content and diverse visual styles of arXiv charts, we construct ReChartPrompt-240K, a large-scale and highly diverse dataset that better reflects realistic chart variations. For the second challenge, although SFT improves code understanding by optimizing next-token prediction, it does not provide direct supervision on visual features. As a result, it often fails to guarantee that the generated charts visually match the original ones. To address this, we propose ChartSimRL, a GRPO-based reinforcement learning algorithm guided by a novel chart similarity reward. This reward consists of two components: attribute similarity, which measures the overlap of chart attributes like layout and color between the generated and original charts, and visual similarity, which evaluates overall visual features, including texture, using convolutional neural networks. Unlike traditional text-based rewards, our reward accounts for the multimodal nature of the chart-to-code generation task, significantly enhancing the model’s ability to accurately reproduce charts. Integrating ReChartPrompt and ChartSimRL, we develop the ChartMaster model, achieving SOTA results among 7B-parameter models and rivaling GPT-4o on various chart-to-code benchmarks. All resources are available at https://github.com/WentaoTan/ChartMaster.

[1600] Metis: Training Large Language Models with Advanced Low-Bit Quantization

Hengjie Cao, Mengyi Chen, Yifeng Yang, Ruijun Huang, Fang Dong, Jixian Zhou, Anrui Chen, Mingzhi Dong, Yujiang Wang, Jinlong Hou, Yuan Cheng, Fan Wu, Fan Yang, Tun Lu, Ning Gu, Li Shang

Main category: cs.LG

TL;DR: Metis is a spectral-domain quantization framework that enables robust 4-bit training of large language models by addressing anisotropic singular value spectra through partitioning and independent quantization, achieving near-BF16 performance with minimal overhead.

Details

Motivation: The fundamental barrier to low-bit training of LLMs is anisotropy in singular value spectra of parameters, activations, and gradients, which causes quantization bias and severe spectral distortion that degrades training performance.

Method: Metis partitions anisotropic spectra into narrower sub-distributions for independent quantization to reduce errors and preserve spectral structure. It leverages preservation via sparsely random sampling and random projection to minimize decomposition overhead.

Result: On LLaMA-3 8B trained with 100B tokens, Metis enables W4A4G4 training with FP4 quantization, yielding only 0.4% training loss gap and 0.1% degradation in downstream accuracy relative to BF16. It surpasses Nvidia’s FP4 recipe with lower computational overhead.

Conclusion: Metis successfully addresses the fundamental challenge of anisotropic spectra in low-bit LLM training, enabling robust 4-bit quantization with minimal performance degradation and computational overhead compared to full-precision training.

Abstract: This work identifies anisotropy in the singular value spectra of parameters, activations, and gradients as the fundamental barrier to low-bit training of large language models (LLMs). These spectra are dominated by a small fraction of large singular values, inducing wide numerical ranges that cause quantization bias and severe spectral distortion, ultimately degrading training performance. This work presents Metis, a spectral-domain quantization framework that partitions anisotropic spectra into narrower sub-distributions for independent quantization, thereby reducing errors and preserving spectral structure. To minimize overhead, Metis leverages two key properties of the dominant spectral subspace: preservation via sparsely random sampling and preservation via random projection, reducing decomposition cost to a negligible level. On LLaMA-3 8B trained with 100B tokens, Metis enables robust W4A4G4 training with FP4 quantization of weights, activations, and gradients, yielding only a 0.4% training loss gap and a 0.1% degradation in downstream accuracy relative to BF16. Beyond matching BF16 fidelity, Metis also surpasses our implementation of Nvidia’s recently announced (yet to be publicly released) FP4 recipe, consistently achieving lower loss and higher downstream accuracy while incurring significantly lower computational overhead. The code implementation for Metis is available at: https://anonymous.4open.science/r/Metis-quantization-644B.

[1601] Differentiable Expectation-Maximisation and Applications to Gaussian Mixture Model Optimal Transport

Samuel Boïté, Eloi Tanguy, Julie Delon, Agnès Desolneux, Rémi Flamary

Main category: cs.LG

TL;DR: This paper presents and compares several differentiation strategies for the Expectation-Maximization (EM) algorithm, making it differentiable and enabling its integration into modern learning pipelines. The key application is using differentiable EM to compute the Mixture Wasserstein distance between Gaussian Mixture Models as a differentiable loss function.

Details

Motivation: The EM algorithm is widely used but typically treated as a non-differentiable black box, preventing its integration into modern learning pipelines that require end-to-end gradient propagation.

Method: The authors present and compare several differentiation strategies for EM, from full automatic differentiation to approximate methods. They apply this differentiable EM to compute the Mixture Wasserstein distance between Gaussian Mixture Models and introduce a novel unbalanced variant of this distance.

Result: Numerical experiments on barycentre computation, colour and style transfer, image generation, and texture synthesis demonstrate the versatility of the proposed approach across different settings.

Conclusion: The work enables EM to be used as a differentiable component in modern learning pipelines and provides theoretical justification for using Mixture Wasserstein distance with EM, along with practical applications in various imaging and machine learning tasks.

Abstract: The Expectation-Maximisation (EM) algorithm is a central tool in statistics and machine learning, widely used for latent-variable models such as Gaussian Mixture Models (GMMs). Despite its ubiquity, EM is typically treated as a non-differentiable black box, preventing its integration into modern learning pipelines where end-to-end gradient propagation is essential. In this work, we present and compare several differentiation strategies for EM, from full automatic differentiation to approximate methods, assessing their accuracy and computational efficiency. As a key application, we leverage this differentiable EM in the computation of the Mixture Wasserstein distance $\mathrm{MW}_2$ between GMMs, allowing $\mathrm{MW}_2$ to be used as a differentiable loss in imaging and machine learning tasks. To complement our practical use of $\mathrm{MW}_2$, we contribute a novel stability result which provides theoretical justification for the use of $\mathrm{MW}_2$ with EM, and also introduce a novel unbalanced variant of $\mathrm{MW}_2$. Numerical experiments on barycentre computation, colour and style transfer, image generation, and texture synthesis illustrate the versatility of the proposed approach in different settings.

[1602] Small Vectors, Big Effects: A Mechanistic Study of RL-Induced Reasoning via Steering Vectors

Viacheslav Sinii, Nikita Balagansky, Gleb Gerasimov, Daniil Laptev, Yaroslav Aksenov, Vadim Kurochkin, Alexey Gorbatovski, Boris Shaposhnikov, Daniil Gavrilov

Main category: cs.LG

TL;DR: Lightweight steering vectors trained with RL can match full fine-tuning performance while preserving interpretability. These vectors operate through different mechanisms across layers: last-layer acts as token bias, penultimate-layer uses MLP/unembedding, and middle layers filter non-English tokens.

Details

Motivation: To understand how reasoning training reshapes LLMs' internal computations and mechanisms behind lightweight steering vectors that can match fine-tuning performance.

Method: Used lightweight steering vectors inserted into residual stream, trained with reinforcement learning objective. Analyzed with logit-lens readouts and path-patching on two models. Also used SAE (sparse autoencoder) to isolate features.

Result: Steering vectors transfer across models, combine across layers, and concentrate on meaningful prompt segments. They match full fine-tuning performance while enabling interpretable analysis of internal computations.

Conclusion: These findings deepen understanding of how trained steering vectors shape computation and should inform future work in activation engineering and reasoning model studies.

Abstract: The mechanisms by which reasoning training reshapes LLMs’ internal computations remain unclear. We study lightweight steering vectors inserted into the base model’s residual stream and trained with a reinforcement-learning objective. These vectors match full fine-tuning performance while preserving the interpretability of small, additive interventions. Using logit-lens readouts and path-patching analyses on two models, we find that (i) the last-layer steering vector acts like a token-substitution bias concentrated on the first generated token, consistently boosting tokens such as “To” and “Step”; (ii) the penultimate-layer vector leaves attention patterns largely intact and instead operates through the MLP and unembedding, preferentially up-weighting process words and structure symbols; and (iii) middle layers de-emphasize non-English tokens. Next, we show that a SAE isolates features associated with correct generations. We also show that steering vectors (i) transfer to other models, (ii) combine across layers when trained in isolation, and (iii) concentrate magnitude on meaningful prompt segments under adaptive token-wise scaling. Taken together, these results deepen understanding of how trained steering vectors shape computation and should inform future work in activation engineering and the study of reasoning models.

[1603] Enhancing Generative Auto-bidding with Offline Reward Evaluation and Policy Search

Zhiyu Mou, Yiqin Lv, Miao Xu, Qi Wang, Yixiu Mao, Qichen Ye, Chao Li, Rongquan Bai, Chuan Yu, Jian Xu, Bo Zheng

Main category: cs.LG

TL;DR: AIGB-Pearl is a novel auto-bidding method that combines generative planning and policy optimization to overcome limitations of offline RL-based approaches by enabling safe exploration beyond static datasets.

Details

Motivation: Existing AI-Generated Bidding (AIGB) methods face performance bottlenecks due to their inability to explore beyond static offline datasets, limiting their effectiveness in real-world advertising systems.

Method: AIGB-Pearl integrates generative planning with policy optimization by constructing a trajectory evaluator to score generation quality and using a KL-Lipschitz-constrained score maximization scheme for safe generalization. It employs synchronous coupling technique to ensure model regularity.

Result: Extensive experiments on both simulated and real-world advertising systems demonstrate state-of-the-art performance, showing superior effectiveness compared to existing auto-bidding methods.

Conclusion: AIGB-Pearl successfully addresses the exploration limitation of offline AIGB methods through its novel integration of generative planning and policy optimization with provable safety guarantees, achieving superior performance in auto-bidding applications.

Abstract: Auto-bidding serves as a critical tool for advertisers to improve their advertising performance. Recent progress has demonstrated that AI-Generated Bidding (AIGB), which learns a conditional generative planner from offline data, achieves superior performance compared to typical offline reinforcement learning (RL)-based auto-bidding methods. However, existing AIGB methods still face a performance bottleneck due to their inherent inability to explore beyond the static offline dataset. To address this, we propose {AIGB-Pearl} (\emph{{P}lanning with {E}valu{A}tor via RL}), a novel method that integrates generative planning and policy optimization. The core of AIGB-Pearl lies in constructing a trajectory evaluator for scoring generation quality and designing a provably sound KL-Lipschitz-constrained score maximization scheme to ensure safe and efficient generalization beyond the offline dataset. A practical algorithm incorporating the synchronous coupling technique is further devised to ensure the model regularity required by the proposed scheme. Extensive experiments on both simulated and real-world advertising systems demonstrate the state-of-the-art performance of our approach.

[1604] Composable Score-based Graph Diffusion Model for Multi-Conditional Molecular Generation

Anjie Qiao, Zhen Wang, Chuan Chen, DeFu Lian, Enhong Chen

Main category: cs.LG

TL;DR: CSGD is a score-based graph diffusion model for controllable molecular generation that enables flexible multi-property control through composable guidance and probability calibration techniques.

Details

Motivation: Existing graph diffusion models struggle with multi-conditional molecular generation due to reliance on joint conditioning or continuous relaxations that compromise fidelity.

Method: Extends score matching to discrete graphs via concrete scores, introduces Composable Guidance (CoG) for fine-grained control over condition subsets, and Probability Calibration (PC) to mitigate train-test mismatches.

Result: Achieves state-of-the-art performance with 15.3% average improvement in controllability over prior methods while maintaining high validity and distributional fidelity across four molecular datasets.

Conclusion: Score-based modeling provides practical advantages for discrete graph generation and enables flexible, multi-property molecular design.

Abstract: Controllable molecular graph generation is essential for material and drug discovery, where generated molecules must satisfy diverse property constraints. While recent advances in graph diffusion models have improved generation quality, their effectiveness in multi-conditional settings remains limited due to reliance on joint conditioning or continuous relaxations that compromise fidelity. To address these limitations, we propose Composable Score-based Graph Diffusion model (CSGD), the first model that extends score matching to discrete graphs via concrete scores, enabling flexible and principled manipulation of conditional guidance. Building on this foundation, we introduce two score-based techniques: Composable Guidance (CoG), which allows fine-grained control over arbitrary subsets of conditions during sampling, and Probability Calibration (PC), which adjusts estimated transition probabilities to mitigate train-test mismatches. Empirical results on four molecular datasets show that CSGD achieves state-of-the-art performance, with a 15.3% average improvement in controllability over prior methods, while maintaining high validity and distributional fidelity. Our findings highlight the practical advantages of score-based modeling for discrete graph generation and its capacity for flexible, multi-property molecular design.

[1605] StefaLand: An Efficient Geoscience Foundation Model That Improves Dynamic Land-Surface Predictions

Nicholas Kraabel, Jiangtao Liu, Yuchen Bian, Daniel Kifer, Chaopeng Shen

Main category: cs.LG

TL;DR: StefaLand is a generative spatiotemporal earth foundation model that improves predictions for streamflow, soil moisture, and soil composition across diverse regions, outperforming state-of-the-art baselines while requiring less computational resources.

Details

Motivation: Traditional models struggle with spatial generalization due to limited observations and concept drift, while existing vision foundation models demand massive compute and are ill-suited for dynamic land-surface prediction.

Method: Uses a masked autoencoder backbone with location-aware architecture fusing static and time-series inputs, attribute-based representations to reduce compute, and residual fine-tuning adapters for enhanced transfer.

Result: Outperforms prior state-of-the-art on four tasks and five datasets, demonstrating ability to generalize across diverse, data-scarce regions and support broad land-surface applications.

Conclusion: StefaLand is the first geoscience land-surface foundation model that demonstrably improves dynamic land-surface interaction predictions and supports diverse downstream applications, achieving robust performance with academic-level compute.

Abstract: Stewarding natural resources, mitigating floods, droughts, wildfires, and landslides, and meeting growing demands require models that can predict climate-driven land-surface responses and human feedback with high accuracy. Traditional impact models, whether process-based, statistical, or machine learning, struggle with spatial generalization due to limited observations and concept drift. Recently proposed vision foundation models trained on satellite imagery demand massive compute and are ill-suited for dynamic land-surface prediction. We introduce StefaLand, a generative spatiotemporal earth foundation model centered on landscape interactions. StefaLand improves predictions on four tasks and five datasets: streamflow, soil moisture, and soil composition, compared to prior state-of-the-art. Results highlight its ability to generalize across diverse, data-scarce regions and support broad land-surface applications. The model builds on a masked autoencoder backbone that learns deep joint representations of landscape attributes, with a location-aware architecture fusing static and time-series inputs, attribute-based representations that drastically reduce compute, and residual fine-tuning adapters that enhance transfer. While inspired by prior methods, their alignment with geoscience and integration in one model enables robust performance on dynamic land-surface tasks. StefaLand can be pretrained and finetuned on academic compute yet outperforms state-of-the-art baselines and even fine-tuned vision foundation models. To our knowledge, this is the first geoscience land-surface foundation model that demonstrably improves dynamic land-surface interaction predictions and supports diverse downstream applications.

[1606] The Multi-Query Paradox in Zeroth-Order Optimization

Wei Lin, Qingyu Song, Hong Xu

Main category: cs.LG

TL;DR: This paper systematically analyzes the query allocation problem in zeroth-order optimization, showing that for simple averaging (ZO-Avg), single-query per iteration is optimal, while for the proposed Projection Alignment method (ZO-Align), full-subspace estimation with more queries is better.

Details

Motivation: Zeroth-order optimization faces a fundamental trade-off: under fixed query budget, queries per iteration and total iterations are inversely proportional. How to best allocate this budget between queries per iteration and number of iterations is an under-explored question.

Method: The paper analyzes two aggregation methods: simple averaging (ZO-Avg) and a new Projection Alignment method (ZO-Align) derived from local surrogate minimization. It derives convergence rates for both methods across strongly convex, convex, non-convex, and stochastic settings.

Result: A stark dichotomy is uncovered: ZO-Avg is query-inefficient with more than one query per iteration (single-query optimal), while ZO-Align performs better with more queries per iteration (full-subspace estimation optimal).

Conclusion: The multi-query problem reduces to choosing between two classic algorithms (single-query vs full-subspace), with the choice dictated entirely by the aggregation method used. Theoretical findings are validated by extensive experiments.

Abstract: Zeroth-order (ZO) optimization provides a powerful framework for problems where explicit gradients are unavailable and have to be approximated using only queries to function value. The prevalent single-query approach is simple, but suffers from high estimation variance, motivating a multi-query paradigm to improves estimation accuracy. This, however, creates a critical trade-off: under a fixed budget of queries (i.e. cost), queries per iteration and the total number of optimization iterations are inversely proportional to one another. How to best allocate this budget is a fundamental, under-explored question. This work systematically resolves this query allocation problem. We analyze two aggregation methods: the de facto simple averaging (ZO-Avg), and a new Projection Alignment method (ZO-Align) we derive from local surrogate minimization. By deriving convergence rates for both methods that make the dependence on the number of queries explicit across strongly convex, convex, non-convex, and stochastic settings, we uncover a stark dichotomy: For ZO-Avg, we prove that using more than one query per iteration is always query-inefficient, rendering the single-query approach optimal. On the contrary, ZO-Align generally performs better with more queries per iteration, resulting in a full-subspace estimation as the optimal approach. Thus, our work clarifies that the multi-query problem boils down to a choice not about an intermediate query size, but between two classic algorithms, a choice dictated entirely by the aggregation method used. These theoretical findings are also consistently validated by extensive experiments.

[1607] Joint Memory Frequency and Computing Frequency Scaling for Energy-efficient DNN Inference

Yunchu Han, Zhaojun Nan, Sheng Zhou, Zhisheng Niu

Main category: cs.LG

TL;DR: This paper proposes joint memory and computing frequency scaling to optimize DNN inference energy consumption on resource-constrained devices, addressing limitations of traditional DVFS approaches that focus only on computing frequency.

Details

Motivation: Traditional DVFS techniques only adjust computing frequency, ignoring memory frequency's significant impact on DNN inference time and energy consumption, leading to suboptimal performance on resource-constrained devices.

Method: The authors use a model-based and data-driven approach to investigate joint memory and computing frequency scaling effects, combined with fitting parameters from different DNN models, and validate through simulations in local and cooperative inference scenarios.

Result: Simulation results demonstrate that jointly scaling memory and computing frequency effectively reduces energy consumption in both local and cooperative DNN inference cases.

Conclusion: Joint memory and computing frequency scaling provides an effective approach for optimizing DNN inference energy efficiency on resource-constrained devices, outperforming traditional computing-only frequency scaling methods.

Abstract: Deep neural networks (DNNs) have been widely applied in diverse applications, but the problems of high latency and energy overhead are inevitable on resource-constrained devices. To address this challenge, most researchers focus on the dynamic voltage and frequency scaling (DVFS) technique to balance the latency and energy consumption by changing the computing frequency of processors. However, the adjustment of memory frequency is usually ignored and not fully utilized to achieve efficient DNN inference, which also plays a significant role in the inference time and energy consumption. In this paper, we first investigate the impact of joint memory frequency and computing frequency scaling on the inference time and energy consumption with a model-based and data-driven method. Then by combining with the fitting parameters of different DNN models, we give a preliminary analysis for the proposed model to see the effects of adjusting memory frequency and computing frequency simultaneously. Finally, simulation results in local inference and cooperative inference cases further validate the effectiveness of jointly scaling the memory frequency and computing frequency to reduce the energy consumption of devices.

[1608] RMT-KD: Random Matrix Theoretic Causal Knowledge Distillation

Davide Ettori, Nastaran Darabi, Sureshkumar Senthilkumar, Amit Ranjan Trivedi

Main category: cs.LG

TL;DR: RMT-KD is a compression method using Random Matrix Theory for knowledge distillation to reduce network size while maintaining accuracy, achieving 80% parameter reduction with minimal performance loss.

Details

Motivation: Large deep learning models like BERT and ResNet are too computationally expensive to deploy at the edge due to their size and resource demands.

Method: Uses Random Matrix Theory to identify and preserve only informative directions in hidden representations through spectral analysis, applied layer by layer with self-distillation for stability.

Result: Achieves up to 80% parameter reduction with only 2% accuracy loss on GLUE, AG News, and CIFAR-10, delivering 2.8x faster inference and nearly halved power consumption.

Conclusion: RMT-KD establishes a mathematically grounded approach to network distillation that effectively compresses models for edge deployment.

Abstract: Large deep learning models such as BERT and ResNet achieve state-of-the-art performance but are costly to deploy at the edge due to their size and compute demands. We present RMT-KD, a compression method that leverages Random Matrix Theory (RMT) for knowledge distillation to iteratively reduce network size. Instead of pruning or heuristic rank selection, RMT-KD preserves only informative directions identified via the spectral properties of hidden representations. RMT-based causal reduction is applied layer by layer with self-distillation to maintain stability and accuracy. On GLUE, AG News, and CIFAR-10, RMT-KD achieves up to 80% parameter reduction with only 2% accuracy loss, delivering 2.8x faster inference and nearly halved power consumption. These results establish RMT-KD as a mathematically grounded approach to network distillation.

[1609] Reinforced Generation of Combinatorial Structures: Applications to Complexity Theory

Ansh Nagda, Prabhakar Raghavan, Abhradeep Thakurta

Main category: cs.LG

TL;DR: AI techniques (AlphaEvolve) improve combinatorial algorithm limits: near-optimal bounds for MAX-CUT/Independent Set on random graphs via extremal Ramanujan graphs, and better inapproximability results for MAX-k-CUT through new gadget reductions.

Details

Motivation: To explore whether AI can help discover new combinatorial structures that improve known limits on efficient algorithms, specifically for computational hardness problems.

Method: Use AlphaEvolve (LLM coding agent) to construct extremal Ramanujan graphs for improved lower bounds and discover new gadget reductions for inapproximability results. Also use AlphaEvolve to evolve faster verification procedures.

Result: Improved near-optimal bounds for MAX-CUT/Independent Set certification on random graphs (163-node Ramanujan graphs), and better inapproximability factors: MAX-4-CUT 0.987 (vs 0.9883 SOTA) and MAX-3-CUT 0.9649 (vs 0.9853 gadget-based). Verification speed improved up to 10,000×.

Conclusion: AI (AlphaEvolve) successfully helps discover combinatorial structures improving algorithm limits, with key innovation being AI-evolved verification procedures to overcome exponential verification costs.

Abstract: We explore whether techniques from AI can help discover new combinatorial structures that improve on known limits on efficient algorithms. Specifically, we use AlphaEvolve (an LLM coding agent) to study two settings: a) Average-case hardness for MAX-CUT and MAX-Independent Set: We improve a recent result of Kunisky and Yu to obtain near-optimal upper and (conditional) lower bounds on certification algorithms for MAX-CUT and MAX-Independent Set on random 3- and 4-regular graphs. Our improved lower bounds are obtained by constructing nearly extremal Ramanujan graphs on as many as $163$ nodes, using AlphaEvolve. Additionally, via analytical arguments we strengthen the upper bounds to settle the computational hardness of these questions up to an error in the third decimal place. b) Worst-case Hardness of Approximation for MAX-k-CUT: We obtain new inapproximability results, proving that it is NP-hard to approximate MAX-4-CUT and MAX-3-CUT within factors of $0.987$ and $0.9649$ respectively, using AlphaEvolve to discover new gadget reductions. Our MAX-4-CUT result improves upon the SOTA of $0.9883$, and our MAX-3-CUT result improves on the current best gadget-based inapproximability result of $0.9853$, but falls short of improving the SOTA of $16/17$ that relies on a custom PCP, rather than a gadget reduction from “standard” H{\aa}stad-style PCPs. A key technical challenge we faced: verifying a candidate construction produced by AlphaEvolve is costly (often requiring exponential time). In both settings above, our results were enabled by using AlphaEvolve itself to evolve the verification procedure to be faster (sometimes by $10,000\times$). We conclude with a discussion of norms by which to assess the assistance from AI in developing proofs.

[1610] EigenTrack: Spectral Activation Feature Tracking for Hallucination and Out-of-Distribution Detection in LLMs and VLMs

Davide Ettori, Nastaran Darabi, Sina Tayebati, Ranganath Krishnan, Mahesh Subedar, Omesh Tickoo, Amit Ranjan Trivedi

Main category: cs.LG

TL;DR: EigenTrack is a real-time detector that uses spectral geometry of hidden activations to identify hallucination and out-of-distribution errors in LLMs before surface errors appear.

Details

Motivation: Large language models are prone to hallucination and OOD errors, but existing detection methods have limitations - black/grey-box methods require multiple passes, while white-box detectors lack temporal context and global signal aggregation.

Method: Uses spectral geometry of hidden activations as a global signature, streaming covariance-spectrum statistics (entropy, eigenvalue gaps, KL divergence) into a lightweight recurrent classifier to track temporal shifts in representation structure.

Result: The method can detect hallucination and OOD drift before surface errors appear, requires only a single forward pass without resampling, preserves temporal context, aggregates global signals, and offers interpretable accuracy-latency trade-offs.

Conclusion: EigenTrack provides an interpretable, efficient approach for real-time detection of LLM errors that outperforms existing methods by leveraging spectral geometry and temporal context.

Abstract: Large language models (LLMs) offer broad utility but remain prone to hallucination and out-of-distribution (OOD) errors. We propose EigenTrack, an interpretable real-time detector that uses the spectral geometry of hidden activations, a compact global signature of model dynamics. By streaming covariance-spectrum statistics such as entropy, eigenvalue gaps, and KL divergence from random baselines into a lightweight recurrent classifier, EigenTrack tracks temporal shifts in representation structure that signal hallucination and OOD drift before surface errors appear. Unlike black- and grey-box methods, it needs only a single forward pass without resampling. Unlike existing white-box detectors, it preserves temporal context, aggregates global signals, and offers interpretable accuracy-latency trade-offs.

[1611] Self-Evolving LLMs via Continual Instruction Tuning

Jiazheng Kang, Le Huang, Cheng Hou, Zhe Zhao, Zhenxiang Yan, Chuan Shi, Ting Bai

Main category: cs.LG

TL;DR: MoE-CL is a parameter-efficient adversarial mixture-of-experts framework for continual instruction tuning of LLMs that uses dedicated LoRA experts per task and a shared LoRA expert with task-aware discriminator to balance knowledge retention and cross-task generalization.

Details

Motivation: Real-world industrial LLMs need continual learning to adapt to evolving tasks, but existing approaches suffer from catastrophic forgetting where training on new tasks degrades performance on earlier ones.

Method: Dual-expert design: dedicated LoRA expert per task for task-specific knowledge preservation, and shared LoRA expert for cross-task transfer with GAN-based task-aware discriminator to filter task-irrelevant noise.

Result: Extensive experiments on MTL5 and Tencent3 benchmarks show effectiveness, with real-world A/B testing on Tencent Video platform reducing manual review costs by 15.3%.

Conclusion: MoE-CL is practical for large-scale industrial deployment where continual adaptation and stable transfer are critical, supporting self-evolution of LLMs.

Abstract: In real-world industrial settings, large language models (LLMs) must learn continually to keep pace with diverse and evolving tasks, requiring self-evolution to refine knowledge under dynamic data distributions. However, existing continual learning (CL) approaches, such as replay and parameter isolation, often suffer from catastrophic forgetting: training on new tasks degrades performance on earlier ones by overfitting to the new distribution and weakening generalization.We propose MoE-CL, a parameter-efficient adversarial mixture-of-experts framework for industrial-scale, self-evolving continual instruction tuning of LLMs. MoE-CL uses a dual-expert design: (1) a dedicated LoRA expert per task to preserve task-specific knowledge via parameter independence, mitigating forgetting; and (2) a shared LoRA expert to enable cross-task transfer. To prevent transferring task-irrelevant noise through the shared pathway, we integrate a task-aware discriminator within a GAN. The discriminator encourages the shared expert to pass only task-aligned information during sequential training. Through adversarial learning, the shared expert acquires generalized representations that mimic the discriminator, while dedicated experts retain task-specific details, balancing knowledge retention and cross-task generalization and thereby supporting self-evolution.Extensive experiments on the public MTL5 benchmark and an industrial Tencent3 benchmark validate the effectiveness of MoE-CL for continual instruction tuning. In real-world A/B testing for content compliance review on the Tencent Video platform, MoE-CL reduced manual review costs by 15.3%. These results demonstrate that MoE-CL is practical for large-scale industrial deployment where continual adaptation and stable transfer are critical.

[1612] Spatio-temporal, multi-field deep learning of shock propagation in meso-structured media

M. Giselle Fernández-Godino, Meir H. Shachar, Kevin Korner, Jonathan L. Belof, Mukul Kumar, Jonathan Lind, William J. Schill

Main category: cs.LG

TL;DR: A multi-field deep learning model (MSTM) that predicts shock wave dynamics in porous materials with high accuracy and significant speedup, enabling practical design optimization for planetary defense and fusion energy.

Details

Motivation: Predicting shock wave behavior in porous materials is crucial for planetary defense and inertial fusion energy, but capturing complex phenomena like pore collapse and localized heating has remained challenging despite recent advances.

Method: Developed a multi-field spatio-temporal deep learning model (MSTM) that unifies seven coupled fields (pressure, density, temperature, energy, material distribution, and two velocity components) into a single autoregressive surrogate trained on high-fidelity hydrocode data.

Result: MSTM captures nonlinear shock-driven dynamics across porous and architected configurations with mean errors of 1.4% and 3.2% respectively, while delivering over three orders of magnitude speedup compared to traditional methods.

Conclusion: This approach transforms previously intractable problems into tractable design studies, establishing a practical framework for optimizing meso-structured materials in planetary impact mitigation and inertial fusion energy applications.

Abstract: The ability to predict how shock waves traverse porous and architected materials is a key challenge in planetary defense and in the pursuit of inertial fusion energy. Yet capturing pore collapse, anomalous Hugoniot responses, and localized heating-phenomena that strongly influence asteroid deflection or fusion ignition has remained a major challenge despite recent advances in single-field and reduced representations. We introduce a multi-field spatio-temporal deep learning model (MSTM) that unifies seven coupled fields-pressure, density, temperature, energy, material distribution, and two velocity components into a single autoregressive surrogate. Trained on high-fidelity hydrocode data, MSTM captures nonlinear shock-driven dynamics across porous and architected configurations, achieving mean errors of 1.4% and 3.2% respectively, all while delivering over three orders of magnitude in speedup. This advance transforms problems once considered intractable into tractable design studies, establishing a practical framework for optimizing meso-structured materials in planetary impact mitigation and inertial fusion energy.

[1613] APRIL: Active Partial Rollouts in Reinforcement Learning to Tame Long-tail Generation

Yuzhen Zhou, Jiajun Li, Yusheng Su, Gowtham Ramesh, Zilin Zhu, Xiang Long, Chenyang Zhao, Jin Pan, Xiaodong Yu, Ze Wang, Kangrui Du, Jialian Wu, Ximeng Sun, Jiang Liu, Qiaolin Yu, Hao Chen, Zicheng Liu, Emad Barsoum

Main category: cs.LG

TL;DR: APRIL is a reinforcement learning optimization method that addresses the computational bottleneck of rollout generation by over-provisioning requests, terminating once target responses are reached, and recycling incomplete responses, improving throughput by 22.5% on average and final accuracy by 2.1%.

Details

Motivation: RL training is computationally expensive with rollout generation accounting for over 90% of runtime, and efficiency is constrained by long-tail distribution of rollout response lengths where lengthy responses stall batches and leave GPUs underutilized.

Method: APRIL over-provisions rollout requests, terminates once the target number of responses is reached, and recycles incomplete responses for continuation in future steps, ensuring no rollouts are discarded while reducing GPU idle time.

Result: APRIL improves rollout throughput by 22.5% on average (up to 44%) across RL algorithms (GRPO, DAPO, GSPO), accelerates convergence, and achieves 2.1% on average (up to 8%) higher final accuracy across tasks.

Conclusion: APRIL unifies system-level and algorithmic considerations to advance RL training efficiency, is framework and hardware agnostic, and has been integrated into the slime RL framework for deployment on both NVIDIA and AMD GPUs.

Abstract: Reinforcement learning (RL) has become a cornerstone in advancing large-scale pre-trained language models (LLMs). Successive generations, including GPT-o series, DeepSeek-R1, Kimi-K1.5, Grok 4, and GLM-4.5, have relied on large-scale RL training to enhance reasoning and coding capabilities. To meet the community’s growing RL needs, numerous RL frameworks have been proposed. However, RL training remains computationally expensive, with rollout generation accounting for more than 90% of total runtime. In addition, its efficiency is often constrained by the long-tail distribution of rollout response lengths, where a few lengthy responses stall entire batches, leaving GPUs idle and underutilized. As model and rollout sizes continue to grow, this bottleneck increasingly limits scalability. To address this challenge, we propose Active Partial Rollouts in Reinforcement Learning (APRIL), which mitigates long-tail inefficiency. In the rollout phase, APRIL over-provisions rollout requests, terminates once the target number of responses is reached, and recycles incomplete responses for continuation in future steps. This strategy ensures that no rollouts are discarded while substantially reducing GPU idle time. Experiments show that APRIL improves rollout throughput by 22.5% on average (at most 44%) across commonly used RL algorithms (GRPO, DAPO, GSPO), accelerates convergence, and achieves 2.1% on average(at most 8%) higher final accuracy across tasks. Moreover, APRIL is both framework and hardware agnostic, already integrated into the slime RL framework, and deployable on NVIDIA and AMD GPUs alike. Taken together, this work unifies system-level and algorithmic considerations in proposing APRIL, with the aim of advancing RL training efficiency and inspiring further optimizations in RL systems. Our codebase is available at https://github.com/RLsys-Foundation/APRIL

[1614] VQEzy: An Open-Source Dataset for Parameter Initialization in Variational Quantum Eigensolvers

Chi Zhang, Mengxin Zheng, Qian Lou, Hui Min Leung, Fan Chen

Main category: cs.LG

TL;DR: VQEzy is a large-scale dataset for VQE parameter initialization that addresses limitations of existing resources by providing comprehensive coverage across multiple domains with complete optimization trajectories.

Details

Motivation: Existing VQE initialization methods are limited by small, domain-restricted datasets that lack comprehensive coverage of Hamiltonians, ansatz circuits, and optimization trajectories.

Method: Created VQEzy dataset spanning three major domains and seven representative tasks, comprising 12,110 instances with full VQE specifications and complete optimization trajectories.

Result: Successfully built the first large-scale dataset for VQE parameter initialization that provides comprehensive coverage and is available online for research use.

Conclusion: VQEzy addresses critical data limitations in VQE research and will be continuously refined to support future optimization studies.

Abstract: Variational Quantum Eigensolvers (VQEs) are a leading class of noisy intermediate-scale quantum (NISQ) algorithms, whose performance is highly sensitive to parameter initialization. Although recent machine learning-based initialization methods have achieved state-of-the-art performance, their progress has been limited by the lack of comprehensive datasets. Existing resources are typically restricted to a single domain, contain only a few hundred instances, and lack complete coverage of Hamiltonians, ansatz circuits, and optimization trajectories. To overcome these limitations, we introduce VQEzy, the first large-scale dataset for VQE parameter initialization. VQEzy spans three major domains and seven representative tasks, comprising 12,110 instances with full VQE specifications and complete optimization trajectories. The dataset is available online, and will be continuously refined and expanded to support future research in VQE optimization.

[1615] Remote Sensing-Oriented World Model

Yuxi Lu, Biao Wu, Zhidong Li, Kunqi Li, Chenya Huang, Huacan Wang, Qizhen Lan, Ronghao Chen, Ling Chen, Bin Liang

Main category: cs.LG

TL;DR: This paper introduces the first framework for world modeling in remote sensing, formulating it as direction-conditioned spatial extrapolation and creating the RSWISE benchmark for evaluation. They also present RemoteBAGEL, a multimodal model that outperforms state-of-the-art baselines.

Details

Motivation: Existing world modeling approaches are limited to synthetic or constrained environments, while remote sensing applications urgently need spatial reasoning capabilities for disaster response and urban planning. The paper aims to bridge this gap by bringing world modeling to remote sensing contexts.

Method: The paper formulates remote sensing world modeling as direction-conditioned spatial extrapolation, where models generate adjacent image tiles given a central observation and directional instruction. They develop the RSWISE benchmark with 1,600 evaluation tasks across four scenarios, and present RemoteBAGEL - a unified multimodal model fine-tuned on remote sensing data.

Result: Extensive experiments demonstrate that RemoteBAGEL consistently outperforms state-of-the-art baselines on the RSWISE benchmark, showing superior performance in spatial extrapolation tasks for remote sensing applications.

Conclusion: The paper successfully establishes the first framework for world modeling in remote sensing, providing both a rigorous evaluation benchmark (RSWISE) and an effective model (RemoteBAGEL) that advances spatial reasoning capabilities for real-world remote sensing applications.

Abstract: World models have shown potential in artificial intelligence by predicting and reasoning about world states beyond direct observations. However, existing approaches are predominantly evaluated in synthetic environments or constrained scene settings, limiting their validation in real-world contexts with broad spatial coverage and complex semantics. Meanwhile, remote sensing applications urgently require spatial reasoning capabilities for disaster response and urban planning. This paper bridges these gaps by introducing the first framework for world modeling in remote sensing. We formulate remote sensing world modeling as direction-conditioned spatial extrapolation, where models generate semantically consistent adjacent image tiles given a central observation and directional instruction. To enable rigorous evaluation, we develop RSWISE (Remote Sensing World-Image Spatial Evaluation), a benchmark containing 1,600 evaluation tasks across four scenarios: general, flood, urban, and rural. RSWISE combines visual fidelity assessment with instruction compliance evaluation using GPT-4o as a semantic judge, ensuring models genuinely perform spatial reasoning rather than simple replication. Afterwards, we present RemoteBAGEL, a unified multimodal model fine-tuned on remote sensing data for spatial extrapolation tasks. Extensive experiments demonstrate that RemoteBAGEL consistently outperforms state-of-the-art baselines on RSWISE.

[1616] SimpleFold: Folding Proteins is Simpler than You Think

Yuyang Wang, Jiarui Lu, Navdeep Jaitly, Josh Susskind, Miguel Angel Bautista

Main category: cs.LG

TL;DR: SimpleFold introduces a flow-matching based protein folding model using only general-purpose transformer blocks, achieving competitive performance without domain-specific architectural designs.

Details

Motivation: To challenge the necessity of domain-specific architectural designs in protein folding models and explore whether general-purpose transformers can achieve similar performance.

Method: Uses standard transformer blocks with adaptive layers, trained via generative flow-matching objective with structural term, scaled to 3B parameters on 9M distilled protein structures and PDB data.

Result: Achieves competitive performance on standard folding benchmarks, strong ensemble prediction capabilities, and efficient deployment on consumer hardware.

Conclusion: SimpleFold demonstrates that complex domain-specific architectures are not necessary for state-of-the-art protein folding, opening alternative design possibilities.

Abstract: Protein folding models have achieved groundbreaking results typically via a combination of integrating domain knowledge into the architectural blocks and training pipelines. Nonetheless, given the success of generative models across different but related problems, it is natural to question whether these architectural designs are a necessary condition to build performant models. In this paper, we introduce SimpleFold, the first flow-matching based protein folding model that solely uses general purpose transformer blocks. Protein folding models typically employ computationally expensive modules involving triangular updates, explicit pair representations or multiple training objectives curated for this specific domain. Instead, SimpleFold employs standard transformer blocks with adaptive layers and is trained via a generative flow-matching objective with an additional structural term. We scale SimpleFold to 3B parameters and train it on approximately 9M distilled protein structures together with experimental PDB data. On standard folding benchmarks, SimpleFold-3B achieves competitive performance compared to state-of-the-art baselines, in addition SimpleFold demonstrates strong performance in ensemble prediction which is typically difficult for models trained via deterministic reconstruction objectives. Due to its general-purpose architecture, SimpleFold shows efficiency in deployment and inference on consumer-level hardware. SimpleFold challenges the reliance on complex domain-specific architectures designs in protein folding, opening up an alternative design space for future progress.

[1617] Beyond Sharp Minima: Robust LLM Unlearning via Feedback-Guided Multi-Point Optimization

Wenhan Wu, Zheyuan Liu, Chongyang Gao, Ren Wang, Kaize Ding

Main category: cs.LG

TL;DR: Current LLM unlearning methods are vulnerable to relearning attacks that can recover supposedly erased knowledge. StableUN addresses this by seeking stable parameter regions through bi-level feedback-guided optimization.

Details

Motivation: Conventional unlearning methods create sharp minima in loss landscape, making erased knowledge recoverable through minimal fine-tuning. This exposes a critical gap between apparent unlearning and actual knowledge removal.

Method: StableUN uses bi-level feedback-guided optimization with neighborhood-aware optimization. It integrates forgetting feedback (using adversarial perturbations) with remembering feedback to preserve utility, aligned through gradient projection.

Result: Experiments on WMDP and MUSE benchmarks show StableUN is significantly more robust against relearning and jailbreaking attacks while maintaining competitive utility performance.

Conclusion: StableUN provides a more robust unlearning framework that addresses the fundamental security vulnerability in current LLM unlearning methods by seeking stable parameter regions.

Abstract: Current LLM unlearning methods face a critical security vulnerability that undermines their fundamental purpose: while they appear to successfully remove sensitive or harmful knowledge, this ``forgotten" information remains precariously recoverable through relearning attacks. We identify that the root cause is that conventional methods optimizing the forgetting loss at individual data points will drive model parameters toward sharp minima in the loss landscape. In these unstable regions, even minimal parameter perturbations can drastically alter the model’s behaviors. Consequently, relearning attacks exploit this vulnerability by using just a few fine-tuning samples to navigate the steep gradients surrounding these unstable regions, thereby rapidly recovering knowledge that was supposedly erased. This exposes a critical robustness gap between apparent unlearning and actual knowledge removal. To address this issue, we propose StableUN, a bi-level feedback-guided optimization framework that explicitly seeks more stable parameter regions via neighborhood-aware optimization. It integrates forgetting feedback, which uses adversarial perturbations to probe parameter neighborhoods, with remembering feedback to preserve model utility, aligning the two objectives through gradient projection. Experiments on WMDP and MUSE benchmarks demonstrate that our method is significantly more robust against both relearning and jailbreaking attacks while maintaining competitive utility performance.

[1618] Subspace Clustering of Subspaces: Unifying Canonical Correlation Analysis and Subspace Clustering

Paris A. Karakasis, Nicholas D. Sidiropoulos

Main category: cs.LG

TL;DR: A novel framework for clustering tall matrices by their column spaces using Block Term Decomposition, with superior performance on hyperspectral imaging data.

Details

Motivation: Traditional subspace clustering methods vectorize data, but many real-world applications have matrix-structured data where column spaces contain meaningful structure that should be preserved for clustering.

Method: Based on Block Term Decomposition of a third-order tensor constructed from input matrices, enabling joint estimation of cluster memberships and partially shared subspaces.

Result: Achieves superior clustering accuracy and robustness on hyperspectral imaging datasets, especially under high noise and interference compared to existing methods.

Conclusion: The framework shows strong potential for high-dimensional applications where structure exists beyond individual data vectors, preserving matrix structure leads to better clustering performance.

Abstract: We introduce a novel framework for clustering a collection of tall matrices based on their column spaces, a problem we term Subspace Clustering of Subspaces (SCoS). Unlike traditional subspace clustering methods that assume vectorized data, our formulation directly models each data sample as a matrix and clusters them according to their underlying subspaces. We establish conceptual links to Subspace Clustering and Generalized Canonical Correlation Analysis (GCCA), and clarify key differences that arise in this more general setting. Our approach is based on a Block Term Decomposition (BTD) of a third-order tensor constructed from the input matrices, enabling joint estimation of cluster memberships and partially shared subspaces. We provide the first identifiability results for this formulation and propose scalable optimization algorithms tailored to large datasets. Experiments on real-world hyperspectral imaging datasets demonstrate that our method achieves superior clustering accuracy and robustness, especially under high noise and interference, compared to existing subspace clustering techniques. These results highlight the potential of the proposed framework in challenging high-dimensional applications where structure exists beyond individual data vectors.

[1619] PipelineRL: Faster On-policy Reinforcement Learning for Long Sequence Generation

Alexandre Piché, Ehsan Kamalloo, Rafael Pardinas, Xiaoyin Chen, Dzmitry Bahdanau

Main category: cs.LG

TL;DR: PipelineRL is a novel RL approach for LLMs that enables concurrent data generation and model training with in-flight weight updates, achieving ~2x faster learning while maintaining high data freshness.

Details

Motivation: Scaling RL methods for LLMs is challenging due to the difficulty in maintaining high AI accelerator utilization without generating stale off-policy data that harms RL algorithms.

Method: Uses concurrent asynchronous data generation and model training with novel in-flight weight updates, allowing LLM generation engine to receive updated model weights with minimal interruption during token sequence generation.

Result: Achieves approximately 2x faster learning compared to conventional RL baselines on long-form reasoning tasks using 128 H100 GPUs, while maintaining highly on-policy training data.

Conclusion: PipelineRL provides a superior trade-off between hardware efficiency and data on-policyness for LLM training, with scalable open-source implementation released.

Abstract: Reinforcement Learning (RL) is increasingly utilized to enhance the reasoning capabilities of Large Language Models (LLMs). However, effectively scaling these RL methods presents significant challenges, primarily due to the difficulty in maintaining high AI accelerator utilization without generating stale, off-policy data that harms common RL algorithms. This paper introduces PipelineRL, an approach designed to achieve a superior trade-off between hardware efficiency and data on-policyness for LLM training. PipelineRL employs concurrent asynchronous data generation and model training, distinguished by the novel in-flight weight updates. This mechanism allows the LLM generation engine to receive updated model weights with minimal interruption during the generation of token sequences, thereby maximizing both the accelerator utilization and the freshness of training data. Experiments conducted on long-form reasoning tasks using 128 H100 GPUs demonstrate that PipelineRL achieves approximately $\sim 2x$ faster learning compared to conventional RL baselines while maintaining highly on-policy training data. A scalable and modular open-source implementation of PipelineRL is also released as a key contribution.

[1620] Towards Foundation Models for Zero-Shot Time Series Anomaly Detection: Leveraging Synthetic Data and Relative Context Discrepancy

Tian Lan, Hao Duong Le, Jinbo Li, Wenjun He, Meng Wang, Chenghao Liu, Chen Zhang

Main category: cs.LG

TL;DR: TimeRCD is a novel foundation model for time series anomaly detection that uses Relative Context Discrepancy (RCD) pre-training instead of reconstruction, enabling better zero-shot detection of subtle anomalies by identifying contextual shifts between adjacent time windows.

Details

Motivation: Current foundation models for time series anomaly detection rely on reconstruction-based objectives, which suffer from fundamental limitations: they struggle to identify subtle anomalies and often misinterpret complex normal patterns, leading to high false negative and false positive rates.

Method: TimeRCD uses a Relative Context Discrepancy (RCD) pre-training paradigm where the model is trained to detect significant discrepancies between adjacent time windows rather than reconstructing inputs. It employs a standard Transformer architecture and is pre-trained on a large-scale synthetic corpus with token-level anomaly labels.

Result: Extensive experiments show that TimeRCD significantly outperforms existing general-purpose and anomaly-specific foundation models in zero-shot time series anomaly detection across diverse datasets.

Conclusion: The RCD paradigm establishes a new effective path for building robust and generalizable foundation models for time series anomaly detection, overcoming the limitations of reconstruction-based approaches.

Abstract: Time series anomaly detection (TSAD) is a critical task, but developing models that generalize to unseen data in a zero-shot manner remains a major challenge. Prevailing foundation models for TSAD predominantly rely on reconstruction-based objectives, which suffer from a fundamental objective mismatch: they struggle to identify subtle anomalies while often misinterpreting complex normal patterns, leading to high rates of false negatives and positives. To overcome these limitations, we introduce \texttt{TimeRCD}, a novel foundation model for TSAD built upon a new pre-training paradigm: Relative Context Discrepancy (RCD). Instead of learning to reconstruct inputs, \texttt{TimeRCD} is explicitly trained to identify anomalies by detecting significant discrepancies between adjacent time windows. This relational approach, implemented with a standard Transformer architecture, enables the model to capture contextual shifts indicative of anomalies that reconstruction-based methods often miss. To facilitate this paradigm, we develop a large-scale, diverse synthetic corpus with token-level anomaly labels, providing the rich supervisory signal necessary for effective pre-training. Extensive experiments demonstrate that \texttt{TimeRCD} significantly outperforms existing general-purpose and anomaly-specific foundation models in zero-shot TSAD across diverse datasets. Our results validate the superiority of the RCD paradigm and establish a new, effective path toward building robust and generalizable foundation models for time series anomaly detection.

[1621] Alignment-Sensitive Minimax Rates for Spectral Algorithms with Learned Kernels

Dongming Huang, Zhifan Li, Yicheng Li, Qian Lin

Main category: cs.LG

TL;DR: The paper introduces Effective Span Dimension (ESD), a new complexity measure for spectral algorithms with learned kernels, showing minimax risk scales as σ²K for sequence models and connecting feature learning to generalization improvements.

Details

Motivation: To understand generalization in spectral algorithms when kernels are learned from data, moving beyond traditional fixed-kernel theories that don't account for adaptive feature learning.

Method: Introduces Effective Span Dimension (ESD) as an alignment-sensitive complexity measure, analyzes sequence models and over-parameterized gradient flow, extends framework to linear models and RKHS regression, and validates with numerical experiments.

Result: Proves that for sequence models with ESD at most K, minimax excess risk scales as σ²K, and shows gradient flow can reduce ESD, connecting feature learning to generalization improvements.

Conclusion: ESD provides a novel framework for understanding generalization in spectral algorithms with learned kernels, bridging adaptive feature learning and provable generalization benefits beyond traditional fixed-kernel approaches.

Abstract: We study spectral algorithms in the setting where kernels are learned from data. We introduce the effective span dimension (ESD), an alignment-sensitive complexity measure that depends jointly on the signal, spectrum, and noise level $\sigma^2$. The ESD is well-defined for arbitrary kernels and signals without requiring eigen-decay conditions or source conditions. We prove that for sequence models whose ESD is at most $K$, the minimax excess risk scales as $\sigma^2 K$. Furthermore, we analyze over-parameterized gradient flow and prove that it can reduce the ESD. This finding establishes a connection between adaptive feature learning and provable improvements in generalization of spectral algorithms. We demonstrate the generality of the ESD framework by extending it to linear models and RKHS regression, and we support the theory with numerical experiments. This framework provides a novel perspective on generalization beyond traditional fixed-kernel theories.

[1622] $\mathbf{Li_2}$: A Framework on Dynamics of Feature Emergence and Delayed Generalization

Yuandong Tian

Main category: cs.LG

TL;DR: The paper proposes a mathematical framework called Li₂ that explains grokking (delayed generalization) in 2-layer nonlinear networks through three stages: lazy learning, independent feature learning, and interactive feature learning.

Details

Motivation: To understand the mathematical principles behind grokking behavior - why and how features emerge during training, and what conditions lead to delayed generalization in complex structured inputs.

Method: Developed the Li₂ framework analyzing gradient dynamics in 2-layer networks. Studied how backpropagated gradients carry label information, enabling independent feature learning that follows gradient ascent of an energy function.

Result: The framework explains how features emerge as local maxima of an energy function, reveals the roles of hyperparameters (weight decay, learning rate, sample sizes), and provides scaling laws for memorization and generalization.

Conclusion: The Li₂ framework provides a principled understanding of grokking phenomena, explains why optimizers like Muon work effectively, and can be extended to multi-layer architectures.

Abstract: While the phenomenon of grokking, i.e., delayed generalization, has been studied extensively, it remains an open problem whether there is a mathematical framework to characterize what kind of features will emerge, how and in which conditions it happens from training, for complex structured inputs. We propose a novel framework, named $\mathbf{Li_2}$, that captures three key stages for the grokking behavior of 2-layer nonlinear networks: (I) \underline{\textbf{L}}azy learning, (II) \underline{\textbf{i}}ndependent feature learning and (III) \underline{\textbf{i}}nteractive feature learning. At the lazy learning stage, top layer overfits to random hidden representation and the model appears to memorize. Thanks to lazy learning and weight decay, the \emph{backpropagated gradient} $G_F$ from the top layer now carries information about the target label, with a specific structure that enables each hidden node to learn their representation \emph{independently}. Interestingly, the independent dynamics follows exactly the \emph{gradient ascent} of an energy function $E$, and its local maxima are precisely the emerging features. We study whether these local-optima induced features are generalizable, their representation power, and how they change on sample size, in group arithmetic tasks. When hidden nodes start to interact in the later stage of learning, we provably show how $G_F$ changes to focus on missing features that need to be learned. Our study sheds lights on roles played by key hyperparameters such as weight decay, learning rate and sample sizes in grokking, leads to provable scaling laws of memorization and generalization, and reveals the underlying cause why recent optimizers such as Muon can be effective, from the first principles of gradient dynamics. Our analysis can be extended to multi-layer architectures.

[1623] Efficiently Attacking Memorization Scores

Tue Do, Varun Chandrasekaran, Daniel Alabi

Main category: cs.LG

TL;DR: The paper demonstrates that memorization-based influence estimators can be adversarially manipulated through practical attacks using pseudoinverse calculations, revealing vulnerabilities in data attribution methods.

Details

Motivation: To investigate whether influence estimation tools used for understanding model behavior and data attribution can be adversarially manipulated, especially in applications like data valuation and responsible machine learning.

Method: Proposed a practical attack method using pseudoinverse of input that requires only black-box access to model outputs, with modest computational overhead. Validated across various image classification tasks.

Result: Empirical validation shows state-of-the-art influence proxies are vulnerable to targeted score manipulations. Theoretical analysis reveals conditions where memorization scores are inherently fragile under adversarial perturbations.

Conclusion: The findings highlight critical vulnerabilities in influence-based attribution methods and suggest the need for robust defenses against such adversarial manipulations.

Abstract: Influence estimation tools – such as memorization scores – are widely used to understand model behavior, attribute training data, and inform dataset curation. However, recent applications in data valuation and responsible machine learning raise the question: can these scores themselves be adversarially manipulated? In this work, we present a systematic study of the feasibility of attacking memorization-based influence estimators. We characterize attacks for producing highly memorized samples as highly sensitive queries in the regime where a trained algorithm is accurate. Our attack (calculating the pseudoinverse of the input) is practical, requiring only black-box access to model outputs and incur modest computational overhead. We empirically validate our attack across a wide suite of image classification tasks, showing that even state-of-the-art proxies are vulnerable to targeted score manipulations. In addition, we provide a theoretical analysis of the stability of memorization scores under adversarial perturbations, revealing conditions under which influence estimates are inherently fragile. Our findings highlight critical vulnerabilities in influence-based attribution and suggest the need for robust defenses. All code can be found at https://github.com/tuedo2/MemAttack

[1624] Auto-Regressive U-Net for Full-Field Prediction of Shrinkage-Induced Damage in Concrete

Liya Gaynutdinova, Petr Havlásek, Ondřej Rokoš, Fleur Hendriks, Martin Doškář

Main category: cs.LG

TL;DR: Deep learning approach using auto-regressive U-Net and CNN to predict time-dependent full-field damage in concrete and forecast mechanical properties, enabling efficient damage progression assessment and concrete mix optimization.

Details

Motivation: To reduce the computational load traditionally associated with full-field damage evaluations in concrete and gain insights into how aggregate properties affect shrinkage and stiffness reduction.

Method: Uses auto-regressive U-Net model to predict scalar damage field evolution given microstructural geometry and shrinkage profile, with CNN using damage estimations to forecast mechanical properties like shrinkage and residual stiffness.

Result: The dual-network architecture demonstrates high computational efficiency and robust predictive performance on synthesised datasets, enabling continuous assessment of damage progression.

Conclusion: This approach can help optimize concrete mix designs by understanding aggregate property effects, leading to improved durability and reduced internal damage.

Abstract: This paper introduces a deep learning approach for predicting time-dependent full-field damage in concrete. The study uses an auto-regressive U-Net model to predict the evolution of the scalar damage field in a unit cell given microstructural geometry and evolution of an imposed shrinkage profile. By sequentially using the predicted damage output as input for subsequent predictions, the model facilitates the continuous assessment of damage progression. Complementarily, a convolutional neural network (CNN) utilises the damage estimations to forecast key mechanical properties, including observed shrinkage and residual stiffness. The proposed dual-network architecture demonstrates high computational efficiency and robust predictive performance on the synthesised datasets. The approach reduces the computational load traditionally associated with full-field damage evaluations and is used to gain insights into the relationship between aggregate properties, such as shape, size, and distribution, and the effective shrinkage and reduction in stiffness. Ultimately, this can help to optimize concrete mix designs, leading to improved durability and reduced internal damage.

[1625] Spectral Collapse Drives Loss of Plasticity in Deep Continual Learning

Naicheng He, Kaicheng Guo, Arjun Prakash, Saket Tiwari, Ruo Yu Tao, Tyrone Serapio, Amy Greenwald, George Konidaris

Main category: cs.LG

TL;DR: Deep neural networks lose plasticity in continual learning due to Hessian spectral collapse, where meaningful curvature directions vanish. The paper introduces τ-trainability framework and proposes regularization methods to maintain plasticity.

Details

Motivation: To understand why deep neural networks lose plasticity in continual learning and fail to learn new tasks without parameter reinitialization, focusing on the phenomenon of Hessian spectral collapse.

Method: Introduces τ-trainability framework, analyzes Kronecker factored approximation of Hessian, and proposes two regularization enhancements: maintaining high effective feature rank and applying L2 penalties.

Result: Experiments on continual supervised and reinforcement learning tasks show that combining the two proposed regularizers effectively preserves plasticity in deep neural networks.

Conclusion: Hessian spectral collapse causes plasticity loss in continual learning, and the proposed regularization methods successfully address this issue by maintaining trainability through feature rank preservation and L2 penalties.

Abstract: We investigate why deep neural networks suffer from loss of plasticity in deep continual learning, failing to learn new tasks without reinitializing parameters. We show that this failure is preceded by Hessian spectral collapse at new-task initialization, where meaningful curvature directions vanish and gradient descent becomes ineffective. To characterize the necessary condition for successful training, we introduce the notion of $\tau$-trainability and show that current plasticity preserving algorithms can be unified under this framework. Targeting spectral collapse directly, we then discuss the Kronecker factored approximation of the Hessian, which motivates two regularization enhancements: maintaining high effective feature rank and applying L2 penalties. Experiments on continual supervised and reinforcement learning tasks confirm that combining these two regularizers effectively preserves plasticity.

[1626] d2: Improved Techniques for Training Reasoning Diffusion Language Models

Guanghan Wang, Yair Schiff, Gilad Turok, Volodymyr Kuleshov

Main category: cs.LG

TL;DR: d2 is a reasoning framework for masked diffusion language models that uses a novel policy gradient algorithm to improve reasoning abilities through reinforcement learning, achieving state-of-the-art performance on logical and math reasoning tasks.

Details

Motivation: While diffusion language models show competitive text generation performance, their reasoning abilities need improvement through reinforcement learning approaches.

Method: Developed a new policy gradient algorithm that leverages masking properties to estimate sampling trajectory likelihoods, with estimators that trade computation for accuracy, particularly effective for DLMs supporting any-order likelihood estimation.

Result: Significantly outperforms previous diffusion reasoning frameworks using only RL (no supervised fine-tuning), achieving SOTA performance on Countdown, Sudoku, GSM8K, and MATH500 benchmarks.

Conclusion: The d2 framework demonstrates that efficient diffusion-based reasoning is achievable through proper likelihood estimation techniques and policy gradient methods, establishing a new approach for enhancing DLM reasoning capabilities.

Abstract: While diffusion language models (DLMs) have achieved competitive performance in text generation, improving their reasoning ability with reinforcement learning remains an active research area. Here, we introduce d2, a reasoning framework tailored for masked DLMs. Central to our framework is a new policy gradient algorithm that relies on properties of masking to accurately estimate the likelihoods of sampling trajectories. Our estimators trade off computation for approximation accuracy in an analytically tractable manner, and are particularly effective for DLMs that support any-order likelihood estimation. We characterize and study this property in popular DLMs and show that it is key for efficient diffusion-based reasoning. Empirically, d2 significantly improves over previous diffusion reasoning frameworks using only RL (without relying on supervised fine-tuning), and sets a new state-of-the-art performance for DLMs on logical reasoning tasks (Countdown and Sudoku) and math reasoning benchmarks (GSM8K and MATH500).

cs.MA

[1627] Game-Theoretic Understandings of Multi-Agent Systems with Multiple Objectives

Yue Wang

Main category: cs.MA

TL;DR: The paper introduces Multi-Objective Markov Games (MOMG) for multi-agent reinforcement learning with diverse objectives, proposes Pareto-Nash Equilibrium as the solution concept, and develops efficient algorithms to compute solutions without requiring new samples for different preferences.

Details

Motivation: In practical multi-agent systems, agents have diverse objectives creating complex strategic trade-offs, as each agent's performance across multiple criteria depends on joint actions of all agents.

Method: Introduces Multi-Objective Markov Game (MOMG) framework and Pareto-Nash Equilibrium (PNE) solution concept. Proves PNE existence and establishes equivalence with Nash Equilibria of linearly scalarized games. Develops online learning algorithms and a two-phase preference-free algorithm that decouples exploration from planning.

Result: Proves existence of Pareto-Nash Equilibrium and establishes theoretical equivalence with Nash Equilibria of corresponding linearly scalarized games. Develops algorithms that can compute PNE for any given preference profile without collecting new samples.

Conclusion: The framework enables efficient characterization of the entire Pareto-Nash front, addressing the computational challenges of multi-objective multi-agent systems while providing tractable solution methods.

Abstract: In practical multi-agent systems, agents often have diverse objectives, which makes the system more complex, as each agent’s performance across multiple criteria depends on the joint actions of all agents, creating intricate strategic trade-offs. To address this, we introduce the Multi-Objective Markov Game (MOMG), a framework for multi-agent reinforcement learning with multiple objectives. We propose the Pareto-Nash Equilibrium (PNE) as the primary solution concept, where no agent can unilaterally improve one objective without sacrificing performance on another. We prove existence of PNE, and establish an equivalence between the PNE and the set of Nash Equilibria of MOMG’s corresponding linearly scalarized games, enabling solutions of MOMG by transferring to a standard single-objective Markov game. However, we note that computing a PNE is theoretically and computationally challenging, thus we propose and study weaker but more tractable solution concepts. Building on these foundations, we develop online learning algorithm that identify a single solution to MOMGs. Furthermore, we propose a two-phase, preference-free algorithm that decouples exploration from planning. Our algorithm enables computation of a PNE for any given preference profile without collecting new samples, providing an efficient methodological characterization of the entire Pareto-Nash front.

[1628] Situational Awareness for Safe and Robust Multi-Agent Interactions Under Uncertainty

Benjamin Alcorn, Eman Hammad

Main category: cs.MA

TL;DR: This paper proposes a multi-agent system model where autonomous agents observe their environment, predict other agents’ future actions using an estimation algorithm, and make optimal decisions while managing uncertainty through risk analysis.

Details

Motivation: To address two key problems in multi-agent systems: predicting intentions of non-coordinating agents for behavior prediction, and achieving objectives under resource constraints without significant performance sacrifice.

Method: Developed a model where agents observe environment within safety radius, determine surrounding agent states, estimate future actions, and act optimally. Uses estimation algorithm based on historical trajectory when observations are unavailable, with risk analysis to manage uncertainty.

Result: The proposed approach was validated using two learning-based decision making frameworks: reinforcement learning and game theoretic algorithms.

Conclusion: The study presents a comprehensive approach for multi-agent systems that combines observation, prediction, and risk-managed decision making, validated through multiple learning frameworks.

Abstract: Multi-agent systems are prevalent in a wide range of domains including power systems, vehicular networks, and robotics. Two important problems to solve in these types of systems are how the intentions of non-coordinating agents can be determined to predict future behavior and how the agents can achieve their objectives under resource constraints without significantly sacrificing performance. To study this, we develop a model where an autonomous agent observes the environment within a safety radius of observation, determines the state of a surrounding agent of interest (within the observation radius), estimates future actions to be taken, and acts in an optimal way. In the absence of observations, agents are able to utilize an estimation algorithm to predict the future actions of other agents based on historical trajectory. The use of the proposed estimation algorithm introduces uncertainty, which is managed via risk analysis. The proposed approach in this study is validated using two different learning-based decision making frameworks: reinforcement learning and game theoretic algorithms.

[1629] PartnerMAS: An LLM Hierarchical Multi-Agent Framework for Business Partner Selection on High-Dimensional Features

Lingyao Li, Haolun Wu, Zhenkun Li, Jiabei Hu, Yu Wang, Xiaoshan Huang, Wenyue Hua, Wenqian Wang

Main category: cs.MA

TL;DR: PartnerMAS is a hierarchical multi-agent framework that improves high-dimensional decision-making by decomposing evaluation into planner, specialized, and supervisor agents, achieving 10-15% higher match rates than single-agent or debate-based approaches.

Details

Motivation: High-dimensional decision-making tasks like business partner selection involve evaluating large candidate pools with heterogeneous features, but single-agent or debate-style LLM systems struggle with scalability and consistency in such settings.

Method: A hierarchical multi-agent framework with three layers: Planner Agent designs strategies, Specialized Agents perform role-specific assessments, and Supervisor Agent integrates outputs. Evaluated on a curated benchmark dataset of venture capital co-investments.

Result: Across 140 cases, PartnerMAS consistently outperformed single-agent and debate-based multi-agent baselines, achieving up to 10-15% higher match rates. Analysis showed planners respond to domain-informed prompts, specialists provide complementary feature coverage, and supervisors play important aggregation roles.

Conclusion: Structured collaboration among LLM agents generates more robust outcomes than scaling individual models, highlighting PartnerMAS as a promising framework for high-dimensional decision-making in data-rich domains.

Abstract: High-dimensional decision-making tasks, such as business partner selection, involve evaluating large candidate pools with heterogeneous numerical, categorical, and textual features. While large language models (LLMs) offer strong in-context reasoning capabilities, single-agent or debate-style systems often struggle with scalability and consistency in such settings. We propose PartnerMAS, a hierarchical multi-agent framework that decomposes evaluation into three layers: a Planner Agent that designs strategies, Specialized Agents that perform role-specific assessments, and a Supervisor Agent that integrates their outputs. To support systematic evaluation, we also introduce a curated benchmark dataset of venture capital co-investments, featuring diverse firm attributes and ground-truth syndicates. Across 140 cases, PartnerMAS consistently outperforms single-agent and debate-based multi-agent baselines, achieving up to 10–15% higher match rates. Analysis of agent reasoning shows that planners are most responsive to domain-informed prompts, specialists produce complementary feature coverage, and supervisors play an important role in aggregation. Our findings demonstrate that structured collaboration among LLM agents can generate more robust outcomes than scaling individual models, highlighting PartnerMAS as a promising framework for high-dimensional decision-making in data-rich domains.

[1630] CORRECT: COndensed eRror RECognition via knowledge Transfer in multi-agent systems

Yifan Yu, Moyan Li, Shaoyuan Xu, Jinmiao Fu, Xinhai Hou, Fan Lai, Bryan Wang

Main category: cs.MA

TL;DR: CORRECT is a lightweight, training-free framework that uses cached error schemata to recognize and transfer failure patterns in multi-agent systems, improving error localization without expensive retraining.

Details

Motivation: Multi-agent systems face challenges in error recognition due to complex coordination and error propagation, with minor errors escalating into task failures that are costly to debug.

Method: CORRECT leverages an online cache of distilled error schemata to recognize and transfer knowledge of failure structures across new requests, enabling targeted error localization at inference time without retraining.

Result: Experiments across seven diverse MAS applications show CORRECT improves step-level error localization up to 19.8% over existing methods with near-zero overhead, narrowing the gap between automated and human-level error recognition.

Conclusion: CORRECT provides an effective framework for error recognition in multi-agent systems through cache-based reuse of error patterns, achieving significant improvements in error localization while maintaining low computational overhead.

Abstract: Multi-agent systems (MAS) are increasingly capable of tackling complex real-world tasks, yet their reliance on inter-agent coordination, tool use, and long-horizon reasoning makes error recognition particularly challenging. Minor errors can propagate across agents, escalating into task failures while producing long, intertwined execution trajectories that impose significant costs for both human developers and automated systems to debug and analyze. Our key insight is that, despite surface differences in failure trajectories (e.g., logs), MAS errors often recur with similar structural patterns. This paper presents CORRECT, the first lightweight, training-free framework that leverages an online cache of distilled error schemata to recognize and transfer knowledge of failure structures across new requests. This cache-based reuse allows LLMs to perform targeted error localization at inference time, avoiding the need for expensive retraining while adapting to dynamic MAS deployments in subseconds. To support rigorous study in this domain, we also introduce CORRECT-Error, a large-scale dataset of over 2,000 annotated trajectories collected through a novel error-injection pipeline guided by real-world distributions, and further validated through human evaluation to ensure alignment with natural failure patterns. Experiments across seven diverse MAS applications show that CORRECT improves step-level error localization up to 19.8% over existing advances while at near-zero overhead, substantially narrowing the gap between automated and human-level error recognition.

[1631] MAS$^2$: Self-Generative, Self-Configuring, Self-Rectifying Multi-Agent Systems

Kun Wang, Guibin Zhang, ManKit Ye, Xinyu Deng, Dongxia Wang, Xiaobin Hu, Jinyang Guo, Yang Liu, Yufei Guo

Main category: cs.MA

TL;DR: MAS² introduces a recursive self-generation paradigm where multi-agent systems autonomously architect bespoke agent systems for diverse problems, achieving significant performance gains over state-of-the-art methods without excessive token costs.

Details

Motivation: Current automatic multi-agent systems follow a rigid "generate-once-and-deploy" paradigm, making them brittle and ill-prepared for real-world dynamism and uncertainty. The authors aim to transcend this limitation through autonomous system generation.

Method: A “generator-implementer-rectifier” tri-agent team that dynamically composes and adaptively rectifies target agent systems in response to real-time task demands, trained using Collaborative Tree Optimization.

Result: Achieves performance gains up to 19.6% over state-of-the-art MAS in complex scenarios like deep research and code generation, with superior cross-backbone generalization (up to 15.1% improvement with unseen LLMs) while maintaining cost-effectiveness.

Conclusion: MAS² demonstrates that recursive self-generation enables multi-agent systems to autonomously create specialized systems for diverse problems, achieving substantial performance improvements while staying on the Pareto frontier of cost-performance trade-offs.

Abstract: The past two years have witnessed the meteoric rise of Large Language Model (LLM)-powered multi-agent systems (MAS), which harness collective intelligence and exhibit a remarkable trajectory toward self-evolution. This paradigm has rapidly progressed from manually engineered systems that require bespoke configuration of prompts, tools, roles, and communication protocols toward frameworks capable of automated orchestration. Yet, dominant automatic multi-agent systems, whether generated by external modules or a single LLM agent, largely adhere to a rigid \textit{generate-once-and-deploy}'' paradigm, rendering the resulting systems brittle and ill-prepared for the dynamism and uncertainty of real-world environments. To transcend this limitation, we introduce MAS$^2$, a paradigm predicated on the principle of recursive self-generation: a multi-agent system that autonomously architects bespoke multi-agent systems for diverse problems. Technically, we devise a \textit{generator-implementer-rectifier}’’ tri-agent team capable of dynamically composing and adaptively rectifying a target agent system in response to real-time task demands. Collaborative Tree Optimization is proposed to train and specialize these meta-agents. Extensive evaluation across seven benchmarks reveals that MAS$^2$ achieves performance gains of up to $19.6%$ over state-of-the-art MAS in complex scenarios such as deep research and code generation. Moreover, MAS$^2$ exhibits superior cross-backbone generalization, effectively leveraging previously unseen LLMs to yield improvements of up to $15.1%$. Crucially, these gains are attained without incurring excessive token costs, as MAS$^2$ consistently resides on the Pareto frontier of cost-performance trade-offs. The source codes are available at https://github.com/yeyeyeah2/MAS2.

[1632] MARLIN: Multi-Agent Reinforcement Learning with Murmuration Intelligence and LLM Guidance for Reservoir Management

Heming Fu, Guojun Xiong, Jian Li, Shan Lin

Main category: cs.MA

TL;DR: MARLIN is a decentralized reservoir management framework that combines multi-agent reinforcement learning with bio-inspired flocking rules and LLM guidance to handle uncertainties in water systems, improving coordination while reducing computational complexity.

Details

Motivation: Climate change intensifies extreme weather events and water disasters, creating cascading uncertainties in reservoir networks that traditional centralized approaches cannot handle effectively due to exponential computational complexity and poor uncertainty management.

Method: MARLIN integrates bio-inspired alignment, separation, and cohesion rules from starling murmurations with multi-agent reinforcement learning. It uses a decentralized approach where individual reservoirs make local decisions while achieving global coordination, with LLM providing real-time reward shaping for environmental adaptation.

Result: Experiments on USGS data show MARLIN improves uncertainty handling by 23%, reduces computation by 35%, accelerates flood response by 68%, and exhibits super-linear coordination with complexity scaling only 5.4x from 400 to 10,000 nodes.

Conclusion: MARLIN demonstrates potential for disaster prevention and community protection through intelligent, scalable water resource management that effectively handles real-world uncertainties while maintaining computational efficiency.

Abstract: As climate change intensifies extreme weather events, water disasters pose growing threats to global communities, making adaptive reservoir management critical for protecting vulnerable populations and ensuring water security. Modern water resource management faces unprecedented challenges from cascading uncertainties propagating through interconnected reservoir networks. These uncertainties, rooted in physical water transfer losses and environmental variability, make precise control difficult. For example, sending 10 tons downstream may yield only 8-12 tons due to evaporation and seepage. Traditional centralized optimization approaches suffer from exponential computational complexity and cannot effectively handle such real-world uncertainties, while existing multi-agent reinforcement learning (MARL) methods fail to achieve effective coordination under uncertainty. To address these challenges, we present MARLIN, a decentralized reservoir management framework inspired by starling murmurations intelligence. Integrating bio-inspired alignment, separation, and cohesion rules with MARL, MARLIN enables individual reservoirs to make local decisions while achieving emergent global coordination. In addition, a LLM provides real-time reward shaping signals, guiding agents to adapt to environmental changes and human-defined preferences. Experiments on real-world USGS data show that MARLIN improves uncertainty handling by 23%, cuts computation by 35%, and accelerates flood response by 68%, exhibiting super-linear coordination, with complexity scaling 5.4x from 400 to 10,000 nodes. These results demonstrate MARLIN’s potential for disaster prevention and protecting communities through intelligent, scalable water resource management.

[1633] Asymmetric Information Enhanced Mapping Framework for Multirobot Exploration based on Deep Reinforcement Learning

Jiyu Cheng, Junhui Fan, Xiaolei Li, Paul L. Rosin, Yibin Li, Wei Zhang

Main category: cs.MA

TL;DR: AIM-Mapping is an asymmetric information enhanced mapping framework for multi-robot collaborative exploration that uses privilege information during training to improve environment representation and decision-making.

Details

Motivation: Efficient collaborative exploration of unknown environments remains challenging despite advances in multirobot technologies, requiring better utilization of training information.

Method: Uses asymmetric actor-critic training with privilege information for performance evaluation, combines feature encoding with topological maps based on geometric distance, and employs topological graph matching for goal assignment.

Result: Experiments in Gibson simulation environments show significant performance improvement compared to existing methods.

Conclusion: The proposed asymmetric information enhanced mapping framework effectively improves multi-robot collaborative exploration performance through better utilization of training information and topological representations.

Abstract: Despite the great development of multirobot technologies, efficiently and collaboratively exploring an unknown environment is still a big challenge. In this paper, we propose AIM-Mapping, a Asymmetric InforMation Enhanced Mapping framework. The framework fully utilizes the privilege information in the training process to help construct the environment representation as well as the supervised signal in an asymmetric actor-critic training framework. Specifically, privilege information is used to evaluate the exploration performance through an asymmetric feature representation module and a mutual information evaluation module. The decision-making network uses the trained feature encoder to extract structure information from the environment and combines it with a topological map constructed based on geometric distance. Utilizing this kind of topological map representation, we employ topological graph matching to assign corresponding boundary points to each robot as long-term goal points. We conduct experiments in real-world-like scenarios using the Gibson simulation environments. It validates that the proposed method, when compared to existing methods, achieves great performance improvement.

[1634] Cost-Aware Opinion Dynamics in Multi-Agents Systems under Malicious Agent Influence

Yuhan Suo, Kaiyuan Chen, Yuanqing Xia, Xudong Zhao, Shuo Wang, Runqi Chai

Main category: cs.MA

TL;DR: This paper addresses security vulnerabilities in multi-agent systems where malicious agents cannot be immediately removed, proposing a Boomerang Effect-inspired approach that balances security costs with convergence speed.

Details

Motivation: In MASs where malicious agents persist, standard averaging consensus mechanisms lack sufficient resistance, making systems vulnerable to harmful deviations. The need exists for approaches that maintain resilience while managing practical trade-offs.

Method: Leverages the Boomerang Effect from sociology to make normal agents reject malicious inputs, analyzes additional costs from Boomerang-style fusion, and proposes a cost-aware evolution rate adjustment mechanism.

Result: Multi-robot simulations show the mechanism suppresses excess costs while maintaining resilience to extremist disruptions and ensuring stable convergence.

Conclusion: The proposed approach enables MAS to efficiently develop in an ethical order by balancing security measures with practical convergence requirements.

Abstract: In many MASs, links to malicious agents cannot be severed immediately. Under these conditions, averaging-only consensus mechanisms typically lack sufficient resistance, leaving the system vulnerable to harmful deviations. To address this challenge, this brief leverages the Boomerang Effect from sociology, which drives normal agents to firmly reject malicious inputs, although this strategy may appear overly cautious. Thus, this brief emphasizes the necessity of acknowledging the resulting trade-off between cost and convergence speed in practice. To address this, the additional costs induced by Boomerang-style fusion is analyzed and a cost aware evolution rate adjustment mechanism is proposed. Multi-robot simulations demonstrate that this mechanism suppresses excess costs while maintaining resilience to extremist disruptions and ensuring stable convergence, enabling MAS to efficiently develop in a ethical order.

[1635] Learning Large-Scale Competitive Team Behaviors with Mean-Field Interactions and Online Opponent Modeling

Bhavini Jeloka, Yue Guan, Panagiotis Tsiotras

Main category: cs.MA

TL;DR: MF-MAPPO is a mean-field extension of PPO for zero-sum team games that combines intra-team cooperation with inter-team competition, enabling scalable deployment to thousands of agents.

Details

Motivation: Existing MARL algorithms struggle to scale to large agent populations, and current mean-field frameworks focus only on fully cooperative or purely competitive settings, leaving a gap for mixed cooperative-competitive scenarios.

Method: MF-MAPPO employs a shared actor and minimally informed critic per team, trained directly on finite-population simulators. It extends to partially observable settings through gradient-regularized training.

Result: MF-MAPPO outperforms existing methods in large-scale benchmarks including battlefield tasks and population-based rock-paper-scissors games, exhibiting complex heterogeneous behaviors.

Conclusion: Combining mean-field theory with MARL techniques enables effective scaling to large populations in mixed cooperative-competitive settings, demonstrating the viability of MF-MAPPO for realistic scenarios with thousands of agents.

Abstract: While multi-agent reinforcement learning (MARL) has been proven effective across both collaborative and competitive tasks, existing algorithms often struggle to scale to large populations of agents. Recent advancements in mean-field (MF) theory provide scalable solutions by approximating population interactions as a continuum, yet most existing frameworks focus exclusively on either fully cooperative or purely competitive settings. To bridge this gap, we introduce MF-MAPPO, a mean-field extension of PPO designed for zero-sum team games that integrate intra-team cooperation with inter-team competition. MF-MAPPO employs a shared actor and a minimally informed critic per team and is trained directly on finite-population simulators, thereby enabling deployment to realistic scenarios with thousands of agents. We further show that MF-MAPPO naturally extends to partially observable settings through a simple gradient-regularized training scheme. Our evaluation utilizes large-scale benchmark scenarios using our own testing simulation platform for MF team games (MFEnv), including offense-defense battlefield tasks as well as variants of population-based rock-paper-scissors games that admit analytical solutions, for benchmarking. Across these benchmarks, MF-MAPPO outperforms existing methods and exhibits complex, heterogeneous behaviors, demonstrating the effectiveness of combining mean-field theory and MARL techniques at scale.

[1636] When Is Diversity Rewarded in Cooperative Multi-Agent Learning?

Michael Amir, Matteo Bettini, Amanda Prorok

Main category: cs.MA

TL;DR: The paper studies when heterogeneous teams outperform homogeneous ones in multi-agent task allocation, analyzing reward design principles and developing a MARL algorithm to find scenarios where diversity is beneficial.

Details

Motivation: To provide a principled explanation for when diverse specialists in teams surpass homogeneous teams, particularly in multi-agent task allocation problems, and understand what reward structures incentivize heterogeneity.

Method: 1) Theoretical analysis of reward operators’ curvature to determine when heterogeneity increases reward; 2) Developed Heterogeneity Gain Parameter Search (HetGPS) algorithm using multi-agent reinforcement learning to find scenarios where heterogeneity is advantageous.

Result: Proved that the curvature of reward aggregation operators determines whether heterogeneity can increase reward, with broad reward families collapsing to simple convexity tests. HetGPS successfully rediscovers reward regimes predicted by theory to maximize heterogeneity advantage.

Conclusion: The research provides theoretical and computational tools to understand when behavioral diversity delivers measurable benefits, connecting theoretical insights about reward design to practical multi-agent reinforcement learning applications.

Abstract: The success of teams in robotics, nature, and society often depends on the division of labor among diverse specialists; however, a principled explanation for when such diversity surpasses a homogeneous team is still missing. Focusing on multi-agent task allocation problems, we study this question from the perspective of reward design: what kinds of objectives are best suited for heterogeneous teams? We first consider an instantaneous, non-spatial setting where the global reward is built by two generalized aggregation operators: an inner operator that maps the $N$ agents’ effort allocations on individual tasks to a task score, and an outer operator that merges the $M$ task scores into the global team reward. We prove that the curvature of these operators determines whether heterogeneity can increase reward, and that for broad reward families this collapses to a simple convexity test. Next, we ask what incentivizes heterogeneity to emerge when embodied, time-extended agents must learn an effort allocation policy. To study heterogeneity in such settings, we use multi-agent reinforcement learning (MARL) as our computational paradigm, and introduce Heterogeneity Gain Parameter Search (HetGPS), a gradient-based algorithm that optimizes the parameter space of underspecified MARL environments to find scenarios where heterogeneity is advantageous. Across different environments, we show that HetGPS rediscovers the reward regimes predicted by our theory to maximize the advantage of heterogeneity, both validating HetGPS and connecting our theoretical insights to reward design in MARL. Together, these results help us understand when behavioral diversity delivers a measurable benefit.

[1637] Communicating Plans, Not Percepts: Scalable Multi-Agent Coordination with Embodied World Models

Brennen A. Hill, Mant Koh En Wei, Thangavel Jishnuanandh

Main category: cs.MA

TL;DR: Comparison of learned vs engineered communication in multi-agent systems shows world model-based approach outperforms emergent communication in complex scenarios.

Details

Motivation: To determine whether communication protocols should be engineered or learned end-to-end in multi-agent reinforcement learning for robust coordination under partial observability.

Method: Proposed two communication strategies: Learned Direct Communication (LDC) that learns protocol end-to-end, and Intention Communication using engineered world model (ITGM) to simulate future states and communicate plan summaries.

Result: World model-based approach demonstrated superior performance, sample efficiency, and scalability compared to emergent communication as task complexity increased.

Conclusion: Integrating structured, predictive models into MARL agents enables more effective goal-driven coordination than purely learned communication protocols.

Abstract: Robust coordination is critical for effective decision-making in multi-agent systems, especially under partial observability. A central question in Multi-Agent Reinforcement Learning (MARL) is whether to engineer communication protocols or learn them end-to-end. We investigate this dichotomy using embodied world models. We propose and compare two communication strategies for a cooperative task-allocation problem. The first, Learned Direct Communication (LDC), learns a protocol end-to-end, with agents generating messages and actions concurrently. The second, Intention Communication, uses an engineered inductive bias: a compact, learned world model, the Imagined Trajectory Generation Module (ITGM), to simulate future states. Agents then communicate a summary of this plan. We evaluate these approaches on goal-directed interaction in a grid world, a canonical abstraction for embodied AI problems. Our experiments reveal that while emergent communication is viable in simple settings, the engineered, world model-based approach shows superior performance, sample efficiency, and scalability as complexity increases. These findings advocate for integrating structured, predictive models into MARL agents to enable active, goal-driven coordination.

cs.MM

[1638] XGC-AVis: Towards Audio-Visual Content Understanding with a Multi-Agent Collaborative System

Yuqin Cao, Xiongkuo Min, Yixuan Gao, Wei Sun, Zicheng Zhang, Jinliang Han, Guangtao Zhai

Main category: cs.MM

TL;DR: XGC-AVis is a multi-agent framework that enhances audio-video temporal alignment in MLLMs through perception, planning, execution, and reflection stages. The paper also introduces XGC-AVQuiz, the first benchmark for assessing MLLMs’ understanding in real-world and AI-generated scenarios.

Details

Motivation: Current multimodal large models (MLLMs) struggle with audio-video temporal alignment and quality perception tasks. There is a need for better frameworks to improve these capabilities and comprehensive benchmarks to evaluate them.

Method: Proposed XGC-AVis framework with 4-stage multi-agent approach (perception, planning, execution, reflection). Created XGC-AVQuiz benchmark with 2,685 QA pairs across 20 tasks, featuring AIGC scenario expansion and quality perception dimensions.

Result: Experimental results show current MLLMs struggle with quality perception and temporal alignment. XGC-AVis improves these capabilities without requiring additional training, validated on two benchmarks.

Conclusion: The proposed framework effectively enhances MLLMs’ audio-video alignment capabilities, and the new benchmark provides comprehensive evaluation for both real-world and AI-generated content understanding.

Abstract: In this paper, we propose XGC-AVis, a multi-agent framework that enhances the audio-video temporal alignment capabilities of multimodal large models (MLLMs) and improves the efficiency of retrieving key video segments through 4 stages: perception, planning, execution, and reflection. We further introduce XGC-AVQuiz, the first benchmark aimed at comprehensively assessing MLLMs’ understanding capabilities in both real-world and AI-generated scenarios. XGC-AVQuiz consists of 2,685 question-answer pairs across 20 tasks, with two key innovations: 1) AIGC Scenario Expansion: The benchmark includes 2,232 videos, comprising 1,102 professionally generated content (PGC), 753 user-generated content (UGC), and 377 AI-generated content (AIGC). These videos cover 10 major domains and 53 fine-grained categories. 2) Quality Perception Dimension: Beyond conventional tasks such as recognition, localization, and reasoning, we introduce a novel quality perception dimension. This requires MLLMs to integrate low-level sensory capabilities with high-level semantic understanding to assess audio-visual quality, synchronization, and coherence. Experimental results on XGC-AVQuiz demonstrate that current MLLMs struggle with quality perception and temporal alignment tasks. XGC-AVis improves these capabilities without requiring additional training, as validated on two benchmarks.

[1639] OnomatoGen: Onomatopoeia Generation with the Alpha-Channel in Manga

Takara Taniguchi, Wataru Shimoda, Kota Yamaguchi, Hideki Nakayama

Main category: cs.MM

TL;DR: OnomatoGen is a method for stylizing plain text into onomatopoeic expressions for manga generation, addressing the unique visual properties of onomatopoeia that differ from typical text stylization.

Details

Motivation: Onomatopoeia is crucial for textual messaging in manga but has been overlooked in manga generation research. It has unique visual characteristics like shape, size, and placement variations that reflect scene intensity and mood.

Method: Proposed OnomatoGen, a system that transforms plain text into stylized onomatopoeic expressions suitable for manga.

Result: Empirical evidence shows that onomatopoeia generation has distinct properties different from typical text stylization methods, and OnomatoGen effectively stylizes plain text in an onomatopoeia style.

Conclusion: The paper successfully addresses the gap in onomatopoeia generation for manga and demonstrates the effectiveness of OnomatoGen in creating stylized onomatopoeic expressions.

Abstract: Onomatopoeia is an important element for textual messaging in manga. Unlike character dialogue in manga, onomatopoeic expressions are visually stylized, with variations in shape, size, and placement that reflect the scene’s intensity and mood. Despite its important role, onomatopoeia has not received much attention in manga generation. In this paper, we focus on onomatopoeia generation and propose OnomatoGen, which stylizes plain text to an onomatopoeic style. We empirically show the unique properties of onomatopoeia generation, which differ from typical text stylization methods, and that OnomatoGen can effectively stylize plain text in an onomatopoeia style.

[1640] Nagare Media Engine: A System for Cloud- and Edge-Native Network-based Multimedia Workflows

Matthias Neugebauer

Main category: cs.MM

TL;DR: This paper presents nagare media engine, an open source implementation of the ISO/IEC 23090-8 NBMP standard for distributed multimedia workflow systems, built on Kubernetes for cloud and edge deployment.

Details

Motivation: Traditional multimedia workflows that ran on single machines are now complex distributed systems, requiring new approaches for describing and implementing them effectively.

Method: Developed an open source research prototype called nagare media engine that implements the ISO/IEC 23090-8 Network-Based Media Processing (NBMP) standard, built on the Kubernetes platform.

Result: Created a cloud- and edge-native solution that meets modern requirements for multimedia workflow systems, providing a standards-based approach to distributed media processing.

Conclusion: The nagare media engine successfully addresses the challenges of implementing complex distributed multimedia workflows through a standards-based approach using modern container orchestration technology.

Abstract: Before media playback is possible, live and video-on-demand content alike usually undergoes various operations described as tasks within a multimedia workflow. Where previously ingest, transcode, packaging and delivery tasks might have run on a single machine, today’s workflows are significantly more complex distributed systems. Describing and implementing multimedia workflows is challenging and requires new approaches. A standards-based multimedia workflow system is described in ISO/IEC 23090-8 Network-Based Media Processing (NBMP) developed by MPEG. This technical report discusses details of nagare media engine, our open source research prototype implementation of NBMP. Built upon the Kubernetes platform, nagare media engine provides a cloud- and edge-native solution that meets today’s requirements for multimedia workflow systems.

[1641] IML-Spikeformer: Input-aware Multi-Level Spiking Transformer for Speech Processing

Zeyang Song, Shimin Zhang, Yuhong Chou, Jibin Wu, Haizhou Li

Main category: cs.MM

TL;DR: IML-Spikeformer is a spiking Transformer architecture that achieves competitive speech recognition performance while reducing energy consumption by 4.6x compared to conventional ANNs, using input-aware multi-level spike mechanisms and hierarchical attention modules.

Details

Motivation: SNNs offer energy efficiency but struggle with large-scale speech tasks due to high training overhead from multi-timestep spike firing and lack of specialized architectures for speech processing.

Method: Proposes IML-Spikeformer with Input-aware Multi-Level Spike (IMLS) mechanism that simulates multi-timestep firing in single timestep, and HD-RepSSA module with Re-parameterized Spiking Self-Attention and Hierarchical Decay Mask for precise attention and multi-scale temporal modeling.

Result: Achieves 6.0% WER on AiShell-1 and 3.4% WER on Librispeech-960, comparable to ANN transformers while reducing theoretical inference energy by 4.64x and 4.32x respectively.

Conclusion: IML-Spikeformer advances scalable SNN architectures for large-scale speech processing, demonstrating competitive performance with significant energy efficiency improvements.

Abstract: Spiking Neural Networks (SNNs), inspired by biological neural mechanisms, represent a promising neuromorphic computing paradigm that offers energy-efficient alternatives to traditional Artificial Neural Networks (ANNs). Despite proven effectiveness, SNN architectures have struggled to achieve competitive performance on large-scale speech processing tasks. Two key challenges hinder progress: (1) the high computational overhead during training caused by multi-timestep spike firing, and (2) the absence of large-scale SNN architectures tailored to speech processing tasks. To overcome the issues, we introduce Input-aware Multi-Level Spikeformer, i.e. IML-Spikeformer, a spiking Transformer architecture specifically designed for large-scale speech processing. Central to our design is the Input-aware Multi-Level Spike (IMLS) mechanism, which simulates multi-timestep spike firing within a single timestep using an adaptive, input-aware thresholding scheme. IML-Spikeformer further integrates a Re-parameterized Spiking Self-Attention (RepSSA) module with a Hierarchical Decay Mask (HDM), forming the HD-RepSSA module. This module enhances the precision of attention maps and enables modeling of multi-scale temporal dependencies in speech signals. Experiments demonstrate that IML-Spikeformer achieves word error rates of 6.0% on AiShell-1 and 3.4% on Librispeech-960, comparable to conventional ANN transformers while reducing theoretical inference energy consumption by 4.64$\times$ and 4.32$\times$ respectively. IML-Spikeformer marks an advance of scalable SNN architectures for large-scale speech processing in both task performance and energy efficiency. Our source code and model checkpoints are publicly available at github.com/Pooookeman/IML-Spikeformer.

eess.AS

[1642] PerformSinger: Multimodal Singing Voice Synthesis Leveraging Synchronized Lip Cues from Singing Performance Videos

Ke Gu, Zhicong Wu, Peng Bai, Sitong Qiao, Zhiqi Jiang, Junchen Lu, Xiaodong Shi, Xinyuan Qian

Main category: eess.AS

TL;DR: PerformSinger is a multimodal singing voice synthesis framework that uses lip cues from video to enable duration-free synthesis, achieving state-of-the-art performance.

Details

Motivation: Existing SVS models rely on phoneme-level durations, limiting practical application and overlooking the complementary role of visual information in duration prediction.

Method: Uses parallel multi-branch multimodal encoders, feature fusion with adapter and fusion blocks, progressive fusion strategy in aligned semantic space, duration and variational prediction network, mel-spectrogram decoder and vocoder.

Result: Extensive experiments demonstrate state-of-the-art performance in both subjective and objective evaluations. Created a novel SVS dataset with synchronized video streams and phoneme-level annotations.

Conclusion: PerformSinger enables high-quality duration-free singing voice synthesis by incorporating visual information, with code and dataset to be publicly available.

Abstract: Existing singing voice synthesis (SVS) models largely rely on fine-grained, phoneme-level durations, which limits their practical application. These methods overlook the complementary role of visual information in duration prediction.To address these issues, we propose PerformSinger, a pioneering multimodal SVS framework, which incorporates lip cues from video as a visual modality, enabling high-quality “duration-free” singing voice synthesis. PerformSinger comprises parallel multi-branch multimodal encoders, a feature fusion module, a duration and variational prediction network, a mel-spectrogram decoder and a vocoder. The fusion module, composed of adapter and fusion blocks, employs a progressive fusion strategy within an aligned semantic space to produce high-quality multimodal feature representations, thereby enabling accurate duration prediction and high-fidelity audio synthesis. To facilitate the research, we design, collect and annotate a novel SVS dataset involving synchronized video streams and precise phoneme-level manual annotations. Extensive experiments demonstrate the state-of-the-art performance of our proposal in both subjective and objective evaluations. The code and dataset will be publicly available.

[1643] Learning What To Hear: Boosting Sound-Source Association For Robust Audiovisual Instance Segmentation

Jinbae Seo, Hyeongjun Kwon, Kwonyoung Kim, Jiyoung Lee, Kwanghoon Sohn

Main category: eess.AS

TL;DR: The paper proposes Audio-Centric Query Generation and Sound-Aware Ordinal Counting loss to address visual bias in audiovisual instance segmentation, improving performance on AVISeg benchmark.

Details

Motivation: Existing methods suffer from visual bias due to uniform additive fusion preventing query specialization to different sound sources, and visual-only training objectives allowing queries to converge to arbitrary salient objects.

Method: Audio-Centric Query Generation using cross-attention enables each query to selectively attend to distinct sound sources and carry sound-specific priors into visual decoding. Sound-Aware Ordinal Counting loss explicitly supervises sounding object numbers through ordinal regression with monotonic consistency constraints.

Result: Experiments on AVISeg benchmark show consistent improvements: +1.64 mAP, +0.6 HOTA, and +2.06 FSLA.

Conclusion: Query specialization and explicit counting supervision are crucial for accurate audiovisual instance segmentation.

Abstract: Audiovisual instance segmentation (AVIS) requires accurately localizing and tracking sounding objects throughout video sequences. Existing methods suffer from visual bias stemming from two fundamental issues: uniform additive fusion prevents queries from specializing to different sound sources, while visual-only training objectives allow queries to converge to arbitrary salient objects. We propose Audio-Centric Query Generation using cross-attention, enabling each query to selectively attend to distinct sound sources and carry sound-specific priors into visual decoding. Additionally, we introduce Sound-Aware Ordinal Counting (SAOC) loss that explicitly supervises sounding object numbers through ordinal regression with monotonic consistency constraints, preventing visual-only convergence during training. Experiments on AVISeg benchmark demonstrate consistent improvements: +1.64 mAP, +0.6 HOTA, and +2.06 FSLA, validating that query specialization and explicit counting supervision are crucial for accurate audiovisual instance segmentation.

[1644] Index-MSR: A high-efficiency multimodal fusion framework for speech recognition

Jinming Chen, Lu Wang, Zheshu Song, Wei Deng

Main category: eess.AS

TL;DR: Index-MSR is a multimodal speech recognition framework that uses video text cues (subtitles, slides) to improve ASR accuracy, reducing substitution errors by 20-50%

Details

Motivation: Current ASR systems struggle with domain-specific terminology and short utterances lacking semantic coherence, leading to degraded recognition performance

Method: Proposes a Multimodal Fusion Decoder (MFD) that incorporates text-related information from videos into speech recognition through cross-modal integration

Result: Achieves state-of-the-art accuracy on both in-house subtitle dataset and public AVSR dataset, with substitution errors reduced by 20-50%

Conclusion: The approach efficiently exploits text-related cues from video to improve speech recognition accuracy, showing strong potential for applications requiring strict audio-text synchronization like audio translation

Abstract: Driven by large scale datasets and LLM based architectures, automatic speech recognition (ASR) systems have achieved remarkable improvements in accuracy. However, challenges persist for domain-specific terminology, and short utterances lacking semantic coherence, where recognition performance often degrades significantly. In this work, we present Index-MSR, an efficient multimodal speech recognition framework. At its core is a novel Multimodal Fusion Decoder (MFD), which effectively incorporates text-related information from videos (e.g., subtitles and presentation slides) into the speech recognition. This cross-modal integration not only enhances overall ASR accuracy but also yields substantial reductions in substitution errors. Extensive evaluations on both an in-house subtitle dataset and a public AVSR dataset demonstrate that Index-MSR achieves sota accuracy, with substitution errors reduced by 20,50%. These results demonstrate that our approach efficiently exploits text-related cues from video to improve speech recognition accuracy, showing strong potential in applications requiring strict audio text synchronization, such as audio translation.

[1645] Unsupervised Speech Enhancement using Data-defined Priors

Dominik Klement, Matthew Maciejewski, Sanjeev Khudanpur, Jan Černocký, Lukáš Burget

Main category: eess.AS

TL;DR: Proposes a dual-branch encoder-decoder architecture for unsupervised speech enhancement using adversarial training with unpaired clean speech and noise data, achieving comparable performance to leading methods while highlighting the importance of clean speech data selection.

Details

Motivation: Addresses the gap between training and testing phases in supervised methods that rely on synthetic noisy speech data, which is difficult to collect at scale in real-world conditions.

Method: Uses a dual-branch encoder-decoder architecture that separates input into clean speech and residual noise, with adversarial training imposing priors on each branch using unpaired datasets of clean speech and optionally noise.

Result: Achieves performance comparable to leading unsupervised speech enhancement approaches and demonstrates that performance appears overly optimistic when in-domain clean speech data are used for prior definition.

Conclusion: The proposed unsupervised method effectively addresses the data collection challenge while revealing the critical impact of clean speech data selection on enhancement performance, cautioning against using in-domain data for prior definition.

Abstract: The majority of deep learning-based speech enhancement methods require paired clean-noisy speech data. Collecting such data at scale in real-world conditions is infeasible, which has led the community to rely on synthetically generated noisy speech. However, this introduces a gap between the training and testing phases. In this work, we propose a novel dual-branch encoder-decoder architecture for unsupervised speech enhancement that separates the input into clean speech and residual noise. Adversarial training is employed to impose priors on each branch, defined by unpaired datasets of clean speech and, optionally, noise. Experimental results show that our method achieves performance comparable to leading unsupervised speech enhancement approaches. Furthermore, we demonstrate the critical impact of clean speech data selection on enhancement performance. In particular, our findings reveal that performance may appear overly optimistic when in-domain clean speech data are used for prior definition – a practice adopted in previous unsupervised speech enhancement studies.

[1646] BFA: Real-time Multilingual Text-to-speech Forced Alignment

Abdul Rehman, Jingyao Cai, Jian-Jun Zhang, Xiaosong Yang

Main category: eess.AS

TL;DR: Bournemouth Forced Aligner (BFA) is a fast speech alignment system that combines CUPE with CTC decoding, achieving 240x speedup over MFA while providing fine-grained boundary prediction with silence modeling.

Details

Motivation: To develop a faster forced alignment system that can enable interactive speech applications by overcoming the speed limitations of existing aligners like Montreal Forced Aligner.

Method: Combines Contextless Universal Phoneme Encoder (CUPE) with CTC-based decoder, introduces explicit modeling of inter-phoneme gaps and silences, and uses hierarchical decoding strategies for fine-grained boundary prediction.

Result: Achieves competitive recall relative to MFA at relaxed tolerance levels, predicts both onset and offset boundaries, processes speech up to 240x faster than MFA, enabling faster than real-time alignment.

Conclusion: BFA’s combination of speed and silence-aware alignment opens opportunities for interactive speech applications previously constrained by slow aligners.

Abstract: We present Bournemouth Forced Aligner (BFA), a system that combines a Contextless Universal Phoneme Encoder (CUPE) with a connectionist temporal classification (CTC)based decoder. BFA introduces explicit modelling of inter-phoneme gaps and silences and hierarchical decoding strategies, enabling fine-grained boundary prediction. Evaluations on TIMIT and Buckeye corpora show that BFA achieves competitive recall relative to Montreal Forced Aligner at relaxed tolerance levels, while predicting both onset and offset boundaries for richer temporal structure. BFA processes speech up to 240x faster than MFA, enabling faster than real-time alignment. This combination of speed and silence-aware alignment opens opportunities for interactive speech applications previously constrained by slow aligners.

[1647] AI-Assisted Music Production: A User Study on Text-to-Music Models

Francesca Ronchini, Luca Comanducci, Simone Marcucci, Fabio Antonacci

Main category: eess.AS

TL;DR: Case study on how text-to-music models impact music production workflows, revealing challenges and opportunities through user studies.

Details

Motivation: Text-to-music models have revolutionized music creation but their integration into musicians' workflows remains underexplored.

Method: User study with participants producing tracks using custom tool combining TTM and source separation models, followed by semi-structured interviews and thematic analysis.

Result: Revealed key challenges, opportunities, and ethical considerations in TTM integration into music production workflows.

Conclusion: Findings provide insights into the transformative potential of TTMs in music production and the challenges in their real-world integration.

Abstract: Text-to-music models have revolutionized the creative landscape, offering new possibilities for music creation. Yet their integration into musicians workflows remains underexplored. This paper presents a case study on how TTM models impact music production, based on a user study of their effect on producers creative workflows. Participants produce tracks using a custom tool combining TTM and source separation models. Semi-structured interviews and thematic analysis reveal key challenges, opportunities, and ethical considerations. The findings offer insights into the transformative potential of TTMs in music production, as well as challenges in their real-world integration.

[1648] AudioFuse: Unified Spectral-Temporal Learning via a Hybrid ViT-1D CNN Architecture for Robust Phonocardiogram Classification

Md. Saiful Bari Siddiqui, Utsab Saha

Main category: eess.AS

TL;DR: AudioFuse fuses spectrogram and raw waveform representations for phonocardiogram classification, achieving state-of-the-art performance and superior domain shift robustness without large-scale pre-training.

Details

Motivation: Biomedical audio signals like PCGs contain diagnostic information in both spectral and temporal domains, but standard spectrograms compromise phase information and temporal precision while 1D waveforms lack rich spectral features.

Method: Proposed AudioFuse architecture that simultaneously learns from both representations using a custom wide-and-shallow Vision Transformer for spectrograms and a shallow 1D CNN for raw waveforms to mitigate overfitting.

Result: Achieved competitive ROC-AUC of 0.8608 on PhysioNet 2016 dataset, outperforming spectrogram (0.8066) and waveform (0.8223) baselines. Showed superior robustness on PASCAL dataset with ROC-AUC 0.7181 while spectrogram baseline collapsed to 0.4873.

Conclusion: Fusing complementary representations provides strong inductive bias for creating efficient, generalizable classifiers without requiring large-scale pre-training.

Abstract: Biomedical audio signals, such as phonocardiograms (PCG), are inherently rhythmic and contain diagnostic information in both their spectral (tonal) and temporal domains. Standard 2D spectrograms provide rich spectral features but compromise the phase information and temporal precision of the 1D waveform. We propose AudioFuse, an architecture that simultaneously learns from both complementary representations to classify PCGs. To mitigate the overfitting risk common in fusion models, we integrate a custom, wide-and-shallow Vision Transformer (ViT) for spectrograms with a shallow 1D CNN for raw waveforms. On the PhysioNet 2016 dataset, AudioFuse achieves a state-of-the-art competitive ROC-AUC of 0.8608 when trained from scratch, outperforming its spectrogram (0.8066) and waveform (0.8223) baselines. Moreover, it demonstrates superior robustness to domain shift on the challenging PASCAL dataset, maintaining an ROC-AUC of 0.7181 while the spectrogram baseline collapses (0.4873). Fusing complementary representations thus provides a strong inductive bias, enabling the creation of efficient, generalizable classifiers without requiring large-scale pre-training.

[1649] LORT: Locally Refined Convolution and Taylor Transformer for Monaural Speech Enhancement

Junyu Wang, Zizhen Lin, Tianrui Wang, Meng Ge, Longbiao Wang, Jianwu Dang

Main category: eess.AS

TL;DR: LORT is a novel speech enhancement architecture that combines spatial-channel enhanced Taylor Transformer with locally refined convolution, achieving competitive performance with only 0.96M parameters.

Details

Motivation: To achieve superior speech enhancement performance while maintaining low parameter count and computational complexity, addressing the challenge of efficient and robust speech enhancement.

Method: Proposes LORT architecture with Taylor multi-head self-attention (T-MSA) enhanced with spatial-channel enhancement attention (SCEA) for global modeling, and locally refined convolution (LRC) blocks for local detail capture. Uses U-Net-like encoder-decoder structure with only 16 output channels.

Result: Achieves competitive or superior performance to state-of-the-art models on VCTK+DEMAND and DNS Challenge datasets with only 0.96M parameters.

Conclusion: LORT is an effective solution for real-world speech enhancement applications with limited computational resources, demonstrating high performance with minimal parameters.

Abstract: Achieving superior enhancement performance while maintaining a low parameter count and computational complexity remains a challenge in the field of speech enhancement. In this paper, we introduce LORT, a novel architecture that integrates spatial-channel enhanced Taylor Transformer and locally refined convolution for efficient and robust speech enhancement. We propose a Taylor multi-head self-attention (T-MSA) module enhanced with spatial-channel enhancement attention (SCEA), designed to facilitate inter-channel information exchange and alleviate the spatial attention limitations inherent in Taylor-based Transformers. To complement global modeling, we further present a locally refined convolution (LRC) block that integrates convolutional feed-forward layers, time-frequency dense local convolutions, and gated units to capture fine-grained local details. Built upon a U-Net-like encoder-decoder structure with only 16 output channels in the encoder, LORT processes noisy inputs through multi-resolution T-MSA modules using alternating downsampling and upsampling operations. The enhanced magnitude and phase spectra are decoded independently and optimized through a composite loss function that jointly considers magnitude, complex, phase, discriminator, and consistency objectives. Experimental results on the VCTK+DEMAND and DNS Challenge datasets demonstrate that LORT achieves competitive or superior performance to state-of-the-art (SOTA) models with only 0.96M parameters, highlighting its effectiveness for real-world speech enhancement applications with limited computational resources.

[1650] AISHELL6-whisper: A Chinese Mandarin Audio-visual Whisper Speech Dataset with Speech Recognition Baselines

Cancan Li, Fei Su, Juan Liu, Hui Bu, Yulong Wan, Hongbin Suo, Ming Li

Main category: eess.AS

TL;DR: AISHELL6-Whisper is a large-scale Chinese mandarin audio-visual whisper speech dataset with 30 hours each of whisper and normal speech, plus synchronized facial videos. An AVSR baseline using Whisper-Flamingo framework achieves 4.13% CER for whisper speech and 1.11% for normal speech.

Details

Motivation: Chinese mandarin audio-visual whisper speech recognition development is hindered by lack of large-scale datasets. Whisper speech is crucial for privacy in sensitive communications, medical patients under vocal restraint, and noise-sensitive environments.

Method: Proposed an audio-visual speech recognition baseline based on Whisper-Flamingo framework, integrating parallel training strategy to align embeddings across speech types, and employing projection layer to adapt to whisper speech’s spectral properties.

Result: Achieved Character Error Rate (CER) of 4.13% for whisper speech and 1.11% for normal speech in the test set. Established new state-of-the-art results on the wTIMIT benchmark.

Conclusion: The AISHELL6-Whisper dataset and AVSR baseline successfully address the data scarcity issue for Chinese mandarin whisper speech recognition, achieving strong performance and advancing the field.

Abstract: Whisper speech recognition is crucial not only for ensuring privacy in sensitive communications but also for providing a critical communication bridge for patients under vocal restraint and enabling discrete interaction in noise-sensitive environments. The development of Chinese mandarin audio-visual whisper speech recognition is hindered by the lack of large-scale datasets. We present AISHELL6-Whisper, a large-scale open-source audio-visual whisper speech dataset, featuring 30 hours each of whisper speech and parallel normal speech, with synchronized frontal facial videos. Moreover, we propose an audio-visual speech recognition (AVSR) baseline based on the Whisper-Flamingo framework, which integrates a parallel training strategy to align embeddings across speech types, and employs a projection layer to adapt to whisper speech’s spectral properties. The model achieves a Character Error Rate (CER) of 4.13% for whisper speech and 1.11% for normal speech in the test set of our dataset, and establishes new state-of-the-art results on the wTIMIT benchmark. The dataset and the AVSR baseline codes are open-sourced at https://zutm.github.io/AISHELL6-Whisper.

[1651] Reasoning Beyond Majority Vote: An Explainable SpeechLM Framework for Speech Emotion Recognition

Bo-Hao Su, Hui-Ying Shih, Jinchuan Tian, Jiatong Shi, Chi-Chun Lee, Carlos Busso, Shinji Watanabe

Main category: eess.AS

TL;DR: Proposes an explainable Speech Language Model framework for Speech Emotion Recognition that generates both emotion labels and natural-language rationales grounded in lexical/acoustic cues, using teacher LLM-generated rationales as supervision.

Details

Motivation: Traditional SER uses majority-voted labels which mask subjectivity, neglect minority annotations, and limit interpretability. Need for transparent predictions that explain why emotions are assigned.

Method: Frames SER as generative reasoning task: model produces transcript, then outputs emotion label + concise rationale. Uses reasoning-capable teacher LLM to generate rationales as intermediate supervision during fine-tuning, combined with majority labels.

Result: Model maintains improvements over zero-shot SpeechLM baselines on MSP-Podcast v1.12. Produces rationales that human evaluators find plausible and well grounded. Preserves competitive performance while enhancing explainability.

Conclusion: Incorporating rationale supervision offers practical path toward interpretable SER without sacrificing predictive quality, addressing limitations of majority-voting approaches.

Abstract: Speech Emotion Recognition (SER) is typically trained and evaluated on majority-voted labels, which simplifies benchmarking but masks subjectivity and provides little transparency into why predictions are made. This neglects valid minority annotations and limits interpretability. We propose an explainable Speech Language Model (SpeechLM) framework that frames SER as a generative reasoning task. Given an utterance, the model first produces a transcript, then outputs both an emotion label and a concise natural-language rationale grounded in lexical and acoustic cues. Rationales are generated by a reasoning-capable teacher LLM and used as intermediate supervision, combined with majority labels during fine-tuning. Unlike prior work primarily focused on boosting classification accuracy, we aim to enhance explainability while preserving competitive performance. To this end, we complement majority-label metrics with annotator-aware scoring that credits matches with any annotator label. On MSP-Podcast v1.12, our model maintains improvements over zero-shot SpeechLM baselines, and produces rationales that human evaluators find plausible and well grounded. This demonstrates that incorporating rationale supervision offers a practical path toward interpretable SER without sacrificing predictive quality.

[1652] SynthCloner: Synthesizer Preset Conversion via Factorized Codec with ADSR Envelope Control

Jeng-Yue Liu, Ting-Chao Hsu, Yen-Tung Yeh, Li Su, Yi-Hsuan Yang

Main category: eess.AS

TL;DR: SynthCloner is a factorized codec model that disentangles audio into ADSR envelope, timbre, and content attributes for expressive synthesizer preset conversion with independent control.

Details

Motivation: Existing timbre transfer methods offer limited control over envelope shaping, and public synthesizer datasets lack diverse coverage of timbres and ADSR envelopes.

Method: Proposed SynthCloner factorized codec model that separates audio into three attributes: ADSR envelope, timbre, and content. Also introduced SynthCAT dataset with 250 timbres, 120 ADSR envelopes, and 100 MIDI sequences.

Result: SynthCloner outperforms baselines on both objective and subjective metrics while enabling independent attribute control.

Conclusion: The approach successfully addresses preset conversion challenges by disentangling audio attributes and provides a comprehensive dataset for synthesizer research.

Abstract: Electronic synthesizer sounds are controlled by presets, parameters settings that yield complex timbral characteristics and ADSR envelopes, making preset conversion particularly challenging. Recent approaches to timbre transfer often rely on spectral objectives or implicit style matching, offering limited control over envelope shaping. Moreover, public synthesizer datasets rarely provide diverse coverage of timbres and ADSR envelopes. To address these gaps, we present SynthCloner, a factorized codec model that disentangles audio into three attributes: ADSR envelope, timbre, and content. This separation enables expressive synthesizer preset conversion with independent control over these three attributes. Additionally, we introduce SynthCAT, a new synthesizer dataset with a task-specific rendering pipeline covering 250 timbres, 120 ADSR envelopes, and 100 MIDI sequences. Experiments show that SynthCloner outperforms baselines on both objective and subjective metrics, while enabling independent attribute control. The code, model checkpoint, and audio examples are available at https://buffett0323.github.io/synthcloner/.

[1653] Code-switching Speech Recognition Under the Lens: Model- and Data-Centric Perspectives

Hexin Liu, Haoyang Zhang, Qiquan Zhang, Xiangyu Zhang, Dongyuan Shi, Eng Siong Chng, Haizhou Li

Main category: eess.AS

TL;DR: This paper analyzes code-switching automatic speech recognition (CS-ASR) from model-centric and data-centric perspectives, addressing challenges of language confusion, accent bias, and data scarcity through algorithmic comparisons and novel data augmentation methods including a prompting strategy called SECT.

Details

Motivation: CS-ASR faces challenges from language confusion due to intra-sentence switching, accent bias blurring phonetic boundaries, and scarcity of annotated code-switching data, even when constituent languages are high-resource.

Method: Systematic analysis of CS-ASR comparing state-of-the-art algorithmic methods (language-specific processing, multi-task learning), TTS data augmentation with varied textual characteristics and accents, and a novel prompting strategy SECT that simplifies equivalence constraint theory to guide LLMs in generating valid code-switching text.

Result: SECT outperforms existing methods in ASR performance and linguistic quality, generating code-switching text that more closely resembles real-world code-switching. When used with TTS to create speech-text pairs, SECT effectively improves CS-ASR performance.

Conclusion: Effective CS-ASR requires strategies carefully aligned with the specific linguistic characteristics of code-switching data, with both model-centric and data-centric approaches playing crucial roles in addressing the unique challenges of code-switching scenarios.

Abstract: Code-switching automatic speech recognition (CS-ASR) presents unique challenges due to language confusion introduced by spontaneous intra-sentence switching and accent bias that blurs the phonetic boundaries. Although the constituent languages may be individually high-resource, the scarcity of annotated code-switching data further compounds these challenges. In this paper, we systematically analyze CS-ASR from both model-centric and data-centric perspectives. By comparing state-of-the-art algorithmic methods, including language-specific processing and auxiliary language-aware multi-task learning, we discuss their varying effectiveness across datasets with different linguistic characteristics. On the data side, we first investigate TTS as a data augmentation method. By varying the textual characteristics and speaker accents, we analyze the impact of language confusion and accent bias on CS-ASR. To further mitigate data scarcity and enhance textual diversity, we propose a prompting strategy by simplifying the equivalence constraint theory (SECT) to guide large language models (LLMs) in generating linguistically valid code-switching text. The proposed SECT outperforms existing methods in ASR performance and linguistic quality assessments, generating code-switching text that more closely resembles real-world code-switching text. When used to generate speech-text pairs via TTS, SECT proves effective in improving CS-ASR performance. Our analysis of both model- and data-centric methods underscores that effective CS-ASR requires strategies to be carefully aligned with the specific linguistic characteristics of the code-switching data.

[1654] Unsupervised Single-Channel Speech Separation with a Diffusion Prior under Speaker-Embedding Guidance

Runwu Shi, Kai Li, Chang Li, Jiang Wang, Sihan Tan, Kazuhiro Nakadai

Main category: eess.AS

TL;DR: This paper proposes an unsupervised speech separation method using diffusion models with speaker-embedding guidance to maintain temporal speaker consistency.

Details

Motivation: Traditional supervised speech separation systems rely on synthetic data that may not reflect real-world conditions. The authors revisit source-model paradigm to address limitations of unconditional diffusion models that lack speaker-level conditioning.

Method: Train diffusion generative model on anechoic speech and formulate separation as diffusion inverse problem. Propose speaker-embedding guidance to maintain speaker coherence and separation-oriented solver for speech separation.

Result: The proposed strategies effectively enhance performance on unsupervised source-model-based speech separation, as confirmed by extensive experimental results.

Conclusion: The speaker-embedding guidance and separation-oriented solver successfully address temporal speaker inconsistency in diffusion-based speech separation, providing an effective unsupervised alternative to supervised methods.

Abstract: Speech separation is a fundamental task in audio processing, typically addressed with fully supervised systems trained on paired mixtures. While effective, such systems typically rely on synthetic data pipelines, which may not reflect real-world conditions. Instead, we revisit the source-model paradigm, training a diffusion generative model solely on anechoic speech and formulating separation as a diffusion inverse problem. However, unconditional diffusion models lack speaker-level conditioning, they can capture local acoustic structure but produce temporally inconsistent speaker identities in separated sources. To address this limitation, we propose Speaker-Embedding guidance that, during the reverse diffusion process, maintains speaker coherence within each separated track while driving embeddings of different speakers further apart. In addition, we propose a new separation-oriented solver tailored for speech separation, and both strategies effectively enhance performance on the challenging task of unsupervised source-model-based speech separation, as confirmed by extensive experimental results. Audio samples and code are available at https://runwushi.github.io/UnSepDiff_demo.

[1655] Assessing speech quality metrics for evaluation of neural audio codecs under clean speech conditions

Wolfgang Mack, Nezih Topaloglu, Laura Lechler, Ivana Balić, Alexandra Craciun, Mansur Yesilbursa, Kamil Wojcicki

Main category: eess.AS

TL;DR: Evaluation of 45 objective speech-quality metrics for neural codecs, finding neural-based metrics like scoreq and utmos correlate best with subjective scores, with non-intrusive metrics saturating at high quality levels.

Details

Motivation: To determine which objective speech-quality metrics provide reliable quality estimates for neural codecs, as it's often unclear which metrics to trust.

Method: Evaluated 45 objective metrics by correlating their scores with subjective listening scores for clean speech across 17 different codec conditions.

Result: Neural-based metrics (scoreq and utmos) achieved the highest Pearson correlations with subjective scores. Non-intrusive metrics tend to saturate at high subjective quality levels.

Conclusion: Neural-based objective metrics provide the most reliable quality estimates for neural codecs, while non-intrusive metrics have limitations at high quality ranges.

Abstract: Objective speech-quality metrics are widely used to assess codec performance. However, for neural codecs, it is often unclear which metrics provide reliable quality estimates. To address this, we evaluated 45 objective metrics by correlating their scores with subjective listening scores for clean speech across 17 codec conditions. Neural-based metrics such as scoreq and utmos achieved the highest Pearson correlations with subjective scores. Further analysis across different subjective quality ranges revealed that non-intrusive metrics tend to saturate at high subjective quality levels.

[1656] ISSE: An Instruction-Guided Speech Style Editing Dataset And Benchmark

Yun Chen, Qi Chen, Zheqi Dai, Arshdeep Singh, Philip J. B. Jackson, Mark D. Plumbley

Main category: eess.AS

TL;DR: The paper introduces ISSE, an Instruction-guided Speech Style Editing Dataset with nearly 400 hours of speech and over 100,000 source-target pairs aligned with detailed textual editing instructions, enabling more flexible and controllable speech style editing.

Details

Motivation: Existing speech style editing approaches depend on explicit labels or reference audio, limiting flexibility and scalability. Recent methods using natural language descriptions are constrained by oversimplified instructions and coarse style control.

Method: Built a systematic instructed speech data generation pipeline using large language models, expressive text-to-speech, and voice conversion technologies to construct high-quality paired samples. Trained an instruction-guided autoregressive speech model on the ISSE dataset.

Result: Experimental results show that ISSE enables accurate, controllable, and generalizable speech style editing compared to other datasets, achieving better instruction adherence, timbre preservation, and content consistency.

Conclusion: The ISSE dataset addresses limitations of existing approaches by providing diverse and detailed textual editing instructions, enabling more flexible and scalable speech style editing while preserving linguistic content and speaker identity.

Abstract: Speech style editing refers to modifying the stylistic properties of speech while preserving its linguistic content and speaker identity. However, most existing approaches depend on explicit labels or reference audio, which limits both flexibility and scalability. More recent attempts to use natural language descriptions remain constrained by oversimplified instructions and coarse style control. To address these limitations, we introduce an Instruction-guided Speech Style Editing Dataset (ISSE). The dataset comprises nearly 400 hours of speech and over 100,000 source-target pairs, each aligned with diverse and detailed textual editing instructions. We also build a systematic instructed speech data generation pipeline leveraging large language model, expressive text-to-speech and voice conversion technologies to construct high-quality paired samples. Furthermore, we train an instruction-guided autoregressive speech model on ISSE and evaluate it in terms of instruction adherence, timbre preservation, and content consistency. Experimental results demonstrate that ISSE enables accurate, controllable, and generalizable speech style editing compared to other datasets. The project page of ISSE is available at https://ychenn1.github.io/ISSE/.

[1657] Word-Level Emotional Expression Control in Zero-Shot Text-to-Speech Synthesis

Tianrui Wang, Haoyu Wang, Meng Ge, Cheng Gong, Chunyu Qiang, Ziyang Ma, Zikang Huang, Guanrou Yang, Xiaobao Wang, Eng Siong Chng, Xie Chen, Longbiao Wang, Jianwu Dang

Main category: eess.AS

TL;DR: WeSCon is a self-training framework that enables word-level control of emotion and speaking rate in zero-shot TTS models without requiring datasets with intra-sentence emotional transitions.

Details

Motivation: Existing emotional TTS research is limited to utterance-level expression and lacks word-level control capabilities, facing challenges in modeling multi-emotion transitions and data scarcity for intra-sentence emotional variation.

Method: Proposes a self-training framework with transition-smoothing strategy, dynamic speed control mechanism, dynamic emotional attention bias, and multi-round inference process to enable word-level expressive control in pretrained TTS models.

Result: WeSCon effectively overcomes data scarcity and achieves state-of-the-art performance in word-level emotional expression control while preserving the original TTS model’s zero-shot synthesis capabilities.

Conclusion: The proposed framework successfully enables fine-grained word-level control of emotion and speaking rate in TTS systems without requiring specialized datasets, representing a significant advancement in expressive speech synthesis.

Abstract: While emotional text-to-speech (TTS) has made significant progress, most existing research remains limited to utterance-level emotional expression and fails to support word-level control. Achieving word-level expressive control poses fundamental challenges, primarily due to the complexity of modeling multi-emotion transitions and the scarcity of annotated datasets that capture intra-sentence emotional and prosodic variation. In this paper, we propose WeSCon, the first self-training framework that enables word-level control of both emotion and speaking rate in a pretrained zero-shot TTS model, without relying on datasets containing intra-sentence emotion or speed transitions. Our method introduces a transition-smoothing strategy and a dynamic speed control mechanism to guide the pretrained TTS model in performing word-level expressive synthesis through a multi-round inference process. To further simplify the inference, we incorporate a dynamic emotional attention bias mechanism and fine-tune the model via self-training, thereby activating its ability for word-level expressive control in an end-to-end manner. Experimental results show that WeSCon effectively overcomes data scarcity, achieving state-of-the-art performance in word-level emotional expression control while preserving the strong zero-shot synthesis capabilities of the original TTS model.

[1658] Advancing Zero-Shot Open-Set Speech Deepfake Source Tracing

Manasi Chhibber, Jagabandhu Mishra, Tomi H. Kinnunen

Main category: eess.AS

TL;DR: A zero-shot source tracing framework using SSL-AASIST for attack classification, with zero-shot (cosine/Siamese) and few-shot (MLP/Siamese) backend scoring. Few-shot works better in closed-set, zero-shot in open-set scenarios.

Details

Motivation: To develop a source tracing framework inspired by speaker verification that can handle both closed-set and open-set scenarios effectively.

Method: Adapt SSL-AASIST system for attack classification with disjoint training/verification attacks. Use zero-shot (cosine similarity, Siamese) and few-shot (MLP, Siamese) backend scoring approaches.

Result: In closed-set: few-shot MLP (15.11% EER) and Siamese (18.44% EER) beat zero-shot cosine (27.14% EER). In open-set: zero-shot cosine (21.70% EER) outperforms few-shot MLP (22.65% EER) and Siamese (27.40% EER).

Conclusion: Few-shot learning excels in closed-set scenarios while zero-shot approaches are more effective for open-set source tracing tasks.

Abstract: We propose a novel zero-shot source tracing framework inspired by advances in speaker verification. Specifically, we adapt the SSL-AASIST system for attack classification, ensuring that the attacks used for training are disjoint from those used to form fingerprint-trial pairs. For backend scoring in attack verification, we explore both zero-shot approaches (cosine similarity and Siamese) and few-shot approaches (MLP and Siamese). Experiments on our recently introduced STOPA dataset suggest that few-shot learning provides advantages in the closed-set scenario, while zero-shot approaches perform better in the open-set scenario. In closed-set trials, few-shot Siamese and MLP achieve equal error rates (EER) of 18.44% and 15.11%, compared to 27.14% for zero-shot cosine scoring. Conversely, in open-set trials, zero-shot cosine scoring reaches 21.70%, outperforming few-shot Siamese and MLP at 27.40% and 22.65%, respectively.

[1659] SenSE: Semantic-Aware High-Fidelity Universal Speech Enhancement

Xingchen Li, Hanke Xie, Ziqian Wang, Zihan Zhang, Longshuai Xiao, Lei Xie

Main category: eess.AS

TL;DR: SenSE is a generative universal speech enhancement method that integrates semantic information from language models into flow-matching-based enhancement to address semantic ambiguity and improve speech quality under severe distortions.

Details

Motivation: Existing generative speech enhancement methods lack awareness of high-level semantic information, causing semantic ambiguity and acoustic discontinuities. Humans comprehend corrupted speech using semantic priors, suggesting semantics are crucial for enhancement.

Method: Proposes SenSE framework that uses a semantic-aware speech language model to capture semantics from degraded speech and generate semantic tokens. Includes semantic guidance mechanism to integrate semantic information into flow-matching-based enhancement, and prompt guidance using reference utterances to maintain speaker similarity.

Result: Demonstrates high perceptual quality and substantially improved speech fidelity on benchmark datasets while maintaining strong robustness under severe distortions.

Conclusion: Integrating semantic information through language models effectively mitigates semantic ambiguity in speech enhancement and improves performance under challenging distortion conditions.

Abstract: Generative universal speech enhancement (USE) methods aim to leverage generative models to improve speech quality under various types of distortions. Diffusion- or flow-based generative models are capable of producing enhanced speech with high quality and fidelity. However, they typically achieve speech enhancement by learning an acoustic feature mapping from degraded speech to clean speech, while lacking awareness of high-level semantic information. This deficiency tends to cause semantic ambiguity and acoustic discontinuities in the enhanced speech. In contrast, humans can often comprehend heavily corrupted speech by relying on semantic priors, suggesting that semantics play a crucial role in speech enhancement. Therefore, in this paper, we propose SenSE, which leverages a language model to capture the semantic information of distorted speech and effectively integrates it into a flow-matching-based speech enhancement framework. Specifically, we introduce a semantic-aware speech language model to capture the semantics of degraded speech and generate semantic tokens. We then design a semantic guidance mechanism that incorporates semantic information into the flow-matching-based speech enhancement process, effectively mitigating semantic ambiguity. In addition, we propose a prompt guidance mechanism, which leverages a short reference utterance to alleviate the loss of speaker similarity under severe distortion conditions. The results of several benchmark data sets demonstrate that SenSE not only ensures high perceptual quality but also substantially improves speech fidelity while maintaining strong robustness under severe distortions. Codes and demos are available.

[1660] Deep Learning-Based Prediction of Energy Decay Curves from Room Geometry and Material Properties

Imran Muhammad, Gerald Schuller

Main category: eess.AS

TL;DR: Deep learning framework using LSTM networks to predict room energy decay curves from geometry and absorption data, achieving accurate estimation of acoustic parameters like EDT, T20, and C50.

Details

Motivation: Enable robust room acoustics analysis and reliable estimation of key acoustic parameters through accurate prediction of energy decay curves.

Method: Used 6000 shoebox rooms with realistic dimensions, source-receiver placements, and frequency-dependent wall absorptions. Simulated room impulse responses using Pyroomacoustics, computed target EDCs, and trained LSTM network on normalized room features.

Result: Achieved close agreement between predicted and target EDCs with EDT MAE 0.017 s, T20 MAE 0.021 s. Model generalizes across diverse rooms.

Conclusion: The approach supports efficient room-acoustics modeling for early-stage design and real-time applications.

Abstract: Accurate prediction of energy decay curves (EDCs) enables robust analysis of room acoustics and reliable estimation of key parameters. We present a deep learning framework that predicts EDCs directly from room geometry and surface absorption. A dataset of 6000 shoebox rooms with realistic dimensions, source-receiver placements, and frequency-dependent wall absorptions was synthesized. For each configuration we simulate room impulse responses (RIRs) using Pyroomacoustics and compute target EDCs. Normalized room features are provided to a long short-term memory (LSTM) network that maps configuration to EDC. Performance is evaluated with mean absolute error (MAE) and root mean square error (RMSE) over time. We further derive early decay time (EDT), reverberation time (T20), and clarity index (C50) from predicted and target EDCs; close agreement is observed (e.g., EDT MAE 0.017 s, T20 MAE 0.021 s). The approach generalizes across diverse rooms and supports efficient room-acoustics modeling for early-stage design and real-time applications.

[1661] VSSFlow: Unifying Video-conditioned Sound and Speech Generation via Joint Learning

Xin Cheng, Yuyue Wang, Xihua Wang, Yihan Wu, Kaisi Guan, Yijing Chen, Peng Zhang, Xiaojiang Liu, Meng Cao, Ruihua Song

Main category: eess.AS

TL;DR: VSSFlow unifies video-to-sound (V2S) and visual text-to-speech (VisualTTS) tasks in a single flow-matching framework using cross-attention for video conditions and self-attention for speech transcripts, achieving state-of-the-art performance through joint learning.

Details

Motivation: Current approaches treat V2S and VisualTTS as separate tasks with complex training stages, lacking a unified framework that can handle both heterogeneous video and transcript conditions efficiently.

Method: VSSFlow uses a flow-matching framework with condition aggregation mechanism leveraging cross-attention for ambiguous video conditions and self-attention for deterministic speech transcripts, enabling end-to-end joint training.

Result: VSSFlow surpasses state-of-the-art domain-specific baselines on both V2S and VisualTTS benchmarks, benefiting from learned general audio prior that accelerates convergence and enhances conditional generation.

Conclusion: Unified generative models like VSSFlow demonstrate critical potential for handling multiple audio generation tasks effectively through shared representations and joint learning.

Abstract: Video-conditioned sound and speech generation, encompassing video-to-sound (V2S) and visual text-to-speech (VisualTTS) tasks, are conventionally addressed as separate tasks, with limited exploration to unify them within a signle framework. Recent attempts to unify V2S and VisualTTS face challenges in handling distinct condition types (e.g., heterogeneous video and transcript conditions) and require complex training stages. Unifying these two tasks remains an open problem. To bridge this gap, we present VSSFlow, which seamlessly integrates both V2S and VisualTTS tasks into a unified flow-matching framework. VSSFlow uses a novel condition aggregation mechanism to handle distinct input signals. We find that cross-attention and self-attention layer exhibit different inductive biases in the process of introducing condition. Therefore, VSSFlow leverages these inductive biases to effectively handle different representations: cross-attention for ambiguous video conditions and self-attention for more deterministic speech transcripts. Furthermore, contrary to the prevailing belief that joint training on the two tasks requires complex training strategies and may degrade performance, we find that VSSFlow benefits from the end-to-end joint learning process for sound and speech generation without extra designs on training stages. Detailed analysis attributes it to the learned general audio prior shared between tasks, which accelerates convergence, enhances conditional generation, and stabilizes the classifier-free guidance process. Extensive experiments demonstrate that VSSFlow surpasses the state-of-the-art domain-specific baselines on both V2S and VisualTTS benchmarks, underscoring the critical potential of unified generative models.

[1662] Room Impulse Response Prediction with Neural Networks: From Energy Decay Curves to Perceptual Validation

Imran Muhammad, Gerald Schuller

Main category: eess.AS

TL;DR: Neural network framework predicts room impulse responses from room parameters using energy decay curves and reverse-differentiation, achieving perceptually indistinguishable results from reference simulations.

Details

Motivation: Conventional room impulse response simulations and measurements are computationally expensive and time-consuming, creating need for efficient prediction methods for room acoustics and audio applications.

Method: Uses neural network to predict energy decay curves from room dimensions, material absorption, and source-receiver positions, then reconstructs RIRs via reverse-differentiation. Trained on large dataset from acoustic simulations with realistic parameters.

Result: Objective evaluation shows low RMSE for EDCs and good correlation/MSE/spectral similarity for RIRs. MUSHRA listening test confirms no significant perceptual differences between predicted and reference RIRs.

Conclusion: Proposed framework provides accurate and perceptually reliable RIR predictions, offering scalable solution for practical acoustic modeling and audio rendering applications.

Abstract: Prediction of room impulse responses (RIRs) is essential for room acoustics, spatial audio, and immersive applications, yet conventional simulations and measurements remain computationally expensive and time-consuming. This work proposes a neural network framework that predicts energy decay curves (EDCs) from room dimensions, material absorption coefficients, and source-receiver positions, and reconstructs corresponding RIRs via reverse-differentiation. A large training dataset was generated using room acoustic simulations with realistic geometries, frequency-dependent absorption, and diverse source-receiver configurations. Objective evaluation employed root mean squared error (RMSE) and a custom loss for EDCs, as well as correlation, mean squared error (MSE), spectral similarity for reconstructed RIRs. Perceptual validation through a MUSHRA listening test confirmed no significant perceptual differences between predicted and reference RIRs. The results demonstrate that the proposed framework provides accurate and perceptually reliable RIR predictions, offering a scalable solution for practical acoustic modeling and audio rendering applications.

[1663] SAGA-SR: Semantically and Acoustically Guided Audio Super-Resolution

Jaekwon Im, Juhan Nam

Main category: eess.AS

TL;DR: SAGA-SR is a versatile audio super-resolution model that combines semantic and acoustic guidance to upsample audio from 4-32 kHz to 44.1 kHz, achieving state-of-the-art performance across speech, music, and sound effects.

Details

Motivation: Existing diffusion-based audio SR methods often fail to produce semantically aligned outputs and struggle with consistent high-frequency reconstruction across diverse audio domains.

Method: Uses a DiT backbone trained with flow matching objective, conditioned on text and spectral roll-off embeddings to provide effective semantic and acoustic guidance.

Result: Robustly upsamples audio from arbitrary input sampling rates (4-32 kHz) to 44.1 kHz, achieving state-of-the-art performance in both objective and subjective evaluations across all test cases.

Conclusion: SAGA-SR effectively combines semantic and acoustic guidance to overcome limitations of existing methods and deliver superior audio super-resolution performance across diverse domains.

Abstract: Versatile audio super-resolution (SR) aims to predict high-frequency components from low-resolution audio across diverse domains such as speech, music, and sound effects. Existing diffusion-based SR methods often fail to produce semantically aligned outputs and struggle with consistent high-frequency reconstruction. In this paper, we propose SAGA-SR, a versatile audio SR model that combines semantic and acoustic guidance. Based on a DiT backbone trained with a flow matching objective, SAGA-SR is conditioned on text and spectral roll-off embeddings. Due to the effective guidance provided by its conditioning, SAGA-SR robustly upsamples audio from arbitrary input sampling rates between 4 kHz and 32 kHz to 44.1 kHz. Both objective and subjective evaluations show that SAGA-SR achieves state-of-the-art performance across all test cases. Sound examples and code for the proposed model are available online.

[1664] TADA: Training-free Attribution and Out-of-Domain Detection of Audio Deepfakes

Adriana Stan, David Combei, Dan Oneata, Horia Cucu

Main category: eess.AS

TL;DR: A training-free kNN-based approach using pre-trained SSL models achieves high accuracy in audio deepfake model attribution (0.93 F1-score) and strong out-of-domain detection (0.84 F1-score).

Details

Motivation: Deepfake detection has high accuracy but identifying the exact source/model behind deepfakes remains understudied, especially for audio deepfake model attribution.

Method: Training-free green AI approach using k-Nearest Neighbors with pre-trained self-supervised learning models, requiring no additional training.

Result: Achieved 0.93 F1-score across five deepfake datasets for model attribution and 0.84 F1-score for out-of-domain detection of unseen models.

Conclusion: The proposed training-free kNN method effectively solves audio deepfake model attribution and shows strong generalization to unseen models, providing a green AI solution.

Abstract: Deepfake detection has gained significant attention across audio, text, and image modalities, with high accuracy in distinguishing real from fake. However, identifying the exact source–such as the system or model behind a deepfake–remains a less studied problem. In this paper, we take a significant step forward in audio deepfake model attribution or source tracing by proposing a training-free, green AI approach based entirely on k-Nearest Neighbors (kNN). Leveraging a pre-trained self-supervised learning (SSL) model, we show that grouping samples from the same generator is straightforward–we obtain an 0.93 F1-score across five deepfake datasets. The method also demonstrates strong out-of-domain (OOD) detection, effectively identifying samples from unseen models at an F1-score of 0.84. We further analyse these results in a multi-dimensional approach and provide additional insights. All code and data protocols used in this work are available in our open repository: https://github.com/adrianastan/tada/.

[1665] Unmasking real-world audio deepfakes: A data-centric approach

David Combei, Adriana Stan, Dan Oneata, Nicolas Müller, Horia Cucu

Main category: eess.AS

TL;DR: The paper introduces a real-world audio deepfake dataset (AI4T) and demonstrates that data-centric approaches (curation, pruning, augmentation) significantly improve detection performance over complex models.

Details

Motivation: Existing deepfake detection systems are evaluated on scientific datasets, creating a gap with real-world deepfakes that pose significant challenges to current models.

Method: Data-centric paradigm using dataset curation, pruning, and augmentation strategies to improve model robustness and generalization, rather than increasing model complexity.

Result: Achieved 55% relative reduction in EER on In-the-Wild dataset (absolute EER 1.7%) and 63% reduction on the new AI4T real-world deepfake dataset.

Conclusion: Data-centric approaches have transformative potential for enhancing deepfake detection in real-world applications, with code and data made publicly available.

Abstract: The growing prevalence of real-world deepfakes presents a critical challenge for existing detection systems, which are often evaluated on datasets collected just for scientific purposes. To address this gap, we introduce a novel dataset of real-world audio deepfakes. Our analysis reveals that these real-world examples pose significant challenges, even for the most performant detection models. Rather than increasing model complexity or exhaustively search for a better alternative, in this work we focus on a data-centric paradigm, employing strategies like dataset curation, pruning, and augmentation to improve model robustness and generalization. Through these methods, we achieve a 55% relative reduction in EER on the In-the-Wild dataset, reaching an absolute EER of 1.7%, and a 63% reduction on our newly proposed real-world deepfakes dataset, AI4T. These results highlight the transformative potential of data-centric approaches in enhancing deepfake detection for real-world applications. Code and data available at: https://github.com/davidcombei/AI4T.

[1666] SynParaSpeech: Automated Synthesis of Paralinguistic Datasets for Speech Generation and Understanding

Bingsong Bai, Qihang Lu, Wenbing Yang, Zihan Sun, Yueran Hou, Peilei Jia, Songbai Pu, Ruibo Fu, Yingming Gao, Ya Li, Jun Gao

Main category: eess.AS

TL;DR: Proposed automated framework for generating large-scale paralinguistic data, creating SynParaSpeech dataset with 6 categories and 118.75 hours of precisely timestamped data from natural conversations.

Details

Motivation: Existing methods rely on proprietary datasets, while public resources have issues with incomplete speech, inaccurate timestamps, and limited real-world relevance for paralinguistic sounds like laughter and sighs.

Method: Developed automated framework for generating large-scale paralinguistic data, applied to construct SynParaSpeech dataset with precise timestamps derived from natural conversational speech.

Result: Created SynParaSpeech dataset with 6 paralinguistic categories containing 118.75 hours of data with precise timestamps, all from natural conversational speech.

Conclusion: Introduced first automated method for large-scale paralinguistic datasets and released SynParaSpeech corpus, advancing speech generation through natural paralinguistic synthesis and improving paralinguistic event detection for speech understanding.

Abstract: Paralinguistic sounds, like laughter and sighs, are crucial for synthesizing more realistic and engaging speech. However, existing methods typically depend on proprietary datasets, while publicly available resources often suffer from incomplete speech, inaccurate or missing timestamps, and limited real-world relevance. To address these problems, we propose an automated framework for generating large-scale paralinguistic data and apply it to construct the SynParaSpeech dataset. The dataset comprises 6 paralinguistic categories with 118.75 hours of data and precise timestamps, all derived from natural conversational speech. Our contributions lie in introducing the first automated method for constructing large-scale paralinguistic datasets and releasing the SynParaSpeech corpus, which advances speech generation through more natural paralinguistic synthesis and enhances speech understanding by improving paralinguistic event detection. The dataset and audio samples are available at https://github.com/ShawnPi233/SynParaSpeech.

eess.IV

[1667] VIRTUS-FPP: Virtual Sensor Modeling for Fringe Projection Profilometry in NVIDIA Isaac Sim

Adam Haroon, Anush Lakshman, Badrinath Balasubramaniam, Beiwen Li

Main category: eess.IV

TL;DR: VIRTUS-FPP is a physics-based virtual sensor modeling framework for fringe projection profilometry (FPP) built in NVIDIA Isaac Sim, enabling end-to-end modeling from calibration to reconstruction with full mathematical fidelity to structured light principles.

Details

Motivation: Traditional FPP faces limitations including complex calibration requirements, bulky system footprint, and sensitivity to environmental conditions, which this framework aims to overcome through virtual simulation.

Method: Leverages physics-based rendering and programmable sensing capabilities of NVIDIA Isaac Sim to create comprehensive virtual sensor modeling, including virtual calibration and digital twin replication of physical FPP systems.

Result: The framework accurately models optical phenomena critical to FPP and achieves results comparable to real-world systems, with validation through quantitative comparison against ground truth geometry and correspondence between virtual and real-world measurements.

Conclusion: VIRTUS-FPP significantly accelerates real-world FPP system development by enabling rapid virtual prototyping before physical implementation, offering unprecedented flexibility for system configuration, sensor prototyping, and environmental control.

Abstract: Fringe projection profilometry (FPP) has been established as a high-accuracy 3D reconstruction method capable of achieving sub-pixel accuracy. However, this technique faces significant constraints due to complex calibration requirements, bulky system footprint, and sensitivity to environmental conditions. To address these limitations, we present VIRTUS-FPP, the first comprehensive physics-based virtual sensor modeling framework for FPP built in NVIDIA Isaac Sim. By leveraging the physics-based rendering and programmable sensing capabilities of simulation, our framework enables end-to-end modeling from calibration to reconstruction with full mathematical fidelity to the underlying principles of structured light. We conduct comprehensive virtual calibration and validate our system’s reconstruction accuracy through quantitative comparison against ground truth geometry. Additionally, we demonstrate the ability to model the virtual system as a digital twin by replicating a physical FPP system in simulation and validating correspondence between virtual and real-world measurements. Experimental results demonstrate that VIRTUS-FPP accurately models optical phenomena critical to FPP and achieves results comparable to real-world systems while offering unprecedented flexibility for system configuration, sensor prototyping, and environmental control. This framework significantly accelerates the development of real-world FPP systems by enabling rapid virtual prototyping before physical implementation.

[1668] Explainable Deep Learning for Cataract Detection in Retinal Images: A Dual-Eye and Knowledge Distillation Approach

MohammadReza Abbaszadeh Bavil Soflaei, Karim SamadZamini

Main category: eess.IV

TL;DR: Deep learning pipeline achieves high-accuracy cataract detection from retinal images using transformers and knowledge-distilled lightweight models, with explainable AI showing focus on medically relevant features.

Details

Motivation: Cataract is a leading cause of visual impairment worldwide, and early detection from retinal imaging is critical for timely intervention, especially in resource-limited settings.

Method: Evaluated CNNs, transformers, lightweight architectures, and knowledge-distilled models on Ocular Disease Recognition dataset with 5000 patients’ fundus photos. Developed dual-eye Siamese variant of distilled MobileNetV3 that integrates information from both eyes.

Result: Swin-Base Transformer achieved 98.58% accuracy and 0.9836 F1-score. Distilled MobileNetV3 reached 98.42% accuracy with greatly reduced computational cost. Dual-eye Siamese variant achieved 98.21% accuracy. Grad-CAM showed models focused on medically significant features like lens opacity and central blur.

Conclusion: Accurate, interpretable cataract detection is achievable even with lightweight models, supporting potential clinical integration in resource-limited settings.

Abstract: Cataract remains a leading cause of visual impairment worldwide, and early detection from retinal imaging is critical for timely intervention. We present a deep learning pipeline for cataract classification using the Ocular Disease Recognition dataset, containing left and right fundus photographs from 5000 patients. We evaluated CNNs, transformers, lightweight architectures, and knowledge-distilled models. The top-performing model, Swin-Base Transformer, achieved 98.58% accuracy and an F1-score of 0.9836. A distilled MobileNetV3, trained with Swin-Base knowledge, reached 98.42% accuracy and a 0.9787 F1-score with greatly reduced computational cost. The proposed dual-eye Siamese variant of the distilled MobileNet, integrating information from both eyes, achieved an accuracy of 98.21%. Explainability analysis using Grad-CAM demonstrated that the CNNs concentrated on medically significant features, such as lens opacity and central blur. These results show that accurate, interpretable cataract detection is achievable even with lightweight models, supporting potential clinical integration in resource-limited settings

[1669] Achieving Fair Skin Lesion Detection through Skin Tone Normalization and Channel Pruning

Zihan Wei, Tapabrata Chakraborti

Main category: eess.IV

TL;DR: Proposes ITA Loss-based skin tone normalization and meta learning-based joint channel pruning to improve fairness in skin lesion classification without significant accuracy degradation.

Details

Motivation: Deep learning models for skin lesion classification exhibit bias toward demographic attributes like race, age, and gender, and current bias mitigation methods either degrade accuracy or only address single attributes.

Method: Uses ITA for skin tone normalization and data augmentation, combined with adaptable meta learning-based joint channel pruning with nested optimization loops for finding critical channels.

Result: Experiments on ISIC2019 dataset show improved fairness on multiple sensitive attributes without significant accuracy degradation.

Conclusion: The method effectively addresses bias in skin lesion classification while maintaining accuracy, though pruning adds computational cost during training.

Abstract: Recent works have shown that deep learning based skin lesion image classification models trained on unbalanced dataset can exhibit bias toward protected demographic attributes such as race, age,and gender. Current bias mitigation methods usually either achieve high level of fairness with the degradation of accuracy, or only improve the model fairness on a single attribute. Additionally usually most bias mitigation strategies are either pre hoc through data processing or post hoc through fairness evaluation, instead of being integrated into the model learning itself. To solve these existing drawbacks, we propose a new Individual Typology Angle (ITA) Loss-based skin tone normalization and data augmentation method that directly feeds into an adaptable meta learning-based joint channel pruning framework. In skin tone normalization, ITA is used to estimate skin tone type and adjust automatically to target tones for dataset balancing. In the joint channel pruning framework, two nested optimization loops are used to find critical channels.The inner optimization loop finds and prunes the local critical channels by weighted soft nearest neighbor loss, and the outer optimization loop updates the weight of each attribute using group wise variance loss on meta-set. Experiments conducted in the ISIC2019 dataset validate the effectiveness of our method in simultaneously improving the fairness of the model on multiple sensitive attributes without significant degradation of accuracy. Finally, although the pruning mechanism adds some computational cost during training phase, usually training is done off line. More importantly,

[1670] Consistency Models as Plug-and-Play Priors for Inverse Problems

Merve Gülle, Junno Yun, Yaşar Utku Alçalar, Mehmet Akçakaya

Main category: eess.IV

TL;DR: The paper proposes PnP-CM, a plug-and-play framework that integrates consistency models as proximal operators for solving inverse problems, achieving high-quality reconstructions in just 2-4 neural function evaluations.

Details

Motivation: Existing diffusion-based inverse problem solvers are slow, and current CM-based approaches either require task-specific training or have slow convergence, making them unsuitable for large-scale problems.

Method: Reinterpret consistency models as proximal operators and integrate them into PnP-ADMM framework with conjugate gradient acceleration, noise injection, and momentum.

Result: PnP-CM achieves high-quality reconstructions in 2-4 NFEs across various inverse problems (inpainting, super-resolution, deblurring, MRI), outperforming comparable CM-based approaches.

Conclusion: The proposed PnP-CM framework effectively solves inverse problems with fast convergence and minimal NFEs, demonstrating practical utility for real-world applications.

Abstract: Diffusion models have found extensive use in solving numerous inverse problems. Such diffusion inverse problem solvers aim to sample from the posterior distribution of data given the measurements, using a combination of the unconditional score function and an approximation of the posterior related to the forward process. Recently, consistency models (CMs) have been proposed to directly predict the final output from any point on the diffusion ODE trajectory, enabling high-quality sampling in just a few NFEs. CMs have also been utilized for inverse problems, but existing CM-based solvers either require additional task-specific training or utilize data fidelity operations with slow convergence, not amenable to large-scale problems. In this work, we reinterpret CMs as proximal operators of a prior, enabling their integration into plug-and-play (PnP) frameworks. We propose a solver based on PnP-ADMM, which enables us to leverage the fast convergence of conjugate gradient method. We further accelerate this with noise injection and momentum, dubbed PnP-CM, and show it maintains the convergence properties of the baseline PnP-ADMM. We evaluate our approach on a variety of inverse problems, including inpainting, super-resolution, Gaussian deblurring, and magnetic resonance imaging (MRI) reconstruction. To the best of our knowledge, this is the first CM trained for MRI datasets. Our results show that PnP-CM achieves high-quality reconstructions in as few as 4 NFEs, and can produce meaningful results in 2 steps, highlighting its effectiveness in real-world inverse problems while outperforming comparable CM-based approaches.

[1671] Enhanced Quality Aware-Scalable Underwater Image Compression

Linwei Zhu, Junhao Zhu, Xu Zhang, Huan Zhang, Ye Li, Runmin Cong, Sam Kwong

Main category: eess.IV

TL;DR: A scalable underwater image compression framework that simultaneously performs compression and enhancement using base and enhancement layers with sparse coefficients and dual-branch filtering.

Details

Motivation: Underwater imaging faces challenges of limited bandwidth and severe distortion in aquatic environments, requiring simultaneous compression and enhancement solutions.

Method: Two-layer framework: Base Layer uses sparse coefficients for compression and shared enhancement dictionary; Enhancement Layer uses dual-branch filtering (rough filtering + detail refinement) for residual redundancy removal and quality improvement.

Result: Outperforms state-of-the-art methods on five large-scale underwater image datasets in terms of Underwater Image Quality Measure (UIQM).

Conclusion: The proposed scheme effectively addresses underwater imaging challenges by integrating compression and enhancement in a scalable framework with superior performance.

Abstract: Underwater imaging plays a pivotal role in marine exploration and ecological monitoring. However, it faces significant challenges of limited transmission bandwidth and severe distortion in the aquatic environment. In this work, to achieve the target of both underwater image compression and enhancement simultaneously, an enhanced quality-aware scalable underwater image compression framework is presented, which comprises a Base Layer (BL) and an Enhancement Layer (EL). In the BL, the underwater image is represented by controllable number of non-zero sparse coefficients for coding bits saving. Furthermore, the underwater image enhancement dictionary is derived with shared sparse coefficients to make reconstruction close to the enhanced version. In the EL, a dual-branch filter comprising rough filtering and detail refinement branches is designed to produce a pseudo-enhanced version for residual redundancy removal and to improve the quality of final reconstruction. Extensive experimental results demonstrate that the proposed scheme outperforms the state-of-the-art works under five large-scale underwater image datasets in terms of Underwater Image Quality Measure (UIQM).

[1672] Untangling Vascular Trees for Surgery and Interventional Radiology

Guillaume Houry, Tom Boeken, Stéphanie Allassonnière, Jean Feydy

Main category: eess.IV

TL;DR: A method to create 2D planar maps of 3D vascular networks that preserves topology, length, and curvature for catheter navigation assistance.

Details

Motivation: The diffusion of minimally invasive endovascular interventions requires visualization methods for complex vascular networks to aid catheter navigation.

Method: Algorithm that takes 3D digital angiography as input and produces 2D vessel maps using optimized morphological filters and a recursive embedding algorithm preserving global orientation.

Result: Produces faithful 2D maps of patient’s vessels within seconds, demonstrated on brain, pelvic and knee artery networks from peroperative images.

Conclusion: Simplifies device choice for interventions, reduces navigation failure risk, enables anatomical studies on branching patterns, and code is available as open-source.

Abstract: The diffusion of minimally invasive, endovascular interventions motivates the development of visualization methods for complex vascular networks. We propose a planar representation of blood vessel trees which preserves the properties that are most relevant to catheter navigation: topology, length and curvature. Taking as input a three-dimensional digital angiography, our algorithm produces a faithful two-dimensional map of the patient’s vessels within a few seconds. To this end, we propose optimized implementations of standard morphological filters and a new recursive embedding algorithm that preserves the global orientation of the vascular network. We showcase our method on peroperative images of the brain, pelvic and knee artery networks. On the clinical side, our method simplifies the choice of devices prior to and during the intervention. This lowers the risk of failure during navigation or device deployment and may help to reduce the gap between expert and common intervention centers. From a research perspective, our method simulates the cadaveric display of artery trees from anatomical dissections. This opens the door to large population studies on the branching patterns and tortuosity of fine human blood vessels. Our code is released under the permissive MIT license as part of the scikit-shapes Python library (https://scikit-shapes.github.io ).

[1673] On the Impact of LiDAR Point Cloud Compression on Remote Semantic Segmentation

Tiago de S. Fernandes, Ricardo L. de Queiroz

Main category: eess.IV

TL;DR: This paper analyzes the impact of LiDAR point cloud compression on semantic segmentation performance for autonomous vehicles, finding that high segmentation quality requires 0.6 MB/s for G-PCC and 2.8 MB/s for L3C2 compression.

Details

Motivation: To understand how point cloud compression affects remote cloud-based segmentation performance in smart city autonomous vehicle frameworks, and to estimate necessary bandwidth requirements for infrastructure planning.

Method: Developed a new distortion metric, tested two MPEG compression algorithms (GPCC and L3C2) and two semantic segmentation algorithms (2DPASS and PVKD) on the Semantic KITTI dataset.

Result: High segmentation quality requires communication throughput of approximately 0.6 MB/s for G-PCC and 2.8 MB/s for L3C2 compression.

Conclusion: The bandwidth requirements identified are crucial for planning infrastructure resources to support autonomous navigation in smart city environments.

Abstract: Autonomous vehicles rely on LiDAR sensors to generate 3D point clouds for accurate segmentation and object detection. In a context of a smart city framework, we would like to understand the effect that transmission (compression) can have on remote (cloud) segmentation, instead of local processing. In this short paper, we try to understand the impact of point cloud compression on semantic segmentation performance and to estimate the necessary bandwidth requirements. We developed a new (suitable) distortion metric to evaluate such an impact. Two of MPEG’s compression algorithms (GPCC and L3C2) and two leading semantic segmentation algorithms (2DPASS and PVKD) were tested over the Semantic KITTI dataset. Results indicate that high segmentation quality requires communication throughput of approximately 0.6 MB/s for G-PCC and 2.8 MB/s for L3C2. These results are important in order to plan infrastructure resources for autonomous navigation.

Md. Saiful Bari Siddiqui, Mohammed Imamul Hassan Bhuiyan

Main category: eess.IV

TL;DR: S$^3$F-Net is a dual-branch framework that combines spatial CNN with spectral analysis using Fourier transforms for medical image classification, achieving state-of-the-art performance across multiple datasets.

Details

Motivation: Convolutional Neural Networks focus on single domain spatial features, which are inefficient at capturing global patterns and fail to model frequency-domain characteristics in medical images.

Method: Proposed S$^3$F-Net with dual branches: deep spatial CNN and shallow spectral encoder (SpectraNet) using SpectralFilter layer that applies learnable filters to Fourier spectrum via element-wise multiplication for global receptive field.

Result: Consistently outperforms spatial-only baseline with up to 5.13% accuracy improvement, achieves 98.76% accuracy on BRISC2025 and 93.11% on Chest X-Ray Pneumonia dataset. Explainability shows dynamic branch reliance based on input pathology.

Conclusion: The dual-domain approach combining spatial and spectral representations is a powerful and generalizable paradigm for medical image analysis.

Abstract: Convolutional Neural Networks have become a cornerstone of medical image analysis due to their proficiency in learning hierarchical spatial features. However, this focus on a single domain is inefficient at capturing global, holistic patterns and fails to explicitly model an image’s frequency-domain characteristics. To address these challenges, we propose the Spatial-Spectral Summarizer Fusion Network (S$^3$F-Net), a dual-branch framework that learns from both spatial and spectral representations simultaneously. The S$^3$F-Net performs a fusion of a deep spatial CNN with our proposed shallow spectral encoder, SpectraNet. SpectraNet features the proposed SpectralFilter layer, which leverages the Convolution Theorem by applying a bank of learnable filters directly to an image’s full Fourier spectrum via a computation-efficient element-wise multiplication. This allows the SpectralFilter layer to attain a global receptive field instantaneously, with its output being distilled by a lightweight summarizer network. We evaluate S$^3$F-Net across four medical imaging datasets spanning different modalities to validate its efficacy and generalizability. Our framework consistently and significantly outperforms its strong spatial-only baseline in all cases, with accuracy improvements of up to 5.13%. With a powerful Bilinear Fusion, S$^3$F-Net achieves a SOTA competitive accuracy of 98.76% on the BRISC2025 dataset. Concatenation Fusion performs better on the texture-dominant Chest X-Ray Pneumonia dataset, achieving 93.11% accuracy, surpassing many top-performing, much deeper models. Our explainability analysis also reveals that the S$^3$F-Net learns to dynamically adjust its reliance on each branch based on the input pathology. These results verify that our dual-domain approach is a powerful and generalizable paradigm for medical image analysis.

[1675] Foundation Model-Based Adaptive Semantic Image Transmission for Dynamic Wireless Environments

Fangyu Liu, Peiwen Jiang, Wenjin Wang, Chao-Kai Wen, Shi Jin, Jun Zhang

Main category: eess.IV

TL;DR: A foundation model-based adaptive semantic image transmission system that decomposes images into semantic maps and compressed representations, uses task-adaptive precoding with scenario-specific channel estimation, and employs diffusion models for robust reconstruction in dynamic wireless environments.

Details

Motivation: Existing semantic transmission methods overlook varying importance of semantic components for specific tasks and insufficiently exploit wireless domain knowledge, leading to limited robustness under dynamic channel conditions.

Method: Decomposes images into semantic segmentation maps and compressed representations, uses task-adaptive precoding with radio resource allocation based on semantic importance, constructs channel estimation knowledge maps using conditional diffusion models, and employs diffusion models for image reconstruction.

Result: Outperforms existing approaches on BDD100K dataset in perceptual quality (SSIM, LPIPS, FID), task-specific accuracy (IoU), and transmission efficiency under multi-scenario channels.

Conclusion: Effectively integrates task-aware semantic decomposition, scenario-adaptive channel estimation, and diffusion-based reconstruction for robust semantic transmission in dynamic wireless environments.

Abstract: Foundation model-based semantic transmission has recently shown great potential in wireless image communication. However, existing methods exhibit two major limitations: (i) they overlook the varying importance of semantic components for specific downstream tasks, and (ii) they insufficiently exploit wireless domain knowledge, resulting in limited robustness under dynamic channel conditions. To overcome these challenges, this paper proposes a foundation model-based adaptive semantic image transmission system for dynamic wireless environments, such as autonomous driving. The proposed system decomposes each image into a semantic segmentation map and a compressed representation, enabling task-aware prioritization of critical objects and fine-grained textures. A task-adaptive precoding mechanism then allocates radio resources according to the semantic importance of extracted features. To ensure accurate channel information for precoding, a channel estimation knowledge map (CEKM) is constructed using a conditional diffusion model that integrates user position, velocity, and sparse channel samples to train scenario-specific lightweight estimators. At the receiver, a conditional diffusion model reconstructs high-quality images from the received semantic features, ensuring robustness against channel impairments and partial data loss. Simulation results on the BDD100K dataset with multi-scenario channels generated by QuaDRiGa demonstrate that the proposed method outperforms existing approaches in terms of perceptual quality (SSIM, LPIPS, FID), task-specific accuracy (IoU), and transmission efficiency. These results highlight the effectiveness of integrating task-aware semantic decomposition, scenario-adaptive channel estimation, and diffusion-based reconstruction for robust semantic transmission in dynamic wireless environments.

[1676] A University of Texas Medical Branch Case Study on Aortic Calcification Detection

Eric Walser, Peter McCaffrey, Kal Clark, Nicholas Czarnek

Main category: eess.IV

TL;DR: UTMB partnered with Zauron Labs to use AI tools for detecting aortic calcifications in chest radiographs, identifying significant miscoding and misdiagnosis rates that led to missed revenue and improved patient care.

Details

Motivation: Aortic calcifications are often underreported despite their importance for cardiovascular disease prognosis, leading to missed clinical care opportunities and revenue.

Method: Used Zauron’s AI tools including a high-performing image model (AUC = 0.938) and fine-tuned Llama 3.2 language model to retrospectively analyze imaging and report data from 3,988 patients (5,000 exams).

Result: Found 495 patients (12.4%) with aortic calcifications not properly coded for reimbursement and 84 patients (2.1%) with missed aortic calcifications during initial review, representing $314k in missed annual revenue.

Conclusion: UTMB adopted Zauron’s Guardian Pro software system-wide to ensure accurate AI-enhanced peer review and coding, improving both patient care and financial outcomes.

Abstract: This case study details The University of Texas Medical Branch (UTMB)’s partnership with Zauron Labs, Inc. to enhance detection and coding of aortic calcifications (ACs) using chest radiographs. ACs are often underreported despite their significant prognostic value for cardiovascular disease, and UTMB partnered with Zauron to apply its advanced AI tools, including a high-performing image model (AUC = 0.938) and a fine-tuned language model based on Meta’s Llama 3.2, to retrospectively analyze imaging and report data. The effort identified 495 patients out of 3,988 unique patients assessed (5,000 total exams) whose reports contained indications of aortic calcifications that were not properly coded for reimbursement (12.4% miscode rate) as well as an additional 84 patients who had aortic calcifications that were missed during initial review (2.1% misdiagnosis rate). Identification of these patients provided UTMB with the potential to impact clinical care for these patients and pursue $314k in missed annual revenue. These findings informed UTMB’s decision to adopt Zauron’s Guardian Pro software system-wide to ensure accurate, AI-enhanced peer review and coding, improving both patient care and financial solvency. This study is covered under University of Texas Health San Antonio’s Institutional Review Board Study ID 00001887.

[1677] Non-Invasive Detection of PROState Cancer with Novel Time-Dependent Diffusion MRI and AI-Enhanced Quantitative Radiological Interpretation: PROS-TD-AI

Baltasar Ramos, Cristian Garrido, Paulette Narv’aez, Santiago Gelerstein Claro, Haotian Li, Rafael Salvador, Constanza V’asquez-Venegas, Iv’an Gallegos, Yi Zhang, V’ictor Casta~neda, Cristian Acevedo, Dan Wu, Gonzalo C’ardenas, Camilo G. Sotomayor

Main category: eess.IV

TL;DR: This study protocol evaluates an AI-enhanced time-dependent diffusion MRI software (PROSTDAI) for improved prostate cancer diagnosis, comparing it against current standard PI-RADS v2.1 using MRI-guided biopsy validation.

Details

Motivation: Multiparametric MRI has limitations including false positives/negatives and interobserver variability in prostate cancer diagnosis. Time-dependent diffusion MRI shows promise for better tissue characterization and distinguishing clinically significant from insignificant cancer.

Method: Prospective evaluation of a home-developed AI-enhanced TDD-MRI software (PROSTDAI) in routine diagnostic care, comparing it against PI-RADS v2.1 and validating results with MRI-guided prostate biopsy.

Result: The study protocol outlines the rationale but does not present actual results as this is a protocol paper describing the planned evaluation methodology.

Conclusion: Combining TDD-derived metrics with machine learning may provide more robust, zone-specific risk prediction with improved accuracy and less dependence on reader training compared to current standard-of-care.

Abstract: Prostate cancer (PCa) is the most frequently diagnosed malignancy in men and the eighth leading cause of cancer death worldwide. Multiparametric MRI (mpMRI) has become central to the diagnostic pathway for men at intermediate risk, improving de-tection of clinically significant PCa (csPCa) while reducing unnecessary biopsies and over-diagnosis. However, mpMRI remains limited by false positives, false negatives, and moderate to substantial interobserver agreement. Time-dependent diffusion (TDD) MRI, a novel sequence that enables tissue microstructure characterization, has shown encouraging preclinical performance in distinguishing clinically significant from insignificant PCa. Combining TDD-derived metrics with machine learning may provide robust, zone-specific risk prediction with less dependence on reader training and improved accuracy compared to current standard-of-care. This study protocol out-lines the rationale and describes the prospective evaluation of a home-developed AI-enhanced TDD-MRI software (PROSTDAI) in routine diagnostic care, assessing its added value against PI-RADS v2.1 and validating results against MRI-guided prostate biopsy.

[1678] Adaptive Source-Channel Coding for Multi-User Semantic and Data Communications

Kai Yuan, Dongxu Li, Jianhao Huang, Han Zhang, Chuan Huang

Main category: eess.IV

TL;DR: Proposes a multi-user adaptive source-channel coding (MU-ASCC) framework for simultaneous semantic and data communication tasks in downlink MISO systems, achieving better performance than conventional single-task schemes.

Details

Motivation: Address challenges in multi-user semantic and data communication systems including heterogeneous tasks, diverse channel conditions, and digital compatibility requirements.

Method: Uses data-regression to approximate E2E distortions, formulates weighted-sum distortion minimization problem, and develops alternating optimization with subgradient descent and uplink-downlink duality.

Result: Simulation results show MU-ASCC achieves simultaneous improvements in both data recovery and semantic task performance compared to SSCC and DJSCC schemes.

Conclusion: The proposed MU-ASCC framework effectively handles multi-user semantic and data communication with adaptive optimization of source-channel coding, power allocation, and beamforming.

Abstract: This paper considers a multi-user semantic and data communication (MU-SemDaCom) system, where a base station (BS) simultaneously serves users with different semantic and data tasks through a downlink multi-user multiple-input single-output (MU-MISO) channel. The coexistence of heterogeneous communication tasks, diverse channel conditions, and the requirements for digital compatibility poses significant challenges to the efficient design of MU-SemDaCom systems. To address these issues, we propose a multi-user adaptive source-channel coding (MU-ASCC) framework that adaptively optimizes deep neural network (DNN)-based source coding, digital channel coding, and superposition broadcasting. First, we employ a data-regression method to approximate the end-to-end (E2E) semantic and data distortions, for which no closed-form expressions exist. The obtained logistic formulas decompose the E2E distortion as the addition of the source and channel distortion terms, in which the logistic parameter variations are task-dependent and jointly determined by both the DNN and channel parameters. Then, based on the derived formulas, we formulate a weighted-sum E2E distortion minimization problem that jointly optimizes the source-channel coding rates, power allocation, and beamforming vectors for both the data and semantic users. Finally, an alternating optimization (AO) framework is developed, where the adaptive rate optimization is solved using the subgradient descent method, while the joint power and beamforming is addressed via the uplink-downlink duality (UDD) technique. Simulation results demonstrate that, compared with the conventional separate source-channel coding (SSCC) and deep joint source-channel coding (DJSCC) schemes that are designed for a single task, the proposed MU-ASCC scheme achieves simultaneous improvements in both the data recovery and semantic task performance.

[1679] ReCon-GS: Continuum-Preserved Guassian Streaming for Fast and Compact Reconstruction of Dynamic Scenes

Jiaye Fu, Qiankun Gao, Chengxiang Wen, Yanmin Wu, Siwei Ma, Jiaqi Zhang, Jian Zhang

Main category: eess.IV

TL;DR: ReCon-GS is a storage-aware framework for online free-viewpoint video reconstruction that improves training efficiency by 15%, achieves superior rendering quality, and reduces memory requirements by over 50% compared to state-of-the-art methods.

Details

Motivation: To address challenges in online FVV reconstruction including slow per-frame optimization, inconsistent motion estimation, and unsustainable storage demands.

Method: Uses multi-level Anchor Gaussians in density-adaptive fashion to capture geometric deformations, dynamic hierarchy reconfiguration for motion expressiveness and temporal consistency, and storage-aware optimization for adjustable density trade-offs.

Result: Improves training efficiency by approximately 15%, achieves superior FVV synthesis quality with enhanced robustness and stability, and reduces memory requirements by over 50% at equivalent rendering quality.

Conclusion: ReCon-GS effectively addresses key challenges in online dynamic scene reconstruction by providing a storage-aware framework that balances reconstruction fidelity and memory usage while maintaining high performance.

Abstract: Online free-viewpoint video (FVV) reconstruction is challenged by slow per-frame optimization, inconsistent motion estimation, and unsustainable storage demands. To address these challenges, we propose the Reconfigurable Continuum Gaussian Stream, dubbed ReCon-GS, a novel storage-aware framework that enables high fidelity online dynamic scene reconstruction and real-time rendering. Specifically, we dynamically allocate multi-level Anchor Gaussians in a density-adaptive fashion to capture inter-frame geometric deformations, thereby decomposing scene motion into compact coarse-to-fine representations. Then, we design a dynamic hierarchy reconfiguration strategy that preserves localized motion expressiveness through on-demand anchor re-hierarchization, while ensuring temporal consistency through intra-hierarchical deformation inheritance that confines transformation priors to their respective hierarchy levels. Furthermore, we introduce a storage-aware optimization mechanism that flexibly adjusts the density of Anchor Gaussians at different hierarchy levels, enabling a controllable trade-off between reconstruction fidelity and memory usage. Extensive experiments on three widely used datasets demonstrate that, compared to state-of-the-art methods, ReCon-GS improves training efficiency by approximately 15% and achieves superior FVV synthesis quality with enhanced robustness and stability. Moreover, at equivalent rendering quality, ReCon-GS slashes memory requirements by over 50% compared to leading state-of-the-art methods.

[1680] Wavelet-Assisted Mamba for Satellite-Derived Sea Surface Temperature Super-Resolution

Wankun Chen, Feng Gao, Yanhai Gan, Jingchao Cao, Junyu Dong, Qian Du

Main category: eess.IV

TL;DR: Proposes WMSR framework using wavelet-assisted Mamba for SST super-resolution, achieving state-of-the-art performance through low-frequency state space modeling and high-frequency enhancement.

Details

Motivation: High-resolution SST data is crucial for climate monitoring but challenging to obtain due to physical imaging limitations. Mamba-based SSMs show promise for long-range dependency modeling but haven't been applied to SST super-resolution.

Method: WMSR framework with two key components: LFSSM using 2D-SSM for global information capture and temperature preservation, and HFEM using pixel difference convolution for high-frequency feature correction and texture enhancement.

Result: Comprehensive experiments on three SST datasets demonstrate superior performance over state-of-the-art methods.

Conclusion: WMSR effectively addresses SST super-resolution challenges by leveraging Mamba’s global modeling capabilities and wavelet-based frequency decomposition, with code and datasets made publicly available.

Abstract: Sea surface temperature (SST) is an essential indicator of global climate change and one of the most intuitive factors reflecting ocean conditions. Obtaining high-resolution SST data remains challenging due to limitations in physical imaging, and super-resolution via deep neural networks is a promising solution. Recently, Mamba-based approaches leveraging State Space Models (SSM) have demonstrated significant potential for long-range dependency modeling with linear complexity. However, their application to SST data super-resolution remains largely unexplored. To this end, we propose the Wavelet-assisted Mamba Super-Resolution (WMSR) framework for satellite-derived SST data. The WMSR includes two key components: the Low-Frequency State Space Module (LFSSM) and High-Frequency Enhancement Module (HFEM). The LFSSM uses 2D-SSM to capture global information of the input data, and the robust global modeling capabilities of SSM are exploited to preserve the critical temperature information in the low-frequency component. The HFEM employs the pixel difference convolution to match and correct the high-frequency feature, achieving accurate and clear textures. Through comprehensive experiments on three SST datasets, our WMSR demonstrated superior performance over state-of-the-art methods. Our codes and datasets will be made publicly available at https://github.com/oucailab/WMSR.

[1681] A Novel Preprocessing Unit for Effective Deep Learning based Classification and Grading of Diabetic Retinopathy

Pranoti Nage, Sanjay Shitole

Main category: eess.IV

TL;DR: A framework for early detection of diabetic retinopathy and diabetic macular edema using preprocessing with novel AVDS filter, segmentation with improved Mask RCNN, and classification with SSA-VGG-16.

Details

Motivation: Early detection of diabetic retinopathy is crucial for timely intervention to prevent vision loss and manage diabetic complications effectively.

Method: Three-stage framework: preprocessing (fuzzy filtering, non-linear diffusion, AVDS filter), segmentation (Improved Mask RCNN), and classification (SSA-VGG-16 with self-spatial attention).

Result: The method was evaluated on IDRiD and MESSIDOR datasets, with Hamming distance achieving better contrast and Euclidean distance showing less error with high PSNR.

Conclusion: The proposed framework effectively captures global contextual relationships and critical spatial regions, improving accuracy and robustness in DR and DME detection and grading.

Abstract: Early detection of diabetic retinopathy (DR) is crucial as it allows for timely intervention, preventing vision loss and enabling effective management of diabetic complications. This research performs detection of DR and DME at an early stage through the proposed framework which includes three stages: preprocessing, segmentation, feature extraction, and classification. In the preprocessing stage, noise filtering is performed by fuzzy filtering, artefact removal is performed by non-linear diffusion filtering, and the contrast improvement is performed by a novel filter called Adaptive Variable Distance Speckle (AVDS) filter. The AVDS filter employs four distance calculation methods such as Euclidean, Bhattacharya, Manhattan, and Hamming. The filter adaptively chooses a distance method which produces the highest contrast value amongst all 3 methods. From the analysis, hamming distance method was found to achieve better results for contrast and Euclidean distance showing less error value with high PSNR. The segmentation stage is performed using Improved Mask-Regional Convolutional Neural Networks (Mask RCNN). In the final stage, feature extraction and classification using novel Self-Spatial Attention infused VGG-16 (SSA-VGG-16), which effectively captures both global contextual relationships and critical spatial regions within retinal images, thereby improving the accuracy and robustness of DR and DME detection and grading. The effectiveness of the proposed method is assessed using two distinct datasets: IDRiD and MESSIDOR.

[1682] Few-shot Personalized Saliency Prediction Based on Interpersonal Gaze Patterns

Yuya Moroto, Keisuke Maeda, Takahiro Ogawa, Miki Haseyama

Main category: eess.IV

TL;DR: A few-shot personalized saliency prediction method that uses interpersonal gaze patterns and tensor-based regression to predict individual visual attention from limited eye-tracking data.

Details

Motivation: Personalized saliency maps capture individual visual preferences but are hard to predict due to complex gaze patterns and limited individual eye-tracking data. Leveraging data from other people can help overcome these challenges.

Method: Uses image selection to gather diverse gaze patterns from other persons and preserves structural information of personalized saliency maps through tensor-based regression.

Result: Experimental results show that both image selection for diverse gaze patterns and structural preservation through tensor regression are beneficial for few-shot personalized saliency prediction.

Conclusion: The proposed approach effectively addresses the challenges of personalized saliency prediction from limited data by leveraging interpersonal gaze patterns and maintaining structural information.

Abstract: This study proposes a few-shot personalized saliency prediction method that leverages interpersonal gaze patterns. Unlike general saliency maps, personalized saliency maps (PSMs) capture individual visual attention and provide insights into individual visual preferences. However, predicting PSMs is challenging because of the complexity of gaze patterns and the difficulty of collecting extensive eye-tracking data from individuals. An effective strategy for predicting PSMs from limited data is the use of eye-tracking data from other persons. To efficiently handle the PSMs of other persons, this study focuses on the selection of images to acquire eye-tracking data and the preservation of the structural information of PSMs. In the proposed method, these images are selected such that they bring more diverse gaze patterns to persons, and structural information is preserved using tensor-based regression. The experimental results demonstrate that these two factors are beneficial for few-shot PSM prediction.

[1683] Chronic Obstructive Pulmonary Disease Prediction Using Deep Convolutional Network

Shahran Rahman Alve, Muhammad Zawad Mahmud, Samiha Islam, Mohammad Monirujjaman Khan

Main category: eess.IV

TL;DR: A deep CNN-based system for detecting COPD from respiratory sounds achieves 96% accuracy using acoustic features like MFCCs and Mel-spectrograms, with severity classification.

Details

Motivation: Growing need for automated tools to support clinicians in diagnosing respiratory diseases due to limited trained personnel and rising patient loads.

Method: Deep CNN approach using Librosa-extracted acoustic features (MFCCs, Mel-spectrogram, Chroma variants) with 10-fold cross-validation on ICBHI database.

Result: 96% accuracy with cross-validation, 90% without cross-validation, outperforming existing methods in COPD detection and severity classification.

Conclusion: The proposed network shows strong potential as a practical clinical tool for automated respiratory disease diagnosis and severity assessment.

Abstract: Artificial intelligence and deep learning are increasingly applied in the clinical domain, particularly for early and accurate disease detection using medical imaging and sound. Due to limited trained personnel, there is a growing demand for automated tools to support clinicians in managing rising patient loads. Respiratory diseases such as cancer and diabetes remain major global health concerns requiring timely diagnosis and intervention. Auscultation of lung sounds, combined with chest X-rays, is an established diagnostic method for respiratory illness. This study presents a Deep Convolutional Neural Network (CNN)-based approach for the analysis of respiratory sound data to detect Chronic Obstructive Pulmonary Disease (COPD). Acoustic features extracted with the Librosa library, including Mel-Frequency Cepstral Coefficients (MFCCs), Mel-Spectrogram, Chroma, Chroma (Constant Q), and Chroma CENS, were used in training. The system also classifies disease severity as mild, moderate, or severe. Evaluation on the ICBHI database achieved 96% accuracy using 10-fold cross-validation and 90% accuracy without cross-validation. The proposed network outperforms existing methods, demonstrating potential as a practical tool for clinical deployment.

[1684] Freqformer: Frequency-Domain Transformer for 3-D Reconstruction and Quantification of Human Retinal Vasculature

Lingyun Wang, Bingjie Wang, Jay Chhablani, Jose Alain Sahel, Shaohua Pi

Main category: eess.IV

TL;DR: Freqformer is a Transformer-based model that achieves accurate 3D reconstruction of retinal vasculature from single OCTA scans using a dual-branch architecture combining spatial context and frequency-domain enhancement.

Details

Motivation: To enable accurate 3D reconstruction and quantitative analysis of human retinal vasculature from single OCTA scans, overcoming limitations of conventional methods.

Method: Dual-branch Transformer architecture with Transformer layer for global spatial context and complex-valued frequency-domain module for adaptive frequency enhancement, trained on single depth-plane OCTA images using volumetrically merged OCTA as ground truth.

Result: Freqformer outperformed existing CNN and Transformer methods, achieving superior image metrics and strong correlation with merged volumes on vascular quantification metrics. 2D approach proved more efficient than 3D counterparts without performance loss.

Conclusion: Freqformer reliably generates high-definition 3D retinal microvasculature from single-scan OCTA, enabling precise vascular quantification comparable to standard volumetric merging methods with excellent generalization capability.

Abstract: Objective: To achieve accurate 3-D reconstruction and quantitative analysis of human retinal vasculature from a single optical coherence tomography angiography (OCTA) scan. Methods: We introduce Freqformer, a novel Transformer-based model featuring a dual-branch architecture that integrates a Transformer layer for capturing global spatial context with a complex-valued frequency-domain module designed for adaptive frequency enhancement. Freqformer was trained using single depth-plane OCTA images, utilizing volumetrically merged OCTA as the ground truth. Performance was evaluated quantitatively through 2-D and 3-D image quality metrics. 2-D networks and their 3-D counterparts were compared to assess the differences between enhancing volume slice by slice and enhancing it by 3-D patches. Furthermore, 3-D quantitative vascular metrics were conducted to quantify human retinal vasculature. Results: Freqformer substantially outperformed existing convolutional neural networks and Transformer-based methods, achieving superior image metrics. Importantly, the enhanced OCTA volumes show strong correlation with the merged volumes on vascular segment count, density, length, and flow index, further underscoring its reliability for quantitative vascular analysis. 3-D counterparts did not yield additional gains in image metrics or downstream 3-D vascular quantification but incurred nearly an order-of-magnitude longer inference time, supporting our 2-D slice-wise enhancement strategy. Additionally, Freqformer showed excellent generalization capability on larger field-of-view scans, surpassing the quality of conventional volumetric merging methods. Conclusion: Freqformer reliably generates high-definition 3-D retinal microvasculature from single-scan OCTA, enabling precise vascular quantification comparable to standard volumetric merging methods.

[1685] UNIR-Net: A Novel Approach for Restoring Underwater Images with Non-Uniform Illumination Using Synthetic Data

Ezequiel Perez-Zarate, Chunxiao Liu, Oscar Ramos-Soto, Diego Oliva, Marco Perez-Cisneros

Main category: eess.IV

TL;DR: UNIR-Net is a deep learning network that effectively restores underwater images affected by non-uniform illumination by integrating illumination enhancement, attention mechanisms, visual refinement, and contrast correction.

Details

Motivation: Underwater images often suffer from non-uniform illumination that degrades visual quality and usability in marine applications. Existing methods struggle with complex illumination patterns, and learning-based approaches lack targeted datasets for training.

Method: Proposed UNIR-Net with multiple components: illumination enhancement, attention mechanisms, visual refinement, and contrast correction. Also introduced the PUNI dataset specifically designed for training models under non-uniform illumination conditions.

Result: Experimental results on PUNI and NUID datasets show UNIR-Net achieves superior performance in both quantitative metrics and visual outcomes. It also improves downstream tasks like underwater semantic segmentation.

Conclusion: UNIR-Net effectively addresses non-uniform illumination restoration in underwater images and demonstrates practical relevance for marine applications. The code is publicly available.

Abstract: Restoring underwater images affected by non-uniform illumination (NUI) is essential to improve visual quality and usability in marine applications. Conventional methods often fall short in handling complex illumination patterns, while learning-based approaches face challenges due to the lack of targeted datasets. To address these limitations, the Underwater Non-uniform Illumination Restoration Network (UNIR-Net) is proposed. UNIR-Net integrates multiple components, including illumination enhancement, attention mechanisms, visual refinement, and contrast correction, to effectively restore underwater images affected by NUI. In addition, the Paired Underwater Non-uniform Illumination (PUNI) dataset is introduced, specifically designed for training and evaluating models under NUI conditions. Experimental results on PUNI and the large-scale real-world Non-Uniform Illumination Dataset (NUID) show that UNIR-Net achieves superior performance in both quantitative metrics and visual outcomes. UNIR-Net also improves downstream tasks such as underwater semantic segmentation, highlighting its practical relevance. The code of this method is available at https://github.com/xingyumex/UNIR-Net

[1686] Reconstruct Anything Model: a lightweight foundation model for computational imaging

Matthieu Terris, Samuel Hurault, Maxime Song, Julian Tachella

Main category: eess.IV

TL;DR: A novel non-iterative, lightweight architecture for solving various imaging inverse problems that incorporates forward operator knowledge without unrolling, handles arbitrary image sizes/channels, and can adapt to unseen problems with minimal fine-tuning.

Details

Motivation: Existing methods have limitations: iterative methods are computationally costly with suboptimal performance, while unrolled architectures are problem-specific and require expensive training.

Method: Proposed a non-iterative architecture that incorporates knowledge about the forward operator (acquisition physics and noise parameters) without relying on unrolling, trained to solve various inverse problems and handle arbitrary image sizes/channels.

Result: Demonstrated state-of-the-art performance across medical imaging, low-photon imaging, and microscopy applications. The model can adapt to unseen problems with few fine-tuning steps in self-supervised manner.

Conclusion: The proposed method offers an efficient alternative to existing approaches, providing strong performance across diverse imaging inverse problems with minimal adaptation requirements.

Abstract: Most existing learning-based methods for solving imaging inverse problems can be roughly divided into two classes: iterative algorithms, such as plug-and-play and diffusion methods leveraging pretrained denoisers, and unrolled architectures that are trained end-to-end for specific imaging problems. Iterative methods in the first class are computationally costly and often yield suboptimal reconstruction performance, whereas unrolled architectures are generally problem-specific and require expensive training. In this work, we propose a novel non-iterative, lightweight architecture that incorporates knowledge about the forward operator (acquisition physics and noise parameters) without relying on unrolling. Our model is trained to solve a wide range of inverse problems, such as deblurring, magnetic resonance imaging, computed tomography, inpainting, and super-resolution, and handles arbitrary image sizes and channels, such as grayscale, complex, and color data. The proposed model can be easily adapted to unseen inverse problems or datasets with a few fine-tuning steps (up to a few images) in a self-supervised way, without ground-truth references. Throughout a series of experiments, we demonstrate state-of-the-art performance from medical imaging to low-photon imaging and microscopy. Our code is available at https://github.com/matthieutrs/ram.

[1687] RAM-W1K: A Multi-Task Wrist Dataset and Benchmark for Rheumatoid Arthritis

Songxiao Yang, Haolin Wang, Yao Fu, Ye Tian, Tamotsu Kamishima, Masayuki Ikebe, Yafei Ou, Masatoshi Okutomi

Main category: eess.IV

TL;DR: This paper presents the first public multi-task dataset for wrist bone analysis in conventional radiography, focusing on rheumatoid arthritis diagnosis with instance segmentation and bone erosion scoring.

Details

Motivation: Limited CAD research for wrist RA due to annotation challenges: complex wrist anatomy with small bones and narrow joints, and disease progression altering bone morphology requiring rheumatology expertise.

Method: Created a dataset of 1048 wrist radiographs from 388 patients across 4 medical centers, with pixel-level instance segmentation annotations for 618 images and SvdH bone erosion scores for 800 images.

Result: Established the first public resource for wrist bone instance segmentation, providing annotations for 618 images and bone erosion scores for 800 images from multiple medical centers.

Conclusion: This dataset lowers barriers for wrist RA research and can support various RA-related tasks including joint space narrowing quantification, bone erosion detection, and other wrist-related applications like fracture localization.

Abstract: Rheumatoid arthritis (RA) is a common autoimmune disease that has been the focus of research in computer-aided diagnosis (CAD) and disease monitoring. In clinical settings, conventional radiography (CR) is widely used for the screening and evaluation of RA due to its low cost and accessibility. The wrist is a critical region for the diagnosis of RA. However, CAD research in this area remains limited, primarily due to the challenges in acquiring high-quality instance-level annotations. (i) The wrist comprises numerous small bones with narrow joint spaces, complex structures, and frequent overlaps, requiring detailed anatomical knowledge for accurate annotation. (ii) Disease progression in RA often leads to osteophyte, bone erosion (BE), and even bony ankylosis, which alter bone morphology and increase annotation difficulty, necessitating expertise in rheumatology. This work presents a multi-task dataset for wrist bone in CR, including two tasks: (i) wrist bone instance segmentation and (ii) Sharp/van der Heijde (SvdH) BE scoring, which is the first public resource for wrist bone instance segmentation. This dataset comprises 1048 wrist conventional radiographs of 388 patients from four medical centers, with pixel-level instance segmentation annotations for 618 images and SvdH BE scores for 800 images. This dataset can potentially support a wide range of research tasks related to RA, including joint space narrowing (JSN) progression quantification, BE detection, bone deformity evaluation, and osteophyte detection. It may also be applied to other wrist-related tasks, such as carpal bone fracture localization. We hope this dataset will significantly lower the barrier to research on wrist RA and accelerate progress in CAD research within the RA-related domain.

[1688] Can General-Purpose Omnimodels Compete with Specialists? A Case Study in Medical Image Segmentation

Yizhe Zhang, Qiang Chen, Tao Zhou

Main category: eess.IV

TL;DR: Omnimodels show task-dependent performance in medical image segmentation - they match specialists on hard cases for polyp and breast tumor segmentation but lag behind in retinal vessel segmentation, suggesting complementary roles rather than universal replacement.

Details

Motivation: To investigate whether general-purpose omnimodels can perform as well as specialized models in knowledge-intensive domains like medical image segmentation.

Method: Comparative study analyzing zero-shot performance of Gemini omnimodel vs domain-specific deep learning models on three medical segmentation tasks (polyp, retinal vessel, breast tumor), focusing on easiest and hardest cases based on specialist model performance.

Result: Specialist models excel on easy samples for polyp and breast tumor segmentation, but omnimodels show greater robustness on hard cases where specialists fail catastrophically. For retinal vessel segmentation, specialists maintain superior performance across all cases. Omnimodels may have higher sensitivity to subtle features.

Conclusion: Current omnimodels are not yet universal replacements for specialists, but their unique strengths suggest potential complementary roles, particularly for enhancing robustness on challenging edge cases.

Abstract: The emergence of powerful, general-purpose omnimodels capable of processing diverse data modalities has raised a critical question: can these jack-of-all-trades'' systems perform on par with highly specialized models in knowledge-intensive domains? This work investigates this question within the high-stakes field of medical image segmentation. We conduct a comparative study analyzing the zero-shot performance of a state-of-the-art omnimodel (Gemini, the Nano Banana’’ model) against domain-specific deep learning models on three distinct tasks: polyp (endoscopy), retinal vessel (fundus), and breast tumor segmentation (ultrasound). Our study focuses on performance at the extremes by curating subsets of the easiest'' and hardest’’ cases based on the specialist models’ accuracy. Our findings reveal a nuanced and task-dependent landscape. For polyp and breast tumor segmentation, specialist models excel on easy samples, but the omnimodel demonstrates greater robustness on hard samples where specialists fail catastrophically. Conversely, for the fine-grained task of retinal vessel segmentation, the specialist model maintains superior performance across both easy and hard cases. Intriguingly, qualitative analysis suggests omnimodels may possess higher sensitivity, identifying subtle anatomical features missed by human annotators. Our results indicate that while current omnimodels are not yet a universal replacement for specialists, their unique strengths suggest a potential complementary role with specialist models, particularly in enhancing robustness on challenging edge cases.

[1689] Frequency-Aware Ensemble Learning for BraTS 2025 Pediatric Brain Tumor Segmentation

Yuxiao Yi, Qingyao Zhuang, Zhi-Qin John Xu

Main category: eess.IV

TL;DR: Ensemble method combining nnU-Net, Swin UNETR, and HFF-Net with adjustable initialization, transfer learning, and frequency domain decomposition for pediatric brain tumor segmentation, achieving competitive Dice scores across tumor subregions.

Details

Motivation: Pediatric brain tumor segmentation is challenging due to rarity and heterogeneity of these malignancies, but remains critical for clinical diagnosis and treatment planning.

Method: Ensemble approach integrating nnU-Net with adjustable initialization scales, Swin UNETR with transfer learning from BraTS 2021 pre-trained models, and HFF-Net with frequency domain decomposition to separate low-frequency tissue contours from high-frequency texture details.

Result: Final ensemble achieved Dice scores of 72.3% (ET), 95.6% (NET), 68.9% (CC), 89.5% (ED), 92.3% (TC), and 92.3% (WT) respectively for different tumor subregions.

Conclusion: The proposed ensemble method effectively addresses pediatric brain tumor segmentation challenges through complementary model architectures and specialized techniques, demonstrating strong performance across multiple tumor subregions.

Abstract: Pediatric brain tumor segmentation presents unique challenges due to the rarity and heterogeneity of these malignancies, yet remains critical for clinical diagnosis and treatment planning. We propose an ensemble approach integrating nnU-Net, Swin UNETR, and HFF-Net for the BraTS-PED 2025 challenge. Our method incorporates three key extensions: adjustable initialization scales for optimal nnU-Net complexity control, transfer learning from BraTS 2021 pre-trained models to enhance Swin UNETR’s generalization on pediatric dataset, and frequency domain decomposition for HFF-Net to separate low-frequency tissue contours from high-frequency texture details. Our final ensemble combines nnU-Net ($\gamma=0.7$), fine-tuned Swin UNETR, and HFF-Net, achieving Dice scores of 72.3% (ET), 95.6% (NET), 68.9% (CC), 89.5% (ED), 92.3% (TC), and 92.3% (WT), respectively.

Today’s Research Highlights

Table of Contents

cs.CL

[1] Are you sure? Measuring models bias in content moderation through uncertainty

[2] AccessEval: Benchmarking Disability Bias in Large Language Models

[3] RAR$^2$: Retrieval-Augmented Medical Reasoning via Thought-Driven Retrieval

[4] TRUEBench: Can LLM Response Meet Real-world Constraints as Productivity Assistant?

[5] Multi-Modal Sentiment Analysis with Dynamic Attention Fusion

[6] Enabling Approximate Joint Sampling in Diffusion LMs

[7] Painless Activation Steering: An Automated, Lightweight Approach for Post-Training Large Language Models

[8] MIRAGE: Multi-hop Reasoning with Ambiguity Evaluation for Illusory Questions

[9] ML2B: Multi-Lingual ML Benchmark For AutoML

[10] ArFake: A Multi-Dialect Benchmark and Baselines for Arabic Spoof-Speech Detection

[11] EditGRPO: Reinforcement Learning with Post -Rollout Edits for Clinically Accurate Chest X-Ray Report Generation

[12] Critique-Coder: Enhancing Coder Models by Critique Reinforcement Learning

[13] ChatInject: Abusing Chat Templates for Prompt Injection in LLM Agents

[14] Learning to Detect Relevant Contexts and Knowledge for Response Selection in Retrieval-based Dialogue Systems

[15] Towards Generalizable Implicit In-Context Learning with Attention Routing

[16] The Bias is in the Details: An Assessment of Cognitive Bias in LLMs

[17] Lexicon-Enriched Graph Modeling for Arabic Document Readability Prediction

[18] HEART: Emotionally-driven test-time scaling of Language Models

[19] Infusing Theory of Mind into Socially Intelligent LLM Agents

[20] Extract-0: A Specialized Language Model for Document Information Extraction

[21] HiKE: Hierarchical Evaluation Framework for Korean-English Code-Switching Speech Recognition

[22] Large language models management of medications: three performance analyses

[23] LLMs Behind the Scenes: Enabling Narrative Scene Illustration

[24] What Matters More For In-Context Learning under Matched Compute Budgets: Pretraining on Natural Text or Incorporating Targeted Synthetic Examples?

[25] Emergent morpho-phonological representations in self-supervised speech models

[26] Same Content, Different Representations: A Controlled Study for Table QA

[27] ADAM: A Diverse Archive of Mankind for Evaluating and Enhancing LLMs in Biographical Reasoning

[28] DM-Codec: Distilling Multimodal Representations for Speech Tokenization

[29] AI Brown and AI Koditex: LLM-Generated Corpora Comparable to Traditional Corpora of English and Czech Texts

[30] Look Back to Reason Forward: Revisitable Memory for Long-Context LLM Agents

[31] Peacemaker or Troublemaker: How Sycophancy Shapes Multi-Agent Debate

[32] Beyond Classification: Towards Speech Emotion Reasoning with Multitask AudioLLMs

[33] Semantic Voting: A Self-Evaluation-Free Approach for Efficient LLM Self-Improvement on Unverifiable Open-ended Tasks

[34] From Evidence to Trajectory: Abductive Reasoning Path Synthesis for Training Retrieval-Augmented Generation Agents

[35] The Geometry of Creative Variability: How Credal Sets Expose Calibration Gaps in Language Models

[36] d$^2$Cache: Accelerating Diffusion-Based LLMs via Dual Adaptive Caching

[37] How to Make Large Language Models Generate 100% Valid Molecules?

[38] Non-Collaborative User Simulators for Tool Agents

[39] Tagging the Thought: Unlocking Personalization Reasoning via Reinforcement Learning

[40] Tree Reward-Aligned Search for TReASURe in Masked Diffusion Language Models

[41] Test-Time Policy Adaptation for Enhanced Multi-Turn Interactions with LLMs

[42] Pretraining LLM with Latent Thoughts in Continuous Space

[43] Diagnose, Localize, Align: A Full-Stack Framework for Reliable LLM Multi-Agent Systems under Instruction Conflicts

[44] Estimating the strength and timing of syntactic structure building in naturalistic reading

[45] From Harm to Help: Turning Reasoning In-Context Demos into Assets for Reasoning LMs

[46] Global Beats, Local Tongue: Studying Code Switching in K-pop Hits on Billboard Charts

[47] Steering Prepositional Phrases in Language Models: A Case of with-headed Adjectival and Adverbial Complements in Gemma-2

[48] PARL-MT: Learning to Call Functions in Multi-Turn Conversation with Progress Awareness

[49] A Structured Framework for Evaluating and Enhancing Interpretive Capabilities of Multimodal LLMs in Culturally Situated Tasks

[50] Detecting Corpus-Level Knowledge Inconsistencies in Wikipedia with Large Language Models

[51] Fin-ExBERT: User Intent based Text Extraction in Financial Context using Graph-Augmented BERT and trainable Plugin

[52] A2D: Any-Order, Any-Step Safety Alignment for Diffusion Language Models

[53] Scaling Policy Compliance Assessment in Language Models with Policy Reasoning Traces

[54] Learning to Reason in Structured In-context Environments with Reinforcement Learning

[55] C-Evolve: Consensus-based Evolution for Prompt Groups

[56] Dual-Space Smoothness for Robust and Balanced LLM Unlearning

[57] MedCritical: Enhancing Medical Reasoning in Small Language Models via Self-Collaborative Correction

[58] Alignment through Meta-Weighted Online Sampling: Bridging the Gap between Data Generation and Preference Optimization

[59] CCD: Mitigating Hallucinations in Radiology MLLMs via Clinical Contrastive Decoding

[60] Guard Vector: Beyond English LLM Guardrails with Task-Vector Composition and Streaming-Aware Prefix SFT

[61] Train Once, Answer All: Many Pretraining Experiments for the Cost of One

[62] No Loss, No Gain: Gated Refinement and Adaptive Compression for Prompt Optimization

[63] Liaozhai through the Looking-Glass: On Paratextual Explicitation of Culture-Bound Terms in Machine Translation

[64] Comparison of Scoring Rationales Between Large Language Models and Human Raters

[65] Retrieval-Constrained Decoding Reveals Underestimated Parametric Knowledge in Language Models

[66] Cognition-of-Thought Elicits Social-Aligned Reasoning in Large Language Models

[67] Text-Based Approaches to Item Difficulty Modeling in Large-Scale Assessments: A Systematic Review

[68] The Impact of Role Design in In-Context Learning for Large Language Models

[69] AraS2P: Arabic Speech-to-Phonemes System

[70] From Human Annotation to Automation: LLM-in-the-Loop Active Learning for Arabic Sentiment Analysis

[71] On the Shelf Life of Fine-Tuned LLM Judges: Future Proofing, Backward Compatibility, and Question Generalization

[72] Automatic Speech Recognition for Greek Medical Dictation

[73] Towards Efficient CoT Distillation: Self-Guided Rationale Selector for Better Performance with Fewer Rationales

[74] Jackal: A Real-World Execution-Based Benchmark Evaluating Large Language Models on Text-to-JQL Tasks

[75] LLM Hallucination Detection: HSAD

[76] Timber: Training-free Instruct Model Refining with Base via Effective Rank

[77] Fast Thinking for Large Language Models